Tag Archives: storage

Automating AWS service logs table creation and querying them with Amazon Athena

Post Syndicated from Michael Hamilton original https://aws.amazon.com/blogs/big-data/automating-aws-service-logs-table-creation-and-querying-them-with-amazon-athena/

I was working with a customer who was just getting started using AWS, and they wanted to understand how to query their AWS service logs that were being delivered to Amazon Simple Storage Service (Amazon S3). I introduced them to Amazon Athena, a serverless, interactive query service that allows you to easily analyze data in Amazon S3 and other sources. Together, we used Athena to query service logs, and were able to create tables for AWS CloudTrail logs, Amazon S3 access logs, and VPC flow logs. As I was walking the customer through the documentation and creating tables and partitions for each service log in Athena, I thought there had to be an easier and faster way to allow customers to query their logs in Amazon S3, which is the focus of this post.

This post demonstrates how to use AWS CloudFormation to automatically create AWS service log tables, partitions, and example queries in Athena. We also use the SQL query editor in Athena to query the AWS service log tables that AWS CloudFormation created.

Athena best practices

This solution is appropriate for ad hoc use and queries the raw log files. These raw files can range from compressed JSON to uncompressed text formats, depending on how they were configured to be sent to Amazon S3. If you need to query over hundreds of GBs or TBs of data per day in Amazon S3, performing ETL on your raw files and transforming them to a columnar file format like Apache Parquet can lead to increased performance and cost savings. You can save on your Amazon S3 storage costs by using snappy compression for Parquet files stored in Amazon S3. To learn more about Athena best practices, see Top 10 Performance Tuning Tips for Amazon Athena.

Table partition strategies

There are a few important considerations when deciding how to define your table partitions. Mainly you should ask: what types of queries will I be writing against my data in Amazon S3? Do I only need to query data for that day and for a single account, or do I need to query across months of data and multiple accounts? In this post, we talk about how to query across a single, partitioned account.

By partitioning data, you can restrict the amount of data scanned per query, thereby improving performance and reducing cost. When creating a table schema in Athena, you set the location of where the files reside in Amazon S3, and you can also define how the table is partitioned. The location is a bucket path that leads to the desired files. If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data only for that partition. For more information, see Table Location in Amazon S3 and Partitioning Data. You can then define partitions in Athena that map to the data residing in Amazon S3.

Let’s look at an example to see how defining a location and partitioning our table can improve performance and reduce costs. In the following tree diagram, we’ve outlined what the bucket path may look like as logs are delivered to your S3 bucket, starting from the bucket name and going all the way down to the day.

In the following tree diagram, we’ve outlined what the bucket path may look like as logs are delivered to your S3 bucket

Outlined in red is where we set the location for our table schema, and Athena then scans everything after the CloudTrail folder. We then outlined our partitions in blue. This is where we can specify the granularity of our queries. In this case, we partition our table down to the day, which is very granular because we can tell Athena exactly where to look for our data. This is also the most performant and cost-effective option because it results in scanning only the required data and nothing else.

If you have to query multiple accounts and Regions, you should back off the location to AWSLogs and then create a non-partitioned CloudTrail table. This allows you to write queries across all your accounts and Regions, but the trade-off is that your queries take much longer and are more expensive due to Athena having to scan all the data that comes after AWSLogs every query. However, querying multiple accounts is beyond the scope of this post.

Prerequisites

Before you get started, you should have the following prerequisites:

  • Service logs already being delivered to Amazon S3
  • An AWS account with access to your service logs

Deploying the automated solution in your AWS account

The following steps walk you through deploying a CloudFormation template that creates saved queries for you to run (Create Table, Create Partition, and example queries for each service log).

  1. Choose Launch Stack:

  1. Choose Next.
  2. For Stack name, enter a name for your stack.

You don’t need to have every AWS service log that the template asks for. If you don’t have CloudFront logs for example, you can leave the PathParameter as is. If you need CloudFront logs in the future, you can simply update the Create Table statement with the correct Amazon S3 location in Athena.

  1. For each service log table you want to create, follow the steps below:
  • Replace <_BUCKET_NAME> with the name of your S3 bucket that holds each AWS service log. You can use the same bucket name if it’s used to hold more than one type of service log.
  • Replace <Prefix> with your own folder prefix in Amazon S3. If you don’t have a prefix, make sure to remove it from the path parameters.
  • Replace <ACCOUNT-ID> and <REGION> with desired account and region.

Choose Next.

  1. Choose Next.
  2. Enter any tags you wish to assign to the stack.
  3. Choose Next.
  4. Verify parameters are correct and choose Create stack at the bottom.

Verify the stack has been created successfully. The stack takes about 1 minute to create the resources.

Querying your tables

You’re now ready to start querying your service logs.

  1. On the Athena console, on the Saved queries tab, search for the service log you want to interact with.

On the Athena console, on the Saved queries tab, search for the service log you want to interact with.

  1. Choose Create Table – CloudTrail Logs to run the SQL statement in the Athena query editor.

Make sure the location for Amazon S3 is correct in your SQL statement and verify you have the correct database selected.

  1. Choose Run query or press Tab+Enter to run the query.

Choose Run query or press Tab+Enter to run the query.

The table cloudtrail_logs is created in the selected database. You can repeat this process to create other service log tables.

For partitioned tables like cloudtrail_logs, you must add partitions to your table before querying.

  1. On the Saved queries tab, choose Create Partition – CloudTrail.
  2. Update the Region, year, month, and day you want to partition. Choose Run query or press Tab+Enter to run the query.

Choose Run query or press Tab+Enter to run the query.

After you run the query, you have successfully added a partition to your cloudtrail_logs table. Let’s look at some of the example queries we can run now.

  1. On the Saved queries tab, choose Query – CloudTrail Logs.

This is a base template included to begin querying your CloudTrail logs.

  1. Highlight the query and choose Run query.

You can see the base query template uses the WHERE clause to leverage partitions that have been loaded.

You can see the base query template uses the WHERE clause to leverage partitions that have been loaded.

Let’s say we have a spike in API calls from AWS Lambda and we want to see the users that the calls were coming from in a specific time range as well as the count for each user. Our query looks like the following code:

SELECT useridentity.sessioncontext.sessionissuer.username as "User",
       count(eventname) as "Lambda API Calls"
FROM cloudtrail_logs
WHERE eventsource = 'lambda.amazonaws.com'
       AND eventtime BETWEEN '2020-11-24T18:00:00Z' AND '2020-11-24T21:00:00Z' 
group by useridentity.sessioncontext.sessionissuer.username
order by count(eventname) desc

Or if we wanted to check our S3 Access Logs to make sure only authorized users are accessing certain prefixes:

SELECT *
FROM s3_access_logs
WHERE key='prefix/images/example.jpg'
        AND requester != 'arn:aws:iam::accountid:user/username'

Cost of solution and cleaning up

Deploying the CloudFormation template doesn’t cost anything. You’re only charged for the amount of data scanned by Athena. Remember to use the best practices we discussed earlier when querying your data in Amazon S3. For more pricing information, see Amazon Athena pricing and Amazon S3 pricing.

To clean up the resources that were created, delete the CloudFormation stack you created earlier. This also deletes the saved queries in Athena.

Summary

In this post, we discussed how we can use AWS CloudFormation to easily create AWS service log tables, partitions, and starter queries in Athena by entering bucket paths as parameters. We used CloudTrail and Amazon S3 access logs as examples, but you can replicate these steps for other service logs that you may need to query by visiting the Saved queries tab in Athena. Feel free to check out the video as well, where I go over how we store logs in Amazon S3 and then give a quick demo on how to deploy the solution.

For more information about service logs, see Easily query AWS service logs using Amazon Athena.


About the Author

Michael Hamilton is a Solutions Architect at Amazon Web Services and is based out of Charlotte, NC. He has a focus in analytics and enjoys helping customers solve their unique use cases. When he’s not working, he loves going hiking with his wife, kids, and a 2-year-old German shepherd.

AWS PrivateLink for Amazon S3 is Now Generally Available

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/aws-privatelink-for-amazon-s3-now-available/

At AWS re:Invent, we pre-announced that AWS PrivateLink for Amazon S3 was coming soon, and soon has arrived — this new feature is now generally available. AWS PrivateLink provides private connectivity between Amazon Simple Storage Service (S3) and on-premises resources using private IPs from your virtual network.

Way back in 2015, S3 was the first service to add a VPC endpoint; these endpoints provide a secure connection to S3 that does not require a gateway or NAT instances. Our customers welcomed this new flexibility but also told us they needed to access S3 from on-premises applications privately over secure connections provided by AWS Direct Connect or AWS VPN.

Our customers are very resourceful and by setting up proxy servers with private IP addresses in their Amazon Virtual Private Clouds and using gateway endpoints for S3, they found a way to solve this problem. While this solution works, proxy servers typically constrain performance, add additional points of failure, and increase operational complexity.

We looked at how we could solve this problem for our customers without these drawbacks and PrivateLink for S3 is the result.

With this feature you can now access S3 directly as a private endpoint within your secure, virtual network using a new interface VPC endpoint in your Virtual Private Cloud. This extends the functionality of existing gateway endpoints by enabling you to access S3 using private IP addresses. API requests and HTTPS requests to S3 from your on-premises applications are automatically directed through interface endpoints, which connect to S3 securely and privately through PrivateLink.

Interface endpoints simplify your network architecture when connecting to S3 from on-premises applications by eliminating the need to configure firewall rules or an internet gateway. You can also gain additional visibility into network traffic with the ability to capture and monitor flow logs in your VPC. Additionally, you can set security groups and access control policies on your interface endpoints.

Available Now
PrivateLink for S3 is available in all AWS Regions. AWS PrivateLink is available at a low per-GB charge for data processed and a low hourly charge for interface VPC endpoints. We hope you enjoy using this new feature and look forward to receiving your feedback. To learn more, check out the PrivateLink for S3 documentation.

Try out AWS PrivateLink for Amazon S3 today, and happy storing.

— Martin

Ingesting Jira data into Amazon S3

Post Syndicated from Vishwa Gupta original https://aws.amazon.com/blogs/big-data/ingesting-jira-data-into-amazon-s3/

Consolidating data from a work management tool like Jira and integrating this data with other data sources like ServiceNow, GitHub, Jenkins, and Time Entry Systems enables end-to-end visibility of different aspects of the software development lifecycle and helps keep your projects on schedule and within budget.

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, performance, security, and data availability. Many of our customers choose to build their data lakes on Amazon S3. They find the flexible, pay-as-you-go, cloud model ideal when dealing with vast amounts of heterogeneous data.

This post discusses some of the use cases for ingesting Jira data into an Amazon S3 data lake, the ingestion data flow, and a conceptional approach to ingesting data. We also provide the relevant Python code.

Use cases

Business use cases for Jira data ingestion range from proactive project monitoring, detection and resolution of project effort and cost variances, and non-compliance of SDLC processes. In this section, we provide an inclusive but not exhaustive list of use cases and their benefits.

Cognitive project monitoring

Cognitive project monitoring use cases include the following:

  • Automated analytics – You can prevent and reduce project schedule and budget variances by proactively monitoring metrics by combining data from Jira, GitHub, Jenkins, and Time Entry Systems.
  • Automated status reporting – You can use Amazon SageMaker machine learning (ML) models to derive prescriptive metrics by looking at data across various sources. This could reduce a project manager’s time spent stitching data and generating reports, and provide a holistic view of project-tracking metrics.

Automated project compliance and governance

You can analyze user behavior to detect potentially suspicious patterns by building a baseline of user activity. You create this based on primary data from HR (such as role, location, and work hours) and IT infrastructure (such as an assigned asset’s IP address).

Possible business outcomes include the following:

  • Proactively identify user IDs and passwords being shared with other users
  • Detect insider threats, such as abnormal login times, unauthorized access to Jira, and incorrect access permissions in Jira, GitHub, Jenkins, or Time Entry Systems
  • Identify compromised accounts based on frequent logins from unassigned assets or unusual successive authentications
  • Identify theft of corporate IPs based on unusual printing volume, printing project-related documents, and emailing organization-related documents and code to external accounts

Accelerated migration from Jira to another project management product

You can also use AWS Glue and its Data Catalog metadata to map between two products. This could increase your data migration.

Overview of solution

One of the most common approaches to ingest data from Jira into AWS is to create a Python module, which is used in AWS Glue or AWS Lambda. The following diagram shows the high-level approach for an end-to-end solution. In this solution, Glue Development Endpoint and respective SageMaker Jupyter Notebook instance are used to create the Jira Python module to facilitate Jupyter notebook experience, interactive testing and debugging capability. The scope of this post is limited to the following steps:

  • Setting up access for Jira
  • Using the Python model with AWS Lambda or AWS Glue
  • Incrementally pulling changed data from Jira with JQL (Jira Query Language)
  • Ingesting data to the AWS serverless data lake

Ingesting data from Jira into Amazon S3

The Jira server exposes data using REST APIs and open authorization (OAuth) authentication methods. It uses a three-legged OAuth approach (also called the OAuth dance) to acquire access to the resources served by the APIs. For more information about the following steps, see OAuth for REST APIs.

Generating an RSA public/private key pair

Consumer key and consumer secret details are required for interacting with API endpoints. You store the details inside encrypted SSM parameters.

To use macOS or Linux, run the following OpenSSL commands in the terminal (anywhere in the file system):

openssl genrsa -out jira_privatekey.pem 1024
openssl req -newkey rsa:1024 -x509 -key jira_privatekey.pem -out jira_publickey.cer -days 365
openssl pkcs8 -topk8 -nocrypt -in jira_privatekey.pem -out jira_privatekey.pcks8
openssl x509 -pubkey -noout -in jira_publickey.cer > jira_publickey.pem

To use Windows, download OpenSSL and run it using the path to the bin folder. Create a new environment variable named OPENSSL_CONF and the value "path_to"\openssl.cnf. Run the command as admin:

"path_to_openssl"\bin\openssl genrsa -out jira_privatekey.pem 1024
"path_to_openssl"\bin\openssl req -newkey rsa:1024 -x509 -key jira_privatekey.pem -out jira_publickey.cer -days 365
"path_to_openssl"\bin\openssl pkcs8 -topk8 -nocrypt -in jira_privatekey.pem -out jira_privatekey.pcks8
"path_to_openssl"\bin\openssl x509 -pubkey -noout -in jira_publickey.cer > jira_publickey.pem

Configuring a REST API-based consumer in Jira

For full instructions on configuring your REST API-based consumer, see Step 2: Configure your client application as an OAuth consumer in OAuth for REST APIs. Be sure to complete the following steps:

  1. In the Link applications section, select Create incoming link.
  2. For Public key, enter the public key you created earlier.

Performing the OAuth dance

In this step, you go through the process of getting the access token from the resource so the consumer can access the resource.

  1. Create the following parameters in the AWS Systems Manager Parameter Store:
    1. jira_access_private_key – Stores the private key in AWS Systems Manager as a parameter.
    2. jira_access_urls – Stores URLs to access Jira. These URLs are constructed based on display URLs defined in JIRA by adding the additional tags:
      • request_token_urlhttps://jiratoawss3.atlassian.net/plugins/servlet/oauth/request-token
      • access_token_urlhttps://jiratoawss3.atlassian.net/plugins/servlet/oauth/access-token
      • authorize_urlhttps://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize
      • data_urlhttps://jiratoawss3.atlassian.net/rest/api/2/search
    3. jira_access_secrets – Stores secrets to access Jira. Initially, only two values are present in this SSM parameter; it’s updated later with access_token. You need the following two parameters to start:
      1. consumer_key
      2. consumer_secret
  2. Download the notebook file and upload it to SageMaker notebook instance of AWS Glue Development Endpoint.:
    1. Below are the steps to set-up AWS Glue Development Endpoint
      1. In the AWS Glue console, choose Dev endpoints. Choose Add endpoint.
      2. Specify an endpoint name, such as demo-endpoint.
      3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Create an IAM Role for AWS Glue. Choose Next.
      4. In Networking, leave Skip networking information selected, and choose Next.
      5. In SSH Public Key, enter a public key generated by an SSH key generator program, such as ssh-keygen (do not use an Amazon EC2 key pair). The generated public key will be imported into your development endpoint. Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next. For more information, see ssh-keygen in Wikipedia.b. Once status of AWS Glue Development Endpoint is ready follow steps to set-up SageMaker notebooks with in your development endpoint.
    2. Once status of AWS Glue Development Endpoint is ready follow
    3. Once status shows Ready, open the notebook and upload the downloaded notebook file.

You now run the following cells.

  1. Install and import dependent modules with the following code:
    !pip install tlslite
    !pip install oauth2
    
    import urllib
    import oauth2 as oauth
    from tlslite.utils import keyfactory
    import json
    import sys
    import os
    import base64
    import boto3
    from boto3.dynamodb.conditions import Key, Attr
    import datetime
    import logging
    import pprint
    import time
    from pytz import timezone
    
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    

  2. Create an SSM client to connect parameters defined in Systems Manager (update the Region if it’s different than us-east-1):
    ssm = boto3.client("ssm", region_name='us-east-1')

  3. Define the signature class to sign the Jira REST API requests:
    class SignatureMethod_RSA_SHA1(oauth.SignatureMethod):
        name = 'RSA-SHA1'
    
        def signing_base(self, request, consumer, token):
            if not hasattr(request, 'normalized_url') or request.normalized_url is None:
                raise ValueError("Base URL for request is not set.")
    
            sig = (
                oauth.escape(request.method),
                oauth.escape(request.normalized_url),
                oauth.escape(request.get_normalized_parameters()),
            )
    
            key = '%s&' % oauth.escape(consumer.secret)
            if token:
                key += oauth.escape(token.secret)
            raw = '&'.join(sig)
            return key, raw
    
        def sign(self, request, consumer, token):
    
            key, raw = self.signing_base(request, consumer, token)
    
            # SSM support to fetch private key
            ssm_param = ssm.get_parameter(Name='jira_access_private_key', WithDecryption=True)
            jira_private_key_str = ssm_param['Parameter']['Value']
    
            privateKeyString = jira_private_key_str.strip()
    
            privatekey = keyfactory.parsePrivateKey(privateKeyString)
    
            # Used encode() to convert to bytes
            signature = privatekey.hashAndSign(raw.encode())
            return base64.b64encode(signature) 
    

  4. Get the consumer_key and consumer_secret from the SSM parameter that you defined earlier:
    jira_secrets = json.loads(ssm.get_parameter(Name='jira_access_secrets_1', WithDecryption=True)['Parameter']['Value'])
    jira_secrets
    
    consumer_key = jira_secrets["consumer_key"]
    consumer_key
    
    consumer_secret = jira_secrets["consumer_secret"]
    consumer_secret

  5. Define the URLs for request_token_url, access_token_url, and authorize_url:
    request_token_url = 'input_here'
    access_token_url = 'input_here'
    authorize_url = 'input_here'
    

    These URLs are defined while setting up the Rest API endpoint in Jira. It includes the following components:

    • request_token_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/request-token
    • access_token_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/access-token
    • authorize_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize
  6. Generate your request token:
    # Create Consumer using consumer_key and consumer_secret
    consumer = oauth.Consumer(consumer_key, consumer_secret)
    
    # Use Consumer to create oauth client
    client = oauth.Client(consumer)
    
    # Add Signature Method to the client
    client.set_signature_method(SignatureMethod_RSA_SHA1())
    
    # Get response from request token URL using the client
    resp, content = client.request(request_token_url, "POST")
    
    # Convert the content received from previous step into a Dictionary
    request_token = dict(urllib.parse.parse_qsl(content))
    
    # request token has two components oauth_token and oauth_token_secret
    request_token
    

You only need to do this one time every five years (the default setting in Jira).

The following is an example request token value:

b'oauth_token=oFUFV5cqOuoWycnaCXYrkcioHuRw2TbV&oauth_token_secret=CzhMoEsozCV3xFZ179YQoLzRu4DYQHlR'

The following are example values after the URL parse and converted to dict:

  • {b'oauth_token': b'oFUFV5cqOuoWycnaCXYrkcioHuRw2TbV ',
  • b'oauth_token_secret': b'CzhMoEsozCV3xFZ179YQoLzRu4DYQHlR'}
  1. Manually approve the request token by opening the following user in a browser:
    authorize_url + '?oauth_token=' + request_token[b'oauth_token'].decode()

An example value of the final authorized user is:

  • https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize?oauth_token=wYLlIxmcsnZTHgTy2ZpUmBakqzmqSbww.

When you go to the URL in your output, you see the following screenshot.

  1. Use an approved request token to generate an access token:
    # Create an oauth token using components of request token
    token = oauth.Token(request_token[b'oauth_token'], request_token[b'oauth_token_secret'])
    
    # Use Consumer and token to create oauth client
    client = oauth.Client(consumer, token)
    
    # Add Signature Method to the client
    client.set_signature_method(SignatureMethod_RSA_SHA1())
    
    # Get response from access token URL using the client
    access_token_resp, access_token_content = client.request(access_token_url, "POST")
    
    access_token_content

An example access token is:

b'oauth_token=Ym3UDrs1iYnLUZ1t0TkT1PinfJNN3RLj&oauth_token_secret=FYQfGjLLhbCJg3DXZFaKsE6wsURVfebN&oauth_expires_in=157680000&oauth_session_handle=BulouCOypjssDS3GzeY7Ldi30h0ERWDo&oauth_authorization_expires_in=160272000'

The following are example access token values after URL parse and converted to dict:

  • {b'oauth_token': b'Ym3UDrs1iYnLUZ1t0TkT1PinfJNN3RLj',
  • b'oauth_token_secret': b'FYQfGjLLhbCJg3DXZFaKsE6wsURVfebN',
  • b'oauth_expires_in': b'157680000',
  • b'oauth_session_handle': b'BulouCOypjssDS3GzeY7Ldi30h0ERWDo',
  • b'oauth_authorization_expires_in': b'160272000'}

oauth_authorization_expires_in states when the token expires (in seconds), which is generally 5 years.

  1. Update the access_token key in the SSM parameter jira_access_secrets with the value for access_token_content.

This access token is valid for 5 years (the expires_in key of access_token_content states when the token expires in seconds). Rotating the access key depends on your organization’s security policy and is out of scope for this post.

Using the access token, querying data from Jira, and storing data in Amazon S3

The following are the important points in this step:

  • The Jira REST API returns 50 records at a time, but gives a total record count, which is used to paginate through the result set.
  • The Jira REST API endpoint needs to be updated with JQL filters. JQL allows you to only pick changed records.
  • Data returned from the REST API endpoint is serialized to Amazon S3. The Python code batches the records from Jira pages and commits after every four pages (which is configurable) have been fetched from Jira.
  • JQL is appended to the data_url (defined earlier). When pulling data from Jira, it’s good practice to do a one-time bulk load and get incremental loads by maintaining the last data pull date in an Amazon DynamoDB The following screenshot shows an example of tracking dates in DynamoDB by projects. Because it’s an hourly batch, last_ingest_date is rounded up to the hour.
  • The key attributes of JQL used for constructing JQL and pulling data from Jira are:
    • project – Loop through the projects from DynamoDB and pull data for one project at a time.
    • updated – Last update date for Jira story or task. Data pull from Jira is based on this date.
      • For bulk loads, updated is less than or equal to the batch run date, rounded up to the hour.
      • For incremental loads, updated is greater than or equal to last_ingest_date from DynamoDB, and less than or equal to the batch run date, rounded up to the hour.
      • To use date in JQL, we need to parse the date before we use it.
    • startAt – Jira generally paginates the results every 50 records. This attribute is used to loop through the complete data. For example, if a project has 500 records and page size is 50 records, this attribute is incremented by page size in every iteration, and it takes 10 iterations to get complete data.
    • maxResults – This is the page size set up in Jira (maximum number of records Jira returns in every API call).
  • Use the provided notebook to perform the OAuth dance. The sample code pulls data from Jira based on the approach you specified earlier. The purpose of this code is to accelerate implementing data ingestion from Jira.

Cleaning up

To avoid incurring future charges, delete the resources set up as part of this post:

  • AWS Glue Development Endpoint
  • DynamoDB table
  • Systems Manager parameters
  • S3 bucket

Next steps

To extend the usability scope of Jira data in S3 buckets, you can crawl the location to create AWS Glue Data Catalog database tables. Registering the locations with AWS Lake Formation helps simplify permission management and allows you to implement fine-grained access control. You can also use Amazon Athena, Amazon Redshift, Amazon SageMaker, and Amazon QuickSight for data analysis, ML, and reporting services.

Conclusion

This post aims to simplify and accelerate the steps to ingest Jira data into Amazon S3. The solution includes Jira configuration, performing the three-legged OAuth dance, JQL-based attributes for data selection, and Python-based data extraction into Amazon S3.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Glue forum.


About the Authors

Vishwa Gupta is a Data and ML Engineer with AWS Professional Services Intelligence Practice. He helps customers implement big data and analytics platform and solutions. Outside of work, he enjoys spending time with family, traveling, and playing badminton.

 

 

 

Sreeram Thoom is a Data Architect at Amazon Web Services.

 

 

 

 

 

New – Amazon S3 Replication Adds Support for Multiple Destination Buckets

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/new-amazon-s3-replication-adds-support-for-multiple-destination-buckets/

Amazon Simple Storage Service (S3) supports many types of replication, including S3 Same-Region Replication (SRR), which launched in 2019 and S3 Cross-Region Replication (CRR), which has been around since 2015. Today, we are happy to announce S3 Replication support for multiple destination buckets. S3 Replication now gives you the ability to replicate data from one source bucket to multiple destination buckets. With S3 Replication (multi-destination) you can replicate data in the same AWS Regions using S3 SRR or across different AWS Regions by using S3 CRR, or a combination of both.

Before this launch, if you needed to have multiple copies of your data in different S3 buckets, you had to build your own S3 replication service by monitoring S3 events, identifying created objects, and using AWS Lambda functions to copy objects to each destination bucket.

This launch removes the need for you to develop your own solutions to replicate the data across multiple destinations. You can use the flexibility of S3 Replication (multi-destination) to store multiple copies of your data in different storage classes, with different encryption types, or across different accounts depending on its intended use. Additionally, when replicating to multiple destinations, you can use CloudWatch metrics to track replication progress for each region pair.

S3 Replication (multi-destination) is an extension to S3 Replication, and it supports all existing S3 Replication features like Replication Time Control (RTC) and delete marker replication. If you need a predictable replication time backed by a Service Level Agreement, you can use RTC to replicate objects in less than 15 minutes.

How to Get Started With S3 Replication (multi-destination)
In order to get S3 Replication working, all the buckets involved in the replication (source and destinations) must have bucket versioning enabled.

To setup S3 Replication (multi-destination), you need to define replication rules. You can create a new rule in the bucket Management page, under Replication Rules.

Screenshot of adding a rule

When creating a new replication rule, one very important step is to set up permissions for replication, as S3 will need to replicate objects on your behalf. To do that, you can follow the instructions available in the S3 documentation page.

To create the replication rule, just follow the steps in the console. You can specify to which objects of the bucket this rule applies, the destination bucket, if you want to change the storage class of the replicated objects and many other preferences for your replicated objects.

Screenshot configuring the replication rule

One thing to have in mind when activating a rule is that the replication will start for all new objects added to the bucket from that moment. Objects uploaded to the bucket before the rule was created need to be copied using one time operations like S3 batch operations or S3 copy.

If you want to monitor the progress of your replication using CloudWatch metrics, don’t forget to click the Replication metrics and notifications checkbox.

Screenshot of configuring replication rules metrics

Now that we support multiple destinations for replication, rule priorities are used when there are two or more rules with the same destination. When that happens, the rule with the highest priority will be applied. For the same destination bucket, a lower priority rule will not be applied when the replication configuration has two or more rules with overlapping scope. If there are two or more rules with the same scope and different destinations, both rules will be applied.

You can see a summary of all your rules in the Replication rules listing under the bucket Management page.

Screenshot of replication rules listing

Monitoring Replication
When you have all the rules configured, you can start uploading objects to the source bucket and monitor how they get replicated in all the different destinations.

To know the replication status of an object in the source bucket, you can see the Replication status in the object Details. The status types are:

  • COMPLETED: The replication was successful in all the destinations.
  • PENDING: The replication is still in progress.
  • FAILED: The replication failed to replicate in at least one of the destinations. When there is a failure in replication, the only way to fix it is by uploading the object again.

screenshot of object metadata

For replicated objects, you will see the REPLICA status under the Replication status.

You can also use CloudWatch metrics to monitor the replication. First, you need to enable metrics for each of the rules. And then in the bucket Metrics, you can choose which rules you want to see the metrics of and see the charts for each of them; the metrics are also available in the CloudWatch console.

Screenshot of replication metrics

Availability
S3 Replication (multi-destination) is available today in all AWS Regions. To get started, you can use the AWS Management Console, SDKs, S3 API, or AWS CloudFormation to create replication rules from one source bucket to multiple destination buckets.

Pricing for S3 Replication (multi-destination) applies for each rule. For pricing information, please visit the Amazon S3 pricing page.

For more information about this new feature visit the S3 Replication page.

Marcia

 

New – Amazon EBS gp3 Volume Lets You Provision Performance Apart From Capacity

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-amazon-ebs-gp3-volume-lets-you-provision-performance-separate-from-capacity-and-offers-20-lower-price/

Amazon Elastic Block Store (EBS) is an easy-to-use, high-performance block storage service designed for use with Amazon EC2 instances for both throughput and transaction-intensive workloads of all sizes. Using existing general purpose solid state drive (SSD) gp2 volumes, performance scales with storage capacity. By provisioning larger storage volume sizes, you can improve application input / output operations per second (IOPS) and throughput.

However some applications, such as MySQL, Cassandra, and Hadoop clusters, require high performance but not high storage capacity. Customers want to meet the performance requirements of these types of applications without paying for more storage volumes than they need.

Today I would like to tell you about gp3, a new type of SSD EBS volume that lets you provision performance independent of storage capacity, and offers a 20% lower price than existing gp2 volume types.

New gp3 Volume Type

With EBS, customers can choose from multiple volume types based on the unique needs of their applications. We introduced general purpose SSD gp2 volumes in 2014 to offer SSD performance at a very low price. gp2 provides an easy and cost-effective way to meet the performance and throughput requirements of many applications our customers use such as virtual desktops, medium-sized databases such as SQLServer and OracleDB, and development and testing environments.

That said, some customers need higher performance. Because the basic idea behind gp2 is that the larger the capacity, the faster the IOPS, customers may end up provisioning more storage capacity than desired. Even though gp2 offers a low price point, customers end up paying for storage they don’t need.

The new gp3 is the 7th variation of EBS volume types. It lets customers independently increase IOPS and throughput without having to provision additional block storage capacity, paying only for the resources they need.

gp3 is designed to provide predictable 3,000 IOPS baseline performance and 125 MiB/s regardless of volume size. It is ideal for applications that require high performance at a low cost such as MySQL, Cassandra, virtual desktops and Hadoop analytics. Customers looking for higher performance can scale up to 16,000 IOPS and 1,000 MiB/s for an additional fee. The top performance of gp3 is 4 times faster than max throughput of gp2 volumes.

How to Switch From gp2 to gp3

If you’re currently using gp2, you can easily migrate your EBS volumes to gp3 using Amazon EBS Elastic Volumes, an existing feature of Amazon EBS. Elastic Volumes allows you to modify the volume type, IOPS, and throughput of your existing EBS volumes without interrupting your Amazon EC2 instances. Also, when you create a new Amazon EBS volume, Amazon EC2 instance, or Amazon Machine Image (AMI), you can choose the gp3 volume type. New AWS customers receive 30GiB of gp3 storage with the baseline performance at no charge for 12 months.

Available Today

The gp3 volume type is available for all AWS Regions. You can access the AWS Management Console to launch your first gp3 volume.

For more information, see Amazon Elastic Block Store and get started with gp3 today.

– Kame

 

New – Amazon EC2 R5b Instances Provide 3x Higher EBS Performance

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-amazon-ec2-r5b-instances-providing-3x-higher-ebs-performance/

In July 2018, we announced memory-optimized R5 instances for the Amazon Elastic Compute Cloud (Amazon EC2). R5 instances are designed for memory-intensive applications such as high-performance databases, distributed web scale in-memory caches, in-memory databases, real time big data analytics, and other enterprise applications.

R5 instances offer two different block storage options. R5d instances offer up to 3.6TB of NMVe instance storage for applications that need access to high-speed, low latency local storage. In addition, all R5b instances work with Amazon Elastic Block Store. Amazon EBS is an easy-to-use, high-performance and highly available block storage service designed for use with Amazon EC2 for both throughput- and transaction-intensive workloads at any scale. A broad range of workloads, such as relational and non-relational databases, enterprise applications, containerized applications, big data analytics engines, file systems, and media workflows are widely deployed on Amazon EBS.

Today, we are happy to announce the availability of R5b, a new addition to the R5 instance family. The new R5b instance is powered by the AWS Nitro System to provide the best network-attached storage performance available on EC2. This new instance offers up to 60Gbps of EBS bandwidth and 260,000 I/O operations per second (IOPS).

Amazon EC2 R5b Instance
Many customers use R5 instances with EBS for large relational database workloads such as commerce platforms, ERP systems, and health record systems, and they rely on EBS to provide scalable, durable, and high availability block storage. These instances provide sufficient storage performance for many use cases, but some customers require higher EBS performance on EC2.

R5 instances provide bandwidth up to 19Gbps and maximum EBS performance of 80K IOPS, while the new R5b instances support bandwidth up to 60Gbps and EBS performance of 260K IOPS, providing 3x higher EBS-Optimized performance compared to R5 instances, enabling customers to lift and shift large relational databases applications to AWS. R5b and R5 vCPU to memory ratio and network performance are the same.

Instance NamevCPUsMemoryEBS Optimized Bandwidth (Mbps)EBS Optimized [email protected] (IO/s)
r5b.large216 GiBUp to 10,000Up to 43,333
r5b.xlarge432 GiBUp to 10,000Up to 43,333
r5b.2xlarge864 GiBUp to 10,000Up to 43,333
r5b.4xlarge16128 GiB10,00043,333
r5b.8xlarge32256 GiB20,00086,667
r5b.12xlarge48384 GiB30,000130,000
r5b.16xlarge64512 GiB40,000173,333
r5b.24xlarge96768 GiB60,000260,000
r5b.metal96768 GiB60,000260,000

Customers operating storage performance sensitive workloads can migrate from R5 to R5b to consolidate their existing workloads into fewer or smaller instances. This can reduce the cost of both infrastructure and licensed commercial software working on those instances. R5b instances are supported by Amazon RDS for Oracle and Amazon RDS for SQL Server, simplifying the migration path for large commercial database applications and improving storage performance for current RDS customers by up to 3x.

All Nitro compatible AMIs support R5b instances, and the EBS-backed HVM AMI must have NVMe 1.0e and ENA drivers installed at R5b instance launch. R5b supports io1, io2 Block Express (in preview), gp2, gp3, sc1, st1 and standard volumes. R5b does not support io2 volumes and io1 volumes that have multi-attach enabled, which are coming soon.

Available Today

R5b instances are available in the following regions: US West (Oregon), Asia Pacific (Tokyo), US East (N. Virginia), US East (Ohio), Asia Pacific (Singapore), and Europe (Frankfurt). RDS on r5b is available in US East (Ohio), Asia Pacific (Singapore), and Europe (Frankfurt), and support in other regions is coming soon.

Learn more about EC2 R5 instances and get started with Amazon EC2 today.

– Kame;

re:Invent 2020 Liveblog: Andy Jassy Keynote

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/reinvent-2020-liveblog-andy-jassy-keynote/

I’m always ready to try something new! This year, I am going to liveblog Andy Jassy‘s AWS re:Invent keynote address, which takes place from 8 a.m. to 11 a.m. on Tuesday, December 1 (PST). I’ll be updating this post every couple of minutes as I watch Andy’s address from the comfort of my home office. Stay tuned!

Jeff;


 

 

Introducing Amazon S3 Storage Lens – Organization-wide Visibility Into Object Storage

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/s3-storage-lens/

When starting out in the cloud, a customer’s storage requirements might consist of a handful of S3 buckets, but as they grow, migrate more applications and realize the power of the cloud, things can become more complicated. A customer may have tens or even hundreds of accounts and have multiple S3 buckets across numerous AWS Regions. Customers managing these sorts of environments have told us that they find it difficult to understand how storage is used across their organization, optimize their costs, and improve security posture.

Drawing from more than 14 years of experience helping customers optimize their storage, the S3 team has built a new feature called Amazon S3 Storage Lens. This is the first cloud storage analytics solution to give you organization-wide visibility into object storage, with point-in-time metrics and trend lines as well as actionable recommendations. All these things combined will help you discover anomalies, identify cost efficiencies and apply data protection best practices.

With S3 Storage Lens , you can understand, analyze, and optimize storage with 29+ usage and activity metrics and interactive dashboards to aggregate data for your entire organization, specific accounts, regions, buckets, or prefixes. All of this data is accessible in the S3 Management Console or as raw data in an S3 bucket.

Every Customer Gets a Default Dashboard

S3 Storage Lens includes an interactive dashboard which you can find in the S3 console. The dashboard gives you the ability to perform filtering and drill-down into your metrics to really understand how your storage is being used. The metrics are organized into categories like data protection and cost efficiency, to allow you to easily find relevant metrics.

For ease of use all customers receive a default dashboard. If you are like many customers, this maybe the only dashboard that you need, but if you want to, you can make changes. For example, you could configure the dashboard to export the data daily to an S3 bucket for analysis with another tool (Amazon QuickSight, Amazon Athena, Amazon Redshift, etc.) or you could upgrade to receive advanced metrics and recommendations.

Creating a Dashboard
You can also create your own dashboards from scratch, to do this I head over to the S3 console and click on the Dashboards menu item inside the Storage Lens section. Secondly, I click the Create dashboard button.

Screenshot of the console

I give my dashboard the name s3-lens-demo and select a home Region. The home Region is where the metrics data for your dashboard will be stored. I choose to enable the dashboard, meaning that it will be updated daily with new metrics.

A dashboard can analyze storage across accounts, Regions, buckets, and prefix. I choose to include buckets from all accounts in my organization and across all regions in the Dashboard scope section.

S3 Storage Lens has two tiers: Free Metrics, which is free of charge, automatically available for all S3 customers and contains 15 usage related metrics; and Advanced metrics and recommendations, which has an additional charge, but includes all 29 usage and activity metrics with 15-month data retention, and contextual recommendations. For this demo, I select Advanced metrics and recommendations.

Screenshot of Management Console

Finally, I can configure the dashboard metrics to be exported daily to a specific S3 bucket. The metrics can be exported to either CSV or Apache Parquet format for further analysis outside of the console.

An alert pops up to tell me that my dashboard has been created, but it can take up to 48 hours to generate my initial metrics.

What does a Dashboard Show?

Once my dashboard has been created, I can start to explore the data. I can filter by Accounts, Regions, Storage classes, Buckets, and Prefixes at the top of the dashboard.

The next section is a snapshot of the metrics such as the Total storage and Object count, and I can see a trendline that shows the trend on each metric over the last 30 days and a percentage change. The number in the % change column shows by default the Day/day percentage change, but I can select to compare by Week/week or Month/month.

I can toggle between different Metric groups by selecting either Summary, Cost efficiency, Data protection, or Activity.

There are some metrics here that are pretty typical like total storage and object counts, and you can already receive these in a few places in the S3 console and in Amazon CloudWatch – but in S3 Storage Lens you can receive these metrics in aggregate across your organization or account, or at the prefix level, which was not previously possible.

There are some other metrics you might not expect, like metrics that pertain to S3 feature utilization. For example we can break out the % of objects that are using encryption, or the number of objects that are non-current versions. These metrics help you understand how your storage is configured, and allows you to identify discrepancies, and then drill in for details.

The dashboard provides contextual recommendations alongside your metrics to indicate actions you can take based on the metric, for example ways to improve cost efficiency, or apply data protection best practices. Any recommendations are shown in the Recommendation column. A few days ago I took the screenshot below which shows a recommendation on one of my dashboards that suggests I should check my buckets’ default encryption configuration.

The dashboard trends and distribution section allows me to compare two metrics over time in more detail. Here I have selected Total storage as my Primary metric and Object Count as my Secondary metric.

These two metrics are now plotted on a graph, and I can select a date range to view the trend over time.

The dashboard also shows me those two metrics and how they are distributed across Storage class and Regions.

I can click on any value in this graph and Drill down to filter the entire dashboard on that value, or select Anayze by to navigate to a new dashboard view for that dimension.

The last section of the dashboard allows me to perform a Top N analysis of a metric over a date range, where N is between 1 and 25. In the example below, I have selected the top 3 items in descending order for the Total storage metric.

I can then see the top three accounts (note: there are only two accounts in my organization) and the Total storage metric for each account.

It also shows the top 3 regions for the Total storage metric, and I can see that 51.15% of my data is stored in US East (N. Virginia)

Lastly, the dashboard contains information about the top 3 buckets and prefixes and the associated trends.

As I have shown, S3 Storage Lens delivers more than 29 individual metrics on S3 storage usage and activity for all accounts in your organization. These metrics are available in the S3 console to visualize storage usage and activity trends in a dashboard, with contextual recommendations that make it easy to take immediate action. In addition to the dashboard in the S3 console, you can export metrics in CSV or Parquet format to an S3 bucket of your choice for further analysis with other tools including Amazon QuickSight, Amazon Athena, or Amazon Redshift to name a few.

Video Walkthrough

If you would like a more indepth look at S3 Storage Lens the team have recorded the following video to explain how this new feature works.

Available Now

S3 Storage Lens is available in all commercial AWS Regions. You can use S3 Storage Lens with the Amazon S3 API, CLI, or in the S3 Console. For pricing information, regarding S3 Storage Lens advanced metrics and recommendations, check out the Amazon S3 pricing page. If you’d like to dive a little deeper, then you should check out the documentation or the S3 Storage Lens webpage.

Happy Storing

— Martin

 

S3 Intelligent-Tiering Adds Archive Access Tiers

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/s3-intelligent-tiering-adds-archive-access-tiers/

We launched S3 Intelligent-Tiering two years ago, which added the capability to take advantage of S3 without needing to have a deep understanding of your data access patterns. Today we are launching two new optimizations for S3 Intelligent-Tiering that will automatically archive objects that are rarely accessed. These new optimizations will reduce the amount of […]

Handling data erasure requests in your data lake with Amazon S3 Find and Forget

Post Syndicated from Chris Deigan original https://aws.amazon.com/blogs/big-data/handling-data-erasure-requests-in-your-data-lake-with-amazon-s3-find-and-forget/

Data lakes are a popular choice for organizations to store data around their business activities. Best practice design of data lakes impose that data is immutable once stored, but new regulations such as the European General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and others have created new obligations that operators now need to be able to erase private data from their data lake when requested.

When asked to erase an individual’s private data, as a data lake operator you have to find all the objects in your Amazon Simple Storage Service (Amazon S3) buckets that contain data relating to that individual. This can be complex because data lakes contain many S3 objects (each of which may contain multiple rows), as shown in the following diagram. You often can’t predict which objects contain data relating to an individual, so you need to check each object. For example, if the user mary34 asks to be removed, you need to check each object to determine if it contains data relating to mary34. This is the first challenge operators face: identifying which objects contain data of interest.

After you identify objects containing data of interest, you face a second challenge: you need to retrieve the object from the S3 bucket, remove relevant rows from the file, put a new version of the object into S3, and make sure you delete any older versions.

Locating and removing data manually can be time-consuming and prone to mistakes, considering the large number of objects typically in data lakes.

Amazon S3 Find and Forget solves these challenges with ready-to-use automations. It allows you to remove records from data lakes of any size that are in AWS Glue Data Catalog. The solution includes a web user interface that you can use and an API that you can use to integrate with your own applications.

Solution overview

Amazon S3 Find and Forget enables you to find and delete records automatically in data lakes on Amazon S3. Using the solution, you can:

  • Define which tables from your AWS Glue Data Catalog contain data you want to erase
  • Manage a queue of identifiers (such as unique customer identifiers) to erase
  • Erase rows from your data lake matching the queued record identifiers
  • Access a log of all actions taken by the solution

You can use Amazon S3 Find and Forget to work with data lakes stored on Amazon S3 in a supported file format.

The solution is developed and distributed as open-source software that you deploy and run inside your own AWS account. When deploying this solution, you only pay for the AWS services consumed to run it. We recommend reviewing the Cost Estimate guide and creating Amazon CloudWatch Billing Alarms to monitor charges before deploying the solution in your own account.

When you handle requests to remove data, you add the identifiers through the web interface or API to a Deletion Queue. The identifiers remain in the queue until you start a Deletion Job. The Deletion Job processes the queue and removes matching rows from objects in your data lake.

Where your requirements allow it, batching deletions can provide significant cost savings by minimizing the number of times the data lake needs to be re-scanned and processed. For example, you could start a Deletion Job once a week to process all requests received in the preceding week.

Solution demonstration

This section provides a demonstration of using Amazon S3 Find and Forget’s main features. To deploy the solution in your own account, refer to the User Guide.

For this demonstration, I have prepared in advance:

The first step is to deploy the solution using AWS CloudFormation by following the instructions in the User Guide. The CloudFormation stack can take 20-30 minutes to deploy depending on the options chosen when deploying.

Once deployed, I visit the web user interface by going to the address in the WebUIUrl CloudFormation stack output. Using a temporary password emailed to the address I provided in my CloudFormation parameters, I login and set a password for future use. I then see a dashboard with some base metrics for my Amazon S3 Find and Forget deployment:

I now need to create a Data Mapper so that Amazon S3 Find and Forget can find my data lake. To do this, I select Data Mappers, then Create Data Mapper:

On this screen, I give my Data Mapper a name, choose the AWS Glue database and table in my account that I want to operate on, and the columns that I want my deletions to match. In this demonstration, I’m using a copy of the Amazon Customer Reviews Dataset that I copied to my own S3 bucket. I’ll be using the customer_id column to remove data. In the dataset, this field contains a unique identifier for each customer who has created a product review.

I then specify the IAM role to be used when modifying the objects in S3. I also choose whether I want the old S3 object versions to be deleted for me. I can turn this off if I want to implement my own strategy to manage deleting old object versions, such as by using S3 lifecycle policies.

After choosing Create Data Mapper the Data Mapper is created, and I am prompted to grant permissions for S3 Find and Forget to operate in my bucket. In the Data Mapper list, I select my new Data Mapper, then choose Generate Access Policies. The interface displays a sample bucket policy that I copy and paste into the bucket policy for my S3 bucket in the AWS Management Console.

With the Data Mapper set up, I’m now able to add the customers who have requested to have their data deleted to the Deletion Queue. Using their Customer IDs, I go to the Deletion Queue section and select Add Match to the Deletion Queue.

I’ve chosen to delete from all the available Data Mappers, but I can also choose specific ones. Once I’ve added my matches, I can see a list of them on Deletion Queue page:

I can now run a deletion job that will cause the matches to be deleted from the data lake. To do this, I select Deletion Jobs then Start a Deletion Job.

After a few minutes the Deletion Job completes, and I can see metrics collected during the job including that the job took just over two-and-a-half minutes:

There is an Export to JSON option that includes all the metrics shown, more granular information about the Deletion Job, and which S3 objects were modified.

At this point the Deletion Queue is empty, and ready for me to use for future requests.

Solution design

This section includes a brief introduction to how the solution works. More comprehensive design documentation is available in the Amazon S3 Find and Forget GitHub repository.

The following diagram illustrates the architecture of this solution.

Amazon S3 Find and Forget uses AWS Serverless services to optimize for cost and scalability. The user interface and API are built using Amazon S3, Amazon Cognito, AWS Lambda, Amazon DynamoDB, and Amazon API Gateway, which automatically scale down when not in use so that there is no expensive baseline cost just for having the solution installed. These AWS services are always available and scale in concert with when the solution is used with a pay-for-what-you-use price model.

The Deletion Job workflow is coordinated using AWS Step Functions, Lambda, and Amazon Simple Queue Service (Amazon SQS). The solution uses Step Functions for high-level coordination and state tracking in the workflow, Lambda functions for discrete computation tasks, and Amazon SQS to store queues of repetitive work.

A deletion job has two phases: Find and Forget. In the Find phase, the solution uses Amazon Athena to scan the data lake for objects containing rows matching the identifiers in the deletion queue. For this to work at scale, we built a query planner Lambda function that uses the partition list in the AWS Glue Data Catalog for each data mapper to run an Athena query on each partition, returning the path to S3 objects that contain matches with the identifiers in the Deletion Queue. The object keys are then added to an SQS queue that we refer to as the Object Deletion Queue.

In the Forget phase, deletion workers are started as a service running on AWS Fargate. These workers process each object in the Object Deletion Queue by downloading the objects from the S3 bucket into memory, deleting the rows that contain matched identifiers, then putting a new version of the object to the S3 bucket using the same key. By default, older versions of the object are then deleted from the S3 bucket to make the deletion irreversible. You can alternatively disable this feature to implement your own strategy for deleting old object versions, such as by using an S3 Lifecycle policy.

Note that during the Forget phase, affected S3 objects are replaced at the time they are processed and are subject to the Amazon S3 data consistency model. We recommend that you avoid running a Deletion Job in parallel to a workload that reads from the data lake unless it has been designed to handle temporary inconsistencies between objects.

When the object deletion queue is empty, the Forget phase is complete and a final status is determined for the Deletion Job based on whether any errors occurred (for example, due to missing permissions for S3 objects).

Logs are generated for all actions throughout the Deletion Job, which you can use for reporting or troubleshooting. These are stored in DynamoDB, along with other persistent data including the Data Mappers and Deletion Queue.

Conclusion

In this post, we introduced the Amazon S3 Find and Forget solution, which assists data lake operators to handle data erasure requests they may receive pursuant to regulations such as GDPR, CCPA, and others. We then described features of the solution and how to use it for a basic use case.

You can get started today by deploying the solution from the GitHub repository, where you can also find more documentation of how the solution works, its features, and limits. We are continuing to develop the solution and welcome you to send feedback, feature requests, or questions through GitHub Issues.

 


About the Authors

Chris Deigan is an AWS Solution Engineer in London, UK. Chris works with AWS Solution Architects to create standardized tools, code samples, demonstrations, and quick starts.

 

 

 

Matteo Figus is an AWS Solution Engineer based in the UK. Matteo works with the AWS Solution Architects to create standardized tools, code samples, demonstrations and quickstarts. He is passionate about open-source software and in his spare time he likes to cook and play the piano.

 

 

 

Nick Lee is an AWS Solution Engineer based in the UK. Nick works with the AWS Solution Architects to create standardized tools, code samples, demonstrations and quickstarts. In his spare time he enjoys playing football and squash, and binge-watching TV shows.

 

 

 

Adir Sharabi is a Solutions Architect with Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud. He is also passionate about Data and helping customers to get the most out of it.

 

 

 

Cristina Fuia is a Specialist Solutions Architect for Analytics at AWS. She works with customers across EMEA helping them to solve complex problems, design and build data architectures so that they can get business value from analyzing their data.

 

Analyzing Amazon S3 server access logs using Amazon ES

Post Syndicated from Mahesh Goyal original https://aws.amazon.com/blogs/big-data/analyzing-amazon-s3-server-access-logs-using-amazon-es/

When you use Amazon Simple Storage Service (Amazon S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves the detection of security incidents. With the advent of increased data storage needs, you can rely on Amazon S3 for a range of use cases and simultaneously looking for ways to analyze your logs to ensure compliance, perform the audit, and discover risks.

Amazon S3 lets you monitor the traffic using the server access logging feature. With server access logging, you can capture and monitor the traffic to your S3 bucket at any time, with detailed information about the source of the request. The logs are stored in the S3 bucket you own in the same Region. This addresses the security and compliance requirements of most organizations. The logs are critical for establishing baselines, analyzing access patterns, and identifying trends. For example, the logs could answer a financial organization’s question about how many requests are made to a bucket and who is making what type of access requests to the objects.

You can discover insights from server access logs through several different methods. One common option is by using Amazon Athena or Amazon Redshift Spectrum and query the log files stored in Amazon S3. However, this solution poses high latency with an exponential growth in volume. It requires further integration with Amazon QuickSight to add visualization capabilities.

You can address this by using Amazon Elasticsearch Service (Amazon ES). Amazon ES is a managed service that makes it easier to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with other AWS services such as Amazon S3 and Amazon Kinesis for loading streaming data into Amazon ES.

This post walks you through automating ingestion of server access logs from Amazon S3 into Amazon ES using AWS Lambda and visualizing the data in Kibana.

Architecture overview

Server access logging is enabled on source buckets, and logs are delivered to access log bucket. The access log bucket is configured to send an event to the Lambda function when a log file is created. On an event trigger, the Lambda function reads the file, processes the access log, and sends it to Amazon ES. When the logs are available, you can use Kibana to create interactive visuals and analyze the logs over a time period.

When designing a log analytics solution for high-frequency incoming data, you should consider buffering layers to avoid instability in the system. Buffering helps you streamline processes for unpredictable incoming log data. For such use cases, you can take advantage of managed services like Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Streaming services buffer data before delivering it to Amazon ES. This helps you avoid overwhelming your cluster with spiky ingestion events. Kinesis Data Firehose can reliably load data into Amazon ES. Kinesis Data Firehose lets you choose a buffer size of 1–100 MiBs and a buffer interval of 60–900 seconds when Amazon ES is selected as the destination. Kinesis Data Firehose also scales automatically to match the throughput of your data and requires no ongoing administration. For more information, see Ingest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon Kinesis Data Firehose.

The following diagram illustrates the solution architecture.

Prerequisites

Before creating resources in AWS CloudFormation, you must enable server access logging on the source bucket. Open the S3 bucket properties and look for Amazon S3 access and delivery bucket. See the following screenshot.

You also need an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and related AWS services. The user must have access to create IAM roles and policies via the CloudFormation template.

Setting up the resources with AWS CloudFormation

First, deploy the CloudFormation template to create the core components of the architecture. AWS CloudFormation automates the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

  1. Sign in to the console and choose the Region of the bucket storing the access log. For this post, I use us-east-1.
  2. Launch the stack:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. On the Parameters page, enter the following parameters:
    1. VPC Configuration – Select any VPC that has at least two private subnets. The template deploys the Amazon ES service domain and Lambda within the VPC.
    2. Private subnets – Select two private subnets of the VPC. The route tables associated with subnets must have a NAT gateway configuration and VPC endpoint for Amazon S3 to privately connect the bucket from Lambda.
    3. Access log S3 bucket – Enter the S3 bucket where access logs are delivered. The template configures event notification on the bucket to trigger the Lambda function.
    4. Amazon ES domain name – Specify the Amazon ES domain name to be deployed through the template.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Acknowledge resource creation under Capabilities and transforms and choose Create.

The stack takes about 10–15 minutes to complete. The CloudFormation stack does the following:

  • Creates an Amazon ES domain with fine-grained access control enabled on it. Fine-grained access control is configured with a primary user in the internal user database.
  • Creates IAM role for the Lambda function with required permission to read from S3 bucket and write to Amazon ES.
  • Creates Lambda within the same VPC of Amazon ES elastic network interfaces (ENI). Amazon ES places an ENI in the VPC for each of your data nodes. The communication from Lambda to the Amazon ES domain is via this ENI.
  • Configures file create event notification on Access log S3 bucket to trigger the Lambda function. The function code segments are discussed in detail in this GitHub project.

You must make several considerations before you proceed with a production-grade deployment. For this post, I use one primary shard with no replicas. As a best practice, we recommend deploying your domain into three Availability Zones with at least two replicas. This configuration lets Amazon ES distribute replica shards to different Availability Zones than their corresponding primary shards and improves the availability of your domain. For more information about sizing your Amazon ES, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

We recommend setting the shard count based on your estimated index size, using 50 GB as a maximum target shard size. You should also define an index template to set the primary and replica shard counts before index creation. For more information about best practices, see Best practices for configuring your Amazon Elasticsearch Service domain.

For high-frequency incoming data, you can rotate indexes either per day or per week depending on the size of data being generated. You can use Index State Management to define custom management policies to automate routine tasks and apply them to indexes and index patterns.

Creating the Kibana user

With Amazon ES, you can configure fine-grained users to control access to your data. Fine-grained access control adds multiple capabilities to give you tighter control over your data. This feature includes the ability to use roles to define granular permissions for indexes, documents, or fields and to extend Kibana with read-only views and secure multi-tenant support. For more information on granular access control, see Fine-Grained Access Control in Amazon Elasticsearch Service.

For this post, you create a fine-grained role for Kibana access and map it to a user.

  1. Navigate to Kibana and enter the primary user credentials:
    1. User nameadminuser01
    2. Password[email protected]

To access Kibana, you must have access to the VPC. For more information about accessing Kibana, see Controlling Access to Kibana.

  1. Choose Security, Roles.
  2. For Role name, enter kibana_only_role.
  3. For Cluster-wide permissions, choose cluster_composite_ops_ro.
  4. For Index patterns, enter access-log and kibana.
  5. For Permissions: Action Groups, choose read, delete, index, and manage.
  6. Choose Save Role Definition.
  7. Choose Security, Internal User Database, and Create a New User.
  8. For Open Distro Security Roles, choose Kibana_only_role (created earlier).
  9. Choose Submit.

The user kibanauser01 now has full access to Kibana and access-logs indexes. You can log in to Kibana with this user and create the visuals and dashboards.

Building dashboards

You can use Kibana to build interactive visuals and analyze the trends and combine the visuals for different use cases in a dashboard. For example, you may want to see the number of requests made to the buckets in the last two days.

  1. Log in to Kibana using kibanauser01.
  2. Create an index pattern and set the time range
  3. On the Visualize section of your Kibana dashboard, add a new visualization.
  4. Choose Vertical Bar.

You can select any time range and visual based on your requirements.

  1. Choose the index pattern and then configure your graph options.
  2. In the Metrics pane, expand Y-Axis.
  3. For Aggregation, choose Count.
  4. For Custom Label, enter Request Count.
  5. Expand the X-Axis
  6. For Aggregation, choose Terms.
  7. For Field, choose bucket.
  8. For Order By, choose metric: Request Count.
  9. Choose Apply changes.
  10. Choose Add sub-bucket and expand the Split Series
  11. For Sub Aggregation, choose Date Histogram.
  12. For Field, choose requestdatetime.
  13. For Interval, choose Daily.
  14. Apply the changes by choosing the play icon at the top of the page.

You should see the visual on the right side, similar to the following screenshot.

You can combine graphs of different use cases into a dashboard. I have built some example graphs for general use cases like the number of operations per bucket, user action breakdown for buckets, HTTPS status rate, top users, and tabular formatted error details. See the following screenshots.

Cleaning up

Delete all the resources deployed through the CloudFormation template to avoid any unintended costs.

  1. Disable the access log on source bucket.
  2. On to the CloudFormation console, identify the stacks appropriately, and delete

Summary

This post detailed a solution to visualize and monitor Amazon S3 access logs using Amazon ES to ensure compliance, perform security audits, and discover risks and patterns at scale with minimal latency. To learn about best practices of Amazon ES, see Amazon Elasticsearch Service Best Practices. To learn how to analyze and create a dashboard of data stored in Amazon ES, see the AWS Security Blog.


About the Authors

Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.

 

 

 

 

New – AWS Fargate for Amazon EKS now supports Amazon EFS

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-aws-fargate-for-amazon-eks-now-supports-amazon-efs/

AWS Fargate is a serverless compute engine for containers available with both Amazon Elastic Kubernetes Service (EKS) and Amazon Elastic Container Service (ECS). With Fargate, developers are able to focus on building applications, eliminating the need to manage the infrastructure related undifferentiated heavy lifting.

Developers specify resources for each Kubernetes pod, and are charged only for provisioned compute resource. When using Fargate, each EKS pod runs in its own kernel runtime environment and CPU, memory, storage, and network resources are never shared with other pods, providing workload isolation and increased security.

Containers are ephemeral in nature. They are dynamically scaled in and out, and their saved state or data is cleared on exit. We’ve had many requirements from our customers about data persistence and shared storage of containerized applications since launching EKS support for Fargate in 2019, and announced Amazon Elastic File System(EFS) support for Fargate on ECS in April 2020. Now many customers are operating stateful workloads on it, and others have requested support for EFS with Fargate when used with EKS. Today we are happy to announce this EFS support.

EFS provides a simple, scalable, and fully managed shared file system for use with AWS cloud services, and can also help Kubernetes applications be highly available because all data written to EFS is written to multiple AWS Availability Zones. EFS is built for on-demand petabyte growth without application interruption, and it automatically grows and shrinks as files are added and removed, eliminating the need to provision and manage capacity to accommodate growth. EFS Access Points is also ideal for security sensitive workloads as it can encrypt data in the file system and data in transit.

Kubernetes supports “Container Storage Interface (CSI)” which is a standard for exposing block and file storage systems to containerized workloads. The EFS CSI driver makes it simple to configure elastic file storage for Kubernetes clusters, and before this update customers could to use EFS via Amazon EC2 worker nodes connected to a cluster. Now customers can also configure their pods running on Fargate to access an EFS file system using standard Kubernetes APIs. With this update, customers can run stateful workloads that require highly available file systems as well as workloads that require access to shared storage. Using the EFS CSI driver, all data in transit is encrypted by default.

We released a generally available version of the Amazon EFS CSI driver for EKS in July 2020. The Amazon EFS CSI driver makes it easy to configure elastic file storage for both EKS and self-managed Kubernetes clusters running on AWS using standard Kubernetes interfaces. If a Kubernetes pod is terminated and relaunched, the CSI driver reconnects the EFS file system, even if the pod is relaunched in a different AWS Availability Zone.​ When using standard EC2 worker nodes, the EFS CSI driver needs to be deployed as a set of pods and DaemonSets. With this new update, for Fargate this step is not required and you do not need to install the EFS CSI driver, as it is installed in the Fargate stack and support for EFS is provided out of the box. Customers can use EFS with Fargate for EKS without spending the time and resources to install and update the CSI driver.

How to configure the Fargate/EKS and EFS integration?

You need to use three Kubernetes settings to mount EFS to Farfgate on EKS. Those are StorageClass, PersistentVolume (PV), and PersistentVolumeClaim (PVC). Configuring StorageClass and PVs are steps that an administrator (or similar) would do to make EFS file systems available for application developers. PVCs are used to allocate PVs from the pool of existing PVs as needed to deploy applications.

The StorageClass object provides a way for a Kubernetes administrator to register a specific storage type (e.g. EFS or EBS) and configuration (e.g. throughput, backup policy). Once a StorageClass is defined the PV object is used to create actual storage volumes inside that class. StorageClass and PV are the Kubernetes mechanisms that allow actual storage subsystems to be abstracted and decoupled from the way they are consumed by Kubernetes users. For example, while a Kubernetes administrator needs to know how exactly to configure a specific storage configuration from a particular storage service, Kubernetes users do not because they only see their volumes within abstract classes of storage.

The last step is the binding: Kubernetes users requests access to said volumes via the PVC object and related API. These volumes can be created dynamically when the user requests them via the PVC or they need to be statically pre-created by an administrator for later consumption by a Kubernetes user. The current implementation of the EFS CSI driver requires the volumes to be statically pre-created for the PVC binding to work.

If you are new to Kubernetes persistent volumes and want to know more about how they work, please refer to this page in the Kubernetes documentation that has all the details.

Let’s see this in action. First, you need to create your own EFS file system in the same AWS Region. If you are not familiar with EFS this EFS getting start guide is a good resource you can start with.

Once you create an EFS file system, you get your file system ID. You can configure the mount settings using a Kubernetes StorageClass and PersistentVolume. Here is an example of the YAML files:

CSIDriver Object

apiVersion: storage.k8s.io/v1beta1
kind: CSIDriver
metadata:
  name: efs.csi.aws.com
spec:
  attachRequired: false

For now you need to add the EFS CSIDriver object shown above to your cluster so Kubernetes can discover the driver that Fargate automatically installs. In the future, this manifest will be added by default to EKS clusters.

Storage Class

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com

PersistentVolume(PV)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: <EFS filesystem ID>

The volumeHandle is returned by the EFS service when you create a file system, and you need to use it to configure the CSI driver to create the PV. You can obtain EFS filesystem ID from the AWS management console or the below command by AWS CLI.

aws efs describe-file-systems – query "FileSystems[*].FileSystemId" – output text

Now that you have created a PV by applying the manifest above, you configure Kurbenetes pods to access the EFS file system by including a PersistentVolumeClaim in the pod manifest. These are two manifest examples that do that:

PersistentVolumeClaim(PVC)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 5Gi

Pod manifest

apiVersion: v1
kind: Pod
metadata:
  name: app1
spec:
  containers:
  - name: app1 image: busybox command: ["/bin/sh"] args: ["-c", "while true; do echo $(date -u) >> /data/out1.txt; sleep 5; done"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /data
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: efs-claim

Available Today

Today, this feature update is available for newly created EKS clusters with Kubernetes version 1.17, and we are planning to roll out support for this feature with additional Kubernetes versions on EKS in the coming weeks. This update is available in all AWS regions where Fargate with EKS is available. You can check our latest documentation for more detail.

– Kame;

New – High-Performance HDD Storage for Amazon FSx for Lustre File Systems

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/new-high-performance-hdd-storage-for-amazon-fsx-for-lustre-file-systems/

Many workloads, such as genome analysis, training of machine learning models, High Performance Computing (HPC), and analytics applications depend on multiple compute instances accessing the same set of data. For these workloads, clusters of compute instances are commonly connected to a high-performance shared file system. Amazon FSx for Lustre makes it easy and cost-effective to launch and run the world’s most popular high-performance shared file system. And today we’re announcing new HDD storage options for FSx for Lustre that reduce storage costs by up to 80% for throughput-intensive workloads that don’t require the sub-millisecond latencies of SSD storage.

Customers can achieve up to tens of gigabytes of throughput per second while lowering their storage costs for workloads where throughput is the dominant performance attribute. Video rendering and financial simulations are two examples of these throughput-intensive workloads.

This announcement includes two new HDD-based storage options which are optimized for reading and writing sequential file data. One offers 12 MB/sec of baseline throughput per TiB of storage and the other offers 40 MB/sec of baseline throughput per TiB of storage, and both allow you to burst to six times those throughput levels. To increase performance for frequently accessed files, you can also provision an SSD cache that is automatically sized to 20% of your HDD file system storage capacity. On file systems that are provisioned with an SSD cache, files read from the cache are served with sub-millisecond latencies.

The new FSx file systems are comprised of multiple HDD-based storage servers and a single SSD-based metadata server. The SSD storage on the metadata servers ensures that all metadata operations, which represent the majority of file system operations, are delivered with sub-millisecond latencies.

HDD performance increases with storage capacity making it easy to scale out your storage solution without encountering file system bottlenecks. Here’s a summary of the performance specifications for both the new HDD storage options and the existing SSD storage options.

Quick Guide

Traditionally, operating and scaling high performance file systems was costly and time consuming. Now with just a few clicks anyone can use FSx for Lustre for any compute workload. Launching the HDD-based file system is easy. Simply open the management console and click the Create file system button.

Chose FSx for Lustre and click Next.

FSx for Lustre offers two deployment types – Persistent and Scratch. HDD storage is available on persistent mode which is designed for longer-term storage and workloads. On persistent file systems, data is replicated and file servers are replaced if they fail whereas the scratch type are ideal for temporary storage and shorter-term processing of data. On scratch file systems, data is not replicated and does not persist if a file server fails. You can can find more detail on the difference between the two deployment options in this blog article.

Once you choose HDD as the Storage Type, you can select 12 or 40 MB/s per TiB for the Throughput per unit of storage. You can also add the SSD cache to accelerate file access by choosing “Read-only SSD cache” as Drive Cache Type.

You can also create a file system by CLI.

fsx create-file-system \
--storage-capacity <capacity> – storage-type HDD \
--file-system-type LUSTRE \
--subnet-ids subnet-<your vpc subnet id>85b2c0ce – lustre-configuration \
DeploymentType=PERSISTENT_1,PerUnitStorageThroughput=<12 or 40>\,DriveCacheType=<NONE or READ>

For PerUnitStorageThroughput=12, acceptable values of storage capacity are multiples of 6000.
For PerUnitStorageThroughput=40, acceptable values of storage capacity are multiples of 1800.

Available Today

The new HDD storage options are available for all AWS regions where Amazon FSx for Lustre is available. Please visit our web site for more details.

–  Kame;

 

New – Using Amazon GuardDuty to Protect Your S3 Buckets

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-using-amazon-guardduty-to-protect-your-s3-buckets/

As we anticipated in this post, the anomaly and threat detection for Amazon Simple Storage Service (S3) activities that was previously available in Amazon Macie has now been enhanced and reduced in cost by over 80% as part of Amazon GuardDuty. This expands GuardDuty threat detection coverage beyond workloads and AWS accounts to also help you protect your data stored in S3.

This new capability enables GuardDuty to continuously monitor and profile S3 data access events (usually referred to data plane operations) and S3 configurations (control plane APIs) to detect suspicious activities such as requests coming from an unusual geo-location, disabling of preventative controls such as S3 block public access, or API call patterns consistent with an attempt to discover misconfigured bucket permissions. To detect possibly malicious behavior, GuardDuty uses a combination of anomaly detection, machine learning, and continuously updated threat intelligence. For your reference, here’s the full list of GuardDuty S3 threat detections.

When threats are detected, GuardDuty produces detailed security findings to the console and to Amazon EventBridge, making alerts actionable and easy to integrate into existing event management and workflow systems, or trigger automated remediation actions using AWS Lambda. You can optionally deliver findings to an S3 bucket to aggregate findings from multiple regions, and to integrate with third party security analysis tools.

If you are not using GuardDuty yet, S3 protection will be on by default when you enable the service. If you are using GuardDuty, you can simply enable this new capability with one-click in the GuardDuty console or through the API. For simplicity, and to optimize your costs, GuardDuty has now been integrated directly with S3. In this way, you don’t need to manually enable or configure S3 data event logging in AWS CloudTrail to take advantage of this new capability. GuardDuty also intelligently processes only the data events that can be used to generate threat detections, significantly reducing the number of events processed and lowering your costs.

If you are part of a centralized security team that manages GuardDuty across your entire organization, you can manage all accounts from a single account using the integration with AWS Organizations.

Enabling S3 Protection for an AWS Account
I already have GuardDuty enabled for my AWS account in this region. Now, I want to add threat detection for my S3 buckets. In the GuardDuty console, I select S3 Protection and then Enable. That’s it. To be more protected, I repeat this process for all regions enabled in my account.

After a few minutes, I start seeing new findings related to my S3 buckets. I can select each finding to get more information on the possible threat, including details on the source actor and the target action.

After a few days, I select the Usage section of the console to monitor the estimated monthly costs of GuardDuty in my account, including the new S3 protection. I can also find which are the S3 buckets contributing more to the costs. Well, it turns out I didn’t have lots of traffic on my buckets recently.

Enabling S3 Protection for an AWS Organization
To simplify management of multiple accounts, GuardDuty uses its integration with AWS Organizations to allow you to delegate an account to be the administrator for GuardDuty for the whole organization.

Now, the delegated administrator can enable GuardDuty for all accounts in the organization in a region with one click. You can also set Auto-enable to ON to automatically include new accounts in the organization. If you prefer, you can add accounts by invitation. You can then go to the S3 Protection page under Settings to enable S3 protection for their entire organization.

When selecting Auto-enable, the delegated administrator can also choose to enable S3 protection automatically for new member accounts.

Available Now
As always, with Amazon GuardDuty, you only pay for the quantity of logs and events processed to detect threats. This includes API control plane events captured in CloudTrail, network flow captured in VPC Flow Logs, DNS request and response logs, and with S3 protection enabled, S3 data plane events. These sources are ingested by GuardDuty through internal integrations when you enable the service, so you don’t need to configure any of these sources directly. The service continually optimizes logs and events processed to reduce your cost, and displays your usage split by source in the console. If configured in multi-account, usage is also split by account.

There is a 30-day free trial for the new S3 threat detection capabilities. This applies as well to accounts that already have GuardDuty enabled, and add the new S3 protection capability. During the trial, the estimated cost based on your S3 data event volume is calculated in the GuardDuty console Usage tab. In this way, while you evaluate these new capabilities at no cost, you can understand what would be your monthly spend.

GuardDuty for S3 protection is available in all regions where GuardDuty is offered. For regional availability, please see the AWS Region Table. To learn more, please see the documentation.

Danilo

Field Notes: Building a Disaster Recovery site on AWS for your Azure Workload

Post Syndicated from Ebrahim (EB) Khiyami original https://aws.amazon.com/blogs/architecture/field-notes-building-a-disaster-recovery-site-on-aws-for-your-azure-workload/

Customers running workloads on other clouds, like Azure or GCP, can increase resilience and meet compliance requirements by using AWS as their disaster recovery site. CloudEndure Disaster Recovery provides an easy cross-cloud solution for replicating and recovering workloads from other cloud providers to AWS. It automatically converts your source machines so that they boot and run natively on AWS.

In this blog post, I show you how to use CloudEndure Disaster Recovery to build a DR site on AWS if your primary workload is on Azure. I will build connectivity between Azure and AWS according to CloudEndure requirements, install CloudEndure Agent, failover a server from Azure to AWS, and finally return to normal operations by failing back from AWS to Azure.

When using CloudEndure Disaster Recovery, your source location can be AWS (for Region to Region replication within AWS), VCenter, or ‘Other Infrastructure’. Other Infrastructure includes physical servers, virtual machines on private cloud, or virtual machines on public cloud providers like Azure or GCP. Using CloudEndure Disaster Recovery, you can achieve a Recovery Point Objectives (RPO) of 1 second, and a Recovery Time Objective (RTO) of 5-20 minutes (depending on the operating system used).

The failover process to AWS is the same regardless of source infrastructure. It starts by setting up your CloudEndure Disaster Recovery project, installing CloudEndure Agent and completing the initial replication from source (Azure) to target (AWS), and then performing a failover to target (AWS). However, the failback process (discussed later), will vary depending on your source infrastructure. During failback, CloudEndure DR reverses the replication to the opposite direction, in our case from AWS to Azure, so that the data written on your recovered servers during the disaster can be restored to the original source environment (Azure).

The following architectural diagram depicts the solution I build in this post.

Architectural Diagram

Figure 1 – Architectural Diagram

For the purpose of this post, I’m using a sample VM that runs Windows Server 2019 datacenter on Azure in East US 2 Region. I choose US-East-1 as a recovery target Region in AWS. The process is identical for any number of servers and any other supported operating systems.

I perform the following steps:

  1. Establish connectivity between Azure and AWS
  2. Complete the networking requirements for CloudEndure Disaster Recovery
  3. Register for a CloudEndure DR account, create a DR project and get an agent installation token.
  4. Install CloudEndure Agent on Azure VM
  5. Configure Blueprint for target server in the CloudEndure DR User Console
  6. Perform failover from Azure to AWS
  7. Perform failback from AWS to Azure
  8. Return to normal operations

Let’s review the details of each step.

1-      Establish connectivity between Azure and AWS

CloudEndure supports connectivity using public internet, VPN, or DirectConnect. For this demo, I will create a VPN connection between the two sites. Creating a VPN connectivity between Azure and AWS is no different than a normal Site-to-Site VPN connection setup between AWS and on-premises. I have to do the configurations in parallel on both sides so that both ends of the connection will be synced.

Assuming you have created a Virtual Network and a Virtual Private Cloud (VPC) on Azure and AWS sides respectively, following are the high-level steps to create a VPN connection between the two.  For more details review, “How do I create a secure connection between my office network and Amazon Virtual Private Cloud?”, and “Create a Site-to-Site connection in the Azure portal.”

On Azure

1-1-           Create Virtual Network Gateway with static routing

1-2-           Take a note of the public IP of the gateway created. We will use it in the next step.

On AWS

1-3-           Create a Virtual Private Gateway (VPG). Select Amazon default ASN for ASN.

Virtual Private Gateway

Figure 2 – Create a Virtual Private Gateway

1-4             Attach the VPG created to the VPC you plan to use to host your DR.

Attach a Virtual Private Gateway

Figure 3 – Attach a Virtual Private Gateway

1-5-           Create a Customer Gateway (CGW). Select Static for Routing and provide the public IP address for the Virtual Network Gateway on Azure side noted in step 1-2. For Static IP Address enter the network address of your primary site on Azure.

Create Customer Gateway

Figure 4 – Create Customer Gateway

1-6-           Create Site-to-Site VPN Connection. Select VGW and customer gateway created in the preceding step. Select Static routing and provide the IP address for VNet network on Azure.

Create VPN Connection

Figure 5 – Create VPN Connection

1-7-           Download the Configuration Template

VPN Configuration Template

Figure 6 – VPN Configuration Template

1-8-           We must locate the Pre-Shared Key, and the Virtual Private Gateway IP address in the configuration file.  Review the file carefully to understand its content. See my example in the following image.

Figure 7- VPN Configuration - Pre-shared key

Figure 7- VPN Configuration – Pre-shared key

 

Figure 8 - VPN Configuration VPG IP

Figure 8 – VPN Configuration VPG IP

Let’s go back to the Azure Portal

1-9-           Create a Local Network Gateway. Provide the Outside IP address noted in step 1-8. In the Address space field provide the DR VPC network address on AWS side.

Figure 9 - Local Network Gateway

Figure 9 – Local Network Gateway

1-10-           Create a connection. You must provide the pre-shared key noted from configuration file in step 1-8.

Figure 10 - Azure VPN Connection

Figure 10 – Azure VPN Connection

 

1-11-           Finally, remember to modify the routing table of the subnet on AWS side so that it routes VPN traffic to VGW.

Figure 11 - AWS Routing Table

Figure 11 – AWS Routing Table

The VPN connection we created on AWS side will have 2 tunnels. I recommend you setup both tunnels for high availability. You must follow the same steps to configure the second tunnel.

At this point, you have a VPN connection between Azure and AWS. In the next step we must add the necessary ports needed for CloudEndure management and replication between Azure and AWS.

2-       Complete the networking requirements for CloudEndure

Adjust the security settings on both sides to allow the ports necessary for CloudEndure DR replication and authentication.

Communication over TCP Port 443:

  • Between the Source Machines and the CloudEndure Service Manager.
  • Between the Staging Area and the CloudEndure Service Manager.

Communication over TCP Port 1500:

  • Between the Source Machines and the Staging Area

3-       Register for a CloudEndure DR account, create a DR project, and get an agent installation token

Before using CloudEndure DR you need to subscribe to the product on AWS Marketplace. The subscription process is as follows:

The first step in using CloudEndure is creating a project. You have two types of projects:

  1. Migration
  2. Disaster Recovery

Select Disaster Recovery.

Figure 12- CloudEndure Project

Figure 12- CloudEndure Project

After you create a project you must setup source and destination environments, replication settings and retrieve an authentication token.  You use the token to install CloudEndure Agent on source environment servers, in this case Azure VM. You do that by providing credentials (Access key and Secret access key) for an Identity and Access Management (IAM) user in your AWS account that has the necessary permissions to perform CloudEndure related API. The details are listed in the CloudEndure IAM policy.

Figure 13 - Project Credentials

Figure 13 – Project Credentials

Next step is to set your disaster recovery source and target. Source can be AWS Region, if you want to perform a Region to Region disaster recovery within AWS. The source is Other Infrastructure, if your source environment is a physical datacenter, virtual environment or any other public cloud.

For Azure, I choose Other Infrastructure. You also must provide the subnet in which the replication server on AWS will be launched in the staging area and the security group it will use. I keep the instance type for the replication server as default.

Figure 14 - Replication Settings

Figure 14 – Replication Settings

After you complete the setup in the replication setting screen you are presented with a How to Add Machines page that will give you the installation instructions for CloudEndure Agent and the installation token. Take a note of these details and get ready to switch to the Azure portal.

Figure 15 - CloudEndure How to Add Machines

Figure 15 – CloudEndure How to Add Machines

4-       Install CloudEndure Agent

My source environment is a VM that runs Windows 2019 datacenter and has two disks of size 8 and 15 GB. As shown in the preceding screenshot, I download the agent installer for Windows and then RDP to the VM. I run the command to install the agent. This is the output from my screen:

Figure 16- CloudEndure Agent Installation

Figure 16- CloudEndure Agent Installation

After installation has completed successfully, CloudEndure will start creating the replication server in the staging area using configurations you set up in the Blueprint (more about Blueprint in the next step). It will authenticate the replication server with the CloudEndure Service Manager.

  • Download the replication software
  • Create staging disks
  • Pair CloudEndure Agent on the source VM with the replication server,
  • Establish communication between the CloudEndure Agent and the replication server

The replication traffic is encrypted in transit using AES 256 bit, and is carried on TCP port 1500.  Make sure you have this port open on Azure side in the egress direction to AWS. CloudEndure then will start to initiate replication and when it finishes, the source VM will show in the dashboard with Continuous Data Protection CDP state in the Data Replication Progress column. The VM at this point is ready to be tested for disaster recovery.

Figure 17- Initial Replication Completed

Figure 17- Initial Replication Completed

5-       Set up the Blueprint

While I’m waiting for the initial data sync to complete and enter Continuous Data Protection state, I choose Machines in the CloudEndure dashboard to go to the Blueprint page.

The Blueprint is a set of instructions on how to launch a target machine. You can define Blueprint properties such as instance type, subnet, security group, IP address and others.

I select my target VM to run on a machine type that’s a copy of source. This means CloudEndure will match an instance type on AWS that provides at least the compute and memory properties on the Azure VM. I also select a subnet that I had already created in my disaster recovery VPC and choose a new IP to be assigned to the target server from that subnet.

Figure 18- CloudEndure Blueprint

Figure 18- CloudEndure Blueprint

6-       Perform Failover from Azure to AWS

The overall process for failing over starts with the replication. As we saw in the previous step, the replication starts with initial sync and then it enters the Continuous data protection state.

Multiple Point in Time (PIT) snapshots will then be created. Before you perform an actual failover, I recommend that you launch your target server in Test mode first.

You do that by selecting the machine and then select Launch Target Machine> Test Mode. You then must select the snapshot you want to recover from. CloudEndure DR by default allows you select from 60 snapshots per disk per month. I select Latest and then Recover.

After the Test machine has been created successfully, the Disaster Recovery Lifecycle column will show “Tested Recently” with a green bar to the left of the machine name. This means the machine is now ready to be failed over.

Figure 19- CloudEndure DR Snapshots

Figure 19- CloudEndure DR Snapshots

Figure 20- VM Ready for Failover

Figure 20- VM Ready for Failover

Steps for failover

  • Launch the machine in Recovery mode. This is similar to launching machines in Test Mode. Select the latest snapshot.
  • Perform the actual Failover. In this step you direct your users’ traffic from Azure to AWS. The steps here vary depending on the tool you use for directing traffic.
  • CloudEndure will create the final failed-over machine using the configurations you provided in Blueprint in step 5. At this point, the Disaster Recovery Lifecycle column will show “Failed Over”.
Figure 21 - Target Machine Created

Figure 21 – Target Machine Created

7-       Perform Failback from Azure to AWS

Whether you perform the disaster recovery failover as part of an annual DR testing exercise, or as part of an actual disaster, you may want to failback to the source environment after the outage/disaster is over. This functionality is called failback.

CloudEndure DR performs failback by resetting your project and reversing the replication direction from AWS to Azure. All data that was written to your recovery machine (on AWS) during the disaster, will be copied back to machines in your original source infrastructure (Azure).

The actual steps for failback depend on the source infrastructure. In case your source environment is AWS or VMWare, the steps are fully orchestrated. If your source infrastructure is physical or other public cloud providers, a Failback client will be necessary.

Steps for Failback

  • Prepare the project for failback by selecting the Prepare for Failback option in the Project Actions menu in the User Console.
Figure 22- Prepare Project for Failback

Figure 22- Prepare Project for Failback

 

Figure 22- Prepare Project for Failback

The project will show “Preparing for failback to original Source” next to the project type. After a few minutes, the Data Replication Progress column will show “Pair the CloudEndure Agent with the Replication Server”. This state will change only after you complete the installation of the failback client on Azure side in the next step.

Figure 23- Prepare for Failback

Figure 23- Prepare for Failback

Download the Failback Client from Replication Settings. The failback client is an ubuntu ISO image that acts as a standalone replication server. You must install it on the source VM and boot from it so that it starts to initiate the connection necessary to start the replication in the reversed direction, from AWS to Azure.

Figure 24- Download Failback Client

Figure 24- Download Failback Client

Prepare Failback client on Azure

Azure currently supports creating virtual machines from VHD. In order to use CloudEndure failback client, you need first to convert it from ISO to VHD format. The detailed steps to do this are outside the scope of this post, but, at a high-level, this is what you do:

  • Deploy the failback ISO image to a VMWare workstation.
  • Once the installation is complete, power off the VM and copy the resulted VMDK file
  • Use the following PowerShell script to convert the VMDK file into VHD. (You must change file locations to reflect your actual environments)
1.  Import-Module 'C:\Program Files\Microsoft Virtual Machine Converter\MvmcCmdlet.psd1<br />
2.  ConvertTo-MvmcVirtualHardDisk -SourceLiteralPath 'C:\Users\Administrator\Documents\Virtual Machines\failbackdvd\failbackdvd.vmdk' -VhdType FixedHardDisk -VhdFormat Vhd -DestinationLiteralPath C:\DR

 

Figure 25- Converting Failback Client

Figure 25- Converting Failback Client

Figure 26- Failback Client Converted

Figure 26- Failback Client Converted

  • Upload the resulted VHD file to your Storage account on Azure using the steps in Create an Azure Storage Account
  • Create an image using the VHD file you uploaded
  • Create a new VM using the image you created. This will act as the replication server. Later I will use it to restore the machine data from AWS to Azure after the replication completes. Make sure you attached the same disks to it as in the source machine on AWS described in step 4.
  • When I start the newly created VM, I see the following screen indicating that CloudEndure failback client has been loaded and is waiting to for CloudEndure authentication token to authenticate with CloudEndure Console. Details on where to retrieve your authentication token from CloudEndure Console is available in Installing the Agents.
Figure 27- Failback Client Starts

Figure 27- Failback Client Starts

I enter the authentication token and then the ID of the machine I want to replicate. Details to check a machine ID in CloudEndure Console are available in the Machines Page. The replication server will authenticate with CloudEndure Console, and will connect to the source machine on AWS on port TCP 1500 to initiate the replication. Make sure you have a security group to allow to allow this connectivity on Azure and AWS side.

Figure 28- Failback Client Authenticated

Figure 28- Failback Client Authenticated

I wait for the replication to complete and enter Continuous Data Protection state on the CloudEndure Console. This indicates the replication direction has been reversed successfully.

Figure 29- Replication Reversed Successfully

Figure 29- Replication Reversed Successfully

I go back to CloudEndure and follow the same steps as in step 6 to launch a Test and then a Recovery instance. One thing to consider here is when launching Test/Recovery instance, the failback client will reboot as part of the launch process. Make sure to manually detach the volume that contains the failback client.

The CloudEndure Console will now show “Pair with CloudEndure Agent”. This status will change only when you return your project to normal operation in the next step.

8-       Return to normal operation

At this point, we failed over from Azure to AWS as part of a disaster recovery exercise or an actual disaster. We then performed a failback from AWS to Azure by reversing the replication direction so that all data written during the outage/disaster on AWS side will be replicated back to Azure.

The last step in the process is to return back to normal operations. This will again reverse the replication from the original primary location in Azure to the original recovery target in AWS. You do so by selecting Project Actions> Return to Normal Operation.

Figure 30- Return to Normal Operation Completes

Figure 30- Return to Normal Operation Completes

The failback has now completed and your replication has been reversed back to its original direction from Azure to AWS.

Clean Up

By the end of this exercise, please make sure you delete the resources you created on both AWS and Azure sides, so you don’t incur charges in the future.

Conclusion

In this blog post, we built a demo environment to demonstrate CloudEndure DR functionality between Azure as a DR source site, and AWS as a DR target. I showed you how to create a VPN tunnel between the sites, start replication, failover from Azure to AWS and then failback from AWS to Azure (using CloudEndure failback client). Finally, I showed you how to return your project to its normal operation by reverting the replication to its original direction that we started with from Azure to AWS. More details can be found in CloudEndure Disaster Recovery. 

If you have questions, feel free to ask in the comments. I look forward to hearing from you.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

New – A Shared File System for Your Lambda Functions

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-a-shared-file-system-for-your-lambda-functions/

I am very happy to announce that AWS Lambda functions can now mount an Amazon Elastic File System (EFS), a scalable and elastic NFS file system storing data within and across multiple availability zones (AZ) for high availability and durability. In this way, you can use a familiar file system interface to store and share data across all concurrent execution environments of one, or more, Lambda functions. EFS supports full file system access semantics, such as strong consistency and file locking.

To connect an EFS file system with a Lambda function, you use an EFS access point, an application-specific entry point into an EFS file system that includes the operating system user and group to use when accessing the file system, file system permissions, and can limit access to a specific path in the file system. This helps keeping file system configuration decoupled from the application code.

You can access the same EFS file system from multiple functions, using the same or different access points. For example, using different EFS access points, each Lambda function can access different paths in a file system, or use different file system permissions.

You can share the same EFS file system with Amazon Elastic Compute Cloud (EC2) instances, containerized applications using Amazon ECS and AWS Fargate, and on-premises servers. Following this approach, you can use different computing architectures (functions, containers, virtual servers) to process the same files. For example, a Lambda function reacting to an event can update a configuration file that is read by an application running on containers. Or you can use a Lambda function to process files uploaded by a web application running on EC2.

In this way, some use cases are much easier to implement with Lambda functions. For example:

  • Processing or loading data larger than the space available in /tmp (512MB).
  • Loading the most updated version of files that change frequently.
  • Using data science packages that require storage space to load models and other dependencies.
  • Saving function state across invocations (using unique file names, or file system locks).
  • Building applications requiring access to large amounts of reference data.
  • Migrating legacy applications to serverless architectures.
  • Interacting with data intensive workloads designed for file system access.
  • Partially updating files (using file system locks for concurrent access).
  • Moving a directory and all its content within a file system with an atomic operation.

Creating an EFS File System
To mount an EFS file system, your Lambda functions must be connected to an Amazon Virtual Private Cloud that can reach the EFS mount targets. For simplicity, I am using here the default VPC that is automatically created in each AWS Region.

Note that, when connecting Lambda functions to a VPC, networking works differently. If your Lambda functions are using Amazon Simple Storage Service (S3) or Amazon DynamoDB, you should create a gateway VPC endpoint for those services. If your Lambda functions need to access the public internet, for example to call an external API, you need to configure a NAT Gateway. I usually don’t change the configuration of my default VPCs. If I have specific requirements, I create a new VPC with private and public subnets using the AWS Cloud Development Kit, or use one of these AWS CloudFormation sample templates. In this way, I can manage networking as code.

In the EFS console, I select Create file system and make sure that the default VPC and its subnets are selected. For all subnets, I use the default security group that gives network access to other resources in the VPC using the same security group.

In the next step, I give the file system a Name tag and leave all other options to their default values.

Then, I select Add access point. I use 1001 for the user and group IDs and limit access to the /message path. In the Owner section, used to create the folder automatically when first connecting to the access point, I use the same user and group IDs as before, and 750 for permissions. With this permissions, the owner can read, write, and execute files. Users in the same group can only read. Other users have no access.

I go on, and complete the creation of the file system.

Using EFS with Lambda Functions
To start with a simple use case, let’s build a Lambda function implementing a MessageWall API to add, read, or delete text messages. Messages are stored in a file on EFS so that all concurrent execution environments of that Lambda function see the same content.

In the Lambda console, I create a new MessageWall function and select the Python 3.8 runtime. In the Permissions section, I leave the default. This will create a new AWS Identity and Access Management (IAM) role with basic permissions.

When the function is created, in the Permissions tab I click on the IAM role name to open the role in the IAM console. Here, I select Attach policies to add the AWSLambdaVPCAccessExecutionRole and AmazonElasticFileSystemClientReadWriteAccess AWS managed policies. In a production environment, you can restrict access to a specific VPC and EFS access point.

Back in the Lambda console, I edit the VPC configuration to connect the MessageWall function to all subnets in the default VPC, using the same default security group I used for the EFS mount points.

Now, I select Add file system in the new File system section of the function configuration. Here, I choose the EFS file system and accesss point I created before. For the local mount point, I use /mnt/msg and Save. This is the path where the access point will be mounted, and corresponds to the /message folder in my EFS file system.

In the Function code editor of the Lambda console, I paste the following code and Save.

import os
import fcntl

MSG_FILE_PATH = '/mnt/msg/content'


def get_messages():
    try:
        with open(MSG_FILE_PATH, 'r') as msg_file:
            fcntl.flock(msg_file, fcntl.LOCK_SH)
            messages = msg_file.read()
            fcntl.flock(msg_file, fcntl.LOCK_UN)
    except:
        messages = 'No message yet.'
    return messages


def add_message(new_message):
    with open(MSG_FILE_PATH, 'a') as msg_file:
        fcntl.flock(msg_file, fcntl.LOCK_EX)
        msg_file.write(new_message + "\n")
        fcntl.flock(msg_file, fcntl.LOCK_UN)


def delete_messages():
    try:
        os.remove(MSG_FILE_PATH)
    except:
        pass


def lambda_handler(event, context):
    method = event['requestContext']['http']['method']
    if method == 'GET':
        messages = get_messages()
    elif method == 'POST':
        new_message = event['body']
        add_message(new_message)
        messages = get_messages()
    elif method == 'DELETE':
        delete_messages()
        messages = 'Messages deleted.'
    else:
        messages = 'Method unsupported.'
    return messages

I select Add trigger and in the configuration I select the Amazon API Gateway. I create a new HTTP API. For simplicity, I leave my API endpoint open.

With the API Gateway trigger selected, I copy the endpoint of the new API I just created.

I can now use curl to test the API:

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
No message yet.
$ curl -X POST -H "Content-Type: text/plain" -d 'Hello from EFS!' https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!

$ curl -X POST -H "Content-Type: text/plain" -d 'Hello again :)' https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!
Hello again :)

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!
Hello again :)

$ curl -X DELETE https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Messages deleted.

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
No message yet.

It would be relatively easy to add unique file names (or specific subdirectories) for different users and extend this simple example into a more complete messaging application. As a developer, I appreciate the simplicity of using a familiar file system interface in my code. However, depending on your requirements, EFS throughput configuration must be taken into account. See the section Understanding EFS performance later in the post for more information.

Now, let’s use the new EFS file system support in AWS Lambda to build something more interesting. For example, let’s use the additional space available with EFS to build a machine learning inference API processing images.

Building a Serverless Machine Learning Inference API
To create a Lambda function implementing machine learning inference, I need to be able, in my code, to import the necessary libraries and load the machine learning model. Often, when doing so, the overall size of those dependencies goes beyond the current AWS Lambda limits in the deployment package size. One way of solving this is to accurately minimize the libraries to ship with the function code, and then download the model from an S3 bucket straight to memory (up to 3 GB, including the memory required for processing the model) or to /tmp (up 512 MB). This custom minimization and download of the model has never been easy to implement. Now, I can use an EFS file system.

The Lambda function I am building this time needs access to the public internet to download a pre-trained model and the images to run inference on. So I create a new VPC with public and private subnets, and configure a NAT Gateway and the route table used by the the private subnets to give access to the public internet. Using the AWS Cloud Development Kit, it’s just a few lines of code.

I create a new EFS file system and an access point in the new VPC using similar configurations as before. This time, I use /ml for the access point path.

Then, I create a new MLInference Lambda function with the same set up as before for permissions and connect the function to the private subnets of the new VPC. Machine learning inference is quite a heavy workload, so I select 3 GB for memory and 5 minutes for timeout. In the File system configuration, I add the new access point and mount it under /mnt/inference.

The machine learning framework I am using for this function is PyTorch, and I need to put the libraries required to run inference in the EFS file system. I launch an Amazon Linux EC2 instance in a public subnet of the new VPC. In the instance details, I select one of the availability zones where I have an EFS mount point, and then Add file system to automatically mount the same EFS file system I am using for the function. For the security groups of the EC2 instance, I select the default security group (to be able to mount the EFS file system) and one that gives inbound access to SSH (to be able to connect to the instance).

I connect to the instance using SSH and create a requirements.txt file containing the dependencies I need:

torch
torchvision
numpy

The EFS file system is automatically mounted by EC2 under /mnt/efs/fs1. There, I create the /ml directory and change the owner of the path to the user and group I am using now that I am connected (ec2-user).

$ sudo mkdir /mnt/efs/fs1/ml
$ sudo chown ec2-user:ec2-user /mnt/efs/fs1/ml

I install Python 3 and use pip to install the dependencies in the /mnt/efs/fs1/ml/lib path:

$ sudo yum install python3
$ pip3 install -t /mnt/efs/fs1/ml/lib -r requirements.txt

Finally, I give ownership of the whole /ml path to the user and group I used for the EFS access point:

$ sudo chown -R 1001:1001 /mnt/efs/fs1/ml

Overall, the dependencies in my EFS file system are using about 1.5 GB of storage.

I go back to the MLInference Lambda function configuration. Depending on the runtime you use, you need to find a way to tell where to look for dependencies if they are not included with the deployment package or in a layer. In the case of Python, I set the PYTHONPATH environment variable to /mnt/inference/lib.

I am going to use PyTorch Hub to download this pre-trained machine learning model to recognize the kind of bird in a picture. The model I am using for this example is relatively small, about 200 MB. To cache the model on the EFS file system, I set the TORCH_HOME environment variable to /mnt/inference/model.

All dependencies are now in the file system mounted by the function, and I can type my code straight in the Function code editor. I paste the following code to have a machine learning inference API:

import urllib
import json
import os

import torch
from PIL import Image
from torchvision import transforms

transform_test = transforms.Compose([
    transforms.Resize((600, 600), Image.BILINEAR),
    transforms.CenterCrop((448, 448)),
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])

model = torch.hub.load('nicolalandro/ntsnet-cub200', 'ntsnet', pretrained=True,
                       **{'topN': 6, 'device': 'cpu', 'num_classes': 200})
model.eval()


def lambda_handler(event, context):
    url = event['queryStringParameters']['url']

    img = Image.open(urllib.request.urlopen(url))
    scaled_img = transform_test(img)
    torch_images = scaled_img.unsqueeze(0)

    with torch.no_grad():
        top_n_coordinates, concat_out, raw_logits, concat_logits, part_logits, top_n_index, top_n_prob = model(torch_images)

        _, predict = torch.max(concat_logits, 1)
        pred_id = predict.item()
        bird_class = model.bird_classes[pred_id]
        print('bird_class:', bird_class)

    return json.dumps({
        "bird_class": bird_class,
    })

I add the API Gateway as trigger, similarly to what I did before for the MessageWall function. Now, I can use the serverless API I just created to analyze pictures of birds. I am not really an expert in the field, so I looked for a couple of interesting images on Wikipedia:

I call the API to get a prediction for these two pictures:

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MLInference?url=https://path/to/image/atlantic-puffin.jpg

{"bird_class": "106.Horned_Puffin"}

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MLInference?url=https://path/to/image/western-grebe.jpg

{"bird_class": "053.Western_Grebe"}

It works! Looking at Amazon CloudWatch Logs for the Lambda function, I see that the first invocation, when the function loads and prepares the pre-trained model for inference on CPUs, takes about 30 seconds. To avoid a slow response, or a timeout from the API Gateway, I use Provisioned Concurrency to keep the function ready. The next invocations take about 1.8 seconds.

Understanding EFS Performance
When using EFS with your Lambda function, is very important to understand how EFS performance works. For throughput, each file system can be configured to use bursting or provisioned mode.

When using bursting mode, all EFS file systems, regardless of size, can burst at least to 100 MiB/s of throughput. Those over 1 TiB in the standard storage class can burst to 100 MiB/s per TiB of data stored in the file system. EFS uses a credit system to determine when file systems can burst. Each file system earns credits over time at a baseline rate that is determined by the size of the file system that is stored in the standard storage class. A file system uses credits whenever it reads or writes data. The baseline rate is 50 KiB/s per GiB of storage.

You can monitor the use of credits in CloudWatch, each EFS file system has a BurstCreditBalance metric. If you see that you are consuming all credits, and the BurstCreditBalance metric is going to zero, you should enable provisioned throughput mode for the file system, from 1 to 1024 MiB/s. There is an additional cost when using provisioned throughput, based on how much throughput you are adding on top of the baseline rate.

To avoid running out of credits, you should think of the throughput as the average you need during the day. For example, if you have a 10GB file system, you have 500 KiB/s of baseline rate, and every day you can read/write 500 KiB/s * 3600 seconds * 24 hours = 43.2 GiB.

If the libraries and everything you function needs to load during initialization are about 2 GiB, and you only access the EFS file system during function initialization, like in the MLInference Lambda function above, that means you can initialize your function (for example because of updates or scaling up activities) about 20 times per day. That’s not a lot, and you would probably need to configure provisioned throughput for the EFS file system.

If you have 10 MiB/s of provisioned throughput, then every day you have 10 MiB/s * 3600 seconds * 24 hours = 864 GiB to read or write. If you only use the EFS file system at function initialization to read about 2 GB of dependencies, it means that you can have 400 initializations per day. That may be enough for your use case.

In the Lambda function configuration, you can also use the reserve concurrency control to limit the maximum number of execution environments used by a function.

If, by mistake, the BurstCreditBalance goes down to zero, and the file system is relatively small (for example, a few GiBs), there is the possibility that your function gets stuck and can’t execute fast enough before reaching the timeout. In that case, you should enable (or increase) provisioned throughput for the EFS file system, or throttle your function by setting the reserved concurrency to zero to avoid all invocations until the EFS file system has enough credits.

Understanding Security Controls
When using EFS file systems with AWS Lambda, you have multiple levels of security controls. I’m doing a quick recap here because they should all be considered during the design and implementation of your serverless applications. You can find more info on using IAM authorization and access points with EFS in this post.

To connect a Lambda function to an EFS file system, you need:

  • Network visibility in terms of VPC routing/peering and security group.
  • IAM permissions for the Lambda function to access the VPC and mount (read only or read/write) the EFS file system.
  • You can specify in the IAM policy conditions which EFS access point the Lambda function can use.
  • The EFS access point can limit access to a specific path in the file system.
  • File system security (user ID, group ID, permissions) can limit read, write, or executable access for each file or directory mounted by a Lambda function.

The Lambda function execution environment and the EFS mount point uses industry standard Transport Layer Security (TLS) 1.2 to encrypt data in transit. You can provision Amazon EFS to encrypt data at rest. Data encrypted at rest is transparently encrypted while being written, and transparently decrypted while being read, so you don’t have to modify your applications. Encryption keys are managed by the AWS Key Management Service (KMS), eliminating the need to build and maintain a secure key management infrastructure.

Available Now
This new feature is offered in all regions where AWS Lambda and Amazon EFS are available, with the exception of the regions in China, where we are working to make this integration available as soon as possible. For more information on availability, please see the AWS Region table. To learn more, please see the documentation.

EFS for Lambda can be configured using the console, the AWS Command Line Interface (CLI), the AWS SDKs, and the Serverless Application Model. This feature allows you to build data intensive applications that need to process large files. For example, you can now unzip a 1.5 GB file in a few lines of code, or process a 10 GB JSON document. You can also load libraries or packages that are larger than the 250 MB package deployment size limit of AWS Lambda, enabling new machine learning, data modelling, financial analysis, and ETL jobs scenarios.

Amazon EFS for Lambda is supported at launch in AWS Partner Network solutions, including Epsagon, Lumigo, Datadog, HashiCorp Terraform, and Pulumi.

There is no additional charge for using EFS from Lambda functions. You pay the standard price for AWS Lambda and Amazon EFS. Lambda execution environments always connect to the right mount target in an AZ and not across AZs. You can connect to EFS in the same AZ via cross account VPC but there can be data transfer costs for that. We do not support cross region, or cross AZ connectivity between EFS and Lambda.

Danilo

UtahFS: Encrypted File Storage

Post Syndicated from Brendan McMillion original https://blog.cloudflare.com/utahfs/

UtahFS: Encrypted File Storage

Encryption is one of the most powerful technologies that everyone uses on a daily basis without realizing it. Transport-layer encryption, which protects data as it’s sent across the Internet to its intended destination, is now ubiquitous because it’s a fundamental tool for creating a trustworthy Internet. Disk encryption, which protects data while it’s sitting idly on your phone or laptop’s hard drive, is also becoming ubiquitous because it prevents anybody who steals your device from also being able to see what’s on your desktop or read your email.

The next improvement on this technology that’s starting to gain popularity is end-to-end encryption, which refers to a system where only the end-users are able to access their data — not any intermediate service providers. Some of the most popular examples of this type of encryption are chat apps like WhatsApp and Signal. End-to-end encryption significantly reduces the likelihood of a user’s data being maliciously stolen from, or otherwise mishandled by a service provider. This is because even if the service provider loses the data, nobody will have the keys to decrypt it!

Several months ago, I realized that I had a lot of sensitive files on my computer (my diary, if you must know) that I was afraid of losing, but I didn’t feel comfortable putting them in something like Google Drive or Dropbox. While Google and Dropbox are absolutely trustworthy companies, they don’t offer encryption and this is a case where I would really wanted complete control of my data.

From looking around, it was hard for me to find something that met all of my requirements:

  1. Would both encrypt and authenticate the directory structure, meaning that file names are hidden and it’s not possible for others to move or rename files.
  2. Viewing/changing part of a large file doesn’t require downloading and decrypting the entire file.
  3. Is open-source and has a documented protocol.

So I set out to build such a system! The end result is called UtahFS, and the code for it is available here. Keep in mind that this system is not used in production at Cloudflare: it’s a proof-of-concept that I built while working on our Research Team. The rest of this blog post describes why I built it as I did, but there’s documentation in the repository on actually using it if you want to skip to that.

Storage Layer

The first and most important part of a storage system is… the storage. For this, I used Object Storage, because it’s one of the cheapest and most reliable ways to store data on somebody else’s hard drives. Object storage is nothing more than a key-value database hosted by a cloud provider, often tuned for storing values around a few kilobytes in size. There are a ton of different providers with different pricing schemes like Amazon S3, Backblaze B2, and Wasabi. All of them are capable of storing terabytes of data, and many also offer geographic redundancy.

Data Layer

One of the requirements that was important to me was that it shouldn’t be necessary to download and decrypt an entire file before being able to read a part of it. One place where this matters is audio and video files, because it enables playback to start quickly. Another case is ZIP files: a lot of file browsers have the ability to explore compressed archives, like ZIP files, without decompressing them. To enable this functionality, the browser needs to be able to read a specific part of the archive file, decompress just that part, and then move somewhere else.

Internally, UtahFS never stores objects that are larger than a configured size (32 kilobytes, by default). If a file has more than that amount of data, the file is broken into multiple objects which are connected by a skip list. A skip list is a slightly more complicated version of a linked list that allows a reader to move to a random position quickly by storing additional pointers in each block that point further than just one hop ahead.

When blocks in a skip list are no longer needed, because a file was deleted or truncated, they’re added to a special “trash” linked list. Elements of the trash list can then be recycled when blocks are needed somewhere else, to create a new file or write more data to the end of an existing file, for example. This maximizes reuse and means new blocks only need to be created when the trash list is empty. Some readers might recognize this as the Linked Allocation strategy described in The Art of Computer Programming: Volume 1, section 2.2.3!

UtahFS: Encrypted File Storage

The reason for using Linked Allocation is fundamentally that it’s the most efficient for most operations. But also, it’s the approach for allocating memory that’s going to be most compatible with the cryptography we talk about in the next three sections.

Encryption Layer

Now that we’ve talked about how files are broken into blocks and connected by a skip list, we can talk about how the data is actually protected. There are two aspects to this:

The first is confidentiality, which hides the contents of each block from the storage provider. This is achieved simply by encrypting each block with AES-GCM, with a key derived from the user’s password.

While simple, this scheme doesn’t provide forward secrecy or post-compromise security. Forward Secrecy means that if the user’s device was compromised, an attacker wouldn’t be able to read deleted files. Post-Compromise Security means that once the user’s device is no longer compromised, an attacker wouldn’t be able to read new files. Unfortunately, providing either of these guarantees means storing cryptographic keys on the user’s device that would need to be synchronized between devices and, if lost, would render the archive unreadable.

This scheme also doesn’t protect against offline password cracking, because an attacker can take any of the encrypted blocks and keep guessing passwords until they find one that works. This is somewhat mitigated by using Argon2, which makes guessing passwords more expensive, and by recommending that users choose strong passwords.

I’m definitely open to improving the encryption scheme in the future, but considered the security properties listed above too difficult and fragile for the initial release.

Integrity Layer

The second aspect of data protection is integrity, which ensures the storage provider hasn’t changed or deleted anything. This is achieved by building a Merkle Tree over the user’s data. Merkle Trees are described in-depth in our blog post about Certificate Transparency. The root hash of the Merkle Tree is associated with a version number that’s incremented with each change, and both the root hash and the version number are authenticated with a key derived from the user’s password. This data is stored in two places: under a special key in the object storage database, and in a file on the user’s device.

Whenever the user wants to read a block of data from the storage provider, they first request the root stored remotely and check that it’s either the same as what they have on disk, or has a greater version number than what’s on disk. Checking the version number prevents the storage provider from reverting the archive to a previous (valid) state undetected. Any data which is read can then be verified against the most recent root hash, which prevents any other types of modifications or deletions.

Using a Merkle Tree here has the same benefit as it does for Certificate Transparency: it allows us to verify individual pieces of data without needing to download and verify everything at once. Another common tool used for data integrity is called a Message Authentication Code (or MAC), and while it’s a lot simpler and more efficient, it doesn’t have a way to do only partial verification.

The one thing our use of Merkle Trees doesn’t protect against is forking, where the storage provider shows different versions of the archive to different users. However, detecting forks would require some kind of gossip between users, which is beyond the scope of the initial implementation for now.

Hiding Access Patterns

Oblivious RAM, or ORAM, is a cryptographic technique for reading and writing to random-access memory in a way that hides which operation was performed (a read, or a write) and to which part of memory the operation was performed, from the memory itself! In our case, the ‘memory’ is our object storage provider, which means we’re hiding from them which pieces of data we’re accessing and why. This is valuable for defending against traffic analysis attacks, where an adversary with detailed knowledge of a system like UtahFS can look at the requests it makes, and infer the contents of encrypted data. For example, they might see that you upload data at regular intervals and almost never download, and infer that you’re storing automated backups.

The simplest implementation of ORAM would consist of always reading the entire memory space and then rewriting the entire memory space with all new values, any time you want to read or write an individual value. An adversary looking at the pattern of memory accesses wouldn’t be able to tell which value you actually wanted, because you always touch everything. This would be incredibly inefficient, however.

The construction we actually use, which is called Path ORAM, abstracts this simple scheme a little bit to make it more efficient. First, it organizes the blocks of memory into a binary tree, and second, it keeps a client-side table that maps application-level pointers to random leafs in the binary tree. The trick is that a value is allowed to live in any block of memory that’s on the path between its assigned leaf and the root of the binary tree.

Now, when we want to lookup the value that a pointer goes to, we look in our table for its assigned leaf, and read all the nodes on the path between the root and that leaf. The value we’re looking for should be on this path, so we already have what we need! And in the absence of any other information, all the adversary saw is that we read a random path from the tree.

UtahFS: Encrypted File Storage
What looks like a random path is read from the tree, that ends up containing the data we’re looking for.

However, we still need to hide whether we’re reading or writing, and to re-randomize some memory to ensure this lookup can’t be linked with others we make in the future. So to re-randomize, we assign the pointer we just read to a new leaf and move the value from whichever block it was stored in before to a block that’s a parent of both the new and old leaves. (In the worst case, we can use the root block since the root is a parent of everything.) Once the value is moved to a suitable block and done being consumed/modified by the application, we re-encrypt all the blocks we fetched and write them back to memory. This puts the value in the path between the root and its new leaf, while only changing the blocks of memory we’ve already fetched.

UtahFS: Encrypted File Storage

This construction is great because we’ve only had to touch the memory assigned to a single random path in a binary tree, which is a logarithmic amount of work relative to the total size of our memory. But even if we read the same value again and again, we’ll touch completely random paths from the tree each time! There’s still a performance penalty caused by the additional memory lookups though, which is why ORAM support is optional.

Wrapping Up

Working on this project has been really rewarding for me because while a lot of the individual layers of the system seem simple, they’re the result of a lot of refinement and build up into something complex quickly. It was difficult though, in that I had to implement a lot of functionality myself instead of reuse other people’s code. This is because building end-to-end encrypted systems requires carefully integrating security into every feature, and the only good way to do that is from the start. I hope UtahFS is useful for others interested in secure storage.

Running a high-performance SAS Grid Manager cluster on AWS with Amazon FSx for Lustre

Post Syndicated from Neelam original https://aws.amazon.com/blogs/big-data/running-a-high-performance-sas-grid-manager-cluster-on-aws-with-amazon-fsx-for-lustre/

SAS® is a software provider of data science and analytics used by enterprises and government organizations. SAS Grid is a highly available, fast processing analytics platform that offers centralized management that balances workloads across different compute nodes. This application suite is capable of data management, visual analytics, governance and security, forecasting and text mining, statistical analysis, and environment management. SAS and AWS recently performed testing using the Amazon FSx for Lustre shared file system to determine how well a standard workload performs on AWS using SAS Grid Manager. For more information about the results, see the whitepaper Accelerating SAS Using High-Performing File Systems on Amazon Web Services.

In this post, we take a look at an approach to deploy underlying AWS infrastructure to run SAS Grid with FSx for Lustre that you can also apply to similar applications with demanding I/O requirements.

System design overview

Running high-performance workloads that use throughput heavily, with sensitivity to network latency, requires approaches outside of typical applications. AWS generally recommends that applications span multiple Availability Zones for high availability. In the case of latency sensitivity, high throughput applications traffic should be local for optimal performance. To maximize throughput, you can do the following:

  • Run in a virtual private cloud (VPC), using instance types that support enhanced networking
  • Run instances in the same Availability Zone
  • Run instances within a placement group

The following diagram illustrates the SAS Grid with FSx for Lustre architecture on AWS.

SAS Grid architecture consists of mid-tier nodes, metadata servers, and Grid compute nodes. The mid-tier nodes are responsible for running the Platform Web Services (PWS) and Load Sharing Facility (LSF) components. These components dispatch jobs submitted and return the status of each job.

To effectively run PWS and LSF on mid-tier nodes, you need Amazon Elastic Compute Cloud (Amazon EC2) instances with high memory. For this use case, the r5 instance family would meet this requirement.

Metadata servers contain the metadata repository that stores the metadata definitions of all SAS Grid manager products, which the r5 instance family can also serve effectively. We recommend either meeting or exceeding the recommended memory requirement of 24 GB of RAM or 8 MB per physical core (whichever is larger). Metadata servers don’t need compute-intensive resources or high I/O bandwidth; therefore, you can choose the r5 instance family for a balance of price and performance.

SAS Grid nodes are responsible for executing the jobs received by the grid, and EC2 instances capable of handling these jobs depend on the size, complexity, and volume of the work the grid performs. To meet the minimum requirements of SAS Grid workloads, we recommend having a minimum of 8 GB of physical RAM per core and a robust I/O throughput of 100–125 MB/second per physical core. For this use case, EC2 instance families of m5n and r5n suffice in meeting the RAM and throughput requirements. You can host SASDATA, SASWORK, and UTILLOC libraries in a shared file system. If you choose to offload SASWORK to instance storage, the i3en instance family meets this need because they support instance storage over 1.2 TB. In the next section, we take a look at how throughput testing was performed to arrive at the EC2 instance recommendations with FSx for Lustre.

Steps to maximize storage I/O performance

SAS Grid requires a shared file system, and we wanted to benchmark the performance of FSx for Lustre as the chosen shared file system against various EC2 instance families that meet the minimum requirements of 8 GB of physical RAM per core and 100–125 MB/second throughput per physical core.

FSx for Lustre is a fully managed file storage service designed for applications that require fast storage. As a POSIX-compliant file system, you can use FSx for Lustre with current Linux-based applications without having to make any changes. Although FSx for Lustre offers a choice between scratch and persistent type file systems, we recommend for SAS Grid to use persistent type FSx for Lustre file system because you need to store the SASWORK, SASDATA, and UTILLOC data and libraries for longer periods with high availability and data durability. To meet I/O throughput, make sure to select the appropriate storage capacity for throughput per unit of storage to achieve the desired range of 100–125 MB/second.

After setting up the file system, we recommend mounting FSx for Lustre with the flock mount option. The following code example is a mount command and mount option for FSx for Lustre:

$ sudo mount -t lustre -o noatime,flock fs-0123456789abcd.fsx.us-west- [email protected]:/za3atbmv /fsx
$ mount -t lustre
[email protected]:/za3atbmv on /fsx type lustre

(rw,noatime,seclabel,flock,lazystatfs)

Throughput testing and results

To select the best-placed EC2 instances for running SAS Grid with FSx for Lustre, we ran a series of highly parallel network throughput tests from individual EC2 instances against a 100.8 TiB persistent file system that had an aggregate throughput capacity of 19.688 GB/second. We ran these tests in multiple regions using multiple EC2 instance families (c5, c5n, i3, i3en, m5, m5a, m5ad, m5n, m5dn, r5, r5a, r5ad, r5n, and r5dn). The tests ran for 3 hours for each instance, and the DataWriteBytes metric of the file system was recorded every 1 minute. Only one instance was accessing the file system at a time, and the p99.9 results were captured. The metrics were consistent across all four Regions.

We observed that the i3en, m5n, m5dn, r5n, and r5dn EC2 instance families meet or exceed the minimum network performance and memory recommendations. For more information about the performance results, see the whitepaper Accelerating SAS Using High-Performing File Systems on Amazon Web Services. The i3 instance family is just shy of meeting the minimum network performance. If you want to use the instance storage for SASWORK and UTILLOC libraries, you can consider i3en instances.

M5n and r5n are a good blend of price and performance, and we recommend the m5n instance family for SAS Grid nodes. However, if your workload is memory bound, consider using r5n instances, which provide higher memory per physical core for a higher price point than m5n instances.

We also ran rhel_iotest.sh, which is available from the SAS technical support samples tool repository (SASTSST), using the same FSx for Lustre configuration as mentioned earlier. The following table shows the read and write performance per physical core for a variety of instances sizes in the m5n and r5n families.

Instance Type

Variable Network Performance Peak per Physical Core
Read (MB/second)Write (MB/second)
m5n.large850.20357.07
m5n.xlarge519.46386.25
m5n.2xlarge283.01446.84
m5n.4xlarge202.89376.57
m5n.8xlarge154.98297.71
r5n.large906.88429.93
r5n.xlarge488.36455.76
r5n.2xlarge256.96471.65
r5n.4xlarge203.31390.03
r5n.8xlarge149.63299.45

To take advantage of the elasticity, scalability, and flexibility of the cloud, we recommend spreading the SAS Grid and compute workload over a larger number of smaller instances versus using a smaller number of larger instances. For mid-tier, use a minimum of two instances, and for metadata servers, we recommend a minimum of three instances for the SAS Grid architecture.

Conclusions

Before FSx for Lustre file system, you either had to use Amazon Elastic File System (Amazon EFS) or a third-party file system from AWS Marketplace and Amazon Elastic Block Store (Amazon EBS) for the SASWORK, SASDATA, and UTILLOC libraries and storage data. Each storage option came with its own settings and limitations, which caused loss in performance. With FSx for Lustre, you have a single solution for all SAS Grid storage requirements, which allows you to focus on running your business instead of maintaining a file system. We recommend that SAS admin deploy SAS Grid with m5n and r5n instances for SAS Grid compute nodes when accessing FSx for Lustre file system.

If you have questions or suggestions, please leave a comment.

Amazon FSx for Windows File Server – Storage Size and Throughput Capacity Scaling

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/amazon-fsx-for-windows-file-server-storage-size-and-throughput-capacity-scaling/

Amazon FSx for Windows File Server provides fully managed, highly reliable file storage that is accessible over the Server Message Block (SMB) protocol. It is built on Windows Server, delivering a wide range of administrative features such as user quotas, end-user file restore, and Microsoft Active Directory integration, consistent with operating an on-premises Microsoft Windows file server. Today, we are happy to announce two new features: storage capacity scaling and throughput capacity scaling. The storage capacity scaling allows you to increase your file system size as your data set increases, and throughput capacity is bidirectional letting you can adjust throughput up or down dynamically to help fine-tune performance and reduce costs. With the capability to grow storage capacity, you can adjust your storage size as your data sets grow, so you don’t need to worry about growing data sets when creating the file system. With the capability to change throughput capacity, you can dynamically adjust throughput capacity for cyclical workloads or for one-time bursts to achieve a time-sensitive goal such as data migration.

When we create a file system, we specify Storage Capacity and Throughput Capacity.

The storage capacity of SSD can be specified between 32 GiB and 65,536 GiB, and the capacity of HDD can be specified between 2,000 GiB and 65,536 GiB. With throughput capacity, every Amazon FSx file system has a throughput capacity that you configure when the file system is created. The throughput capacity determines the speed at which the file server hosting your file system can serve file data to clients accessing it. Higher levels of throughput capacity also come with more memory for caching data on the file server and support higher levels of IOPS.

With this release, you can scale up storage capacity and can scale up / down throughput capacity on your file system with the click of a button within the AWS Management Console, or you can use the AWS Software Development Kit (SDK) or Command Line Interface (CLI) tools. The file system is available online while scaling is in progress and you’ll have full access to it for storage scaling. During scaling throughput, Amazon FSx for Windows switches out the file servers on your file system, so you’ll see an automatic failover and failback on multi-AZ file systems.

So, let’s have a little trip through the new feature. We’ll look at the AWS Management Console at first.

Operation by AWS Management Console

Before we begin, we assume AWS Managed Microsoft AD by AWS Directory Service and Amazon FSx for Windows File Server are already set up. You can obtain a walkthrough guide here. With Actions drop down, we can select Update storage capacity and Update throughput capacity

We can assign new storage capacity by Percentage or Absolute value.

With throughput scaling, we can select the desired capacity from the drop down list.

Then, Status is changed to In Progress, and you still have access to the file system.

Scaling Storage Capacity and Throughput Capacity via CLI

First, we need a CLI environment. I prefer to work on AWS Cloud9, but you can use whatever you want. We need to know the file system ID to scale it. Type in the command below:

aws fsx – endpoint-url <endpoint> describe-file-systems

The endpoint differs among AWS Regions, and you can get a full list here. We’ll get a return, which is long and detailed. The file system ID is at the top of the return.

Let’s change Storage Capacity. The command below is the one to change it:

aws fsx – endpoint-url <endpoint> update-file-system – file-system-id=<FileSystemId> – storage-capacity <new capacity>

The <new capacity> should be a number up to 65536, and the new assigned capacity should be at least 10% larger than the current capacity. Once we type in the command, the new capacity is available for use within minutes. Once the new storage capacity is available on our file system, Amazon FSx begins storage optimization, which is the process of migrating the file system’s data to the new, larger disks. If needed, we can accelerate the storage optimization process at any time by temporarily increasing the file system’s throughput capacity. There is minimal performance impact while Amazon FSx performs these operations in the background, and we always have full access to our file system.

If you enter the following command, you’ll see that file system update is in “IN_PROGESS” and storage optimization is in “PENDING” at the bottom part of the log return.

aws fsx – endpoint-url <endpoint> describe-file-systems

After the storage optimization process begins:

We can also go further and run throughput scaling at the same time. Type the command below:

aws fsx – endpoint-url <endpoint> update-file-system – file-system-id=<FileSystemId> – windows-configuration ThroughputCapacity=<new capacity>

The “new capacity” should be <8 or 16 or 32 or 64 or 128 or 256 or 512 or 1024 or 2048> and should be larger than the current capacity.

Now, we can see that throughput scaling and storage optimization are both in progress. Again, we still have full access to the file system.

With throughput scaling, we can select the desired capacity from the drop down list.

When we need further large capacity more than 65,536 GiB, we can use Microsoft’s Distributed File System (DFS) Namespaces to group multiple file systems under a single namespace.

Available Today

Storage capacity scaling and throughput capacity scaling are available today for all AWS Regions where Amazon FSx for Windows File Server is available. This support is available for new file systems starting today, and will be expanded to all file system in the coming weeks. Check our documentation for more details.

– Kame;