Tag Archives: AWS Storage Gateway

Designing a hybrid AI/ML data access strategy with Amazon SageMaker

Post Syndicated from Franklin Aguinaldo original https://aws.amazon.com/blogs/architecture/designing-a-hybrid-ai-ml-data-access-strategy-with-amazon-sagemaker/

Over time, many enterprises have built an on-premises cluster of servers, accumulating data, and then procuring more servers and storage. They often begin their ML journey by experimenting locally on their laptops. Investment in artificial intelligence (AI) is at a different stage in every business organization. Some remain completely on-premises, others are hybrid (both on-premises and cloud), and the remaining have moved completely into the cloud for their AI and machine learning (ML) workloads.

These enterprises are also researching or have started using the cloud to augment their on-premises systems for several reasons. As technology improves, both the size and quantity of data increases over time. The amount of data captured and the number of datapoints continues to expand, which presents a challenge to manage on-premises. Many enterprises are distributed, with offices in different geographic regions, continents, and time zones. While it is possible to increase the on-premises footprint and network pipes, there are still hidden costs to consider for maintenance and upkeep. These organizations are looking to the cloud to shift some of that effort and enable them to burst and use the rich AI and ML features on the cloud.

Defining a hybrid data access strategy

Moving ML workloads into the cloud calls for a robust hybrid data strategy describing how and when you will connect your on-premises data stores to the cloud. For most, it makes sense to make the cloud the source of truth, while still permitting your teams to use and curate datasets on-premises. Defining the cloud as source of truth for your datasets means the primary copy will be in the cloud and any dataset generated will be stored in the same location in the cloud. This ensures that requests for data is served from the primary copy and any derived copies.

A hybrid data access strategy should address the following:

Understand your current and future storage footprint for ML on-premises. Create a map of your ML workloads, along with performance and access requirements for testing and training.
Define connectivity across on-premises locations and the cloud. This includes east-west and north-south traffic to support interconnectivity between sites, required bandwidth, and throughput for the data movement workload requirements.
Define your single source of truth (SSOT)[1] and where the ML datasets will primarily live. Consider how dated, new, hot, and cold data will be stored.
Define your storage performance requirements, mapping them to the appropriate cloud storage services. This will give you the ability to take advantage of cloud-native ML with Amazon SageMaker.

Hybrid data access strategy architecture

To help address these challenges, we worked on outlining an end-to-end system architecture in Figure 1 that defines: 1) connectivity between on-premises data centers and AWS Regions; 2) mappings for on-premises data to the cloud; and 3) Aligning Amazon SageMaker to appropriate storage, based on ML requirements.

AI/ML hybrid data access strategy reference architecture

Figure 1. AI/ML hybrid data access strategy reference architecture

Let’s explore this architecture step by step.

  1. On-premises connectivity to the AWS Cloud runs through AWS Direct Connect for high transfer speeds.
  2. AWS DataSync is used for migrating large datasets into Amazon Simple Storage Service (Amazon S3). AWS DataSync agent is installed on-premises.
  3. On-premises network file system (NFS) or server message block (SMB) data is bridged to the cloud through Amazon S3 File Gateway, using either a virtual machine (VM) or hardware appliance.
  4. AWS Storage Gateway uploads data into Amazon S3 and caches it on-premises.
  5. Amazon S3 is the source of truth for ML assets stored on the cloud.
  6. Download S3 data for experimentation to Amazon SageMaker Studio.
  7. Amazon SageMaker notebooks instances can access data through S3, Amazon FSx for Lustre, and Amazon Elastic File System. Use Amazon File Cache for high-speed caching for access to on-premises data, and Amazon FSx for NetApp ONTAP for cloud bursting.
  8. SageMaker training jobs can use data in Amazon S3, EFS, and FSx for Lustre. S3 data is accessed via File, Fast File, or Pipe mode, and pre-loaded or lazy-loaded when using FSx for Lustre as training job input. Any existing data on EFS can also be made available to training jobs as well.
  9. Leverage Amazon S3 Glacier for archiving data and reducing storage costs.

ML workloads using Amazon SageMaker

Let’s go deeper into how SageMaker can help you with your ML workloads.

To start mapping ML workloads to the cloud, consider which AWS storage services work with Amazon SageMaker. Amazon S3 typically serves as the central storage location for both structured and unstructured data that is used for ML. This includes raw data coming from upstream applications, and also curated datasets that are organized and stored as part of a Feature Store.

In the initial phases of development, a SageMaker Studio user will leverage S3 APIs to download data from S3 to their private home directory. This home directory is backed by a SageMaker-managed EFS file system. Studio users then point their notebook code (also stored in the home directory) to the local dataset and begin their development tasks.

To scale up and automate model training, SageMaker users can launch training jobs that run outside of the SageMaker Studio notebook environment. There are several options for making data available to a SageMaker training job.

  1. Amazon S3. Users can specify the S3 location of the training dataset. When using S3 as a data source, there are three input modes to choose from:
    • File mode. This is the default input mode, where SageMaker copies the data from S3 to the training instance storage. This storage is either a SageMaker-provisioned Amazon Elastic Block Store (Amazon EBS) volume or an NVMe SSD that is included with specific instance types. Training only starts after the dataset has been downloaded to the storage, and there must be enough storage space to fit the entire dataset.
    • Fast file mode. Fast file mode exposes S3 objects as a POSIX file system on the training instance. Dataset files are streamed from S3 on demand, as the training script reads them. This means that training can start sooner and require less disk space. Fast file mode also does not require changes to the training code.
    • Pipe mode. Pipe input also streams data in S3 as the training script reads it, but requires code changes. Pipe input mode is largely replaced by the newer and easier-to-use Fast File mode.
  2. FSx for Lustre. Users can specify a FSx for Lustre file system, which SageMaker will mount to the training instance and run the training code. When the FSx for Lustre file system is linked to a S3 bucket, the data can be lazily loaded from S3 during the first training job. Subsequent training jobs on the same dataset can then access it with low latency. Users can also choose to pre-load the file system with S3 data using hsm_restore commands.
  3. Amazon EFS. Users can specify an EFS file system that already contains their training data. SageMaker will mount the file system on the training instance and run the training code.
    Find out how to Choose the best data source for your SageMaker training job.

Conclusion

With this reference architecture, you can develop and deliver ML workloads that run either on-premises or in the cloud. Your enterprise can continue using its on-premises storage and compute for particular ML workloads, while also taking advantage of the cloud, using Amazon SageMaker. The scale available on the cloud allows your enterprise to conduct experiments without worrying about capacity. Start defining your hybrid data strategy on AWS today!

Additional resources:

[1] The practice of aggregating data from many sources to a single source or location.

Deploying low-latency hybrid cloud storage on AWS Local Zones using AWS Storage Gateway

Post Syndicated from maceneff original https://aws.amazon.com/blogs/compute/deploying-low-latency-hybrid-cloud-storage-on-aws-local-zones-using-aws-storage-gateway/

This blog post is written by Ruchi Nigam, Senior Cloud Support Engineer and Sumit Menaria, Senior Hybrid SA.

AWS Local Zones are a type of infrastructure deployment that places compute, storage, database, and other select AWS services close to large population and industry centers. With Local Zones close to large population centers in metro areas, customers can achieve the low latency required for use cases like video analytics, online gaming, virtual workstations, live streaming, remote healthcare, and augmented and virtual reality. They can also help customers operating in regulated sectors like healthcare, financial services, mining and resources, and public sector that might have preferences or requirements to keep data within a geographic boundary. In addition to low-latency and residency benefits, Local Zones can help organizations migrate additional workloads to AWS, supporting a hybrid cloud migration strategy and simplifying IT operations.

Your hybrid cloud migration strategy may involve storage requirements for data coming in from various on-premises sources, file sharing within the organization or backup on-premises files. These storage requirements can be met by using Amazon FSx for a feature-rich, high performance file system. You can deploy your workload in the nearest Local Zones and use Amazon FSx in the parent AWS Region for a cost-effective solution with four widely-used file systems: NetApp ONTAP, OpenZFS, Windows File Server, and Lustre.

If your workloads need low-latency access to your storage solution and operate in locations which are not close to an AWS Region, then you can consider AWS Storage Gateway as a set of hybrid cloud storage services to get access to virtually unlimited cloud storage in the region. There are options to deploy Storage Gateway directly in your on-premises environment as a virtual machine (VM) (VMware ESXi, Microsoft Hyper-V, Linux KVM) or as a pre-configured standalone hardware appliance. But you can also deploy it on an Amazon Elastic Compute Cloud (Amazon EC2) instance in Local Zones or the Region, depending where your data sources and users are. Deploying Storage Gateway on Amazon EC2 in Local Zones provides low latency access via local cache for your applications while taking away the undifferentiated heavy lifting of management of the power, space, and hardware for deploying it in the on-premises environment. Before choosing an appropriate location, you must note any data residency requirements with which you must comply. There may be situations where the Local Zone’s parent Region is in the same country. However, it is recommended to work with your compliance and security teams for confirmation, as the objects are stored in the Amazon S3 service in the Region.

Depending on your use cases you can choose among four different deployment options: Amazon S3 File Gateway, Amazon FSx File Gateway , Tape Gateway, and Volume Gateway.

Name

Interface

Use Case

S3 File Gateway

NFS, SMB

Allow on-premises or EC2 instances to store files as objects in Amazon S3 and access them via NFS or SMB mount points

Volume Gateway Stored Mode

iSCSI

Asynchronous replication of on-premises data to Amazon S3

Volume Gateway Cached Mode

iSCSI

Primary data stored in Amazon S3 with frequently accessed data cached locally on-premises

Tape Gateway

ISCSI

Replace on-premises physical tapes with AWS-backed virtual tapes. Provides virtual media changer and tape drives to use with existing backup applications

FSx File Gateway

SMB

Low-latency, efficient connection for remote users when moving on-premises Windows file systems into the cloud.

We expand on how you can deploy Amazon S3 File Gateway in a Local Zones specific setup. However, a similar approach can be used for other deployment options.

Amazon S3 File Gateway on Local Zones

Amazon S3 File Gateway provides a seamless way to connect to the cloud to store application data files and backup images as durable objects in Amazon S3 cloud storage. Amazon S3 File Gateway supports a file interface into Amazon S3 and combines a service and a virtual software appliance. The gateway offers Server Message Block (SMB) or Network File System (NFS)-based access to data in Amazon S3 with local caching. This can be used for both on-premises and data-intensive Amazon EC2-based applications in Local Zones that require file protocol access to Amazon S3 object storage.

Architecture

Amazon S3 File Gateway on Local Zones architecture setup

In the previous architecture, Client connects to the Storage Gateway EC2 instance over a private/public connection. Storage Gateway EC2 instance can access the S3 bucket in the Region via the Storage Gateway service endpoint. The File Share associated with the Storage Gateway presents the S3 bucket as a locally mounted drive for the client to use.

There are few things we must note while deploying file gateway on Amazon EC2 in a Local Zone.

  • Since there are selected EC2 instance types available in the Local Zones, identify the instance types available in your desired Local Zone and select the appropriate one which meets the file gateway requirements from a memory perspective.

For example, to list the EC2 instance types offered in the ‘us-east-1-bos-la’ Availability Zone (AZ), use the following command:

aws ec2 describe-instance-type-offerings —location-type
"availability-zone" —filters Name=location,Values=us-east-1-bos-
1a —region us-east-1
  • Choose a supported instance type and EBS volumes in the Local Zone.
  • Add another 150GiB storage apart from the root volume for cache storage.
  • Review and make sure that the Security Group has correct firewall ports open – SMB/NFS ports, HTTP port (for activation) are open in ingress.
  • For activation, if you must access the Storage Gateway over the Public network, then you must assign a Public IP address to the EC2 instance. If you plan to use an Elastic IP address, then make sure that you select the network-border group specific to the Local Zone.
  • For private connectivity, you can use an AWS Direct Connect connection at the supported Local Zones and also enable VPC endpoint for connectivity between Storage Gateway and service endpoints.

Setting up Amazon S3 File Gateway

1.      Navigate to the Storage Gateway console and select the Create Gateway button. In the Gateway options, select the Gateway type as Amazon S3 File Gateway.

Step-1 Select the Gateway type

2.      Under Platform options, select Amazon EC2 and select the option to Customize your settings.

Step-2 Platform options and customize settings

Then, select the Launch instance button and complete launching the EC2 instance to be used as the Storage Gateway. Navigating to the launch instance wizard picks up the verified file gateway Amazon Machine Image (AMI) available in the Region. However, you can also find the AMI using the following AWS Command Line Interface (AWS CLI) command:

aws --region us-east-1 ssm get-parameter --name
/aws/service/storagegateway/ami/FILE_S3/latest

3.      After launching the EC2 instance, check Confirm set up gateway and select Next.

Step-3 Confirm set up gateway

4.      Under Gateway connection options, choose the IP address radio button and enter the Public IP of the EC2 instance launched in Step 2.

Step-4 Gateway connection options

5.      For the Storage Gateway Service endpoint connection, you can create a VPC endpoint for Storage Gateway and specify the VPC endpoint ID from the dropdown selections for a private connection between the gateway and AWS Storage Services. Alternatively, you can choose the Publicly accessible option.

Step-5 Service endpoint connection options

6.      Review and activate the storage gateway.

Step-6 Review and activate

7.      Once the gateway is activated, you can allocate cache storage from the local disks. It is recommended to only use Amazon Elastic Block Store (Amazon EBS) volumes for the gateway storage.

Step-7 Configure cache storage

Once the gateway is configured, the next steps show how to create a file share that can be accessed using the NFS or the SMB protocol.

8.      A File Gateway can host multiple NFS and SMB file shares. For this example, we configure the NFS file share type. You can also select the corresponding S3 bucket in the Region which is going to be used for storing the data.

Step 7 - Create file share

Once the file share is created, you can see the list of mount commands to be used on different clients.

Example mount commands for different clients

On a Linux Client, use the following steps to mount the previously created NFS file share. Make sure you replace the IP address, S3 bucket name, and mount path with names specific to your configuration.

sudo mount -t nfs -o nolock,hard 10.0.32.151:/my-s3-bucket /my-
mount-path

You can verify that the file share has been mounted by running the following command:

$ df -TH
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 497M 0 497M 0% /dev
tmpfs tmpfs 506M 476k 506M 1% /run
tmpfs tmpfs 506M 0 506M 0% /sys/fs/cgroup
/dev/xvda1 xfs 8.6G 7.3G 1.4G 85% /
10.0.32.151:/my-s3-bucket nfs4 9.3E 0 9.3E 0% /your-mount-path
tmpfs tmpfs 102M 0 102M 0% /run/user/0
tmpfs tmpfs 102M 0 102M 0% /run/user/1000

Now you can also list the S3 objects as files on the locally mounted drive.

Terminal output listing files on Storage Gateway

For reference, here are the objects stored in the S3 bucket in the Region.

Objects stored in S3 bucket in the Region

To see a recently added object in the S3 bucket, select Refresh cache under the Actions options of the file share.

Depending on the client location, performance for access to the cached files is better as compared to direct access to the files in the parent Region. The clients can be either in your on-premises and accessed via Direct Connect to the Local Zone, or workload within the Local Zone, which can mount the file gateway for local access from the VPC.

Furthermore, you can look at Amazon S3 File Gateway performance for clients to select the appropriate EC2 instance type and EBS volume size and monitor Cache hit, Read/Write Time, and other performance metrics of the storage gateway by using CloudWatch Metrics.

Cleaning Up

  1. Unmount the File Gateway from the local machine: unmount /your-mount-path
  2. Delete the Storage Gateway from the Storage Gateway console
  3. Delete the VPC Endpoint created for Storage Gateway service
  4. Delete the EC2 instance from the Amazon EC2 console
  5. Delete the files added to the S3 bucket from the Amazon S3 console

Conclusion

By deploying Amazon Storage Gateway on Local Zones, you can utilize the scalability, security, and cost-effectiveness of the AWS cloud, and simultaneously provide low-latency and high-performance access for on-premises applications and users. This can accelerate the migration your storage workloads to cloud while providing your users with low latency access via Local Zones in a truly hybrid manner. Read more about AWS Storage Gateway and AWS Local Zones in their respective documentation.

AWS Week in Review – October 10, 2022

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-10-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

I had an amazing start to the week last week as I was speaking at the AWS Community Day NL. This event had 500 attendees and over 70 speakers, and Dr. Werner Vogels, Amazon CTO, delivered the keynote. AWS Community Days are community-led conferences organized by local communities, with a variety of workshops and sessions. I recommend checking your region for any of these events.

Community Day NL

Last Week’s Launches
Here are some launches that got my attention during the previous week.

Amazon S3 Object Lambda now supports using your own code to change the results of HEAD and LIST requests, besides GET (which we launched last year). This feature now enables more capabilities for what you can do with S3 Object Lambda. Danilo made a Twitter thread with lots of use cases for this new launch.

Amazon SageMaker Clarify now can provide near real-time explanations for ML predictions. SageMaker Clarify is a service that provides explainability by ML models individual predictions. These explanations are important for developers to get visibility into their training data and models to identify potential bias.

AWS Storage Gateway now supports 15 TiB tapes. It increased the maximum supported virtual tape size on Tape Gateway from 5 TiB to 15 TiB, so you can store more data on a single virtual tape, and you can reduce the number of tapes you need to manage.

Amazon Aurora Serverless v2 now supports AWS CloudFormation. Early this year, we announced the general availability of Aurora Serverless v2, and now you can use AWS CloudFormation Templates to deploy and change the database along with the rest of your infrastructure.

AWS Config now supports 15 new resource types, including AWS DataSync, Amazon GuardDuty, Amazon Simple Email Service (Amazon SES), AWS AppSync, AWS Cloud Map, Amazon EC2, and AWS AppConfig. With this launch, you can use AWS Config to monitor configuration data for the supported resource types in your AWS account, and you can see how the configuration changes.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

This week an article about how AWS is leading a pilot project to turn the Greek island of Naxos into a smart island caught my attention. The project introduces smart solutions for mobility, primary healthcare, and the transport of goods. The solution has been built based on four pillars that were important for the island: sustainability, telehealth, leisure, and digital skills. Check out the whole article to learn what they are doing.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS podcasts in Spanish, and every other week there is a new episode. The podcast is meant for builders, and it shares stories about how customers implemented and learned AWS services, how to architect applications, and how to use new services. You can listen to all the episodes directly from your favorite podcast app or at AWS Podcasts en español.

AWS open-source news and updates – This is a newsletter curated by my colleague Ricardo to bring you the latest open-source projects, posts, events, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent reserved seating opens on October 11. If you are planning to attend, book a spot in advance for your favorite sessions. AWS re:Invent is our biggest conference of the year, it happens in Las Vegas from November 28 to December 2, and registrations are open. Many writers of this blog have sessions at re:Invent, and you can search the event agenda using our names.

I started the post talking about AWS Community Days, and there is one in Warsaw, Poland, on October 14. If you are around Warsaw during this week, you can first check out the AWS Pop-up Hub in Warsaw that runs October 10-14 and then join for the Community Day.

On October 20, there is a virtual event for modernizing .NET workloads with Windows containers on AWS, You can register for free.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia

New – Offline Tape Migration Using AWS Snowball Edge

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-offline-tape-migration-using-aws-snowball-edge/

Over the years, we have given you a succession of increasingly powerful tools to help you migrate your data to the AWS Cloud. Starting with AWS Import/Export back in 2009, followed by Snowball in 2015, Snowmobile and Snowball Edge in 2016, and Snowcone in 2020, each new device has given you additional features to simplify and expedite the migration process. All of the devices are designed to operate in environments that suffer from network constraints such as limited bandwidth, high connections costs, or high latency.

Offline Tape Migration
Today, we are taking another step forward by making it easier for you to migrate data stored offline on physical tapes. You can get rid of your large and expensive storage facility, send your tape robots out to pasture, and eliminate all of the time & effort involved in moving archived data to new formats and mediums every few years, all while retaining your existing tape-centric backup & recovery utilities and workflows.

This launch brings a tape migration capability to AWS Snowball Edge devices, and allows you to migrate up to 80 TB of data per device, making it suitable for your petabyte-scale migration efforts. Tapes can be stored in the Amazon S3 Glacier Flexible Retrieval or Amazon S3 Glacier Deep Archive storage classes, and then accessed from on-premises and cloud-based backup and recovery utilities.

Back in 2013 I showed you how to Create a Virtual Tape Library Using the AWS Storage Gateway. Today’s launch builds on that capability in two different ways. First, you create a Virtual Tape Library (VTL) on a Snowball Edge and copy your physical tapes to it. Second, after your tapes are in the cloud, you create a VTL on a Storage Gateway and use it to access your virtual tapes.

Getting Started
To get started, I open the Snow Family Console and create a new job. Then I select Import virtual tapes into AWS Storage Gateway and click Next:

Then I go through the remainder of the ordering sequence (enter my shipping address, name my job, choose a KMS key, and set up notification preferences), and place my order. I can track the status of the job in the console:

When my device arrives I tell the somewhat perplexed delivery person about data transfer, carry it down to my basement office, and ask Luna to check it out:

Back in the Snow Family console, I download the manifest file and copy the unlock code:

I connect the Snowball Edge to my “corporate” network:

Then I install AWS OpsHub for Snow Family on my laptop, power on the Snowball Edge, and wait for it to obtain & display an IP address:

I launch OpsHub, sign in, and accept the default name for my device:

I confirm that OpsHub has access to my device, and that the device is unlocked:

I view the list of services running on the device, and note that Tape Gateway is not running:

Before I start Tape Gateway, I create a Virtual Network Interface (VNI):

And then I start the Tape Gateway service on the Snow device:

Now that the service is running on the device, I am ready to create the Storage Gateway. I click Open Storage Gateway console from within OpsHub:

I select Snowball Edge as my host platform:

Then I give my gateway a name (MyTapeGateway), select my backup application (Veeam Backup & Replication in this case), and click Activate Gateway:

Then I configure CloudWatch logging:

And finally, I review the settings and click Finish to activate my new gateway:

The activation process takes a few minutes, just enough time to take Luna for a quick walk. When I return, the console shows that the gateway is activated and running, and I am all set:

Creating Tapes
The next step is to create some virtual tapes. I click Create tapes and enter the requested information, including the pool (Deep Archive or Glacier), and click Create tapes:

The next step is to copy data from my physical tapes to the Snowball Edge. I don’t have a data center and I don’t have any tapes, so I can’t show you how to do this part. The data is stored on the device, and my Internet connection is used only for management traffic between the Snowball Edge device and AWS. To learn more about this part of the process, check out our new animated explainer.

After I have copied the desired tapes to the device, I prepare it for shipment to AWS. I make sure that all of the virtual tapes in the Storage Gateway Console have the status In Transit to VTS (Virtual Tape Shelf), and then I power down the device.

The display on the device updates to show the shipping address, and I wait for the shipping company to pick up the device.

When the device arrives at AWS, the virtual tapes are imported, stored in the S3 storage class associated with the pool that I chose earlier, and can be accessed by retrieving them using an online tape gateway. The gateway can be deployed as a virtual machine or a hardware appliance.

Now Available
You can use AWS Snowball Edge for offline tape migration in the US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), Europe (Ireland), Europe (Frankfurt), Europe (London), Asia Pacific (Sydney) Regions. Start migrating petabytes of your physical tape data to AWS, today!

Jeff;

Connect Amazon S3 File Gateway using AWS PrivateLink for Amazon S3

Post Syndicated from Xiaozang Li original https://aws.amazon.com/blogs/architecture/connect-amazon-s3-file-gateway-using-aws-privatelink-for-amazon-s3/

AWS Storage Gateway is a set of services that provides on-premises access to virtually unlimited cloud storage. You can extend your on-premises storage capacity, and move on-premises backups and archives to the cloud. It provides low-latency access to cloud storage by caching frequently accessed data on premises, while storing data securely and durably in the cloud. This simplifies storage management and reduces costs for hybrid cloud storage use.

You may have privacy and security concerns with sending and receiving data across the public internet. In this case, you can use AWS PrivateLink, which provides private connectivity between Amazon Virtual Private Cloud (VPC) and other AWS services.

In this blog post, we will demonstrate how to take advantage of Amazon S3 interface endpoints to connect your on-premises Amazon S3 File Gateway directly to AWS over a private connection. We will also review the steps for implementation using the AWS Management Console.

AWS Storage Gateway on-premises

Storage Gateway offers four different types of gateways to connect on-premises applications with cloud storage.

  • Amazon S3 File Gateway Provides a file interface for applications to seamlessly store files as objects in Amazon S3. These files can be accessed using open standard file protocols.
  • Amazon FSx File Gateway Optimizes on-premises access to Windows file shares on Amazon FSx.
  • Tape Gateway Replaces on-premises physical tapes with virtual tapes in AWS without changing existing backup workflows.
  • Volume Gateway –  Presents cloud-backed iSCSI block storage volumes to your on-premises applications.

We will illustrate the use of Amazon S3 File Gateway in this blog.

VPC endpoints for Amazon S3

AWS PrivateLink provides two types of VPC endpoints that you can use to connect to Amazon S3; Interface endpoints and Gateway endpoints. An interface endpoint is an elastic network interface with a private IP address. It serves as an entry point for traffic destined to a supported AWS service or a VPC endpoint service. A gateway VPC endpoint uses prefix lists as the IP route target in a VPC route table and supports routing traffic privately to Amazon S3 or Amazon DynamoDB. Both these endpoints securely connect to Amazon S3 over the Amazon network, and your network traffic does not traverse the internet.

Solution architecture for PrivateLink connectivity between AWS Storage Gateway and Amazon S3

Previously, AWS Storage Gateway did not support PrivateLink for Amazon S3 and Amazon S3 Access Points. Customers had to build and manage an HTTP proxy infrastructure within their VPC to connect their on-premises applications privately to S3 (see Figure 1). This infrastructure acted as a proxy for all the traffic originating from on-premises gateways to Amazon S3 through Amazon S3 Gateway endpoints. This setup would result in additional configuration and operational overhead. The HTTP proxy could also become a network performance bottleneck.

Figure 1. Connect to Amazon S3 Gateway endpoint using an HTTP proxy

Figure 1. Connect to Amazon S3 Gateway endpoint using an HTTP proxy

AWS Storage Gateway recently added support for AWS PrivateLink for Amazon S3 and Amazon S3 Access Points. Customers can now connect their on-premises Amazon S3 File Gateway directly to Amazon S3 through a private connection. This uses an Amazon S3 interface endpoint and doesn’t require an HTTP proxy. Additionally, customers can use Amazon S3 Access Points instead of bucket names to map file shares. This enables more granular access controls for applications connecting to AWS Storage Gateway file shares (see Figure 2).

Figure 2. AWS Storage Gateway now supports AWS PrivateLink for Amazon S3 endpoints and Amazon S3 Access Points

Figure 2. AWS Storage Gateway now supports AWS PrivateLink for Amazon S3 endpoints and Amazon S3 Access Points

Implement AWS PrivateLink between AWS Storage Gateway and an Amazon S3 endpoint

Let’s look at how to create an Amazon S3 File Gateway file share, which is associated with a Storage Gateway. This file share stores data in an Amazon S3 bucket. It uses AWS PrivateLink for Amazon S3 to transfer data to the S3 endpoint.

  1. Create an Amazon S3 bucket in your preferred Region.
  2. Create and configure an Amazon S3 File Gateway.
  3. Create an Interface endpoint for Amazon S3. Ensure that the S3 interface endpoint is created in the same Region as the S3 bucket.
  4. Customize the File share settings (see Figure 3).
Figure 3. Create file share using VPC endpoints for Amazon S3

Figure 3. Create file share using VPC endpoints for Amazon S3

Best practices:

  • Select the AWS Region where the Amazon S3 bucket is located. This ensures that the VPC endpoint and the storage bucket are in the same Region.
  • When creating the file share with PrivateLink for S3 enabled, you can either select the S3 VPC endpoint ID from the dropdown menu, or manually input the S3 VPC endpoint DNS name.
  • Note that the dropdown list of VPC endpoint IDs only contains the VPCs created by the current AWS account administrator. If you are using a shared VPC in an AWS Organization, you can manually enter the DNS name of the VPC endpoint created in the management account.

Be aware of PrivateLink pricing when using an S3 interface endpoint. The cost for each interface endpoint is based on usage per hour, the number of Availability Zones used, and the volume of data transferred over the endpoint. Additionally, each Amazon S3 VPC interface endpoint can be shared among multiple S3 File Gateways. Each file share associated with the Storage Gateway can be configured with or without PrivateLink. For workloads that do not need the private network connectivity, you can save on interface endpoints costs by creating a file share without PrivateLink.

Verify PrivateLink communication

Once you have set up an S3 File Gateway file share using PrivateLink for S3, you can verify that traffic is flowing over your private connectivity as follows:

1. Enable VPC Flow Log for the VPC hosting the S3 Interface endpoint. This also hosts the Virtual Private Gateway (VGW), which connects to the on-premises environment.

2. From your workstation, connect to your on-premises File Gateway over SMB or NFS protocol and upload a new file (see Figure 4).

Figure 4. Upload a sample file to on-premises Storage Gateway

Figure 4. Upload a sample file to on-premises Storage Gateway

3. Navigate to the S3 bucket associated with the file share.  After a few seconds, you should see that the new file has been successfully uploaded and appears in the S3 bucket (see Figure 5).

Figure 5. Verify that the sample file is uploaded to storage bucket

Figure 5. Verify that the sample file is uploaded to storage bucket

4. On the VPC flow log, look for the generated log events. You’ll see your S3 interface endpoint elastic network interface, your file gateway IP, Amazon S3 private IP, and port number, as shown in Figure 6. This verifies that the file was transferred over the private connection. If you do not see an entry, verify if the VPC Flow Logs have been enabled on the correct VPC and elastic network interface.

Figure 6. VPC Flow Log entry to verify connectivity using Private IPs

Figure 6. VPC Flow Log entry to verify connectivity using Private IPs

Summary

In this blog post, we have demonstrated how to use Amazon S3 File Gateway to transfer files to Amazon S3 buckets over AWS PrivateLink. Use this solution to securely copy your application data and files to cloud storage. This will also provide low latency access to that data from your on-premises applications.

Thanks for reading this blog post. If you have any feedback or questions, please add them in the comments section.

Further Reading: