Enterprise customers have to maintain, protect, and provide access to the petabytes of data they produce in their data centers every day. Traditionally, this involves a set of complex, interrelated systems to store the raw data on Network Attached Storage (NAS), Storage Area Networks (SAN), or Direct Attached Storage (DAS), and to transform it and to load it into relational databases to support querying and analysis activities. This is commonly known as Extract Transform and Load or ETL.
Each of these systems must be separately maintained, often by separate teams: DBAs for the databases, systems engineers for the underlying physical infrastructure, and others. At AWS, we’re constantly looking at ways to invent and simplify on behalf of our customers. This post looks at using a combination of AWS technology that can be deployed in customers’ data centers (AWS Storage Gateway) and serverless, cloud-native technology (Amazon Athena) to simplify the process of querying critical data generated on-premises.
Customers using popular enterprise analysis tools, such as Tableau, to analyze their data rely on ODBC or JDBC to connect to and run queries against their data. Conversely, file systems use protocols like SMB or NFS to read and write files. Until now, it’s often been necessary to translate data from its raw format (often text files) into a relational database in order to allow analysis on it. Enter: AWS Storage Gateway and Amazon Athena.
In this blog post, I use this architecture to demonstrate the combined capabilities of Storage Gateway and Athena. AWS Storage Gateway is a hybrid storage service that enables your on-premises applications to seamlessly use AWS cloud storage. The File Gateway configuration of the AWS Storage Gateway offers you a seamless way to connect to the cloud in order to store application data files and backup images as durable objects on Amazon S3 cloud storage. File gateway offers SMB or NFS-based access to data in Amazon S3 with local caching, and files are stored and billed as S3 objects. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Let’s walk through an example with ACME Corp. ACME is a fictitious, but representative enterprise that wants to store, protect, and analyze the data it receives from millions of IoT sensors around the world.
The figure below gives a high-level view of how data flows between each step in ACME’s workflow in the proposed solution. Once this solution is configured, the flow of data into the hands of ACME’s analysts is fully automated with no manual intervention required.
Today, ACME receives a daily file from each sensor via FTP in text (comma-separated) format. These files share a common set of columns, and the files are stored on an enterprise NAS device behind the FTP server. The NAS device is replicated to a secondary facility for disaster recovery purposes on a daily basis.
At the end of each day, an ETL process runs, reads each text file, and loads it to a relational database table with a similar column structure. ACME analysts receive an email in their in-boxes when the load process is complete, allowing them to begin their analysis of the previous day’s activities provided there were no issues with the load. In the event of a load issue, operations staff are paged, which can delay the start of the analysts’ day while the problem is resolved.
In the event of a NAS failure, the prior day’s data must be replayed into the FTP server — a costly and time-consuming process. ACME’s hypothetical Recovery Time Objective for the analysis activities in the event of a database failure is four hours; their Recovery Point Objective for the data is up to one day. Operations personnel must maintain FTP servers, the NAS environment and Database servers.
Without making changes to ACME’s FTP process, which they wish to maintain in its current state, our first step is to deploy a File Gateway on their VMware infrastructure to replace ACME’s existing NAS. Let me quickly demonstrate how you can setup File Gateway for testing purposes in your own Amazon EC2 environment.
Step 1: From the AWS Management Console, select “Storage Gateway,” then select “Create Gateway:”
Step 2: Select the “File Gateway” gateway type and hit “Next:”
Step 3: Under “Select host platform,” choose “Amazon EC2” and follow the on-screen instructions to launch a Gateway instance:
After configuring and testing the gateway, it is mounted to the FTP server in place of the existing NAS. Here’s ACME’s S3 bucket, where ACME can see the data from the IoT sensors is now appearing in Amazon S3:
Here we can see the contents of the configured S3 bucket with the object keys presented as files to the Windows machine, and hence accessible in Windows Explorer:
Here’s what the File Gateway configuration looks like in ACME’s account. We can see that the gateway we created, AthenaGateway, is up and running, up to date, and mapped to the file share storage resource:
More information on configuring a File Gateway is available here: Creating a File Gateway.
The next step is to configure Amazon Athena. Using the AWS Console, we create a new Athena database and table pointing to ACME’s S3 bucket to which File Gateway is writing, with a table definition representing the columns in the data.
ACME’s policies call for the data to be encrypted at rest, and File Gateway supports encryption via KMS when writing data to the S3 bucket. Athena supports a range of Amazon S3 encryption options, both for encrypted datasets in Amazon S3 and for encrypted query results.
These options encrypt data at rest in Amazon S3. Regardless of whether you use these options, transport layer security (TLS) encrypts objects in-transit between Athena resources and between Athena and Amazon S3. Query results stream to JDBC clients as plain text and are encrypted using TLS. We then run a test query in the Athena console to verify that data is being returned correctly. As new data is received by the File Gateway, it is automatically added to S3, and automatically included in Athena’s query scope. Now, we are going to create an Athena database using AWS Glue; this is to make ACME’s IoT device data in S3 via the File Gateway accessible for querying via Athena.
Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage.
First, we open the Glue home page in the AWS Management Console, then select “Add tables using a crawler,” and follow the steps described, referencing your S3 bucket and prefix when asked. Documentation on configuring a Glue crawler is here:
Once the crawler is configured, run it. It will crawl your data in S3 and flag once completed:
Next, open the Athena home page in the AWS Management Console:
In the Athena home page, you’ll now see the database and tables created by Glue. Here is Athena, configured to point to the sensor data in S3 and running a test query against it. The test query we will use is as follows:
This sample query scans all ACME’s data to count the top 100 cities with sensors that have emitted values in the range between 50 and 60, and reports how many such data points have been emitted.
The final step is to redirect ACME’s Tableau environment to point to Athena’s ODBC endpoint. Tableau’s ODBC configuration is managed centrally by ACME, and the necessary details are swapped to point to Athena in place of the existing on-premise relational database.
When you start Tableau, under “Connect,” you can see the file and database types that are supported by Tableau Desktop. Select “More” to see the complete list. Tableau considers ODBC (Open Database Connectivity) as a standard way to connect to a database. You can connect Tableau to your data using the ODBC driver for Amazon Athena and the Tableau Other Databases (ODBC) connector. Tableau’s complete documentation for connecting to ODBC is available here.
Let’s recap what we’ve changed, and the technology and end-user impact.
- We have replaced ACME’s on-premises NAS with AWS Storage Gateway backed by an S3 bucket, and configured their FTP server to use the File Gateway’s file share in place of their existing one.
- We have configured Storage Gateway as a File Gateway to provide access to the customer S3 bucket as a NAS. Their data is now in S3.
- We have configured a serverless Amazon Athena database to mimic the previous relational database, and exposed an ODBC endpoint to this database.
- We have re-configured ACME’s Tableau environment to point to this ODBC endpoint. Since the relational database in this scenario was only used to service ad-hoc SQL queries, it is no longer needed.
If there are no other dependencies, ACME can now decommission the on-premises ETL, relational database, and NAS infrastructure that were dedicated to supporting this scenario. Aside from the FTP servers and the Storage Gateway Virtual Machine hosts, there are now no servers to manage that support this scenario either.
End-user analysts working with this data no longer need to wait until start of day to begin their analysis. New sensor data arrives in the Athena S3 folder shortly after FTP delivery from the sensors, and is available for query immediately. The removal of the ETL and relational database infrastructure reduces the potential points of failure in the architecture, and in the event of a disaster, an Athena endpoint in a second AWS Region (backed by S3 Cross Region Replication) makes the data available to Tableau as soon as replication completes. Because S3 has the ability to trigger events when new data arrives, analysts can now be notified when data from particular groups of sensors becomes available, allowing them to begin their work at the earliest possible moment.
Data remains cached on the local gateway, allowing for extremely rapid access by other on-premise high-performance computing, big data, or other applications. For high availability, ACME has the ability to rapidly launch a second storage gateway instance on their existing VMware infrastructure should the primary fail. A further refinement would be to use the NotifyWhenUploaded functionality in AWS Storage Gateway to provide CloudWatch Events when groups of data are uploaded to enable batch processing.
And that’s it!
For our many enterprise customers, who deal with complex architectures for these types of hybrid cloud scenarios, the combination of AWS Storage Gateway and Amazon Athena can help simplify and lower costs while enabling on-premise, cloud native and hybrid scenarios across their application portfolios.
If you have any feedback or questions, please feel free to leave a comment.
About the Author
James Forrester is Head of Technology for AWS Global Accounts. He works with customers around the world to provide thought leadership on the transformative value, applicability and usage of the full breadth of AWS services.