AWS Glue provides enhanced support for working with datasets that are organized into Hive-style partitions. AWS Glue crawlers automatically identify partitions in your Amazon S3 data. The AWS Glue ETL (extract, transform, and load) library natively supports partitions when you work with DynamicFrames. DynamicFrames represent a distributed collection of data without requiring you to specify a schema. You can now push down predicates when creating DynamicFrames to filter out partitions and avoid costly calls to S3. We have also added support for writing DynamicFrames directly into partitioned directories without converting them to Apache Spark DataFrames.
Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. For example, you might decide to partition your application logs in Amazon S3 by date—broken down by year, month, and day. Files corresponding to a single day’s worth of data would then be placed under a prefix such as s3://my_bucket/logs/year=2018/month=01/day=23/.
Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by value without making unnecessary calls to Amazon S3. This can significantly improve the performance of applications that need to read only a few partitions.
In this post, we show you how to efficiently process partitioned datasets using AWS Glue. First, we cover how to set up a crawler to automatically scan your partitioned dataset and create a table and partitions in the AWS Glue Data Catalog. Then, we introduce some features of the AWS Glue ETL library for working with partitioned data. You can now filter partitions using SQL expressions or user-defined functions to avoid listing and reading unnecessary data from Amazon S3. We’ve also added support in the ETL library for writing AWS Glue DynamicFrames directly into partitions without relying on Spark SQL DataFrames.
Let’s get started!
Crawling partitioned data
In this example, we use the same GitHub archive dataset that we introduced in a previous post about Scala support in AWS Glue. This data, which is publicly available from the GitHub archive, contains a JSON record for every API request made to the GitHub service. A sample dataset containing one month of activity from January 2017 is available at the following location:
Here you can replace <region> with the AWS Region in which you are working, for example, us-east-1. This dataset is partitioned by year, month, and day, so an actual file will be at a path like the following:
To crawl this data, you can either follow the instructions in the AWS Glue Developer Guide or use the provided AWS CloudFormation template. This template creates a stack that contains the following:
An IAM role with permissions to access AWS Glue resources
A database in the AWS Glue Data Catalog named githubarchive_month
A crawler set up to crawl the GitHub dataset
An AWS Glue development endpoint (which is used in the next section to transform the data)
To run this template, you must provide an S3 bucket and prefix where you can write output data in the next section. The role that this template creates will have permission to write to this bucket only. You also need to provide a public SSH key for connecting to the development endpoint. For more information about creating an SSH key, see our Development Endpoint tutorial. After you create the AWS CloudFormation stack, you can run the crawler from the AWS Glue console.
In addition to inferring file types and schemas, crawlers automatically identify the partition structure of your dataset and populate the AWS Glue Data Catalog. This ensures that your data is correctly grouped into logical tables and makes the partition columns available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.
After you crawl the table, you can view the partitions by navigating to the table in the AWS Glue console and choosing View partitions. The partitions should look like the following:
For partitioned paths in Hive-style of the form key=val, crawlers automatically populate the column name. In this case, because the GitHub data is stored in directories of the form 2017/01/01, the crawlers use default names like partition_0, partition_1, and so on. You can easily change these names on the AWS Glue console: Navigate to the table, choose Edit schema, and rename partition_0 to year, partition_1 to month, and partition_2 to day:
Now that you’ve crawled the dataset and named your partitions appropriately, let’s see how to work with partitioned data in an AWS Glue ETL job.
Transforming and filtering the data
To get started with the AWS Glue ETL libraries, you can use an AWS Glue development endpoint and an Apache Zeppelin notebook. AWS Glue development endpoints provide an interactive environment to build and run scripts using Apache Spark and the AWS Glue ETL library. They are great for debugging and exploratory analysis, and can be used to develop and test scripts before migrating them to a recurring job.
If you ran the AWS CloudFormation template in the previous section, then you already have a development endpoint named partition-endpoint in your account. Otherwise, you can follow the instructions in this development endpoint tutorial. In either case, you need to set up an Apache Zeppelin notebook, either locally, or on an EC2 instance. You can find more information about development endpoints and notebooks in the AWS Glue Developer Guide.
The following examples are all written in the Scala programming language, but they can all be implemented in Python with minimal changes.
Reading a partitioned dataset
To get started, let’s read the dataset and see how the partitions are reflected in the schema. First, you import some classes that you will need for this example and set up a GlueContext, which is the main class that you will use to read and write data.
Execute the following in a Zeppelin paragraph, which is a unit of executable code:
%spark
import com.amazonaws.services.glue.DynamicFrame import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext
import java.util.Calendar
import java.util.GregorianCalendar
import scala.collection.JavaConversions._
@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
This is straightforward with two caveats: First, each paragraph must start with the line %spark to indicate that the paragraph is Scala. Second, the spark variable must be marked @transient to avoid serialization issues. This is only necessary when running in a Zeppelin notebook.
Next, read the GitHub data into a DynamicFrame, which is the primary data structure that is used in AWS Glue scripts to represent a distributed collection of data. A DynamicFrame is similar to a Spark DataFrame, except that it has additional enhancements for ETL transformations. DynamicFrames are discussed further in the post AWS Glue Now Supports Scala Scripts, and in the AWS Glue API documentation.
The following snippet creates a DynamicFrame by referencing the Data Catalog table that you just crawled and then prints the schema:
%spark
val githubEvents: DynamicFrame = glueContext.getCatalogSource(
database = "githubarchive_month",
tableName = "data"
).getDynamicFrame()
githubEvents.schema.asFieldList.foreach { field =>
println(s"${field.getName}: ${field.getType.getType.getName}")
}
You could also print the full schema using githubEvents.printSchema(). But in this case, the full schema is quite large, so I’ve printed only the top-level columns. This paragraph takes about 5 minutes to run on a standard size AWS Glue development endpoint. After it runs, you should see the following output:
Note that the partition columns year, month, and day were automatically added to each record.
Filtering by partition columns
One of the primary reasons for partitioning data is to make it easier to operate on a subset of the partitions, so now let’s see how to filter data by the partition columns. In particular, let’s find out what people are building in their free time by looking at GitHub activity on the weekends. One way to accomplish this is to use the filter transformation on the githubEvents DynamicFrame that you created earlier to select the appropriate events:
%spark
def filterWeekend(rec: DynamicRecord): Boolean = {
def getAsInt(field: String): Int = {
rec.getField(field) match {
case Some(strVal: String) => strVal.toInt
// The filter transformation will catch exceptions and mark the record as an error.
case _ => throw new IllegalArgumentException(s"Unable to extract field $field")
}
}
val (year, month, day) = (getAsInt("year"), getAsInt("month"), getAsInt("day"))
val cal = new GregorianCalendar(year, month - 1, day) // Calendar months start at 0.
val dayOfWeek = cal.get(Calendar.DAY_OF_WEEK)
dayOfWeek == Calendar.SATURDAY || dayOfWeek == Calendar.SUNDAY
}
val filteredEvents = githubEvents.filter(filterWeekend)
filteredEvents.count
This snippet defines the filterWeekend function that uses the Java Calendar class to identify those records where the partition columns (year, month, and day) fall on a weekend. If you run this code, you see that there were 6,303,480 GitHub events falling on the weekend in January 2017, out of a total of 29,160,561 events. This seems reasonable—about 22 percent of the events fell on the weekend, and about 29 percent of the days that month fell on the weekend (9 out of 31). So people are using GitHub slightly less on the weekends, but there is still a lot of activity!
Predicate pushdowns for partition columns
The main downside to using the filter transformation in this way is that you have to list and read all files in the entire dataset from Amazon S3 even though you need only a small fraction of them. This is manageable when dealing with a single month’s worth of data. But as you try to process more data, you will spend an increasing amount of time reading records only to immediately discard them.
To address this issue, we recently released support for pushing down predicates on partition columns that are specified in the AWS Glue Data Catalog. Instead of reading the data and filtering the DynamicFrame at executors in the cluster, you apply the filter directly on the partition metadata available from the catalog. Then you list and read only the partitions from S3 that you need to process.
To accomplish this, you can specify a Spark SQL predicate as an additional parameter to the getCatalogSource method. This predicate can be any SQL expression or user-defined function as long as it uses only the partition columns for filtering. Remember that you are applying this to the metadata stored in the catalog, so you don’t have access to other fields in the schema.
The following snippet shows how to use this functionality to read only those partitions occurring on a weekend:
%spark
val partitionPredicate =
"date_format(to_date(concat(year, '-', month, '-', day)), 'E') in ('Sat', 'Sun')"
val pushdownEvents = glueContext.getCatalogSource(
database = "githubarchive_month",
tableName = "data",
pushDownPredicate = partitionPredicate).getDynamicFrame()
Here you use the SparkSQL string concat function to construct a date string. You use the to_date function to convert it to a date object, and the date_format function with the ‘E’ pattern to convert the date to a three-character day of the week (for example, Mon, Tue, and so on). For more information about these functions, Spark SQL expressions, and user-defined functions in general, see the Spark SQL documentation and list of functions.
Note that the pushdownPredicate parameter is also available in Python. The corresponding call in Python is as follows:
You can observe the performance impact of pushing down predicates by looking at the execution time reported for each Zeppelin paragraph. The initial approach using a Scala filter function took 2.5 minutes:
Because the version using a pushdown lists and reads much less data, it takes only 24 seconds to complete, a 5X improvement!
Of course, the exact benefit that you see depends on the selectivity of your filter. The more partitions that you exclude, the more improvement you will see.
In addition to Hive-style partitioning for Amazon S3 paths, Parquet and ORC file formats further partition each file into blocks of data that represent column values. Each block also stores statistics for the records that it contains, such as min/max for column values. AWS Glue supports pushdown predicates for both Hive-style partitions and block partitions in these formats. While reading data, it prunes unnecessary S3 partitions and also skips the blocks that are determined unnecessary to be read by column statistics in Parquet and ORC formats.
Additional transformations
Now that you’ve read and filtered your dataset, you can apply any additional transformations to clean or modify the data. For example, you could augment it with sentiment analysis as described in the previous AWS Glue post.
To keep things simple, you can just pick out some columns from the dataset using the ApplyMapping transformation:
ApplyMapping is a flexible transformation for performing projection and type-casting. In this example, we use it to unnest several fields, such as actor.login, which we map to the top-level actor field. We also cast the id column to a long and the partition columns to integers.
Writing out partitioned data
The final step is to write out your transformed dataset to Amazon S3 so that you can process it with other systems like Amazon Athena. By default, when you write out a DynamicFrame, it is not partitioned—all the output files are written at the top level under the specified output path. Until recently, the only way to write a DynamicFrame into partitions was to convert it into a Spark SQL DataFrame before writing. We are excited to share that DynamicFrames now support native partitioning by a sequence of keys.
You can accomplish this by passing the additional partitionKeys option when creating a sink. For example, the following code writes out the dataset that you created earlier in Parquet format to S3 in directories partitioned by the type field.
Here, $outpath is a placeholder for the base output path in S3. The partitionKeys parameter can also be specified in Python in the connection_options dict:
When you execute this write, the type field is removed from the individual records and is encoded in the directory structure. To demonstrate this, you can list the output path using the aws s3 ls command from the AWS CLI:
PRE type=CommitCommentEvent/
PRE type=CreateEvent/
PRE type=DeleteEvent/
PRE type=ForkEvent/
PRE type=GollumEvent/
PRE type=IssueCommentEvent/
PRE type=IssuesEvent/
PRE type=MemberEvent/
PRE type=PublicEvent/
PRE type=PullRequestEvent/
PRE type=PullRequestReviewCommentEvent/
PRE type=PushEvent/
PRE type=ReleaseEvent/
PRE type=WatchEvent/
As expected, there is a partition for each distinct event type. In this example, we partitioned by a single value, but this is by no means required. For example, if you want to preserve the original partitioning by year, month, and day, you could simply set the partitionKeys option to be Seq(“year”, “month”, “day”).
Conclusion
In this post, we showed you how to work with partitioned data in AWS Glue. Partitioning is a crucial technique for getting the most out of your large datasets. Many tools in the AWS big data ecosystem, including Amazon Athena and Amazon Redshift Spectrum, take advantage of partitions to accelerate query processing. AWS Glue provides mechanisms to crawl, filter, and write partitioned data so that you can structure your data in Amazon S3 however you want, to get the best performance out of your big data applications.
Ben Sowell is a senior software development engineer at AWS Glue. He has worked for more than 5 years on ETL systems to help users unlock the potential of their data. In his free time, he enjoys reading and exploring the Bay Area.
Mohit Saxena is a senior software development engineer at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies and reading about the latest technology.
If you build (or want to build) data-driven web and mobile apps and need real-time updates and the ability to work offline, you should take a look at AWS AppSync. Announced in preview form at AWS re:Invent 2017 and described in depth here, AWS AppSync is designed for use in iOS, Android, JavaScript, and React Native apps. AWS AppSync is built around GraphQL, an open, standardized query language that makes it easy for your applications to request the precise data that they need from the cloud.
I’m happy to announce that the preview period is over and that AWS AppSync is now generally available and production-ready, with six new features that will simplify and streamline your application development process:
Console Log Access – You can now see the CloudWatch Logs entries that are created when you test your GraphQL queries, mutations, and subscriptions from within the AWS AppSync Console.
Console Testing with Mock Data – You can now create and use mock context objects in the console for testing purposes.
Subscription Resolvers – You can now create resolvers for AWS AppSync subscription requests, just as you can already do for query and mutate requests.
Batch GraphQL Operations for DynamoDB – You can now make use of DynamoDB’s batch operations (BatchGetItem and BatchWriteItem) across one or more tables. in your resolver functions.
CloudWatch Support – You can now use Amazon CloudWatch Metrics and CloudWatch Logs to monitor calls to the AWS AppSync APIs.
CloudFormation Support – You can now define your schemas, data sources, and resolvers using AWS CloudFormation templates.
A Brief AppSync Review Before diving in to the new features, let’s review the process of creating an AWS AppSync API, starting from the console. I click Create API to begin:
I enter a name for my API and (for demo purposes) choose to use the Sample schema:
The schema defines a collection of GraphQL object types. Each object type has a set of fields, with optional arguments:
If I was creating an API of my own I would enter my schema at this point. Since I am using the sample, I don’t need to do this. Either way, I click on Create to proceed:
The GraphQL schema type defines the entry points for the operations on the data. All of the data stored on behalf of a particular schema must be accessible using a path that begins at one of these entry points. The console provides me with an endpoint and key for my API:
It also provides me with guidance and a set of fully functional sample apps that I can clone:
When I clicked Create, AWS AppSync created a pair of Amazon DynamoDB tables for me. I can click Data Sources to see them:
I can also see and modify my schema, issue queries, and modify an assortment of settings for my API.
Let’s take a quick look at each new feature…
Console Log Access The AWS AppSync Console already allows me to issue queries and to see the results, and now provides access to relevant log entries.In order to see the entries, I must enable logs (as detailed below), open up the LOGS, and check the checkbox. Here’s a simple mutation query that adds a new event. I enter the query and click the arrow to test it:
I can click VIEW IN CLOUDWATCH for a more detailed view:
Console Testing with Mock Data You can now create a context object in the console where it will be passed to one of your resolvers for testing purposes. I’ll add a testResolver item to my schema:
Then I locate it on the right-hand side of the Schema page and click Attach:
I choose a data source (this is for testing and the actual source will not be accessed), and use the Put item mapping template:
Then I click Select test context, choose Create New Context, assign a name to my test content, and click Save (as you can see, the test context contains the arguments from the query along with values to be returned for each field of the result):
After I save the new Resolver, I click Test to see the request and the response:
Subscription Resolvers Your AWS AppSync application can monitor changes to any data source using the @aws_subscribe GraphQL schema directive and defining a Subscription type. The AWS AppSync client SDK connects to AWS AppSync using MQTT over Websockets and the application is notified after each mutation. You can now attach resolvers (which convert GraphQL payloads into the protocol needed by the underlying storage system) to your subscription fields and perform authorization checks when clients attempt to connect. This allows you to perform the same fine grained authorization routines across queries, mutations, and subscriptions.
Batch GraphQL Operations Your resolvers can now make use of DynamoDB batch operations that span one or more tables in a region. This allows you to use a list of keys in a single query, read records multiple tables, write records in bulk to multiple tables, and conditionally write or delete related records across multiple tables.
In order to use this feature the IAM role that you use to access your tables must grant access to DynamoDB’s BatchGetItem and BatchPutItem functions.
CloudWatch Logs Support You can now tell AWS AppSync to log API requests to CloudWatch Logs. Click on Settings and Enable logs, then choose the IAM role and the log level:
CloudFormation Support You can use the following CloudFormation resource types in your templates to define AWS AppSync resources:
AWS::AppSync::GraphQLApi – Defines an AppSync API in terms of a data source (an Amazon Elasticsearch Service domain or a DynamoDB table).
AWS::AppSync::ApiKey – Defines the access key needed to access the data source.
AWS::AppSync::GraphQLSchema – Defines a GraphQL schema.
AWS::AppSync::DataSource – Defines a data source.
AWS::AppSync::Resolver – Defines a resolver by referencing a schema and a data source, and includes a mapping template for requests.
Here’s a simple schema definition in YAML form:
AppSyncSchema:
Type: "AWS::AppSync::GraphQLSchema"
DependsOn:
- AppSyncGraphQLApi
Properties:
ApiId: !GetAtt AppSyncGraphQLApi.ApiId
Definition: |
schema {
query: Query
mutation: Mutation
}
type Query {
singlePost(id: ID!): Post
allPosts: [Post]
}
type Mutation {
putPost(id: ID!, title: String!): Post
}
type Post {
id: ID!
title: String!
}
Available Now These new features are available now and you can start using them today! Here are a couple of blog posts and other resources that you might find to be of interest:
As a follow-up to our earlier work to provide longer IDs for a small set of essential EC2 resources, we are now doing the same for the remaining EC2 resources, with a migration deadline of July 2018. You can opt-in on a per-user, per-region, per-type basis and verify that your code, regular expressions, database schemas, and database queries work as expected
If you have code that recognizes, processes, or stores IDs for any type of EC2 resources, please read this post with care! Here’s what you need to know:
Migration Deadline – You have until July 2018 to make sure that your code and your schemas can process and store the new, longer IDs. After that, longer IDs will be assigned by default for all newly created resources. IDs for existing resources will remain as-is and will continue to work.
More Resource Types – Longer IDs are now supported for all types of EC2 resources, and you can opt-in as desired:
I would like to encourage you to opt-in, starting with your test accounts, as soon as possible. This will give you time to thoroughly test your code and to make any necessary changes before promoting the code to production.
More Regions – The longer IDs are now available in the AWS China (Beijing) and AWS China (Ningxia) Regions.
Test AMIs – We have published AMIs with longer IDs that you can use for testing (search for testlongids to find them in the Public images):
Today we’d like to walk you through AWS Identity and Access Management (IAM), federated sign-in through Active Directory (AD) and Active Directory Federation Services (ADFS). With IAM, you can centrally manage users, security credentials such as access keys, and permissions that control which resources users can access. Customers have the option of creating users and group objects within IAM or they can utilize a third-party federation service to assign external directory users access to AWS resources. To streamline the administration of user access in AWS, organizations can utilize a federated solution with an external directory, allowing them to minimize administrative overhead. Benefits of this approach include leveraging existing passwords and password policies, roles and groups. This guide provides a walk-through on how to automate the federation setup across multiple accounts/roles with an Active Directory backing identity store. This will establish the minimum baseline for the authentication architecture, including the initial IdP deployment and elements for federation.
ADFS Federated Authentication Process
The following describes the process a user will follow to authenticate to AWS using Active Directory and ADFS as the identity provider and identity brokers:
Corporate user accesses the corporate Active Directory Federation Services portal sign-in page and provides Active Directory authentication credentials.
AD FS authenticates the user against Active Directory.
Active Directory returns the user’s information, including AD group membership information.
AD FS dynamically builds ARNs by using Active Directory group memberships for the IAM roles and user attributes for the AWS account IDs, and sends a signed assertion to the users browser with a redirect to post the assertion to AWS STS.
Temporary credentials are returned using STS AssumeRoleWithSAML.
The user is authenticated and provided access to the AWS management console.
Configuration Steps
Configuration requires setup in the Identity Provider store (e.g. Active Directory), the identity broker (e.g. Active Directory Federation Services), and AWS. It is possible to configure AWS to federate authentication using a variety of third-party SAML 2.0 compliant identity providers, more information can be found here.
AWS Configuration
The configuration steps outlined in this document can be completed to enable federated access to multiple AWS accounts, facilitating a single sign on process across a multi-account AWS environment. Access can also be provided to multiple roles in each AWS account. The roles available to a user are based on their group memberships in the identity provider (IdP). In a multi-role and/or multi-account scenario, role assumption requires the user to select the account and role they wish to assume during the authentication process.
Identity Provider
A SAML 2.0 identity provider is an IAM resource that describes an identity provider (IdP) service that supports the SAML 2.0 (Security Assertion Markup Language 2.0) standard. AWS SAML identity provider configurations can be used to establish trust between AWS and SAML-compatible identity providers, such as Shibboleth or Microsoft Active Directory Federation Services. These enable users in an organization to access AWS resources using existing credentials from the identity provider.
A SAML identify provider can be configured using the AWS console by completing the following steps.
2. Select SAML for the provider type. Select a provider name of your choosing (this will become the logical name used in the identity provider ARN). Lastly, download the FederationMetadata.xml file from your ADFS server to your client system file (https://yourADFSserverFQDN/FederationMetadata/2007-06/FederationMetadata.xml). Click “Choose File” and upload it to AWS.
3. Click “Next Step” and then verify the information you have entered. Click “Create” to complete the AWS identity provider configuration process.
IAM Role Naming Convention for User Access Once the AWS identity provider configuration is complete, it is necessary to create the roles in AWS that federated users can assume via SAML 2.0. An IAM role is an AWS identity with permission policies that determine what the identity can and cannot do in AWS. In a federated authentication scenario, users (as defined in the IdP) assume an AWS role during the sign-in process. A role should be defined for each access delineation that you wish to define. For example, create a role for each line of business (LOB), or each function within a LOB. Each role will then be assigned a set of policies that define what privileges the users who will be assuming that role will have.
The following steps detail how to create a single role. These steps should be completed multiple times to enable assumption of different roles within AWS, as required.
2. Select “SAML” as the trusted entity type. Click Next Step.
3. Select your previously created identity provider. Click Next: Permissions.
4. The next step requires selection of policies that represent the desired permissions the user should obtain in AWS, once they have authenticated and successfully assumed the role. This can be either a custom policy or preferably an AWS managed policy. AWS recommends leveraging existing AWS access policies for job functions for common levels of access. For example, the “Billing” AWS Managed policy should be utilized to provide financial analyst access to AWS billing and cost information.
5. Provide a name for your role. All roles should be created with the prefix ADFS-<rolename> to simplify the identification of roles in AWS that are accessed through the federated authentication process. Next click, “Create Role”.
Active Directory Configuration
Determining how you will create and delineate your AD groups and IAM roles in AWS is crucial to how you secure access to your account and manage resources. SAML assertions to the AWS environment and the respective IAM role access will be managed through regular expression (regex) matching between your on-premises AD group name to an AWS IAM role.
One approach for creating the AD groups that uniquely identify the AWS IAM role mapping is by selecting a common group naming convention. For example, your AD groups would start with an identifier, for example AWS-, as this will distinguish your AWS groups from others within the organization. Next, include the 12-digit AWS account number. Finally, add the matching role name within the AWS account. Here is an example:
You should do this for each role and corresponding AWS account you wish to support with federated access. Users in Active Directory can subsequently be added to the groups, providing the ability to assume access to the corresponding roles in AWS. If a user is associated with multiple Active Directory groups and AWS accounts, they will see a list of roles by AWS account and will have the option to choose which role to assume. A user will not be able to assume more than one role at a time, but has the ability to switch between them as needed.
Note: Microsoft imposes a limit on the number of groups a user can be a member of (approximately 1,015 groups) due to the size limit for the access token that is created for each security principal. This limitation, however, is not affected by how the groups may or may not be nested.
Active Directory Federation Services Configuration
ADFS federation occurs with the participation of two parties; the identity or claims provider (in this case the owner of the identity repository – Active Directory) and the relying party, which is another application that wishes to outsource authentication to the identity provider; in this case Amazon Secure Token Service (STS). The relying party is a federation partner that is represented by a claims provider trust in the federation service.
Relying Party
You can configure a new relying party in Active Directory Federation Services by doing the following.
1. From the ADFS Management Console, right-click ADFS and select Add Relying Party Trust.
2. In the Add Relying Party Trust Wizard, click Start.
3. Check Import data about the relying party published online or on a local network, enter https://signin.aws.amazon.com/static/saml-metadata.xml, and then click Next. The metadata XML file is a standard SAML metadata document that describes AWS as a relying party.
Note: SAML federations use metadata documents to maintain information about the public keys and certificates that each party utilizes. At run time, each member of the federation can then use this information to validate that the cryptographic elements of the distributed transactions come from the expected actors and haven’t been tampered with. Since these metadata documents do not contain any sensitive cryptographic material, AWS publishes federation metadata at https://signin.aws.amazon.com/static/saml-metadata.xml
4. Set the display name for the relying party and then click Next.
5. We will not choose to enable/configure the MFA settings at this time.
6. Select “Permit all users to access this relying party” and click Next.
7. Review your settings and then click Next.
8. Choose Close on the Finish page to complete the Add Relying Party Trust Wizard. AWS is now configured as a relying party.
Custom Claim Rules
Microsoft Active Directory Federation Services (AD FS) uses Claims Rule Language to issue and transform claims between claims providers and relying parties. A claim is information about a user from a trusted source. The trusted source is asserting that the information is true, and that source has authenticated the user in some manner. The claims provider is the source of the claim. This can be information pulled from an attribute store such as Active Directory (AD). The relying party is the destination for the claims, in this case AWS.
AD FS provides administrators with the option to define custom rules that they can use to determine the behavior of identity claims with the claim rule language. The Active Directory Federation Services (AD FS) claim rule language acts as the administrative building block to help manage the behavior of incoming and outgoing claims. There are four claim rules that need to be created to effectively enable Active Directory users to assume roles in AWS based on group membership in Active Directory.
Right-click on the relying party (in this case Amazon Web Services) and then click Edit Claim Rules
Here are the steps used to create the claim rules for NameId, RoleSessionName, Get AD Groups and Roles.
1. NameId
a) In the Edit Claim Rules for <relying party> dialog box, click Add Rule. b) Select Transform an Incoming Claim and then click Next. c) Use the following settings:
i) Claim rule name: NameId ii) Incoming claim type: Windows Account Name iii) Outgoing claim type: Name ID iv) Outgoing name ID format: Persistent Identifier v) Pass through all claim values: checked
a) Click Add Rule. b) In the Claim rule template list, select Send Claims Using a Custom Rule and then click Next. c) For Claim Rule Name, select Get AD Groups, and then in Custom rule, enter the following:
This custom rule uses a script in the claim rule language that retrieves all the groups the authenticated user is a member of and places them into a temporary claim named http://temp/variable. Think of this as a variable you can access later.
Note: Ensure there’s no trailing whitespace to avoid unexpected results.
4. Role Attributes
a) Unlike the two previous claims, here we used custom rules to send role attributes. This is done by retrieving all the authenticated user’s AD groups and then matching the groups that start with to IAM roles of a similar name. I used the names of these groups to create Amazon Resource Names (ARNs) of IAM roles in my AWS account (i.e., those that start with AWS-). Sending role attributes requires two custom rules. The first rule retrieves all the authenticated user’s AD group memberships and the second rule performs the transformation to the roles claim.
i) Click Add Rule. ii) In the Claim rule template list, select Send Claims Using a Custom Rule and then click Next. iii) For Claim Rule Name, enter Roles, and then in Custom rule, enter the following:
Rule language: c:[Type == "http://temp/variable", Value =~ "(?i)^AWS-([\d]{12})"] => issue(Type = "https://aws.amazon.com/SAML/Attributes/Role", Value = RegExReplace(c.Value, "AWS-([\d]{12})-", "arn:aws:iam::$1:saml-provider/idp1,arn:aws:iam::$1:role/"));
This custom rule uses regular expressions to transform each of the group memberships of the form AWS-<Account Number>-<Role Name> into in the IAM role ARN, IAM federation provider ARN form AWS expects.
Note: In the example rule language above idp1 represents the logical name given to the SAML identity provider in the AWS identity provider setup. Please change this based on the logical name you chose in the IAM console for your identity provider.
Adjusting Session Duration
By default, the temporary credentials that are issued by AWS IAM for SAML federation are valid for an hour. Depending on your organizations security stance, you may wish to adjust. You can allow your federated users to work in the AWS Management Console for up to 12 hours. This can be accomplished by adding another claim rule in your ADFS configuration. To add the rule, do the following:
1. Access ADFS Management Tool on your ADFS Server. 2. Choose Relying Party Trusts, then select your AWS Relying Party configuration. 3. Choose Edit Claim Rules. 4. Choose Add Rule to configure a new rule, and then choose Send claims using a custom rule. Finally, choose Next. 5. Name your Rule “Session Duration” and add the following rule syntax. 6. Adjust the value of 28800 seconds (8 hours) as appropriate.
Rule language: => issue(Type = "https://aws.amazon.com/SAML/Attributes/SessionDuration", Value = "28800");
Note: AD FS 2012 R2 and AD FS 2016 tokens have a sixty-minute validity period by default. This value is configurable on a per-relying party trust basis. In addition to adding the “Session Duration” claim rule, you will also need to update the security token created by AD FS. To update this value, run the following command:
The Parameter “-TokenLifetime” determines the Lifetime in Minutes. In this example, we set the Lifetime to 480 minutes, eight hours.
These are the main settings related to session lifetimes and user authentication. Once updated, any new console session your federated users initiate will be valid for the duration specified in the SessionDuration claim.
API/CLI Access Access to the AWS API and command-line tools using federated access can be accomplished using techniques in the following blog article:
This will enable your users to access your AWS environment using their domain credentials through the AWS CLI or one of the AWS SDKs.
Conclusion In this post, I’ve shown you how to provide identity federation, and thus SSO, to the AWS Management Console for multiple accounts using SAML assertions. With this approach, the AWS Security Token service (STS) will provide temporary credentials (via SAML) for the user to ‘assume’ a role (that they have access to use, as denoted by AD Group membership) that has specific permissions associated; as opposed to providing long-term access credentials to the AWS resources. By adopting this model, you will have a secure and robust IAM approach for accessing AWS resources that align with AWS security best practices.
An ETL (Extract, Transform, Load) process enables you to load data from source systems into your data warehouse. This is typically executed as a batch or near-real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.
Amazon Redshift is a fast, petabyte-scale data warehouse that enables you easily to make data-driven decisions. With Amazon Redshift, you can get insights into your big data in a cost-effective fashion using standard SQL. You can set up any type of data model, from star and snowflake schemas, to simple de-normalized tables for running any analytical queries.
To operate a robust ETL platform and deliver data to Amazon Redshift in a timely manner, design your ETL processes to take account of Amazon Redshift’s architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in performance and scale issues long term. This post guides you through the following best practices for ensuring optimal, consistent runtimes for your ETL processes:
COPY data from multiple, evenly sized files.
Use workload management to improve ETL runtimes.
Perform table maintenance regularly.
Perform multiple steps in a single transaction.
Loading data in bulk.
Use UNLOAD to extract large result sets.
Use Amazon Redshift Spectrum for ad hoc ETL processing.
Monitor daily ETL health using diagnostic queries.
1. COPY data from multiple, evenly sized files
Amazon Redshift is an MPP (massively parallel processing) database, where all the compute nodes divide and parallelize the work of ingesting data. Each node is further subdivided into slices, with each slice having one or more dedicated cores, equally dividing the processing capacity. The number of slices per node depends on the node type of the cluster. For example, each DS2.XLARGE compute node has two slices, whereas each DS2.8XLARGE compute node has 16 slices.
When you load data into Amazon Redshift, you should aim to have each slice do an equal amount of work. When you load the data from a single large file or from files split into uneven sizes, some slices do more work than others. As a result, the process runs only as fast as the slowest, or most heavily loaded, slice. In the example shown below, a single large file is loaded into a two-node cluster, resulting in only one of the nodes, “Compute-0”, performing all the data ingestion:
When splitting your data files, ensure that they are of approximately equal size – between 1 MB and 1 GB after compression. The number of files should be a multiple of the number of slices in your cluster. Also, I strongly recommend that you individually compress the load files using gzip, lzop, or bzip2 to efficiently load large datasets.
When loading multiple files into a single table, use a single COPY command for the table, rather than multiple COPY commands. Amazon Redshift automatically parallelizes the data ingestion. Using a single COPY command to bulk load data into a table ensures optimal use of cluster resources, and quickest possible throughput.
2. Use workload management to improve ETL runtimes
Use Amazon Redshift’s workload management (WLM) to define multiple queues dedicated to different workloads (for example, ETL versus reporting) and to manage the runtimes of queries. As you migrate more workloads into Amazon Redshift, your ETL runtimes can become inconsistent if WLM is not appropriately set up.
I recommend limiting the overall concurrency of WLM across all queues to around 15 or less. This WLM guide helps you organize and monitor the different queues for your Amazon Redshift cluster.
When managing different workloads on your Amazon Redshift cluster, consider the following for the queue setup:
Create a queue dedicated to your ETL processes. Configure this queue with a small number of slots (5 or fewer). Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to the commit queue. Because ETL is a commit-intensive process, having a separate queue with a small number of slots helps mitigate this issue.
Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve COPYing raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.
Create a separate queue for reporting queries. Configure query monitoring rules on this queue to further manage long-running and expensive queries.
Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.
3. Perform table maintenance regularly
Amazon Redshift is a columnar database, which enables fast transformations for aggregating data. Performing regular table maintenance ensures that transformation ETLs are predictable and performant. To get the best performance from your Amazon Redshift database, you must ensure that database tables regularly are VACUUMed and ANALYZEd. The Analyze & Vacuum schema utility helps you automate the table maintenance task and have VACUUM & ANALYZE executed in a regular fashion.
Use VACUUM to sort tables and remove deleted blocks
During a typical ETL refresh process, tables receive new incoming records using COPY, and unneeded data (cold data) is removed using DELETE. New rows are added to the unsorted region in a table. Deleted rows are simply marked for deletion.
DELETE does not automatically reclaim the space occupied by the deleted rows. Adding and removing large numbers of rows can therefore cause the unsorted region and the number of deleted blocks to grow. This can degrade the performance of queries executed against these tables.
After an ETL process completes, perform VACUUM to ensure that user queries execute in a consistent manner. The complete list of tables that need VACUUMing can be found using the Amazon Redshift Util’s table_info script.
Use the following approaches to ensure that VACCUM is completed in a timely manner:
Use wlm_query_slot_count to claim all the memory allocated in the ETL WLM queue during the VACUUM process.
DROP or TRUNCATE intermediate or staging tables, thereby eliminating the need to VACUUM them.
If your table has a compound sort key with only one sort column, try to load your data in sort key order. This helps reduce or eliminate the need to VACUUM the table.
Consider using time series This helps reduce the amount of data you need to VACUUM.
Use ANALYZE to update database statistics
Amazon Redshift uses a cost-based query planner and optimizer using statistics about tables to make good decisions about the query plan for the SQL statements. Regular statistics collection after the ETL completion ensures that user queries run fast, and that daily ETL processes are performant. The Amazon Redshift utility table_info script provides insights into the freshness of the statistics. Keeping the statistics off (pct_stats_off) less than 20% ensures effective query plans for the SQL queries.
4. Perform multiple steps in a single transaction
ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.
To minimize the number of commits in a process, the steps in an ETL script should be surrounded by a BEGIN…END statement so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multi-step ETL script that performs one commit at the end:
Begin
CREATE temporary staging_table;
INSERT INTO staging_table SELECT .. FROM source (transformation logic);
DELETE FROM daily_table WHERE dataset_date =?;
INSERT INTO daily_table SELECT .. FROM staging_table (daily aggregate);
DELETE FROM weekly_table WHERE weekending_date=?;
INSERT INTO weekly_table SELECT .. FROM staging_table(weekly aggregate);
Commit
5. Loading data in bulk
Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon S3 you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:
Use a manifest file to ingest large datasets that span multiple files. The manifest file is a JSON file that lists all the files to be loaded into Amazon Redshift. Using a manifest file ensures that Amazon Redshift has a consistent view of the data to be loaded from S3, while also ensuring that duplicate files do not result in the same data being loaded more than one time.
Use temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.
User ALTER table APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.
6. Use UNLOAD to extract large result sets
Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. Here is an example of a large SELECT statement. Notice that the leader node is doing most of the work to stream out the rows:
Use UNLOAD to extract large results sets directly to S3. After it’s in S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into S3.
If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter to and keep files are 150 MB. Similar to item 1 above, having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.
7. Use Redshift Spectrum for ad hoc ETL processing
Events such as data backfill, promotional activity, and special calendar days can trigger additional data volumes that affect the data refresh times in your Amazon Redshift cluster. To help address these spikes in data volumes and throughput, I recommend staging data in S3. After data is organized in S3, Redshift Spectrum enables you to query it directly using standard SQL. In this way, you gain the benefits of additional capacity without having to resize your cluster.
8. Monitor daily ETL health using diagnostic queries
Monitoring the health of your ETL processes on a regular basis helps identify the early onset of performance issues before they have a significant impact on your cluster. The following monitoring scripts can be used to provide insights into the health of your ETL processes:
• Follow the best practices for the COPY command. • Analyze data growth with the incoming datasets and consider cluster resize to meet the expected SLA.
• Set up regular VACCUM jobs to address unsorted rows and claim the deleted blocks so that transformation SQL execute optimally. • Consider a table redesign to avoid data skewness.
INSERT/UPDATE/COPY/DELETE operations on particular tables do not respond back in timely manner, compared to when run after the ETL
Multiple DML statements are operating on the same target table at the same moment from different transactions. Set up ETL job dependency so that they execute serially for the same target table.
Amazon Redshift data warehouse space growth is trending upwards more than normal
Analyze the individual tables that are growing at higher rate than normal. Consider data archival using UNLOAD to S3 and Redshift Spectrum for later analysis.
Analyze the top transformation SQL and use EXPLAIN to find opportunities for tuning the query plan.
There are several other useful scripts available in the amazon-redshift-utils repository. The AWS Lambda Utility Runner runs a subset of these scripts on a scheduled basis, allowing you to automate much of monitoring of your ETL processes.
Example ETL process
The following ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow where data from an RDBMS source system is staged in S3 and then loaded into Amazon Redshift. Amazon Redshift is used to calculate daily, weekly, and monthly aggregations, which are then unloaded to S3, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.
Step 1: Extract from the RDBMS source to a S3 bucket
In this ETL process, the data extract job fetches change data every 1 hour and it is staged into multiple hourly files. For example, the staged S3 folder looks like the following:
Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Amazon Redshift cluster. Further, the files are compressed (gzipped) to further reduce COPY times.
Step 2: Stage data to the Amazon Redshift table for cleansing
Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to dedupe any files if needed. A sample manifest20170702.json file looks like the following:
The data can be ingested using the following command:
SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;
COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;
Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.
Step 3: Transform data to create daily, weekly, and monthly datasets and load into target tables
Data is staged in the “stage_tbl” from where it can be transformed into the daily, weekly, and monthly aggregates and loaded into target tables. The following job illustrates a typical weekly process:
Begin
INSERT into ETL_LOG (..) values (..);
DELETE from weekly_tbl where dataset_week = <<current week>>;
INSERT into weekly_tbl (..)
SELECT date_trunc('week', dataset_day) AS week_begin_dataset_date, SUM(C1) AS C1, SUM(C2) AS C2
FROM stage_tbl
GROUP BY date_trunc('week', dataset_day);
INSERT into AUDIT_LOG values (..);
COMMIT;
End;
As shown above, multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.
Step 4: Unload the daily dataset to populate the S3 data lake bucket
The transformed results are now unloaded into another S3 bucket, where they can be further processed and made available for end-user reporting using a number of different tools, including Redshift Spectrum and Amazon Athena.
unload ('SELECT * FROM weekly_tbl WHERE dataset_week = <<current week>>’) TO 's3:// <<S3 Bucket>>/datalake/weekly/20170526/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';
Summary
Amazon Redshift lets you easily operate petabyte-scale data warehouses on the cloud. This post summarized the best practices for operating scalable ETL natively within Amazon Redshift. I demonstrated efficient ways to ingest and transform data, along with close monitoring. I also demonstrated the best practices being used in a typical sample ETL workload to transform the data into Amazon Redshift.
If you have questions or suggestions, please comment below.
About the Author
Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.
Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.
For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.
Compiling Hail
Because Hail is still under active development, you must compile it before you can start using it. To help simplify the process, you can launch the following AWS CloudFormation template that creates an EMR cluster, compiles Hail, and installs a Jupyter Notebook so that you’re ready to go with Hail.
There are a few things to note about the AWS CloudFormation template. You must provide a password for the Jupyter Notebook. Also, you must provide a virtual private cloud (VPC) to launch Amazon EMR in, and make sure that you select a subnet from within that VPC. Next, update the cluster resources to fit your needs. Lastly, the HailBuildOutputS3Path parameter should be an Amazon S3 bucket/prefix, where you should save the compiled Hail binaries for later use. Leave the Hail and Spark versions as is, unless you’re comfortable experimenting with more recent versions.
When you’ve completed these steps, the following files are saved locally on the cluster to be used when running the Apache Spark Python API (PySpark) shell.
The files are also copied to the Amazon S3 location defined by the AWS CloudFormation template so that you can include them when running jobs using the Amazon EMR Step API.
Collecting genome data
To get started with Hail, use the 1000 Genome Project dataset available on AWS. The data that you will use is located at s3://1000genomes/release/20130502/.
For Hail to process these files in an efficient manner, they need to be block compressed. In many cases, files that use gzip compression are compressed in blocks, so you don’t need to recompress—you can just rename the file extension to “.bgz” from “.gz” . Hail can process .gz files, but it’s much slower and not recommended. The simple way to accomplish this is to copy the data files from the public S3 bucket to your own and rename them.
The following is the Bash command line to copy the first five genome Variant Call Format (VCF) files and rename them appropriately using the AWS CLI.
for i in $(seq 5); do aws s3 cp s3://1000genomes/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz s3://your_bucket/prefix/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz; done
Now that you have some data files containing variants in the Variant Call Format, you need to get the sample annotations that go along with them. These annotations provide more information about each sample, such as the population they are a part of.
In this section, you use the data collected in the previous section to explore genome variations interactively using a Jupyter Notebook. You then create a simple ETL job to convert these variations into Parquet format. Finally, you query it using Amazon Athena.
Let’s open the Jupyter notebook. To start, sign in to the AWS Management Console, and open the AWS CloudFormation console. Choose the stack that you created, and then choose the Output tab. There you see the JupyterURL. Open this URL in your browser.
Go ahead and download the Jupyter Notebook that is provided to your local machine. Log in to Jupyter with the password that you provided during stack creation. Choose Upload on the right side, and then choose the notebook from your local machine.
After the notebook is uploaded, choose it from the list on the left to open it.
Select the first cell, update the S3 bucket location to point to the bucket where you saved the compiled Hail libraries, and then choose Run. This code imports the Hail modules that you compiled at the beginning. When the cell is executing, you will see In [*]. When the process is complete, the asterisk (*) is replaced by a number, for example, In [1].
Next, run the subsequent two cells, which imports the Hail module into PySpark and initiates the Hail context.
The next cell imports a single VCF file from the bucket where you saved your data in the previous section. If you change the Amazon S3 path to not include a file name, it imports all the VCF files in that directory. Depending on your cluster size, it might take a few minutes.
Remember that in the previous section, you also copied an annotation file. Now you use it to annotate the VCF files that you’ve loaded with Hail. Execute the next cell—as a shortcut, you can select the cell and press Shift+Enter.
The import_table API takes a path to the annotation file in TSV (tab-separated values) format and a parameter named impute that attempts to infer the schema of the file, as shown in the output below the cell.
At this point, you can interactively explore the data. For example, you can count the number of samples you have and group them by population.
You can also calculate the standard quality control (QC) metrics on your variants and samples.
What if you want to query this data outside of Hail and Spark, for example, using Amazon Athena? To start, you need to change the column names to lowercase because Athena currently supports only lowercase names. To do that, use the two functions provided in the notebook and call them on your virtual dedicated server (VDS), as shown in the following image. Note that you’re only changing the case of the variants and samples schemas. If you’ve further augmented your VDS, you might need to modify the lowercase functions to do the same for those schemas.
In the current version of Hail, the sample annotations are not stored in the exported Parquet VDS, so you need to save them separately. As noted by the Hail maintainers, in future versions, the data represented by the VDS Parquet output will change, and it is recommended that you also export the variant annotations. So let’s do that.
Note that both of these lines are similar in that they export a table representation of the sample and variant annotations, convert them to a Spark DataFrame, and finally save them to Amazon S3 in Parquet file format.
Finally, it is beneficial to save the VDS file back to Amazon S3 so that next time you need to analyze your data, you can load it without having to start from the raw VCF. Note that when Hail saves your data, it requires a path and a file name.
After you run these cells, expect it to take some time as it writes out the data.
Discovering table metadata
Before you can query your data, you need to tell Athena the schema of your data. You have a couple of options. The first is to use AWS Glue to crawl the S3 bucket, infer the schema from the data, and create the appropriate table. Before proceeding, you might need to migrate your Athena database to use the AWS Glue Data Catalog.
Creating tables in AWS Glue
To use the AWS Glue crawler, open the AWS Glue console and choose Crawlers in the left navigation pane.
Then choose Add crawler to create a new crawler.
Next, give your crawler a name and assign the appropriate IAM role. Leave Amazon S3 as the data source, and select the S3 bucket where you saved the data and the sample annotations. When you set the crawler’s Include path, be sure to include the entire path, for example: s3://output_bucket/hail_data/sample_annotations/
Under the Exclusion Paths, type _SUCCESS, so that you don’t crawl that particular file.
Continue forward with the default settings until you are asked if you want to add another source. Choose Yes, and add the Amazon S3 path to the variant annotation bucket s3://your_output_bucket/hail_data/sample_annotations/ so that it can build your variant annotation table. Give it an existing database name, or create a new one.
Provide a table prefix and choose Next. Then choose Finish. At this point, assuming that the data is finished writing, you can go ahead and run the crawler. When it finishes, you have two new tables in the database you created that should look something like the following:
You can explore the schema of these tables by choosing their name and then choosing Edit Schema on the right side of the table view; for example:
Creating tables in Amazon Athena
If you cannot or do not want to use AWS Glue crawlers, you can add the tables via the Athena console by typing the following statements:
In the Amazon Athena console, choose the database in which your tables were created. In this case, it looks something like the following:
To verify that you have data, choose the three dots on the right, and then choose Preview table.
Indeed, you can see some data.
You can further explore the sample and variant annotations along with the calculated QC metrics that you calculated previously using Hail.
Summary
To summarize, this post demonstrated the ease in which you can install, configure, and use Hail, an open source highly scalable framework for exploring and analyzing genomics data on Amazon EMR. We demonstrated setting up a Jupyter Notebook to make our exploration easy. We also used the power of Hail to calculate quality control metrics for variants and samples. We exported them to Amazon S3 and allowed a broader range of users and analysts to explore them on-demand in a serverless environment using Amazon Athena.
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
Now, Amazon Cloud Directory makes it easier for you to apply schema changes across your directories with in-place schema upgrades. Your directory now remains available while Cloud Directory applies backward-compatible schema changes such as the addition of new fields. Without migrating data between directories or applying code changes to your applications, you can upgrade your schemas. You also can view the history of your schema changes in Cloud Directory by using version identifiers, which help you track and audit schema versions across directories. If you have multiple instances of a directory with the same schema, you can view the version history of schema changes to manage your directory fleet and ensure that all directories are running with the same schema version.
In this blog post, I demonstrate how to perform an in-place schema upgrade and use schema versions in Cloud Directory. I add additional attributes to an existing facet and add a new facet to a schema. I then publish the new schema and apply it to running directories, upgrading the schema in place. I also show how to view the version history of a directory schema, which helps me to ensure my directory fleet is running the same version of the schema and has the correct history of schema changes applied to it.
Note: I share Java code examples in this post. I assume that you are familiar with the AWS SDK and can use Java-based code to build a Cloud Directory code example. You can apply the concepts I cover in this post to other programming languages such as Python and Ruby.
Cloud Directory fundamentals
I will start by covering a few Cloud Directory fundamentals. If you are already familiar with the concepts behind Cloud Directory facets, schemas, and schema lifecycles, you can skip to the next section.
Facets: Groups of attributes. You use facets to define object types. For example, you can define a device schema by adding facets such as computers, phones, and tablets. A computer facet can track attributes such as serial number, make, and model. You can then use the facets to create computer objects, phone objects, and tablet objects in the directory to which the schema applies.
Schemas: Collections of facets. Schemas define which types of objects can be created in a directory (such as users, devices, and organizations) and enforce validation of data for each object class. All data within a directory must conform to the applied schema. As a result, the schema definition is essentially a blueprint to construct a directory with an applied schema.
Schema lifecycle: The four distinct states of a schema: Development, Published, Applied, and Deleted. Schemas in the Published and Applied states have version identifiers and cannot be changed. Schemas in the Applied state are used by directories for validation as applications insert or update data. You can change schemas in the Development state as many times as you need them to. In-place schema upgrades allow you to apply schema changes to an existing Applied schema in a production directory without the need to export and import the data populated in the directory.
How to add attributes to a computer inventory application schema and perform an in-place schema upgrade
To demonstrate how to set up schema versioning and perform an in-place schema upgrade, I will use an example of a computer inventory application that uses Cloud Directory to store relationship data. Let’s say that at my company, AnyCompany, we use this computer inventory application to track all computers we give to our employees for work use. I previously created a ComputerSchema and assigned its version identifier as 1. This schema contains one facet called ComputerInfo that includes attributes for SerialNumber, Make, and Model, as shown in the following schema details.
AnyCompany has offices in Seattle, Portland, and San Francisco. I have deployed the computer inventory application for each of these three locations. As shown in the lower left part of the following diagram, ComputerSchema is in the Published state with a version of 1. The Published schema is applied to SeattleDirectory, PortlandDirectory, and SanFranciscoDirectory for AnyCompany’s three locations. Implementing separate directories for different geographic locations when you don’t have any queries that cross location boundaries is a good data partitioning strategy and gives your application better response times with lower latency.
The following code example creates the schema in the Development state by using a JSON file, publishes the schema, and then creates directories for the Seattle, Portland, and San Francisco locations. For this example, I assume the schema has been defined in the JSON file. The createSchema API creates a schema Amazon Resource Name (ARN) with the name defined in the variable, SCHEMA_NAME. I can use the putSchemaFromJson API to add specific schema definitions from the JSON file.
// The utility method to get valid Cloud Directory schema JSON
String validJson = getJsonFile("ComputerSchema_version_1.json")
String SCHEMA_NAME = "ComputerSchema";
String developmentSchemaArn = client.createSchema(new CreateSchemaRequest()
.withName(SCHEMA_NAME))
.getSchemaArn();
// Put the schema document in the Development schema
PutSchemaFromJsonResult result = client.putSchemaFromJson(new PutSchemaFromJsonRequest()
.withSchemaArn(developmentSchemaArn)
.withDocument(validJson));
The following code example takes the schema that is currently in the Development state and publishes the schema, changing its state to Published.
String SCHEMA_VERSION = "1";
String publishedSchemaArn = client.publishSchema(
new PublishSchemaRequest()
.withDevelopmentSchemaArn(developmentSchemaArn)
.withVersion(SCHEMA_VERSION))
.getPublishedSchemaArn();
// Our Published schema ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1
The following code example creates a directory named SeattleDirectory and applies the published schema. The createDirectory API call creates a directory by using the published schema provided in the API parameters. Note that Cloud Directory stores a version of the schema in the directory in the Applied state. I will use similar code to create directories for PortlandDirectory and SanFranciscoDirectory.
String DIRECTORY_NAME = "SeattleDirectory";
CreateDirectoryResult directory = client.createDirectory(
new CreateDirectoryRequest()
.withName(DIRECTORY_NAME)
.withSchemaArn(publishedSchemaArn));
String directoryArn = directory.getDirectoryArn();
String appliedSchemaArn = directory.getAppliedSchemaArn();
// This code section can be reused to create directories for Portland and San Francisco locations with the appropriate directory names
// Our directory ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX
// Our applied schema ARN is as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX/schema/ComputerSchema/1
Revising a schema
Now let’s say my company, AnyCompany, wants to add more information for computers and to track which employees have been assigned a computer for work use. I modify the schema to add two attributes to the ComputerInfo facet: Description and OSVersion (operating system version). I make Description optional because it is not important for me to track this attribute for the computer objects I create. I make OSVersion mandatory because it is critical for me to track it for all computer objects so that I can make changes such as applying security patches or making upgrades. Because I make OSVersion mandatory, I must provide a default value that Cloud Directory will apply to objects that were created before the schema revision, in order to handle backward compatibility. Note that you can replace the value in any object with a different value.
I also add a new facet to track computer assignment information, shown in the following updated schema as the ComputerAssignment facet. This facet tracks these additional attributes: Name (the name of the person to whom the computer is assigned), EMail (the email address of the assignee), Department, and department CostCenter. Note that Cloud Directory refers to the previously available version identifier as the Major Version. Because I can now add a minor version to a schema, I also denote the changed schema as Minor Version A.
The following diagram shows the changes that were made when I added another facet to the schema and attributes to the existing facet. The highlighted area of the diagram (bottom left) shows that the schema changes were published.
The following code example revises the existing Development schema by adding the new attributes to the ComputerInfo facet and by adding the ComputerAssignment facet. I use a new JSON file for the schema revision, and for the purposes of this example, I am assuming the JSON file has the full schema including planned revisions.
// The utility method to get a valid CloudDirectory schema JSON
String schemaJson = getJsonFile("ComputerSchema_version_1_A.json")
// Put the schema document in the Development schema
PutSchemaFromJsonResult result = client.putSchemaFromJson(
new PutSchemaFromJsonRequest()
.withSchemaArn(developmentSchemaArn)
.withDocument(schemaJson));
Upgrading the Published schema
The following code example performs an in-place schema upgrade of the Published schema with schema revisions (it adds new attributes to the existing facet and another facet to the schema). The upgradePublishedSchema API upgrades the Published schema with backward-compatible changes from the Development schema.
// From an earlier code example, I know the publishedSchemaArn has this value: "arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1"
// Upgrade publishedSchemaArn to minorVersion A. The Development schema must be backward compatible with
// the existing publishedSchemaArn.
String minorVersion = "A"
UpgradePublishedSchemaResult upgradePublishedSchemaResult = client.upgradePublishedSchema(new UpgradePublishedSchemaRequest()
.withDevelopmentSchemaArn(developmentSchemaArn)
.withPublishedSchemaArn(publishedSchemaArn)
.withMinorVersion(minorVersion));
String upgradedPublishedSchemaArn = upgradePublishedSchemaResult.getUpgradedSchemaArn();
// The Published schema ARN after the upgrade shows a minor version as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:schema/published/ComputerSchema/1/A
Upgrading the Applied schema
The following diagram shows the in-place schema upgrade for the SeattleDirectory directory. I am performing the schema upgrade so that I can reflect the new schemas in all three directories. As a reminder, I added new attributes to the ComputerInfo facet and also added the ComputerAssignment facet. After the schema and directory upgrade, I can create objects for the ComputerInfo and ComputerAssignment facets in the SeattleDirectory. Any objects that were created with the old facet definition for ComputerInfo will now use the default values for any additional attributes defined in the new schema.
I use the following code example to perform an in-place upgrade of the SeattleDirectory to a Major Version of 1 and a Minor Version of A. Note that you should change a Major Version identifier in a schema to make backward-incompatible changes such as changing the data type of an existing attribute or dropping a mandatory attribute from your schema. Backward-incompatible changes require directory data migration from a previous version to the new version. You should change a Minor Version identifier in a schema to make backward-compatible upgrades such as adding additional attributes or adding facets, which in turn may contain one or more attributes. The upgradeAppliedSchema API lets me upgrade an existing directory with a different version of a schema.
// This upgrades ComputerSchema version 1 of the Applied schema in SeattleDirectory to Major Version 1 and Minor Version A
// The schema must be backward compatible or the API will fail with IncompatibleSchemaException
UpgradeAppliedSchemaResult upgradeAppliedSchemaResult = client.upgradeAppliedSchema(new UpgradeAppliedSchemaRequest()
.withDirectoryArn(directoryArn)
.withPublishedSchemaArn(upgradedPublishedSchemaArn));
String upgradedAppliedSchemaArn = upgradeAppliedSchemaResult.getUpgradedSchemaArn();
// The Applied schema ARN after the in-place schema upgrade will appear as follows
// arn:aws:clouddirectory:us-west-2:XXXXXXXXXXXX:directory/XX_DIRECTORY_GUID_XX/schema/ComputerSchema/1
// This code section can be reused to upgrade directories for the Portland and San Francisco locations with the appropriate directory ARN
Note: Cloud Directory has excluded returning the Minor Version identifier in the Applied schema ARN for backward compatibility and to enable the application to work across older and newer versions of the directory.
The following diagram shows the changes that are made when I perform an in-place schema upgrade in the two remaining directories, PortlandDirectory and SanFranciscoDirectory. I make these calls sequentially, upgrading PortlandDirectory first and then upgrading SanFranciscoDirectory. I use the same code example that I used earlier to upgrade SeattleDirectory. Now, all my directories are running the most current version of the schema. Also, I made these schema changes without having to migrate data and while maintaining my application’s high availability.
Schema revision history
I can now view the schema revision history for any of AnyCompany’s directories by using the listAppliedSchemaArns API. Cloud Directory maintains the five most recent versions of applied schema changes. Similarly, to inspect the current Minor Version that was applied to my schema, I use the getAppliedSchemaVersion API. The listAppliedSchemaArns API returns the schema ARNs based on my schema filter as defined in withSchemaArn.
I use the following code example to query an Applied schema for its version history.
// This returns the five most recent Minor Versions associated with a Major Version
ListAppliedSchemaArnsResult listAppliedSchemaArnsResult = client.listAppliedSchemaArns(new ListAppliedSchemaArnsRequest()
.withDirectoryArn(directoryArn)
.withSchemaArn(upgradedAppliedSchemaArn));
// Note: The listAppliedSchemaArns API without the SchemaArn filter returns all the Major Versions in a directory
The listAppliedSchemaArns API returns the two ARNs as shown in the following output.
The following code example queries an Applied schema for current Minor Version by using the getAppliedSchemaVersion API.
// This returns the current Applied schema's Minor Version ARN
GetAppliedSchemaVersion getAppliedSchemaVersionResult = client.getAppliedSchemaVersion(new GetAppliedSchemaVersionRequest()
.withSchemaArn(upgradedAppliedSchemaArn));
The getAppliedSchemaVersion API returns the current Applied schema ARN with a Minor Version, as shown in the following output.
If you have a lot of directories, schema revision API calls can help you audit your directory fleet and ensure that all directories are running the same version of a schema. Such auditing can help you ensure high integrity of directories across your fleet.
Summary
You can use in-place schema upgrades to make changes to your directory schema as you evolve your data set to match the needs of your application. An in-place schema upgrade allows you to maintain high availability for your directory and applications while the upgrade takes place. For more information about in-place schema upgrades, see the in-place schema upgrade documentation.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing the solution in this post, start a new thread in the Directory Service forum or contact AWS Support.
In this day and age, it is almost impossible to do without our mobile devices and the applications that help make our lives easier. As our dependency on our mobile phone grows, the mobile application market has exploded with millions of apps vying for our attention. For mobile developers, this means that we must ensure that we build applications that provide the quality, real-time experiences that app users desire. Therefore, it has become essential that mobile applications are developed to include features such as multi-user data synchronization, offline network support, and data discovery, just to name a few. According to several articles, I read recently about mobile development trends on publications like InfoQ, DZone, and the mobile development blog AlleviateTech, one of the key elements in of delivering the aforementioned capabilities is with cloud-driven mobile applications. It seems that this is especially true, as it related to mobile data synchronization and data storage.
That being the case, it is a perfect time for me to announce a new service for building innovative mobile applications that are driven by data-intensive services in the cloud; AWS AppSync. AWS AppSync is a fully managed serverless GraphQL service for real-time data queries, synchronization, communications and offline programming features. For those not familiar, let me briefly share some information about the open GraphQL specification. GraphQL is a responsive data query language and server-side runtime for querying data sources that allow for real-time data retrieval and dynamic query execution. You can use GraphQL to build a responsive API for use in when building client applications. GraphQL works at the application layer and provides a type system for defining schemas. These schemas serve as specifications to define how operations should be performed on the data and how the data should be structured when retrieved. Additionally, GraphQL has a declarative coding model which is supported by many client libraries and frameworks including React, React Native, iOS, and Android.
Now the power of the GraphQL open standard query language is being brought to you in a rich managed service with AWS AppSync. With AppSync developers can simplify the retrieval and manipulation of data across multiple data sources with ease, allowing them to quickly prototype, build and create robust, collaborative, multi-user applications. AppSync keeps data updated when devices are connected, but enables developers to build solutions that work offline by caching data locally and synchronizing local data when connections become available.
Let’s discuss some key concepts of AWS AppSync and how the service works.
AppSync Concepts
AWS AppSync Client: service client that defines operations, wraps authorization details of requests, and manage offline logic.
Data Source: the data storage system or a trigger housing data
Identity: a set of credentials with permissions and identification context provided with requests to GraphQL proxy
GraphQL Proxy: the GraphQL engine component for processing and mapping requests, handling conflict resolution, and managing Fine Grained Access Control
Operation: one of three GraphQL operations supported in AppSync
Query: a read-only fetch call to the data
Mutation: a write of the data followed by a fetch,
Subscription: long-lived connections that receive data in response to events.
Action: a notification to connected subscribers from a GraphQL subscription.
Resolver: function using request and response mapping templates that converts and executes payload against data source
How It Works
A schema is created to define types and capabilities of the desired GraphQL API and tied to a Resolver function. The schema can be created to mirror existing data sources or AWS AppSync can create tables automatically based the schema definition. Developers can also use GraphQL features for data discovery without having knowledge of the backend data sources. After a schema definition is established, an AWS AppSync client can be configured with an operation request, like a Query operation. The client submits the operation request to GraphQL Proxy along with an identity context and credentials. The GraphQL Proxy passes this request to the Resolver which maps and executes the request payload against pre-configured AWS data services like an Amazon DynamoDB table, an AWS Lambda function, or a search capability using Amazon Elasticsearch. The Resolver executes calls to one or all of these services within a single network call minimizing CPU cycles and bandwidth needs and returns the response to the client. Additionally, the client application can change data requirements in code on demand and the AppSync GraphQL API will dynamically map requests for data accordingly, allowing prototyping and faster development.
In order to take a quick peek at the service, I’ll go to the AWS AppSync console. I’ll click the Create API button to get started.
When the Create new API screen opens, I’ll give my new API a name, TarasTestApp, and since I am just exploring the new service I will select the Sample schema option. You may notice from the informational dialog box on the screen that in using the sample schema, AWS AppSync will automatically create the DynamoDB tables and the IAM roles for me.It will also deploy the TarasTestApp API on my behalf. After review of the sample schema provided by the console, I’ll click the Create button to create my test API.
After the TaraTestApp API has been created and the associated AWS resources provisioned on my behalf, I can make updates to the schema, data source, or connect my data source(s) to a resolver. I also can integrate my GraphQL API into an iOS, Android, Web, or React Native application by cloning the sample repo from GitHub and downloading the accompanying GraphQL schema. These application samples are great to help get you started and they are pre-configured to function in offline scenarios.
If I select the Schema menu option on the console, I can update and view the TarasTestApp GraphQL API schema.
Additionally, if I select the Data Sources menu option in the console, I can see the existing data sources. Within this screen, I can update, delete, or add data sources if I so desire.
Next, I will select the Query menu option which takes me to the console tool for writing and testing queries. Since I chose the sample schema and the AWS AppSync service did most of the heavy lifting for me, I’ll try a query against my new GraphQL API.
I’ll use a mutation to add data for the event type in my schema. Since this is a mutation and it first writes data and then does a read of the data, I want the query to return values for name and where.
If I go to the DynamoDB table created for the event type in the schema, I will see that the values from my query have been successfully written into the table. Now that was a pretty simple task to write and retrieve data based on a GraphQL API schema from a data source, don’t you think.
Summary
AWS AppSync is currently in AWS AppSync is in Public Preview and you can sign up today. It supports development for iOS, Android, and JavaScript applications. You can take advantage of this managed GraphQL service by going to the AWSAppSync console or learn more by reviewing more details about the service by reading a tutorial in the AWS documentation for the service or checking out our AWS AppSync Developer Guide.
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
This year’s session catalog contains the following breakout sessions.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base. MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
A data lake is an increasingly popular way to store and analyze data that addresses the challenges of dealing with massive volumes of heterogeneous data. A data lake allows organizations to store all their data—structured and unstructured—in one centralized repository. Because data can be stored as-is, there is no need to convert it to a predefined schema.
Many organizations understand the benefits of using AWS as their data lake. For example, Amazon S3 is a highly durable, cost-effective object start that supports Open Data Formats while decoupling storage from compute, and it works with all the AWS analytic services. Although Amazon S3 provides the foundation of a data lake, you can add other services to tailor the data lake to your business needs. For more information about building data lakes on AWS, see What is a Data Lake?
Because one of the main challenges of using a data lake is finding the data and understanding the schema and data format, Amazon recently introduced AWS Glue. AWS Glue significantly reduces the time and effort that it takes to derive business insights quickly from an Amazon S3 data lake by discovering the structure and form of your data. AWS Glue automatically crawls your Amazon S3 data, identifies data formats, and then suggests schemas for use with other AWS analytic services.
This post walks you through the process of using AWS Glue to crawl your data on Amazon S3 and build a metadata store that can be used with other AWS offerings.
AWS Glue features
AWS Glue is a fully managed data catalog and ETL (extract, transform, and load) service that simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more.
The AWS Glue Data Catalog is compatible with Apache Hive Metastore and supports popular tools such as Hive, Presto, Apache Spark, and Apache Pig. It also integrates directly with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
In addition, the AWS Glue Data Catalog features the following extensions for ease-of-use and data-management functionality:
AWS Glue is an essential component of an Amazon S3 data lake, providing the data catalog and transformation services for modern data analytics.
In the preceding figure, data is staged for different analytic use cases. Initially, the data is ingested in its raw format, which is the immutable copy of the data. The data is then transformed and enriched to make it more valuable for each use case. In this example, the raw CSV files are transformed into Apache Parquet for use by Amazon Athena to improve performance and reduce cost.
The data can also be enriched by blending it with other datasets to provide additional insights. An AWS Glue crawler creates a table for each stage of the data based on a job trigger or a predefined schedule. In this example, an AWS Lambda function is used to trigger the ETL process every time a new file is added to the Raw Data S3 bucket. The tables can be used by Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR to query the data at any stage using standard SQL or Apache Hive. This configuration is a popular design pattern that delivers Agile Business Intelligence to derive business value from a variety of data quickly and easily.
Walkthrough
In this walkthrough, you define a database, configure a crawler to explore data in an Amazon S3 bucket, create a table, transform the CSV file into Parquet, create a table for the Parquet data, and query the data with Amazon Athena.
Discover the data
Sign in to the AWS Management Console and open the AWS Glue console. You can find AWS Glue in the Analytics section. AWS Glue is currently available in US East (N. Virginia), US East (Ohio), and US West (Oregon). Additional AWS Regions are added frequently.
The first step to discovering the data is to add a database. A database is a collection of tables.
In the console, choose Add database. In Database name, type nycitytaxi, and choose Create.
Choose Tables in the navigation pane. A table consists of the names of columns, data type definitions, and other metadata about a dataset.
Add a table to the database nycitytaxi.You can add a table manually or by using a crawler. A crawler is a program that connects to a data store and progresses through a prioritized list of classifiers to determine the schema for your data. AWS Glue provides classifiers for common file types like CSV, JSON, Avro, and others. You can also write your own classifier using a grok pattern.
To add a crawler, enter the data source: an Amazon S3 bucket named s3://aws-bigdata-blog/artifacts/glue-data-lake/data/. This S3 bucket contains the data file consisting of all the rides for the green taxis for the month of January 2017.
Choose Next.
For IAM role, choose the default role AWSGlueServiceRoleDefault in the drop-down list.
For Frequency, choose Run on demand. The crawler can be run on demand or set to run on a schedule.
For Database, choose nycitytaxi.It is important to understand how AWS Glue deals with schema changes so that you can select the appropriate method. In this example, the table is updated with any change. For more information about schema changes, see Cataloging Tables with a Crawler in the AWS Glue Developer Guide.
Review the steps, and choose Finish. The crawler is ready to run. Choose Run it now. When the crawler has finished, one table has been added.
Choose Tables in the left navigation pane, and then choose data. This screen describes the table, including schema, properties, and other valuable information.
Transform the data from CSV to Parquet format
Now you can configure and run a job to transform the data from CSV to Parquet. Parquet is a columnar format that is well suited for AWS analytics services like Amazon Athena and Amazon Redshift Spectrum.
Under ETL in the left navigation pane, choose Jobs, and then choose Add job.
For the Name, type nytaxi-csv-parquet.
For the IAM role, choose AWSGlueServiceRoleDefault.
For This job runs, choose A proposed script generated by AWS Glue.
Provide a unique Amazon S3 path to store the scripts.
Provide a unique Amazon S3 directory for a temporary directory.
Choose Next.
Choose data as the data source.
Choose Create tables in your data target.
Choose Parquet as the format.
Choose a new location (a new prefix location without any existing objects) to store the results.
Verify the schema mapping, and choose Finish.
View the job.This screen provides a complete view of the job and allows you to edit, save, and run the job.AWS Glue created this script. However, if required, you can create your own.
Choose Save, and then choose Run job.
Add the Parquet table and crawler
When the job has finished, add a new table for the Parquet data using a crawler.
For Crawler name, type nytaxiparquet.
Choose S3 as the Data store.
Include the Amazon S3 path chosen in the ETL
For the IAM role, choose AWSGlueServiceRoleDefault.
For Database, choose nycitytaxi.
For Frequency, choose Run on demand.
After the crawler has finished, there are two tables in the nycitytaxi database: a table for the raw CSV data and a table for the transformed Parquet data.
Analyze the data with Amazon Athena
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is capable of querying CSV data. However, the Parquet file format significantly reduces the time and cost of querying the data. For more information, see the blog post Analyzing Data in Amazon S3 using Amazon Athena.
To use AWS Glue with Amazon Athena, you must upgrade your Athena data catalog to the AWS Glue Data Catalog. For more information about upgrading your Athena data catalog, see this step-by-step guide.
Open the AWS Management Console for Athena. The Query Editor displays both tables in the nycitytaxi
You can query the data using standard SQL.
Choose the nytaxigreenparquet
Type Select * From "nycitytaxi"."data" limit 10;
Choose Run Query.
Conclusion
This post demonstrates how easy it is to build the foundation of a data lake using AWS Glue and Amazon S3. By using AWS Glue to crawl your data on Amazon S3 and build an Apache Hive-compatible metadata store, you can use the metadata across the AWS analytic services and popular Hadoop ecosystem tools. This combination of AWS services is powerful and easy to use, allowing you to get to business insights faster.
If you have questions or suggestions, please comment below.
Additional reading
See the following blog posts for more information:
Gordon Heinrich is a Solutions Architect working with global systems integrators. He works with our partners and customers to provide them architectural guidance for building data lakes and using AWS analytic services. In his spare time, he enjoys spending time with his family, skiing, hiking, and mountain biking in Colorado.
Late last year I told you about our plans to add PostgreSQL compatibility to Amazon Aurora. We launched the private beta shortly after that announcement, and followed it up earlier this year with an open preview. We’ve received lots of great feedback during the beta and the preview and have done our best to make sure that the product meets your needs and exceeds your expectations!
Now Generally Available I am happy to report that Amazon Aurora with PostgreSQL Compatibility is now generally available and that you can use it today in four AWS Regions, with more to follow. It is compatible with PostgreSQL 9.6.3 and scales automatically to support up to 64 TB of storage, with 6-way replication behind the scenes to improve performance and availability.
Just like Amazon Aurora with MySQL compatibility, this edition is fully managed and is very easy to set up and to use. On the performance side, you can expect up to 3x the throughput that you’d get if you ran PostgreSQL on your own (you can read Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases to learn more about how we did this).
You can launch a PostgreSQL-compatible Amazon Aurora instance from the RDS Console by selecting Amazon Aurora as the engine and PostgreSQL-compatible as the edition, and clicking on Next:
Then choose your instance class, single or Multi-AZ deployment (good for dev/test and production, respectively), set the instance name, and the administrator credentials, and click on Next:
You can choose between six instance classes (2 to 64 vCPUs and 15.25 to 488 GiB of memory):
The db.r4 instance class is new addition to Aurora and to RDS, and gives you an additional size at the top-end. The db.r4.16xlarge will give you additional write performance, and may allow you to use a single Aurora database instead of two or more sharded databases.
You can also set many advanced options on the next page, starting with network options such as the VPC and public accessibility:
You can set the cluster name and other database options. Encryption is easy to use and enabled by default; you can use the built-in default master key or choose one of your own:
You can also set failover behavior, the retention period for snapshot backups, and choose to enable collection of detailed (OS-level) metrics via Enhanced Monitoring:
After you have set it up to your liking, click on Launch DB Instance to proceed!
The new instances (primary and secondary since I specified Multi-AZ) are up and running within minutes:
Each PostgreSQL-compatible instance publishes 44 metrics to CloudWatch automatically:
With enhanced monitoring enabled, each instance collects additional per-instance and per-process metrics. It can be enabled when the instance is launched, or afterward, via Modify Instance. Here are some of the metrics collected when enhanced monitoring is enabled:
Clicking on Manage Graphs lets you choose which metrics are shown:
Per-process metrics are also available:
You can scale your read capacity by creating up to 15 Aurora replicas:
The cluster provides a single reader endpoint that you can access in order to load-balance requests across the replicas:
Performance Insights As I noted earlier, Performance Insights is turned on automatically. This Amazon Aurora feature is wired directly into the database engine and allows you to look deep inside of each query, seeing the database resources that it uses and how they contribute to the overall response time. Here’s the initial view:
I can slice the view by SQL query in order to see how many concurrent copies of each query are running:
There are more views and options than I can fit in this post; to learn more take a look at Using Performance Insights.
Migrating to Amazon Aurora with PostgreSQL Compatibility AWS Database Migration Service and the Schema Conversion Tool are ready to help you to move data stored in commercial and open-source databases to Amazon Aurora. The Schema Conversion Tool will perform a quick assessment of your database schemas and your code in order to help you to choose between MySQL and PostgreSQL. Our new, limited-time, Free DMS program allows you to use DMS and SCT to migrate to Aurora at no cost, with access to several types of DMS Instances for up to 6 months.
Available Now You can use Amazon Aurora with PostgreSQL Compatibility today in the US East (Northern Virginia), EU (Ireland), US West (Oregon), and US East (Ohio) Regions, with others to follow as soon as possible.
A robust analytics platform is critical for an organization’s success. However, an analytical system is only as good as the data that is used to power it. Wrong or incomplete data can derail analytical projects completely. Moreover, with varied data types and sources being added to the analytical platform, it’s important that the platform is able to keep up. This means the system has to be flexible enough to adapt to changing data types and volumes.
For a platform based on relational databases, this would mean countless data model updates, code changes, and regression testing, all of which can take a very long time. For a decision support system, it’s imperative to provide the correct answers quickly. Architects must design analytical platforms with the following criteria:
Help ingest data easily from multiple sources
Analyze it for completeness and accuracy
Use it for metrics computations
Store the data assets and scale as they grow rapidly without causing disruptions
Adapt to changes as they happen
Have a relatively short development cycle that is repeatable and easy to implement.
Schema-on-read is a unique approach for storing and querying datasets. It reverses the order of things when compared to schema-on-write in that the data can be stored as is and you apply a schema at the time that you read it. This approach has changed the time to value for analytical projects. It means you can immediately load data and can query it to do exploratory activities.
In this post, I show how to build a schema-on-read analytical pipeline, similar to the one used with relational databases, using Amazon Athena. The approach is completely serverless, which allows the analytical platform to scale as more data is stored and processed via the pipeline.
The data integration challenge
Part of the reason why it’s so difficult to build analytical platforms are the challenges associated with data integration. This is usually done as a series of Extract Transform Load (ETL) jobs that pull data from multiple varied sources and integrate them into a central repository. The multiple steps of ETL can be viewed as a pipeline where raw data is fed in one end and integrated data accumulates at the other.
Two major hurdles complicate the development of ETL pipelines:
The sheer volume of data needing to be integrated makes it very difficult to scale these jobs.
There are an immense variety of data formats from where information needs to be gathered.
Put this in context by looking at the healthcare industry. Building an analytical pipeline for healthcare usually involves working with multiple datasets that cover care quality, provider operations, patient clinical records, and patient claims data. Customers are now looking for even more data sources that may contain valuable information that can be analyzed. Examples include social media data, data from devices and sensors, and biometric and human-generated data such as notes from doctors. These new data sources do not rely on fixed schemas, so integrating them by conventional ETL jobs is much more difficult.
The volume of data available for analysis is growing as well. According to an article published by NCBI, the US healthcare system reached a collective data volume of 150 exabytes in 2011. With the current rate of growth, it is expected to quickly reach zettabyte scale, and soon grow into the yottabytes.
Why schema-on-read?
The traditional data warehouses of the early 2000s, like the one shown in the following diagram, were based on a standard four-layer approach of ingest, stage, store, and report. This involved building and maintaining huge data models that suited certain sources and analytical queries. This type of design is based on a schema-on-write approach, where you write data into a predefined schema and read data by querying it.
As we progressed to more varied datasets and analytical requirements, the predefined schemas were not able to keep up. Moreover, by the time the models were built and put into production, the requirements had changed.
Schema-on-read provides much needed flexibility to an analytical project. Not having to rely on physical schemas improves the overall performance when it comes to high volume data loads. Because there are no physical constraints for the data, it works well with datasets with undefined data structures. It gives customers the power to experiment and try out different options.
How can Athena help?
Athena is a serverless analytical query engine that allows you to start querying data stored in Amazon S3 instantly. It supports standard formats like CSV and Parquet and integrates with Amazon QuickSight, which allows you to build interactive business intelligence reports.
The Amazon Athena User Guide provides multiple best practices that should be considered when using Athena for analytical pipelines at production scale. There are various performance tuning techniques that you can apply to speed up query performance. For more information, see Top 10 Performance Tuning Tips for Amazon Athena.
Solution architecture
To demonstrate this solution, I use the healthcare dataset from the Centers for Disease Control (CDC) Behavioral Risk Factor Surveillance system (BRFSS). It gathers data via telephone surveys for health-related risk behaviors and conditions, and the use of preventive services. The dataset is available as zip files from the CDC FTP portal for general download and analysis. There is also a user guide with comprehensive details about the program and the process of collecting data.
The following diagram shows the schema-on-read pipeline that demonstrates this solution.
In this architecture, S3 is the central data repository for the CSV files, which are divided by behavioral risk factors like smoking, drinking, obesity, and high blood pressure.
Athena is used to collectively query the CSV files in S3. It uses the AWS Glue Data Catalog to maintain the schema details and applies it at the time of querying the data.
The dataset is further filtered and transformed into a subset that is specifically used for reporting with Amazon QuickSight.
As new data files become available, they are incrementally added to the S3 bucket and the subset query automatically appends them to the end of the table. The dashboards in Amazon QuickSight are refreshed with the new values of calculated metrics.
Walkthrough
To use Athena in an analytical pipeline, you have to consider how to design it for initial data ingestion and the subsequent incremental ingestions. Moreover, you also have to decide on the mechanism to trigger a particular stage in the pipeline, which can either be scheduled or event-based.
For the purposes of this post, you build a simple pipeline where the data is:
Ingested into a staging area in S3
Filtered and transformed for reporting into a different S3 location
Transformed into a reporting table in Athena
Included into a dashboard created using Amazon QuickSight
Incrementally ingested for updates
This approach is highly customizable and can be used for building more complex pipelines with many more steps.
Data staging in S3
The data ingestion into S3 is fairly straightforward. The files can be ingested via the S3 command line interface (CLI), API, or the AWS Management Console. The data files are in CSV format and already divided by the behavioral condition. To improve performance, I recommend that you use partitioned data with Athena, especially when dealing with large volumes. You can use pre-partitioned data in S3 or build partitions later in the process. The example you are working with has a total of 247 CSV files storing about 205 MB of data across them, but typical production scale deployment would be much larger.
To automate the pipeline, you can make it either event-based or schedule-based. If you take the event-based approach, you can make use of S3 events to trigger an action when the files are uploaded. The event triggers an action, using an AWS Lambda function that corresponds to another step in the pipeline. Traditional ETL jobs have to rely on mechanisms like database triggers to enable this, which can cause additional performance overhead.
If you choose to go with a scheduled-based approach, you can use Lambda with scheduled events. The schedule is managed via a cron expression and the Lambda function is used to run the next step of the pipeline. This is suitable for workloads similar to a scheduled batch ETL job.
Filter and data transformation
To filter and transform the dataset, first look at the overall counts and structure of the data. This allows you to choose the columns that are important for a given report and use the filter clause to extract a subset of the data. Transforming the data as you progress through the pipeline ensures that you are only exposing relevant data to the reporting layer, to optimize performance.
To look at the entire dataset, create a table in Athena to go across the entire data volume. This can be done using the following query:
CREATE EXTERNAL TABLE IF NOT EXISTS brfsdata(
ID STRING,
HIW STRING,
SUSA_NAME STRING,
MATCH_NAME STRING,
CHSI_NAME STRING,
NSUM STRING,
MEAN STRING,
FLAG STRING,
IND STRING,
UP_CI STRING,
LOW_CI STRING,
SEMEAN STRING,
AGE_ADJ STRING,
DATASRC STRING,
FIPS STRING,
FIPNAME STRING,
HRR STRING,
DATA_YR STRING,
UNIT STRING,
AGEGRP STRING,
GENDER STRING,
RACE STRING,
EHN STRING,
EDU STRING,
FAMINC STRING,
DISAB STRING,
METRO STRING,
SEXUAL STRING,
FAMSTRC STRING,
MARITAL STRING,
POP_SPC STRING,
POP_POLICY STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION "s3://<YourBucket/YourPrefix>"
Replace YourBucket and YourPrefix with your corresponding values.
In this case, there are a total of ~1.4 million records, which you get by running a simple COUNT(*) query on the table.
From here, you can run multiple analysis queries on the dataset. For example, you can find the number of records that fall into a certain behavioral risk, or the state that has the highest number of diabetic patients recorded. These metrics provide data points that help to determine the attributes that would be needed from a reporting perspective.
As you can see, this is much simpler compared to a schema-on-write approach where the analysis of the source dataset is much more difficult. This solution allows you to design the reporting platform in accordance with the questions you are looking to answer from your data. The source data analysis is the first step to design a good analytical platform and this approach allows customers to do that much earlier in the project lifecycle.
Reporting table
After you have completed the source data analysis, the next step is to filter out the required data and transform it to create a reporting database. This is synonymous to the data mart in a standard analytical pipeline. Based on the analysis carried out in the previous step, you might notice some mismatches with the data headers. You might also identify the filter clauses to apply to the dataset to get to your reporting data.
Athena automatically saves query results in S3 for every run. The default bucket for this is created in the following format:
Athena creates a prefix for each saved query and stores the result set as CSV files organized by dates. You can use this feature to filter out result datasets and store them in an S3 bucket for reporting.
To enable this, create queries that can filter out and transform the subset of data on which to report. For this use case, create three separate queries to filter out unwanted data and fix the column headers:
Query 1:
SELECT ID, up_ci AS source, semean AS state, datasrc
AS year, fips AS unit, fipname AS age,mean, current_date AS dt, current_time AS tm FROM brfsdata
WHERE ID != '' AND hrr IS NULL AND semean NOT LIKE '%29193%'
Query 2:
SELECT ID, up_ci AS source, semean AS state, datasrc
AS year, fips AS unit, fipname AS age,mean, current_date AS dt, current_time AS tm
FROM brfsdata WHERE ID != '' AND hrr IS NOT NULL AND up_ci LIKE '%BRFSS%'and semean NOT LIKE '"%' AND semean NOT LIKE '%29193%'
Query 3:
SELECT ID, low_ci AS source, age_adj AS state, fips
AS year, fipname AS unit, hrr AS age,mean, current_date AS dt, current_time AS tm
FROM brfsdata WHERE ID != '' AND hrr IS NOT NULL AND up_ci NOT LIKE '%BRFSS%' AND age_adj NOT LIKE '"%' AND semean NOT LIKE '%29193%' AND low_ci LIKE 'BRFSS'
You can save these queries in Athena so that you can get to the query results easily every time they are executed. The following screenshot is an example of the results when query 1 is executed four times.
The next step is to copy these results over to a new bucket for creating your reporting table. This can be done by running an S3 CP command from the CLI or API, as shown below:
Note the prefix structure in which Athena stores query results. It creates a separate prefix for each day in which the query is executed, and stores the corresponding CSV and metadata file for each run. Copy the result set over to a new prefix “Reporting_Data”. Use the “exclude” and “include” option of S3 CP to only copy the CSV files and use “recursive” to copy all the files from the run.
You can replace the value of the saved query name from “Query1” to “Query2” or “Query3” to copy all data resulting from those queries to the same target prefix. For pipelines that require more complicated transformations, divide the query transformation into multiple steps and execute them based on events or schedule them, as described in the earlier data staging step.
Amazon QuickSight dashboard
After the filtered results sets are copied as CSV files into the new Reporting_Data prefix, create a new table in Athena that is used specifically for BI reporting. This can be done using a create table statement similar to the one below:
CREATE EXTERNAL TABLE IF NOT EXISTS BRFSS_REPORTING(
ID varchar(100),
source varchar(100),
state varchar(100),
year int,
unit varchar(10),
age varchar(10),
mean float
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION "s3://healthdataset/brfsdata/Reporting_Data"
This table can now act as a source for a dashboard on Amazon QuickSight, which is a straightforward process to enable. When you choose a new data source, Athena shows up as an option and Amazon QuickSight automatically detects the tables in Athena that are exposed for querying. Here are the data sources supported by Athena at the time of this post:
After choosing Athena, give a name to the data source and choose the database. The tables available for querying automatically show up in the list.
If you choose “BRFSS_REPORTING”, you can create custom metrics using the columns in the reporting table, which can then be used in reports and dashboards.
To build a complete pipeline, think about ingesting data incrementally for new records as they become available. To enable this, make sure that the data can be incrementally ingested into the reporting schema and that the reporting metrics are refreshed on each run of the report. To demonstrate this, look at a scenario where the new dataset is ingested in a periodic basis into S3, and which has to be included into the reporting schema when calculating the metrics.
Look at the number of records in the reporting table before the incremental ingestion.
SELECT count(*) FROM brfss_reporting;
The results are as follows:
_col0
1
713123
Use Query1 as an example transformation query that can isolate the incremental load. Here is a view of the query result bucket before Query1 runs.
After the incremental data is ingested in S3, trigger an event (or pre-schedule) an execution of Query1 in Athena, which results in the csv result set and the metadata file as shown below.
Next, trigger (or schedule) the copy command to copy the incremental records file into the reporting prefix. This is easily automated by using the predefined structure in which Athena saves the query results on S3. On checking the records count in the reporting table after copying, you get an increased count.
_col0
1
810042
This shows that 96,919 records were added to our reporting table that can be used in metric calculations.
This process can be implemented programmatically to add incremental records into the reporting table every time new records are ingested into the staging area. As a result, you can simulate an end-to-end analytical pipeline that runs based on events or is scheduled to run as a batch workload.
Conclusion
Using Athena, you can benefit from the advantages of a schema-on-read analytical application. You can combine other AWS services like Lambda and Amazon QuickSight to build an end-to-end analytical pipeline.
However, it’s important to note that a schema-on-read analytical pipeline may not be the answer for all use cases. Carefully consider the choices between schema-on-read and schema-on-write. The source systems you are working with play a critical role in making that decision. For example:
Some systems have defined data structures and the chances of variation is very rare. These systems can work with a fixed relational target schema.
If the target queries mostly involve joining across normalized tables, they work better with a relational database. For this use case, schema-on-write is a good choice.
The query performance in a schema-on-write application is faster as the data is pre-structured. For fixed dashboards with little to no changes, a schema-on-write is a good choice.
You can also choose to go hybrid. Offload a part of the pipeline that deals with unstructured flat datasets into a schema-on-read architecture, and integrate the output into the main schema-on-write pipeline at a later stage. AWS provides multiple options to build analytical pipelines that suit various use cases. For more information, read about the AWS big data and analytics services.
If you have questions or suggestions, please comment below.
Ujjwal Ratan is a healthcare and life sciences Solutions Architect at AWS. He has worked with organizations ranging from large enterprises to smaller startups on problems related to distributed computing, analytics and machine learning. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.
Want to provide users with single sign-on access to AppStream 2.0 using existing enterprise credentials? Active Directory Federation Services (AD FS) 3.0 can be used to provide single sign-on for Amazon AppStream 2.0 using SAML 2.0.
You can use your existing Active Directory or any SAML 2.0–compliant identity service to set up single sign-on access of AppStream 2.0 applications for your users. Identity federation using SAML 2.0 is currently available in all AppStream 2.0 regions.
This post explains how to configure federated identities for AppStream 2.0 using AD FS 3.0.
Walkthrough
After setting up SAML 2.0 federation for AppStream 2.0, users can browse to a specially crafted (AD FS RelayState) URL and be taken directly to their AppStream 2.0 applications.
When users sign in with this URL, they are authenticated against Active Directory. After they are authenticated, the browser receives a SAML assertion as an authentication response from AD FS, which is then posted by the browser to the AWS sign-in SAML endpoint. Temporary security credentials are issued after the assertion and the embedded attributes are validated. The temporary credentials are then used to create the sign-in URL. The user is redirected to the AppStream 2.0 streaming session. The following diagram shows the process.
The user browses to https://applications.exampleco.com. The sign-on page requests authentication for the user.
The federation service requests authentication from the organization’s identity store.
The identity store authenticates the user and returns the authentication response to the federation service.
On successful authentication, the federation service posts the SAML assertion to the user’s browser.
The user’s browser posts the SAML assertion to the AWS Sign-In SAML endpoint (https://signin.aws.amazon.com/saml). AWS Sign-In receives the SAML request, processes the request, authenticates the user, and forwards the authentication token to the AppStream 2.0 service.
Using the authentication token from AWS, AppStream 2.0 authorizes the user and presents applications to the browser.
In this post, use domain.local as the name of the Active Directory domain. Here are the steps in this walkthrough:
Configure AppStream 2.0 identity federation.
Configure the relying trust.
Create claim rules.
Enable RelayState and forms authentication.
Create the AppStream 2.0 RelayState URL and access the stack.
Test the configuration.
Prerequisites
This walkthrough assumes that you have the following prerequisites:
An instance joined to a domain with the “Active Directory Federation Services” role installed and post-deployment configuration completed
Familiarity with AppStream 2.0 resources
Configure AppStream 2.0 identity federation
First, create an AppStream 2.0 stack, as you reference the stack in upcoming steps. Name the stack ExampleStack. For this walkthrough, it doesn’t matter which underlying fleet you associate with the stack. You can create a fleet using one of the example Amazon-AppStream2-Sample-Image images available, or associate an existing fleet to the stack.
Get the AD FS metadata file
The first thing you need is the metadata file from your AD FS server. The metadata file is a signed document that is used later in this guide to establish the relying party trust. Don’t edit or reformat this file.
To download and save this file, navigate to the following location, replacing <FQDN_ADFS_SERVER> with your AD FS s fully qualified server name.
In the IAM console, choose Identity providers, Create provider.
On the Configure Provider page, for Provider Type, choose SAML. For Provider Name, type ADFS01 or similar name. Choose Choose File to upload the metadata document previously downloaded. Choose Next Step.
Verify the provider information and choose Create.
You need the Amazon Resource Name (ARN) of the identity provider (IdP) to configure claims rules later in this walkthrough. To get this, select the IdP that you just created. On the summary page, copy the value for Provider ARN. The ARN is in the following format:
Next, configure a policy with permissions to the AppStream 2.0 stack. This is the level of permissions that federated users have within AWS.
In the IAM console, choose Policies, Create Policy, Create Your Own Policy.
For Policy Name, enter a descriptive name. For Description, enter the level of permissions. For Policy Document, you customize the Region-Code, AccountID (without hyphens), and case-sensitive Stack-Name values.
For Region Codes, use one of the following values based on the region you are using AppStream 2.0 (the available regions for AppStream 2.0):
us-east-1
us-west-2
eu-west-1
ap-northeast-1
Choose Create Policy and you should see the following notification:
Create an IAM role
Here, you create a role that relates to an Active Directory group assigned to your AppStream 2.0 federated users. For this configuration, Active Directory groups and AWS roles are case-sensitive. Here you create an IAM Role named “ExampleStack” and an Active Directory group named in the format AWS-AccountNumber-RoleName, for example AWS-012345678910-ExampleStack.
In the IAM console, choose Roles, Create new role.
On the Select Role type page, choose Role for identity provider access. Choose Select next to Grant Web Single Sign-On (WebSSO) access to SAML providers.
On the Establish Trust page, make sure that the SAML provider that you just created (such as ADFS01) is selected. For Attribute and Value, keep the default values.
On the Verify Role Trust page, the Federated value matches the ARN noted previously for the principal IdP created earlier. The SAML: aud value equals https://signin.aws.amazon.com/saml, as shown below. This is prepopulated and does not require any change. Choose Next Step.
On the Attach policy page, attach the policy that you created earlier granting federated users access only to the AppStream 2.0 stack. In this walkthrough, the policy was named AppStream2_ExampleStack.
After selecting the correct policy, choose Next Step.
On the Set role name and review page, name the role ExampleStack. You can customize this naming convention, as I explain later when I create the claims rules.
You can describe the role as desired. Ensure that the trusted entities match the AD FS IdP ARN, and that the policy attached is the policy created earlier granting access only to this stack.
Choose Create Role.
Important: If you grant more than the stack permissions to federated users, you can give them access to other areas of the console as well. AWS strongly recommends that you attach policies to a role that grants access only to the resources to be shared with federated users.
For example, if you attach the AdministratorAccess policy instead of AppStream2_ExampleStack, any AppStream 2.0 federated user in the ExampleStack Active Directory group has AdministratorAccess in your AWS account. Even though AD FS routes users to the stack, users can still navigate to other areas of the console, using deep links that go directly to specific console locations.
Next, create the Active Directory group in the format AWS-AccountNumber-RoleName using the “ExampleStack” role name that you just created. You reference this Active Directory group in the AD FS claim rules later using regex. For Group scope, choose Global. For Group type, choose Security
Note: To follow this walkthrough exactly, name your Active Directory group in the format “AWS-AccountNumber-ExampleStack” replacing AccountNumber with your AWS AccountID (without hyphens). For example:
AWS-012345678910-ExampleStack
Configure the relying party trust
In this section, you configure AD FS 3.0 to communicate with the configurations made in AWS.
Open the AD FS console on your AD FS 3.0 server.
Open the context (right-click) menu for AD FS and choose Add Relying Party Trust…
On the Welcome page, choose Start. On the Select Data Source page, keep Import data about the relying party published online or on a local network checked. For Federation metadata address (host name or URL), type the following link to the SAML metadata to describe AWS as a relying party and then choose Next.
On the Specify Display Name page, for Display name, type “AppStream 2.0 – ExampleStack” or similar value. For Notes, provide a description. Choose Next.
On the Configure Multi-factor Authentication Now? page, choose I do not want to configure multi-factor authentication settings for this relying party trust at this time. Choose Next.
Because you are controlling access to the stack using an Active Directory group, and IAM role with an attached policy, on the Choose Issuance Authorization Rules page, check Permit all users to access this relying party. Choose Next.
On the Ready to Add Trust page, there shouldn’t be any changes needed to be made. Choose Next.
On the Finish page, clear Open the edit Claim Rules dialog for this relying party trust when the wizard closes. You open this later.
Next, you add the https://signin.aws.amazon.com/saml URL is listed on the Identifiers tab within the properties of the trust. To do this, open the context (right-click) menu for the relying party trust that you just created and choose Properties.
On the Monitoring tab and clear Monitor relying party. Choose Apply. On the Identifiers tab, for Relying party identifier, add https://signin.aws.amazon.com/saml and choose OK.
Create claim rules
In this section, you create four AD FS claim rules, which identify accounts, set LDAP attributes, get the Active Directory groups, and match them to the role created earlier.
In the AD FS console, expand Trust Relationships, choose Relying Party Trusts, and then select the relying party trust that you just created (in this case, the display name is AppStream 2.0 – ExampleStack). Open the context (right-click) menu for the relying party trust and choose Edit Claim Rules. Choose Add Rule.
Rule 1: Name ID
This claim rule tells AD FS the type of expected incoming claim and how to send the claim to AWS. AD FS receives the UPN and tags it as the Name ID when it’s forwarded to AWS. This rule interacts with the third rule, which fetches the user groups.
Claim rule template: Transform an Incoming Claim
Configure Claim Rule values:
Claim Rule Name: Name ID
Incoming Claim Type: UPN
Outgoing Claim Type: Name ID
Outgoing name ID format: Persistent Identifier
Pass through all claim values: selected
Rule 2: RoleSessionName
This rule sets a unique identifier for the user. In this case, use the E-Mail-Addresses values.
Claim rule template: Send LDAP Attributes as Claims
This rule converts the value of the Active Directory group starting with AWS-AccountNumber prefix to the roles known by AWS. For this rule, you need the AWS IdP ARN that you noted earlier. If your IdP in AWS was named ADFS01 and the AccountID was 012345678910, the ARN would look like the following:
arn:aws:iam::012345678910:saml-provider/ADFS01
Claim rule template: Send Claims Using a Custom Rule
Configure Claim Rule values:
Claim Rule Name: Roles
Custom Rule:
c:[Type == "http://temp/variable", Value =~ "(?i)^AWS-"]
=> issue(Type = "https://aws.amazon.com/SAML/Attributes/Role", Value = RegExReplace(c.Value, "AWS-012345678910-", "arn:aws:iam::012345678910:saml-provider/ADFS01,arn:aws:iam::019517892450:role/"));
Change arn:aws:iam::012345678910:saml-provider/ADFS01 to the ARN of your AWS IdP
Change 012345678910 to the ID (without hyphens) of the AWS account.
In this walkthrough, “AWS-” returns the Active Directory groups that start with the AWS- prefix, then removes AWS-012345678910- leaving ExampleStack left on the Active Directory Group name to match the ExampleStack IAM role. To customize the role naming convention, for example to name the IAM Role ADFS-ExampleStack, add “ADFS-” to the end of the role ARN at the end of the rule: arn:aws:iam::012345678910:role/ADFS-.
You should now have four claims rules created:
NameID
RoleSessionName
Get Active Directory Groups
Role
Enable RelayState and forms authentication
By default, AD FS 3.0 doesn’t have RelayState enabled. AppStream 2.0 uses RelayState to direct users to your AppStream 2.0 stack.
On your AD FS server, open the following with elevated (administrator) permissions:
In the Microsoft.IdentityServer.Servicehost.exe.config file, find the section <microsoft.identityServer.web>. Within this section, add the following line:
In the AD FS console, verify that forms authentication is enabled. Choose Authentication Policies. Under Primary Authentication, for Global Settings, choose Edit.
For Extranet, choose Forms Authentication. For Intranet, do the same and choose OK.
On the AD FS server, from an elevated (administrator) command prompt, run the following commands sequentially to stop, then start the AD FS service to register the changes:
net stop adfssrv
net start adfssrv
Create the AppStream 2.0 RelayState URL and access the stack
Now that RelayState is enabled, you can generate the URL.
I have created an Excel spreadsheet for RelayState URL generation, available as RelayGenerator.xlsx. This spreadsheet only requires the fully qualified domain name for your AD FS server, account ID (without hyphens), stack name (case-sensitive), and the AppStream 2.0 region. After all the inputs are entered, the spreadsheet generates a URL in the blue box, as shown in the screenshot below. Copy the entire contents of the blue box to retrieve the generated RelayState URL for AD FS.
Alternatively, if you do not have Excel, there are third-party tools for RelayState URL generation. However, they do require some customization to work with AppStream 2.0. Example customization steps for one such tool are provided below.
CodePlex has an AD FS RelayState generator, which downloads an HTML file locally that you can use to create the RelayState URL. The generator says it’s for AD FS 2.0; however, it also works for AD FS 3.0. You can generate the RelayState URL manually but if the syntax or case sensitivity is incorrect even slightly, it won’t work. I recommend using the tool to ensure a valid URL.
When you open the URL generator, clear out the default text fields. You see a tool that looks like the following:
To generate the values, you need three pieces of information:
IDP URL String
Relying Party Identifier
Relay State / Target App
IDP URL String
The IDP URL string is the URL you use to hit your AD FS sign-on page. For example:
Ultimately, the URL looks like the following example, which is for us-east-1, with a stack name of ExampleStack, and an account ID of 012345678910. The stack name is case-sensitive.
The generated RelayState URL can now be saved and used by users to log in directly from anywhere that can reach the AD FS server, using their existing domain credentials. After they are authenticated, users are directed seamlessly to the AppStream 2.0 stack.
Test the configuration
Create a new AD user in Domain.local named Test User, with a username TUser and an email address. An email address is required based on the claim rules.
Next, add TUser to the AD group you created for the AWS-012345678910-ExampleStack stack.
Next, navigate to the RelayState URL and log in with domain\TUser.
After you log in, you are directed to the streaming session for the ExampleStack stack. As an administrator, you can disassociate and associate different fleets of applications to this stack, without impacting federation, and deliver different applications to this group of federated users.
Because the policy attached to the role only allows access to this AppStream 2.0 stack, if a federated user were to try to access another section of the console, such as Amazon EC2, they would discover that they are not authorized to see (describe) any resources or perform any actions, as shown in the screenshot below. This is why it’s important to grant access only to the AppStream 2.0 stack.
Configurations for AD FS 4.0
If you are using AD FS 4.0, there are a few differences from the procedures discussed earlier.
Do not customize the following file as described in the Enable RelayState and forms authentication of the AD FS 3.0 guide:
Enable the IdP-initiated sign-on page that is used when generating the RelayState URL. To do this, open an elevated PowerShell terminal and run the following command:
To register these changes with AD FS, restart the AD FS service from an elevated PowerShell terminal (or command prompt):
net stop adfssrv
net start adfssrv
After these changes are made, AD FS 4.0 should now work for AppStream 2.0 identity federation.
Troubleshooting
If you are still encountering errors with your setup, below are common error messages you may see, and configuration areas that I recommend that you check.
Invalid policy
Unable to authorize the session. (Error Code: INVALID_AUTH_POLICY);Status Code:401
This error message can occur when the IAM policy does not permit access to the AppStream 2.0 stack. However, it can also occur when the stack name is not entered into the policy or RelayState URL using case-sensitive characters. For example, if your stack name is “ExampleStack” in AppStream 2.0 and the policy has “examplestack” or if the Relay State URL has “examplestack” or any capitalization pattern other than the exact stack name, you see this error message.
Invalid relay state
Error: Bad Request.(Error Code: INVALID_RELAY_STATE);Status Code:400
If you are receiving this error message, there is likely to be another issue in the Relay State URL. It could be related to case sensitivity (other than the stack name). For example, https://relay-state-region-endoint?stack=stackname&accountId=aws-account-id-without-hyphens.
Unable to authorize the session. Cross account access is not allowed. (Error Code: CROSS_ACCOUNT_ACCESS_NOT_ALLOWED);Status Code:401
If you see this error message, check to make sure that the AccountId number is correct in the Relay State URL.
Summary
In this post, you walked through enabling AD FS 3.0 for AppStream 2.0 identity federation. You should now be able to configure AD FS 3.0 or 4.0 for AppStream 2.0 identity federation. If you have questions or suggestions, please comment below.
Have you ever been faced with many different data sources in different formats that need to be analyzed together to drive value and insights? You need to be able to query, analyze, process, and visualize all your data as one canonical dataset, regardless of the data source or original format.
In this post, I walk through using AWS Glue to create a query optimized, canonical dataset on Amazon S3 from three different datasets and formats. Then, I use Amazon Athena and Amazon QuickSight to query against that data quickly and easily.
AWS Glue overview
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for query and analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. Point AWS Glue to your data stored on AWS, and a crawler discovers your data, classifies it, and stores the associated metadata (such as table definitions) in the AWS Glue Data Catalog. After it’s cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the ETL code for data transformation, and loads the transformed data into a target data store for analytics.
The AWS Glue ETL engine generates Python code that is entirely customizable, reusable, and portable. You can edit the code using your favorite IDE or notebook and share it with others using GitHub. After your ETL job is ready, you can schedule it to run on the AWS Glue fully managed, scale-out Spark environment, using its flexible scheduler with dependency resolution, job monitoring, and alerting.
AWS Glue is serverless. It automatically provisions the environment needed to complete the job, and customers pay only for the compute resources consumed while running ETL jobs. With AWS Glue, data can be available for analytics in minutes.
Walkthrough
During this post, I step through an example using the New York City Taxi Records dataset. I focus on one month of data, but you could easily do this for the entire eight years of data. At the time of this post, AWS Glue is available in US-East-1 (N. Virginia).
As you crawl the unknown dataset, you discover that the data is in different formats, depending on the type of taxi. You then convert the data to a canonical form, start to analyze it, and build a set of visualizations… all without launching a single server.
Discover the data
To analyze all the taxi rides for January 2016, you start with a set of data in S3. First, create a database for this project within AWS Glue. A database is a set of associated table definitions, organized into a logical group. In Athena, database names are all lowercase, no matter what you type.
Next, add a new crawler and have it infer the schemas, structure, and other properties of the data.
Under Crawlers, choose Add crawler.
For Name, type “NYCityTaxiCrawler”.
Choose an IAM role that AWS Glue can use for the crawler. For more information, see the AWS Glue Developer Guide.
For Datastore, type S3.
For Crawl data in, choose Specified path in anotheraccount.
For Include path, type “s3://serverless-analytics/glue-blog”.
For Add Another data store, choose No.
For Frequency, choose On demand. This allows you to customize when the crawler runs.
Configure the crawler output database and prefix:
For Database, select the database created earlier, “nycitytaxianalysis”.
For Prefix added to tables (optional), type “blog_”.
Choose Finish and Run it now.
The crawler runs and indicates that it found three tables.
If you look under Tables, you can see the three new tables that were created under the database “nycitytaxianalysis”.
The crawler used the built-in classifiers and identified the tables as CSV, inferred the columns/data types, and collected a set of properties for each table. If you look in each of those table definitions, you see the number of rows for each dataset found and that the columns don’t match between tables. As an example, bringing up the blog_yellow table, you can see the yellow dataset for January 2017 with 8.7 million rows, the location on S3, and the various columns found.
Optimize queries and get data into a canonical form
Create an ETL job to move this data into a query-optimized form. In a future post, I’ll cover how to partition the query-optimized data. You convert the data into a column format, changing the storage type to Parquet, and writing the data to a bucket that you own.
Create a new ETL job and name it NYCityTaxiYellow.
Select an IAM role that has permissions to write to a new S3 location in a bucket that you own.
Specify a location for the script and tmp space on S3.
Choose the blog_yellow table as a data source.
Specify a new location (a new prefix location without any existing objects) to store the results.
In the transform, rename the pickup and dropoff dates to be generic fields. The raw data actually lists these fields differently and you should use a common name. To do this, choose the target column name and retype the new field name. Also, change the data types to be time stamps rather than strings. The following table outlines the updates.
Old Name
Target Name
Target Data Type
tpep_pickup_datetime
pickup_date
timestamp
tpep_dropoff_datetime
dropoff_date
timestamp
Choose Next and Finish.
In this case, AWS Glue creates a script. You could also start with a blank script or provide your own.
Because AWS Glue uses Apache Spark behind the scenes, you can easily switch from an AWS Glue DynamicFrame to a Spark DataFrame and do advanced operations within Apache Spark. Just as easily, you can switch back and continue to use the transforms and tables from the catalog.
To show this, convert to a DataFrame and add a new field column indicating the type of taxi. Make the following custom modifications in PySpark.
Add the following header:
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame
Find the last call before the the line that start with the datasink. This is the dynamic frame that is being used to write out the data.
Let’s now convert that to a DataFrame. Please replace the <DYNAMIC_FRAME_NAME> with the name generated in the script.
##----------------------------------
#convert to a Spark DataFrame...
customDF = <DYNAMIC_FRAME_NAME>.toDF()
#add a new column for "type"
customDF = customDF.withColumn("type", lit('yellow'))
# Convert back to a DynamicFrame for further processing.
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")
##----------------------------------
In the last datasink call, change the dynamic frame to point to the new custom dynamic frame created from the Spark DataFrame:
datasink4 = glueContext.write_dynamic_frame.from_options(frame = customDynamicFrame, connection_type = "s3", connection_options = {"path": "s3://<YOURBUCKET/ AND PREFIX/"}, format = "parquet", transformation_ctx = "datasink4")
Save and run the ETL.
To use AWS Glue with Athena, you must upgrade your Athena data catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Athena. As of August 14, 2017, the AWS Glue Data Catalog is only available in US-East-1 (N. Virginia). Start the upgrade in the Athena console. For more information, see Integration with AWS Glue in the Amazon Athena User Guide.
Next, we’ll go into Athena and create a table there for the new data format. This shows how you can interact with the data catalog either through Glue or the Athena.
Within the Glue Data Catalog, select the new table that you created through Athena under Actions, select “View data”.
If you do a query on the count per type of taxi, you notice that there is only “yellow” taxi data in the canonical location. Convert “green” and “fhv” data:
SELECT type, count(*) FROM <OUTPUT_TABLE_NAME> group by type
In the AWS Glue console, create a new ETL job named “NYCityTaxiGreen”. Choose blog_green as the source and the new canonical location as the target. Choose the new pickup_date and dropoff_date fields for the target field types for the date variables, as shown below.
In the script, insert the following two lines at the end of the top imports:
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame
Find the last call before the the line that start with the datasink. This is the dynamic frame that is being used to write out the data.
Let’s now convert that to a DataFrame. Please replace the <DYNAMIC_FRAME_NAME> with the name generated in the script.
Type this line before the last sink calls:
##----------------------------------
#convert to a Spark DataFrame...
customDF = <DYNAMIC_FRAME_NAME>.toDF()
#add a new column for "type"
customDF = customDF.withColumn("type", lit('green'))
# Convert back to a DynamicFrame for further processing.
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")
In the last datasink call, change the dynamic frame to point to the new custom dynamic frame created from the Spark DataFrame:
Notice that it matches the last line in the commands that you pasted earlier. After the ETL job finishes, you can require the count per type through Athena and see the sizes:
SELECT type, count(*) FROM <OUTPUT_TABLE_NAME> group by type
What you have done is taken two datasets that contained taxi data in different formats and converted them into a single, query-optimized format. That format can easily be consumed by various analytical tools, including Athena, EMR, and Redshift Spectrum. You’ve done all of this without launching a single server.
You have one more dataset to convert, so create a new ETL job for that. Name the job “NYCityTaxiFHV”. This dataset is limited as it only has three fields, one of which you are converting. Choose “–” (dash) for the dispatcher and location, and map the pickup date to the canonical field.
In the script, insert the following two lines at the end of the top imports:
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame
Type the following line before the last sink calls:
##----------------------------------
#convert to a Spark DataFrame...
customDF = resolvechoice3.toDF()
#add a new column for "type"
customDF = customDF.withColumn("type", lit('fhv'))
# Convert back to a DynamicFrame for further processing.
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")
In the last datasink call, change the dynamic frame to point to the new custom dynamic frame created from the Spark DataFrame:
After the third ETL job completes, you can query the single dataset, in a query-optimized form on S3:
SELECT type, count(*) FROM <OUTPUT_TABLE_NAME> group by type
Query across the data with Athena
Here’s a query:
SELECT min(trip_distance) as minDistance,
max(trip_distance) as maxDistance,
min(total_amount) as minTotal,
max(total_amount) as maxTotal
FROM <OUTPUT_TABLE_NAME>
Within a second or two, you can quickly see the ranges for these values.
You can see some outliers in the total amount and distances that are = 0. In the next query, don’t include those when calculating things like average cost per mile:
SELECT type,
avg(trip_distance) AS avgDist,
avg(total_amount/trip_distance) AS avgCostPerMile
FROM <OUTPUT_TABLE_NAME>
WHERE trip_distance > 0
AND total_amount > 0
GROUP BY type
Interesting note: you aren’t seeing FHV broken out yet, because that dataset didn’t have a mapping to these fields. Here’s a query that’s a little more complicated, to calculate a 99th percentile:
SELECT TYPE,
avg(trip_distance) avgDistance,
avg(total_amount/trip_distance) avgCostPerMile,
avg(total_amount) avgCost,
approx_percentile(total_amount, .99) percentile99
FROM <OUTPUT_TABLE_NAME>
WHERE trip_distance > 0 and total_amount > 0
GROUP BY TYPE
Visualizing the data in QuickSight
Amazon QuickSight is a data visualization service you can use to analyze the data that you just combined. For more detailed instructions, see the Amazon QuickSight User Guide.
Create a new Athena data source, if you haven’t created one yet.
Create a new dataset for NY taxi data, and point it to the canonical taxi data.
Choose Edit/Preview data.
Add a new calculated field named HourOfDay.
Choose Save and Visualize.
You can now visualize this dataset and query through Athena the canonical dataset on S3 that has been converted by AWS Glue. Selecting HourOfDay allows you to see how many taxi rides there are based on the hour of the day. This is a calculated field that is being derived by the time stamp in the raw dataset.
You can also deselect that and choose type to see the total counts again by type.
Now choose the pickup_date field in addition to the type field and notice that the plot type automatically changed. Because time is being used, it’s showing a time chart defaulting to the year. Choose a day interval instead.
Expand the x axis to zoom out. You can clearly see a large drop in all rides on January 23, 2016.
Summary
In the post, you went from data investigation to analyzing and visualizing a canonical dataset, without starting a single server. You started by crawling a dataset you didn’t know anything about and the crawler told you the structure, columns, and counts of records.
From there, you saw that all three datasets were in different formats, but represented the same thing: NY City Taxi rides. You then converted them into a canonical (or normalized) form that is easily queried through Athena and QuickSight, in addition to a wide number of different tools not covered in this post. If you have questions or suggestions, please comment below.
Ben Snively is a Public Sector Specialist Solutions Architect. He works with government, non-profit and education customers on big data and analytical projects, helping them build solutions using AWS. In his spare time he adds IoT sensors throughout his house and runs analytics on it.
Whew – what a week! Tara, Randall, Ana, and I have been working around the clock to create blog posts for the announcements that we made at the AWS Summit in New York. Here’s a summary to help you to get started:
Amazon Macie – This new service helps you to discover, classify, and secure content at scale. Powered by machine learning and making use of Natural Language Processing (NLP), Macie looks for patterns and alerts you to suspicious behavior, and can help you with governance, compliance, and auditing. You can read Tara’s post to see how to put Macie to work; you select the buckets of interest, customize the classification settings, and review the results in the Macie Dashboard.
AWS Glue – Randall’s post (with deluxe animated GIFs) introduces you to this new extract, transform, and load (ETL) service. Glue is serverless and fully managed, As you can see from the post, Glue crawls your data, infers schemas, and generates ETL scripts in Python. You define jobs that move data from place to place, with a wide selection of transforms, each expressed as code and stored in human-readable form. Glue uses Development Endpoints and notebooks to provide you with a testing environment for the scripts you build. We also announced that Amazon Athena now integrates with Amazon Glue, as does Apache Spark and Hive on Amazon EMR.
AWS Migration Hub – This new service will help you to migrate your application portfolio to AWS. My post outlines the major steps and shows you how the Migration Hub accelerates, tracks,and simplifies your migration effort. You can begin with a discovery step, or you can jump right in and migrate directly. Migration Hub integrates with tools from our migration partners and builds upon the Server Migration Service and the Database Migration Service.
CloudHSM Update – We made a major upgrade to AWS CloudHSM, making the benefits of hardware-based key management available to a wider audience. The service is offered on a pay-as-you-go basis, and is fully managed. It is open and standards compliant, with support for multiple APIs, programming languages, and cryptography extensions. CloudHSM is an integral part of AWS and can be accessed from the AWS Management Console, AWS Command Line Interface (CLI), and through API calls. Read my post to learn more and to see how to set up a CloudHSM cluster.
Managed Rules to Secure S3 Buckets – We added two new rules to AWS Config that will help you to secure your S3 buckets. The s3-bucket-public-write-prohibited rule identifies buckets that have public write access and the s3-bucket-public-read-prohibited rule identifies buckets that have global read access. As I noted in my post, you can run these rules in response to configuration changes or on a schedule. The rules make use of some leading-edge constraint solving techniques, as part of a larger effort to use automated formal reasoning about AWS.
CloudTrail for All Customers – Tara’s post revealed that AWS CloudTrail is now available and enabled by default for all AWS customers. As a bonus, Tara reviewed the principal benefits of CloudTrail and showed you how to review your event history and to deep-dive on a single event. She also showed you how to create a second trail, for use with CloudWatch CloudWatch Events.
Encryption of Data at Rest for EFS – When you create a new file system, you now have the option to select a key that will be used to encrypt the contents of the files on the file system. The encryption is done using an industry-standard AES-256 algorithm. My post shows you how to select a key and to verify that it is being used.
Watch the Keynote My colleagues Adrian Cockcroft and Matt Wood talked about these services and others on the stage, and also invited some AWS customers to share their stories. Here’s the video:
This is a guest post by Jeremy Winters and Ritu Mishra, Solution Architects at Full 360. In their own words, “Full 360 is a cloud first, cloud native integrator, and true believers in the cloud since inception in 2007, our focus has been on helping customers with their journey into the cloud. Our practice areas – Big Data and Warehousing, Application Modernization, and Cloud Ops/Strategy – represent deep, but focused expertise.”
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. As a company who has been building data warehouse solutions in the cloud for 10 years, we at Full 360 were interested to see how we can leverage AWS Glue in customer solutions. This post details our experience and lessons learned from using AWS Glue for an Amazon Redshift data integration use case.
UI-based ETL Tools
We have been anticipating the release of AWS Glue since it was announced at re:Invent 2016. Many of our customers are looking for an easy to use, UI-based tooling to manage their data transformation pipeline. However in our experience, the complexity of any production pipeline tends to be difficult to unwind, regardless of the technology used to create them. At Full 360, we build cloud-native, script-based applications deployed in containers to handle data integration. We think script-based transformation provides a balance of robustness and flexibility necessary to handle any data problem that comes our way.
AWS Glue caters both to developers who prefer writing scripts and those who want UI-based interactions. It is possible to initially create jobs using the UI, by selecting data source and target. Under the hood, AWS Glue auto-generates the Python code for you, which can be edited if needed, though this isn’t necessary for the majority of use cases.
Of course, you don’t have to rely on the UI at all. You can simply write your own Python scripts, store them in Amazon S3 with any dependent libraries, and import them into AWS Glue. AWS Glue also supports IDE integration with tools such as PyCharm, and interestingly enough, Zeppelin notebooks! These integrations are targeted toward developers who prefer writing Python themselves and want a cleaner dev/test cycle.
Developers who are already in the business of scripting ETL will be excited by the ability to easily deploy Python scripts with AWS Glue, using existing workflows for source control and CI/CD, and have them deployed and executed in a fully managed manner by AWS. The UI experience of AWS Glue works well, but it is good to know that the tool accommodates those who prefer traditional coding. You can also dig into complex edge cases for data handling where the UI doesn’t cut it.
Serverless!
When AWS Lambda was released, we were excited for its potential to host our ETL processes. With Lambda, you are limited to the five-minute maximum for function execution. We resorted to running Docker containers and Amazon ECS to orchestrate many of our customers long running tasks. With this approach, we are still required to manage the underlying infrastructure.
After a closer look at AWS Glue, we realize that it is a full serverless PySpark runtime, accompanied by an Apache Hive metastore compatible catalog-as-a-service. This means that you are not just running a script on a single core, but instead you have the full power of a multi-worker Spark environment available. If you’re not familiar with the Spark framework, it introduces a new paradigm that allows for the processing of distributed, in-memory datasets. Spark has many uses, from data transformation to machine learning.
In AWS Glue, you use PySpark dataframes to transform data before reaching your database. Dataframes manage data in a way similar to relational databases, so the methods are likely to be familiar to most SQL users. Additionally, you can use SQL in your PySpark code to manipulate the data.
AWS Glue also simplifies the management of the runtime environment by providing you with a DPU setting, which allows you to dial up or down the amount of compute resources used to run your job. One DPU is equivalent to 4 vCPU, 16 GB Mem.
Common use case
We can see that most customers would leverage AWS Glue to load one or many files from S3 into Amazon Redshift. To accelerate this process, you can use the crawler, an AWS console-based utility, to discover the schema of your data and store it in the AWS Glue Data Catalog, whether your data sits in a file or a database. We were able to discover the schemas of our source file and target table, then have AWS Glue construct and execute the ETL job. It worked! We successfully loaded the beer drinkers’ dataset, JSON files used in our advanced tuning class, into Amazon Redshift!
Our use case
For our use case, we integrated AWS Glue and Amazon Redshift Spectrum with an open-source project that we initiated called SneaQL. SneaQL is an open source, containerized framework for sneaking interactivity into static SQL. SneaQL provides an extension to ANSI SQL with command tags to provide functionality such as loops, variables, conditional execution, and error handling to your scripts. It allows you to manage your scripts in an AWS CodeCommit Git repo, which then gets deployed and executed in a container.
We use SneaQL for complex data integrations, usually with an ELT model, where the data is loaded into the database, then transformed into fact tables using parameterized SQL. SneaQL enables advanced use cases like partial upsert aggregation of data, where multiple data sources can merge into the same fact table.
We think AWS Glue, Redshift Spectrum, and SneaQL offer a compelling way to build a data lake in S3, with all of your metadata accessible through a variety of tools such as Hive, Presto, Spark, and Redshift Spectrum). Build your aggregation table in Amazon Redshift to drive your dashboards or other high-performance analytics.
In the video below, you see a demonstration of using AWS Glue to convert JSON documents into Parquet for partitioned storage in an S3 data lake. We then access the data from S3 into Amazon Redshift by way of Redshift Spectrum. Nearing the end of the AWS Glue job, we then call AWS boto3 to trigger an Amazon ECS SneaQL task to perform an upsert of the data into our fact table. All the sample artifacts needed for this demonstration are available in the Full360/Sneaql Github repository.
In this scenario, AWS Glue manipulates data outside of a data warehouse and loads it to Amazon Redshift, and SneaQL manipulates data inside Amazon Redshift.
We think that they are a good compliment to each other when doing complex integrations.
Our pipeline works as follows:
New files arrive in S3 in JSON format.
AWS Glue converts the JSON files in Parquet format, stored in another S3 bucket.
At the end of the AWS Glue script, the AWS SDK for Python (Boto) is used to trigger the Amazon ECS task that runs SneaQL.
SneaQL container pulls the secrets file from S3 then decrypts it with biscuit or AWS KMS.
SneaQL clones the appropriate branch from an AWS CodeCommit Git repo.
Data in Amazon Redshift is transformed by SneaQL statements stored in the repo.
The AWS Glue team also supports conditional triggers that would allow us to split the jobs, and make one dependent upon the other.
Final thoughts
Although, as a general rule, we believe in managing schema definitions in source control, we definitely think that the crawler feature has some attractive perks. Besides being handy for ad-hoc jobs, it is also partition-aware. This means that you can perform data movements without worrying about which data has been processed already, which is one less thing for you to manage. We’re interested to see the design patterns that emerge around the use of the crawler.
The freedom of PySpark also allows us to implement advanced use cases. While the boto3 library is available to all AWS Glue jobs, it is also possible to include any libraries and additional JAR files along with your Python script.
We think many customers will find AWS Glue valuable. Those who prefer to code their data processing pipeline will be quick to realize how powerful it is, especially for solving complex data cleansing problems. The learning curve for PySpark is real, so we suggest looking into dataframes as a good entry point. In the long run, the fact that AWS Glue is serverless makes the effort more than worth it.
If you have questions or suggestions, please comment below.
Today we’re excited to announce the general availability of AWS Glue. Glue is a fully managed, serverless, and cloud-optimized extract, transform and load (ETL) service. Glue is different from other ETL services and platforms in a few very important ways.
First, Glue is “serverless” – you don’t need to provision or manage any resources and you only pay for resources when Glue is actively running. Second, Glue provides crawlers that can automatically detect and infer schemas from many data sources, data types, and across various types of partitions. It stores these generated schemas in a centralized Data Catalog for editing, versioning, querying, and analysis. Third, Glue can automatically generate ETL scripts (in Python!) to translate your data from your source formats to your target formats. Finally, Glue allows you to create development endpoints that allow your developers to use their favorite toolchains to construct their ETL scripts. Ok, let’s dive deep with an example.
In my job as a Developer Evangelist I spend a lot of time traveling and I thought it would be cool to play with some flight data. The Bureau of Transportations Statistics is kind enough to share all of this data for anyone to use here. We can easily download this data and put it in an Amazon Simple Storage Service (S3) bucket. This data will be the basis of our work today.
Crawlers
First, we need to create a Crawler for our flights data from S3. We’ll select Crawlers in the Glue console and follow the on screen prompts from there. I’ll specify s3://crawler-public-us-east-1/flight/2016/csv/ as my first datasource (we can add more later if needed). Next, we’ll create a database called flights and give our tables a prefix of flights as well.
The Crawler will go over our dataset, detect partitions through various folders – in this case months of the year, detect the schema, and build a table. We could add additonal data sources and jobs into our crawler or create separate crawlers that push data into the same database but for now let’s look at the autogenerated schema.
I’m going to make a quick schema change to year, moving it from BIGINT to INT. Then I can compare the two versions of the schema if needed.
Now that we know how to correctly parse this data let’s go ahead and do some transforms.
ETL Jobs
Now we’ll navigate to the Jobs subconsole and click Add Job. Will follow the prompts from there giving our job a name, selecting a datasource, and an S3 location for temporary files. Next we add our target by specifying “Create tables in your data target” and we’ll specify an S3 location in Parquet format as our target.
After clicking next, we’re at screen showing our various mappings proposed by Glue. Now we can make manual column adjustments as needed – in this case we’re just going to use the X button to remove a few columns that we don’t need.
This brings us to my favorite part. This is what I absolutely love about Glue.
Glue generated a PySpark script to transform our data based on the information we’ve given it so far. On the left hand side we can see a diagram documenting the flow of the ETL job. On the top right we see a series of buttons that we can use to add annotated data sources and targets, transforms, spigots, and other features. This is the interface I get if I click on transform.
If we add any of these transforms or additional data sources, Glue will update the diagram on the left giving us a useful visualization of the flow of our data. We can also just write our own code into the console and have it run. We can add triggers to this job that fire on completion of another job, a schedule, or on demand. That way if we add more flight data we can reload this same data back into S3 in the format we need.
I could spend all day writing about the power and versatility of the jobs console but Glue still has more features I want to cover. So, while I might love the script editing console, I know many people prefer their own development environments, tools, and IDEs. Let’s figure out how we can use those with Glue.
Development Endpoints and Notebooks
A Development Endpoint is an environment used to develop and test our Glue scripts. If we navigate to “Dev endpoints” in the Glue console we can click “Add endpoint” in the top right to get started. Next we’ll select a VPC, a security group that references itself and then we wait for it to provision.
Once it’s provisioned we can create an Apache Zeppelin notebook server by going to actions and clicking create notebook server. We give our instance an IAM role and make sure it has permissions to talk to our data sources. Then we can either SSH into the server or connect to the notebook to interactively develop our script.
Pricing and Documentation
You can see detailed pricing information here. Glue crawlers, ETL jobs, and development endpoints are all billed in Data Processing Unit Hours (DPU) (billed by minute). Each DPU-Hour costs $0.44 in us-east-1. A single DPU provides 4vCPU and 16GB of memory.
We’ve only covered about half of the features that Glue has so I want to encourage everyone who made it this far into the post to go read the documentation and service FAQs. Glue also has a rich and powerful API that allows you to do anything console can do and more.
We’re also releasing two new projects today. The aws-glue-libs provide a set of utilities for connecting, and talking with Glue. The aws-glue-samples repo contains a set of example jobs.
I hope you find that using Glue reduces the time it takes to start doing things with your data. Look for another post from me on AWS Glue soon because I can’t stop playing with this new service. – Randall
By Michael Natkovich, Akshai Sarma, Nathan Speidel, Marcus Svedman, and Cat Utah
Big Data is no longer just Apache server logs. Nowadays, the data may be user engagement data, performance metrics, IoT (Internet of Things) data, or something else completely atypical. Regardless of the size of the data, or the type of querying patterns on it (exploratory, ad-hoc, periodic, long-term, etc.), everyone wants queries to be as fast as possible and cheap to run in terms of resources. Data can be broadly split into two kinds: the streaming (generally real-time) kind or the batched-up-over-a-time-interval (e.g., hourly or daily) kind. The batch version is typically easier to query since it is stored somewhere like a data warehouse that has nice SQL-like interfaces or an easy to use UI provided by tools such as Tableau, Looker, or Superset. Running arbitrary queries on streaming data quickly and cheaply though, is generally much harder… until now. Today, we are pleased to share our newly open sourced, forward-looking general purpose query engine, called Bullet, with the community on GitHub.
With Bullet, you can:
Run queries on real-time streaming data with the same convenience and flexibility as batch data. You get an API and a UI.
Execute forward-looking queries on typed data with arbitrary schemas.
Run queries that support
Powerful and nested filtering
Fetching raw data records
Aggregating data using Group Bys (Sum, Count, Average, etc.), Count Distincts, Top Ks
Getting distributions of fields like Percentiles or Frequency histograms
One of the key differences between how Bullet queries data and the standard querying paradigm is that Bullet does not store any data. In most other systems where you have a persistence layer (including in-memory storage), you are doing a look-back when you query the layer. Instead, Bullet operates on data flowing through the system after the query is started – it’s a look-forward system that doesn’t need persistence. On a real-time data stream, this means that Bullet is querying data after the query is submitted. This also means that Bullet does not query any data that has already passed through the stream. The fact that Bullet does not rely on a persistence layer is exactly what makes it extremely lightweight and cheap to run.
To see why this is better for the kinds of use cases Bullet is meant for – such as quickly looking at some metric, checking some assumption, iterating on a query, checking the status of something right now, etc. – consider the following: if you had a 1000 queries in a traditional query system that operated on the same data, these query systems would most likely scan the data 1000 times each. By the very virtue of it being forward looking, 1000 queries in Bullet scan the data only once because the arrival of the query determines and fixes the data that it will see. Essentially, the data is coming to the queries instead of the queries being farmed out to where the data is. When the conditions of the query are satisfied (usually a time window or a number of events), the query terminates and returns you the result.
A Brief Architecture Overview
High Level Bullet Architecture
The Bullet architecture is multi-tenant, can scale linearly for more queries and/or more data, and has been tested to handle 700+ simultaneous queries on a data stream that had up to 1.5 million records per second, or 5-6 GB/s. Bullet is currently implemented on top of Storm and can be extended to support other stream processing engines as well, like Spark Streaming or Flink. Bullet is pluggable, so you can plug in any source of data that can be read in Storm by implementing a simple data container interface to let Bullet work with it.
The UI, web service, and the backend layers constitute your standard three-tier architecture. The Bullet backend can be split into three main subsystems:
Request Processor – receives queries, adds metadata, and sends it to the rest of the system
Data Processor – reads data from an input stream, converts it to a unified data format, and matches it against queries
Combiner – combines results for different queries, performs final aggregations, and returns results
The web service can be deployed on any servlet container, like Jetty. The UI is a Node-based Ember application that runs in the client browser. Our full documentation contains all the details on exactly how we perform computationally-intractable queries like Count Distincts on fields with cardinality in the millions, etc. (DataSketches).
Usage at Yahoo
An instance of Bullet is currently running at Yahoo in production against a small subset of Yahoo’s user engagement data stream. This data is roughly 100,000 records per second and is about 130 MB/s compressed. Bullet queries this with about 100 CPU Virtual Cores and 120 GB of RAM. This fits on less than 2 of our (64 Virtual Cores, 256 GB RAM each) test Storm cluster machines.
One of the most popular use cases at Yahoo is to use Bullet to manually validate the instrumentation of an app or web application. Instrumentation produces user engagement data like clicks, views, swipes, etc. Since this data powers everything we do from analytics to personalization to targeting, it is absolutely critical that the data is correct. The usage pattern is generally to:
Submit a Bullet query to obtain data associated with your mobile device or browser (filter on a cookie value or mobile device ID)
Open and use the application to generate the data while the Bullet query is running
Go back to Bullet and inspect the data
In addition, Bullet is also used programmatically in continuous delivery pipelines for functional testing instrumentation on product releases. Product usage is simulated, then data is generated and validated in seconds using Bullet. Bullet is orders of magnitude faster to use for this kind of validation and for general data exploration use cases, as opposed to waiting for the data to be available in Hive or other systems. The Bullet UI supports pivot tables and a multitude of charting options that may speed up analysis further compared to other querying options.
We also use Bullet to do a bunch of other interesting things, including instances where we dynamically compute cardinalities (using a Count Distinct Bullet query) of fields as a check to protect systems that can’t support extremely high cardinalities for fields like Druid.
What you do with Bullet is entirely determined by the data you put it on. If you put it on data that is essentially some set of performance metrics (data center statistics for example), you could be running a lot of queries that find the 95th and 99th percentile of a metric. If you put it on user engagement data, you could be validating instrumentation and mostly looking at raw data.
We hope you will find Bullet interesting and tell us how you use it. If you find something you want to change, improve, or fix, your contributions and ideas are always welcome! You can contact us here.
Earlier this year I told you about Cloud Directory, our cloud-native directory for hierarchical data and told you how it was designed to store large amounts of strongly typed hierarchical data. Because Cloud Directory can scale to store hundreds of millions of objects, it is a great fit for many kinds of cloud and mobile applications.
In my original post I explained how each directory has one or more schemas, each of which has, in turn, one or more facets. Each facet defines the set of required and allowable attributes for an object.
Introducing Typed Links Today we are extending the expressive power of the Cloud Directory model by adding support for typed links. You can use these links to create object-to-object relationships across hierarchies. You can define multiple types of links for each of your directories (each link type is a separate facet in one of the schemas associated with the directory). In addition to a type, each link can have a set of attributes. Typed links help to maintain referential data integrity by ensuring objects with existing relationships to other objects are not deleted inadvertently.
Suppose you have a directory of locations, factories, floor numbers, rooms, machines and sensors. Each of these can be represented as a dimension in Cloud Directory with rich metadata information within the hierarchy. You can define and use typed links to connect the objects together in various ways, creating typed links that lead to maintenance requirements, service records, warranties, and safety information, with attributes on the links to store additional information about the relationship between the source object and the destination object.
Then you can run queries that are based on the type of a link and the values of the attributes within it. For example, you could locate all sensors that have not been cleaned in the past 45 days, or all of the motors that are no longer within their warranty period. You can find all of the sensors that are on a given floor, or you can find all of the floors where sensors of a given type are located.
Just as you can do for objects, you can use Attribute Rules to constrain attribute values. You can constrain the length of strings and byte arrays, restrict strings to a specified set of values, and limit numbers to a specific range.
My colleagues shared some sample code that illustrates how to use typed links. Here are the ARNs and the name of the facet:
The first snippet creates a typed link facet called FloorSensorAssociation with sensor_type and maintenance_date attributes, in that order (attribute names and values are part of the identity of the link, so order matters):
As part of the AWS Monthly Online Tech Talks series, AWS will present Deep Dive on Amazon Cloud Directory on Thursday, April 27. This tech talk will start at noon and end at 1:00 P.M. Pacific Time.
AWS Cloud Directory Expert Quint Van Deman will show you how Amazon Cloud Directory enables you to build flexible cloud-native directories for organizing hierarchies of data along multiple dimensions. Using Cloud Directory, you can easily build organizational charts, device registries, course catalogs, and network configurations with multiple hierarchies. For example, you can build an organizational chart with one hierarchy based on reporting structure, a second hierarchy based on physical location, and a third based on cost center.
You also will learn:
About the benefits and features of Cloud Directory.
The advantages of using Cloud Directory over traditional directory solutions.
How to efficiently organize hierarchies of data across multiple dimensions.
How to create and extend Cloud Directory schemas.
How to search your directory using strongly consistent and eventually consistent search APIs.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.