Tag Archives: AWS Lake Formation

Provisioning the Intuit Data Lake with Amazon EMR, Amazon SageMaker, and AWS Service Catalog

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/provisioning-the-intuit-data-lake-with-amazon-emr-amazon-sagemaker-and-aws-service-catalog/

This post shares Intuit’s learnings and recommendations for running a data lake on AWS. The Intuit Data Lake is built and operated by numerous teams in Intuit Data Platform. Thanks to Tristan Baker (Chief Architect), Neil Lamka (Principal Product Manager), Achal Kumar (Development Manager), Nicholas Audo, and Jimmy Armitage for their feedback and support.

A data lake is a centralized repository for storing structured and unstructured data at any scale. At Intuit, creating such a pile of raw data is easy. However, more interesting challenges present themselves:

  1. How should AWS accounts be organized?
  2. What ingestion methods will be used? How will analysts find the data they need?
  3. Where should data be stored? How should access be managed?
  4. What security measures are needed to protect Intuit’s sensitive data?
  5. Which parts of this ecosystem can be automated?

This post outlines the approach taken by Intuit, though it is important to remember that there are many ways to build a data lake (for example, AWS Lake Formation).

We’ll cover the technologies and processes involved in creating the Intuit Data Lake at a high level, including the overall structure and the automation used in provisioning accounts and resources. Watch this space in the future for more detailed blog posts on specific aspects of the system, from the other teams and engineers who worked together to build the Intuit Data Lake.

Architecture

Account Structure

Data lakes typically follow a hub-and-spoke model, with the hub account containing shared services that control access to data sources. For the purposes of this post, we’ll refer to the hub account as Central Data Lake.

In this pattern, access to Central Data Lake is apportioned to spoke accounts called Processing Accounts. This model maintains separation between end users and allows for division of billing among distinct business units.

 

 

It is common to maintain two ecosystems: pre-production (Pre-Prod) and production (Prod). This allows data lake administrators to silo access to data by preventing connectivity between Pre-Prod and Prod.

To enable experimentation and testing, it may also be advisable to maintain separate VPC-based environments within Pre-Prod accounts, such as dev, qa, and e2e. Processing Account VPCs would then be connected to the corresponding VPC in Central Data Lake.

Note that at first, we connected accounts via VPC Peering. However, as we scaled we quickly approached the hard limit of 125 VPC peering connections, requiring us to migrate to AWS Transit Gateway. As of this writing, we connect multiple new Processing Accounts weekly.

 

 

Central Data Lake

There may be numerous services running in a hub account, but we’ll focus on the aspects that are most relevant to this blog: ingestion, sanitization, storage, and a data catalog.

 

 

Ingestion, Sanitization, and Storage

A key component to Central Data Lake is a uniform ingestion pattern for streaming data. One example is an Apache Kafka cluster running on Amazon EC2. (You can read about how Intuit engineers do this in another AWS blog.) As we deal with hundreds of data sources, we’ve enabled access to ingestion mechanisms via AWS PrivateLink.

Note: Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an alternative for running Apache Kafka on Amazon EC2, but was not available at the start of Intuit’s migration.

In addition to stream processing, another method of ingestion is batch processing, such as jobs running on Amazon EMR. After data is ingested by one of these methods, it can be stored in Amazon S3 for further processing and analysis.

Intuit deals with a large volume of customer data, and each field is carefully considered and classified with a sensitivity level. All sensitive data that enters the lake is encrypted at the source. The ingestion systems retrieve the encrypted data and move it into the lake. Before it is written to S3, the data is sanitized by a proprietary RESTful service. Analysts and engineers operating within the data lake consume this masked data.

Data Catalog

A data catalog is a common way to give end users information about the data and where it lives. One example is a Hive Metastore backed by Amazon Aurora. Another alternative is the AWS Glue Data Catalog.

Processing Accounts

When Processing Accounts are delivered to end users, they include an identical set of resources. We’ll discuss the automation of Processing Accounts below, but the primary components are as follows:

 

 

                           Processing Account structure upon delivery to the customer

 

Data Storage Mechanisms

One reasonable question is whether all data should reside in Central Data Lake, or if it’s acceptable to distribute data across multiple accounts. A data lake might employ a combination of the two approaches, and classify data locations as primary or secondary.

The primary location for data is Central Data Lake, and it arrives there via the ingestion pipelines discussed previously. Processing Accounts can read from the primary source, either directly from the ingestion pipelines or from S3. Processing Accounts can contribute their transformed data back into Central Data Lake (primary), or store it in their own accounts (secondary). The proper storage location depends on the type of data, and who needs to consume it.

One rule worth enforcing is that no cross-account writes should be permitted. In other words, the IAM principal (in most cases, an IAM role assumed by EC2 via an instance profile) must be in the same account as the destination S3 bucket. This is because cross-account delegation is not supported—specifically, S3 bucket policies in Central Data Lake cannot grant Processing Account A access to objects written by a role in Processing Account B.

Another possibility is for EMR to assume different IAM roles via a custom credentials provider (see this AWS blog), but we chose not to go down this path at Intuit because it would have required many EMR jobs to be rewritten.

 

 

Data Access Patterns

The majority of end users are interested in the data that resides in S3. In Central Data Lake and some Processing Accounts, there may be a set of read-only S3 buckets: any account in the data lake ecosystem can read data from this type of bucket.

To facilitate management of S3 access for read-only buckets, we built a mechanism to control S3 bucket policies, administered entirely via code. Our deployment pipelines use account metadata to dynamically generate the correct S3 bucket policy based on the type of account (Pre-Prod or Prod). These policies are committed back into our code repository for auditability and ease of management.

We employ the same method for managing KMS key policies, as we use KMS with customer managed customer master keys (CMKs) for at-rest encryption in S3.

Here’s an example of a generated S3 bucket policy for a read-only bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ProcessingAccountReadOnly",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111111111111:root",
                    "arn:aws:iam::222222222222:root",
                    "arn:aws:iam::333333333333:root",
                    "arn:aws:iam::444444444444:root",
                    "arn:aws:iam::555555555555:root",
                    ...
                    ...
                    ...
                    "arn:aws:iam::999999999999:root",
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::intuit-data-lake-example/*",
                "arn:aws:s3:::intuit-data-lake-example"
            ]
        }
    ]
}

Note that we grant access at the account level, rather than using explicit IAM principal ARNs. Because the reads are cross-account, permissions are also required on the IAM principals in Processing Accounts. Maintaining these policies—with automation, at that level of granularity—is untenable at scale. Furthermore, using specific IAM principal ARNs would create an external dependency on foreign accounts. For example, if a Processing Account deletes an IAM role that is referenced in an S3 bucket policy in Central Data Lake, the bucket policy can no longer be saved, causing interruptions to deployment pipelines.

Security

Security is mission critical for any data lake. We’ll mention a subset of the controls we use, but not dive deep.

Encryption

Encryption can be enforced both in transit and at rest, using multiple methods:

  1. Traffic within the lake should use the latest version of TLS (1.2 as of this writing)
  2. Data can be encrypted with application-level (client-side) encryption
  3. KMS keys can used for at-rest encryption of S3, EBS, and RDS

Ingress and Egress

There’s nothing out of the ordinary in our approach to ingress and egress, but it’s worth mentioning the standard patterns we’ve found important:

Policies restricting ingress and egress are the primary points at which a data lake can guarantee quality (ingress) and prevent loss (egress).

Authorization

Access to the Intuit Data Lake is controlled via IAM roles, meaning no IAM users (with long-term credentials) are created. End users are granted access via an internal service that manages role-based, federated access to AWS accounts. Regular reviews are conducted to remove nonessential users.

Configuration Management

We use an internal fork of Cloud Custodian, which is a suite of preventative, detective, and responsive controls consisting of Amazon CloudWatch Events and AWS Config rules. Some of the violations it reports and (optionally) mitigates include:

  • Unauthorized CIDRs in inbound security group rules
  • Public S3 bucket policies and ACLs
  • IAM user console access
  • Unencrypted S3 buckets, EBS volumes, and RDS instances

Lastly, Amazon GuardDuty is enabled in all Intuit Data Lake accounts and is monitored by Intuit Security.

Automation

If there is one thing we’ve learned building the Intuit Data Lake, it is to automate everything.

There are four areas of automation we’ll discuss in this blog:

  1. Creation of Processing Accounts
  2. Processing Account Orchestration Pipeline
  3. Processing Account Terraform Pipeline
  4. EMR and SageMaker deployment via Service Catalog

Creation of Processing Accounts

The first step in creating a Processing Account is to make a request through an internal tool. This triggers automation that provisions an Intuit-stamped AWS account under the correct business unit.

 

Note: AWS Control Tower’s Account Factory was not available at the start of our journey, but it can be leveraged to provision new AWS accounts in a secured, best practice, self-service way.

Account setup also includes automated VPC creation (with optional VPN), fully automated using Service Catalog. End users simply specify subnet sizes.

It’s worth noting that Intuit leverages Service Catalog for self-service deployment of other common patterns, including ingress security groups, VPC endpoints, and VPC peering. Here’s an example portfolio:

Processing Account Orchestration Pipeline

After account creation and VPC provisioning, the Processing Account Orchestration Pipeline runs. This pipeline executes one-time tasks required for Processing Accounts. These tasks include:

  • Bootstrapping an IAM role for use in further configuration management
  • Creation of KMS keys for S3, EBS, and RDS encryption
  • Creation of variable files for the new account
  • Updating the master configuration file with account metadata
  • Generation of scripts to orchestrate the Terraform pipeline discussed below
  • Sharing Transit Gateways via Resource Access Manager

Processing Account Terraform Pipeline

This pipeline manages the lifecycle of dynamic, frequently-updated resources, including IAM roles, S3 buckets and bucket policies, KMS key policies, security groups, NACLs, and bastion hosts.

There is one pipeline for every Processing Account, and each pipeline deploys a series of layers into the account, using a set of parameterized deployment jobs. A layer is a logical grouping of Terraform modules and AWS resources, providing a way to shrink Terraform state files and reduce blast radius if redeployment of specific resources is required.

EMR and SageMaker Deployment via Service Catalog

AWS Service Catalog facilitates the provisioning of Amazon EMR and Amazon SageMaker, allowing end users to launch EMR clusters and SageMaker notebook instances that work out of the box, with embedded security.

Service Catalog allows data scientists and data engineers to launch EMR clusters in a self-service fashion with user-friendly parameters, and provides them with the following:

  • Bootstrap action to enable connectivity to services in Central Data Lake
  • EC2 instance profile to control S3, KMS, and other granular permissions
  • Security configuration that enables at-rest and in-transit encryption
  • Configuration classifications for optimal EMR performance
  • Encrypted AMI with monitoring and logging enabled
  • Custom Kerberos connection to LDAP

For SageMaker, we use Service Catalog to launch notebook instances with custom lifecycle configurations that set up connections or initialize the following: Hive Metastore, Kerberos, security, Splunk logging, and OpenDNS. You can read more about lifecycle configurations in this AWS blog. Launching a SageMaker notebook instance with best-practice configuration is as easy as follows:

 

 

Conclusion

This post illustrates the building blocks we used in creating the Intuit Data Lake. Our solution isn’t wholly unique, but comprised of common-sense approaches we’ve gleaned from dozens of engineers across Intuit, representing decades of experience. These practices have enabled us to push petabytes of data into the lake, and serve hundreds of Processing Accounts with varying needs. We are still building, but we hope our story helps you in your data lake journey.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Ben Covi is a staff software engineer at Intuit. At any given moment, he’s probably losing a game of Catan.

 

 

 

Access and manage data from multiple accounts from a central AWS Lake Formation account

Post Syndicated from Shilpa Mehta original https://aws.amazon.com/blogs/big-data/access-and-manage-data-from-multiple-accounts-from-a-central-aws-lake-formation-account/

This post shows how to access and manage data in multiple accounts from a central AWS Lake Formation account. The walkthrough demonstrates a centralized catalog residing in the master Lake Formation account, with data residing in the different accounts.  The post shows how to grant access permissions from the Lake Formation service to read, write and update the catalog and access data in different accounts.

The post uses two datasets of data to determine if there is a cor-relation between the news generated around the world (gdelt) ) and the number of reviews that Amazon’s products received (amazonreviews).

Prerequisites

This walkthrough requires the use of three accounts, each with S3 buckets and their account numbers.

Setting up the Environment

The three accounts are as follows:

  • Account Products (AP) – This is the account in which the Amazon product reviews are stored. This post deploys the configuration using AWS CloudFormation.
  • Account External (AE) – This account monitors the world’s broadcast, print, and web news from around the world in over 100 languages. It identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images, and events driving global society every second of every day. This post deploys the configuration using AWS CloudFormation.
  • Main Account (MA) – This is the main account, which gathers data from the other two accounts. This post configures Lake Formation in this account. This account has access to the product data and world news account.

The following diagram shows the account architecture.

Account Products

Deploy the following AWS Cloud Formation template in the AP.

This creates an S3 bucket called productsaccountcf-bucketname-1pcfoxar1pxp (templateName-bucket-name-random_string) and a cross account bucket policy in the AP. This policy gives the main account root ID access to this bucket.

  • It uses a Lambda function to download the amazonreviews dataset into the new bucket created.

Make sure you insert the account number of your main account in the datalake-AccountId field. The following screenshot shows that for this post, the MA account number is 1111111111111.

Account External

Deploy the  following AWS Cloud Formation template in the AE.

This creates an S3 bucket called externalaccountcf-resultbucket-12ecq638afqiq (templateName-bucket-name_random_string) and a cross account bucket policy in the AE. This policy gives the main account access to this bucket. Due to the nature of the dataset this template also creates tables in the data catalog. Instead of AWS Glue crawlers, Athena queries are created on the structure of the tables.

The template executes queries in Amazon Athena to download the gdelt dataset, and to create the metadata of the tables that Lake Formation uses.

Make sure you insert the account number of your main account datalakeAccountid field. For this post, the MA account number is 1111111111111.Though the CloudFormation console will show CREATE_COMPLETE, a query is still executing, which you can observe in the Athena console. You can access the Athena console from AWS Management Console. The query that continues to run is creating a new table in the AE with the data in Parquet format so that queries can perform better.

The following screenshot shows your query history and status.

You are now ready to go to Lake Formation in your main account and start configuring.

Registering data stores in Lake Formation

Login to the Lake Formation console in your main account. If it’s the first time you are accessing Lake Formation, you need to add administrators to the account. The user is the account user you have logged into.

To add your data lakes, complete the following steps:

  1. On the Lake Formation console, under Register and ingest, choose Data lake locations. The page displays a list of S3 buckets that are marked as data lake storage resources for Lake Formation. Here, a single S3 bucket may act as the repository for many datasets, or you could use separate buckets for separate data sources. This post registers the S3 buckets in the other accounts and creates a master catalog in Lake Formation
  2. Choose Register location.The following screenshot shows the Data lake locations pane.You can now register both buckets you created in your AP and AE.
  3. In Amazon S3 location, for Amazon S3 path, enter s3://productsaccountcf-bucketname-1pcfoxar1pxp
  4. For IAM role, select You need an IAM role that gives Lake Formation the necessary permissions (GetObject, PutObject, DeleteObject, and ListBucket) to properly use your S3 bucket as a data lake. This default role has the necessary permissions. Alternatively, select a pre-existing IAM role that has required permissions and is configured with lakeformation.amazonaws.com as a trusted entity.
  5. Choose Register location.The following screenshot shows the Amazon S3 location pane.You now have a storage resource and are ready to register the second bucket.
  6. Repeat the previous steps, but in Step 3, register the bucket externalaccountcf-resultbucket-12ecq638afqiq from your AE (888888888888) as s3://externalaccountcf-resultbucket-12ecq638afqiq.

Setting up your IAM role

You need an IAM role that allows Lake Formation to create catalog tables of the datasets in the storage locations. Complete the following steps:

  1. On the AWS console, access IAM and create an IAM role
  2. Attach AWS Glue and AWS Lambda policies as described here.
  3. Edit the trust relationship for the role with the following policy:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "",
          "Effect": "Allow",
          "Principal": {
            "Service": ["glue.amazonaws.com", "lambda.amazonaws.com"]
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }

Add an inline policy  to give the role permissions to S3 to execute Athena queries, AWS Glue and to publish AWS CloudWatch logs as shown in the following screenshot.

Creating a database

Lake Formation maintains a Hive-compatible data catalog within your data lake. Before you can catalog data within your S3 storage backend or use Lake Formation data importers to push data to S3 (which this post discusses later), you must first create a database within your Lake Formation catalog.

A Lake Formation database is a logical construct to which you can later add tables. Each table contains a mapping to one or more objects in S3 that, collectively, represent that table. Tables also contain basic column metadata such as file format, S3 location, and column definitions. Optionally, you can also define arbitrary key-value pairs for tables and columns to better describe the data and act as queryable attributes for data discovery.

You can create one or more databases and populate their tables either manually in the console, programmatically via the AWS SDKs or AWS CLI, or automatically by defining AWS Glue crawlers.

This post defines two logical databases, amazonreviews and gdelt.

  1. On the Lake Formation console, under Data catalog, choose Databases.
  2. Choose Create database.The following screenshot shows the Databases pane.
  3. For Name, enter amazonreviews.
  4. For Location, enter s3://productsaccountcf-bucketname-1pcfoxar1pxp/amazonreviews.
  5. For Description, enter a brief, meaningful description.
  6. Clear Grant All to Everyone for new tables in this database.
  7. Choose Create database.The following screenshot shows the database details.
    1. Create gdelt database. Set Name to gdelt
    2. Set Locationto s3://externalaccountcf-resultbucket-12ecq638afqiq/gdelt
    3. Set Description to a brief, meaningful description like the one shown below
    4. Uncheck the box for Grant All to Everyone for new tables in this database

Granting permissions

You now have your databases and need to grant permissions to the role you created in Lake Formation. You need to configure your IAM users and roles as administrators.

  1. On the Lake Formation console, under Permissions, choose Admins and database creators. The following screenshot shows the Admins and database creators.
  2. Under Permissions choose Data Permissions
  3. From the Actions menu, choose Grant
  4. Select your new IAM role.
  5. For Database permissions, choose Create table and Grant all.
  6. Choose Grant.The following screenshot shows the Grant permissions pane. Repeat the previous steps for the amazonreviews and gdelt databases.
  7. Repeat the previous steps for the amazonreviews and gdelt.The next step is granting your role permissions to the data lakes you created.
  8. From Permissions, choose Data locations.
  9. Choose Grant
  10. Select your new role.The following screenshot shows the Data locations pane.
  11. For IAM users and roles, select your role.
  12. For Storage locations, enter s3://productsaccountcf-bucketname-1pcfoxar1pxp.
  13. Choose Grant.The following screenshot shows the Grant permissions pane.
  14. Repeat the steps for datalake s3://externalaccountcf-resultbucket-12ecq638afqiq/

Setting up your main account

Deploy the following CloudFormation stack in the MA.

This creates tables in the amazonreviews and gdelt databases

From the Actions menu choose Grant. Select the role or users to grant access and select the two check boxes, and choose Grant.

Querying the data

Now that you have the data in the catalog, you can perform queries from the master account with Athena between the datasets in different accounts.

Grant table permissions by completing the following steps:

  1. On the Lake Formation console, under Data catalog, choose Tables.
  2. Choose the table to query.
  3. From the Actions menu, choose Enter your role or user name.The following screenshot shows the Tables pane.
  1. For Table permissions, select Select.
  2. For Grantable permissions, select Alter, Insert, Drop, Delete, Select, and Grant all.The following screenshot shows the Grant permissions pane.
  1. Repeat the previous steps for the events table in the gdeltYou are now ready to query the data.
  1. In Tables, select the events
  2. From the Actions menu, choose View data.The following screenshot shows the Tables pane.
  1. Repeat the previous steps for the events table in the gdeltYou are now ready to query the data.
  1. In Tables, select the events
  2. From the Actions menu, choose View data.
  3. You will be taken to the AWS Athena consoleThe following screenshot shows the Athena console.
  1. Use the Query Editor tab and enter SQL queries for the reviews and events

To query the information by date, standardize the date columns and do aggregations by creating views. In sequential order, run the following queries:

CREATE VIEW amznrevw.aggreviews AS
SELECT count() as reviewcount, star_rating, verified_purchase, from_iso8601_date(review_date) as reviewdate FROM "amznrevw"."reviews2" group by star_rating, verified_purchase, review_date;

CREATE VIEW gdelt.eventsformatted AS
SELECT from_iso8601_date(substr(cast(day as varchar),1,4) || '-' || substr(cast(day as varchar),5,2)||'-' || substr(cast(day as varchar),7,2)) as eventdate, actor1code,actor1name,actor1countrycode, actor2code,actor2name,actor2countrycode FROM "gdelt"."events" ;

CREATE VIEW gdelt.eventsagregated AS
select count() as numevetns, eventdate, (count(distinct actor1code) + count(distinct actor2code)) as numactors from gdelt.eventsformatted group by eventdate;

You are now ready to query. You can determine how many gdelt events were in the five days with the most amount of reviews by performing the following query:

select eventdate, sum(reviewcount) as totalreviews, sum(numevetns) as totalnumevetns from gdelt.eventsagregated as event, amznrevw.aggreviews as review where event.eventdate = review.reviewdate group by eventdate order by totalreviews desc, totalnumevetns desc limit 5;

The following screenshot shows the query results.

January 3, 2015, had the most reviews but not the most gdelt events (833,890).

You can also discover how many reviews where performed in the five days with the most amount of gdelt events by performing the following query:

select eventdate, sum(reviewcount) as totalreviews, sum(numevetns) as totalnumevetns from gdelt.eventsagregated as event, amznrevw.aggreviews as review where event.eventdate = review.reviewdate group by eventdate order by totalnumevetns desc, totalreviews desc limit 5;

The following screenshot shows the query results.

January 25, 2012, had 2 million events but only 378 reviews.

You can also perform a final query to check the correlation between the two with the following query:

SELECT corr(reviewcount,
numevetns) AS review_event_correlation
FROM gdelt.eventsagregated AS event, amznrevw.aggreviews AS review
WHERE event.eventdate = review.reviewdate

The following screenshot shows the query results.

You can likely identify that there is no correlation between the two columns.

Conclusion

This post demonstrated how to set up cross-account access of datastores through a central Lake Formation catalog. The solution walked through creating two S3 buckets in external accounts, downloading some datasets on these buckets, and giving Lake Formation permission to access the data. You also learned how to govern the data in the data lakes from Lake Formation, and how to query the data in the two data lakes using Athena and Glue crawlers.

 


About the Authors


Shilpa Mehta is a Data Lab solutions architect at AWS.
Shilpa helps our customers architect and build data and analytics prototypes in just four days in the AWS Data Lab.

 

 

 

Laura Caicedo Camacho is a solutions architect at AWS. She works with customers to help them embrace and adopt the cloud.

 

 

 

 

Luis Caro Perez is a solutions architect at AWS.  He works with our customers to provide guidance and technical assistance on their applications, helping them improving the value of their solutions when using AWS.

 

 

 

Discover metadata with AWS Lake Formation: Part 2

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/big-data/discover-metadata-with-aws-lake-formation-part-2/

Data lakes are an increasingly popular way to aggregate, store, and analyze both structured and unstructured data. AWS Lake Formation makes it easy for you to set up, secure, and manage your data lakes.

In Part 1 of this post series, you learned how to create and explore a data lake using Lake Formation. This post walks you through data discovery using the metadata search capabilities of Lake Formation in the console, and metadata search results restricted by column permissions.

Prerequisites

For this post, you need the following:

Metadata search in the console

In this post, we demonstrate the catalog search capabilities offered by the Lake Formation console:

  • Search by classification
  • Search by keyword
  • Search by tag: attribute
  • Multiple filter searches

Search by classification

Using the metadata catalog search capabilities, search across all tables within your data lake. Two share the name amazon_reviews but separately belong to your simulated “prod” and “test” databases, and the third is trip-data.

  1. In the Lake Formation console, under Data catalog, choose Tables.
  2. In the search bar, under Resource Attributes, choose Classification, type CSV, and press Enter. You should see only the trip_data table, which you formatted as CSV in your data lake. The amazon_reviews tables do not appear because they are in Parquet format.
  3. In the Name column, choose trip_data. Under Table details, you can see that the classification CSV is correctly identified by the metadata search filter.

Search by keyword

Next, search across your entire data lake filtering metadata by keyword.

  1. To refresh the list of tables, under Data catalog, choose Tables again.
  2. From the search bar, type star_rating, and press Enter. Now that you have applied the filter, you should see only the amazon_reviews tables because they both contain a column named star_rating.
  3. By choosing either of the two tables, you can scroll down to the Schema section, and confirm that they contain a star_rating column.

Search by tag: attribute

Next, search across your data lake and filter results by metadata tags and their attribute value.

  1. To refresh the list of tables, under Data catalog, choose Tables.
  2. From the search bar, type department: research, and press Enter. Now that you have applied the filter, you should see only the trip_data table because this is the only table containing the value of ‘research’ in the table property of ‘department’.
  3. Select the trip_data table. Under Table details, you can see the tag: attribute of department | research listed under Table properties.

Multiple filter searches

Finally, try searching across your entire data lake using multiple filters at one time.

  1. To refresh the list of tables, under Data catalog, choose Tables.
  2. In the search bar, choose Location, type S3, and press Enter. For this post, all of your catalog tables are in S3, so all three tables display.
  3. In the search bar, choose Classification, type parquet, and press Enter. You should see only the amazon_reviews tables because they are the only tables stored in S3 in Parquet format.
  4. Choose either of the displayed amazon_reviews tables. Under Table details, you can see that the following is true.
  • Location: S3
  • Classification: parquet

Metadata search results restricted by column permissions

The metadata search capabilities return results based on the permissions specified within Lake Formation. If a user or a role does not have permission to a particular database, table, or column, that element doesn’t appear in that user’s search results.

To demonstrate this, first create an IAM user, dataResearcher, with AWS Management Console access. Make sure to store the password somewhere safe.

To simplify this post, attach the AdministratorAccess policy to the user. This policy grants full access to your AWS account, which is overly permissive. I recommend that you either remove this user after completing the post, or remove this policy, and enable multi-factor authentication (MFA). For more information, see Creating an IAM user in the console.

In Part 1 of this series, you allowed Everyone to view the tables that the AWS Glue crawlers created. Now, revoke those permissions for the ny-taxi database.

  1. In the Lake Formation console, under Permissions, choose Data permissions.
  2. Scroll down or search until you see the Everyone record for the trip_data table.
  3. Select the record and choose Revoke, Revoke.

Now, your dataResearcher IAM user cannot see the ny-taxi database or the trip_data table. Resolve this issue by setting up Lake Formation permissions.

  1. Under Permissions, choose Data Permission, Grant.
  2. Select the dataResearcher user, the ny-taxi database, and the trip_data table.
  3. Under Table permissions, check Select and choose Grant.
  4. Log out of the console and sign back in using the dataResearcher IAM user that you created earlier.
  5. In the Lake Formation console, choose Tables, select the trip_data table, and look at its properties:

The dataResearcher user currently has visibility across all of these columns. However, you don’t want to allow this user to see the pickup or drop off locations, as those are potential privacy risks. Remove these columns from the dataResearcher user’s permissions.

  1. Log out of the dataResearcher user and log back in with your administrative account.
  2. In the Lake Formation console, under Permissions, choose Data Permissions.
  3. Select the dataResearcher record and choose Revoke.
  4. On the Revoke page, under Column, choose All columns except the exclude columns and then choose the vendor_id, passenger_count, trip_distance, and total_amount columns.
  5. Under Table permissions, check Select. These settings revoke all permissions of the dataResearcher user to the trip_data table except those selected in the window. In other words, the dataResearcher user can only Select(view) the four selected columns.
  6. Choose Revoke.
  7. Log back in as the dataResearcher user.
  8. In the Lake Formation console, choose Data catalog, Tables. Search for vendor_id and press Enter. The trip_data table appears in the search, as shown in the following screenshot.
  9. Search for pu_location_id. This returns no results because you revoked permissions to this column, as shown in the following screenshot.

Conclusion

Congratulations: You have learned how to use the metadata search capabilities of Lake Formation. By defining specific user permissions, Lake Formation allowed you to grant and revoke access to metadata in the Data Catalog as well as the underlying data stored in S3. Therefore, you can discover your data sources across your entire AWS environment using a single pane of glass. To learn more, see AWS Lake Formation.

 


About the Authors

Julia Soscia is a solutions architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.

 

 

 

Eric Weinberg is a systems development engineer on the AWS Envision Engineering team. He has 15 years of experience building and designing software applications.

 

 

 

 

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has more than twenty years experience in Analytics and Data Management.

 

 

 

 

Mat Werber is a solutions architect on the AWS Community SA Team. He is responsible for providing architectural guidance across the full AWS stack with a focus on Serverless, Redshift, DynamoDB, and RDS. He also has an audit background in IT governance, risk, and controls.

 

 

 

 

Discovering metadata with AWS Lake Formation: Part 1

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/big-data/discovering-metadata-with-aws-lake-formation-part-1/

Data lakes are an increasingly popular way to create a single repository to store and analyze both structured and unstructured data. AWS Lake Formation makes it easy for you to set up, secure, and manage data lakes. This post walks you through the creation and exploration of a data lake using Lake Formation:

  • Creating the data lake

o  Adding data to your data lake

o  Creating catalog databases

o  Adding tables from Amazon S3 to catalog databases

  • Editing and adding metadata within the catalog

o  Editing standard metadata

o  Adding custom metadata

Prerequisites

For this post, you need the following:

Create the data lake

In the AWS Lake Formation console, in the left navigation pane, choose Register and ingest, Data lake locations. Select a single S3 bucket to house several independent data sources in your data lake. For more information, see What is AWS Lake Formation?

Add data to your data lake

Now that you have an S3 bucket configured as a storage resource for Lake Formation, you must add data to your data lake. You can add data to your data lake’s S3 bucket storage resource using AWS SDKs, AWS CLI, the S3 console, or a Lake Formation blueprint.

With Lake Formation, you can discover and set up the ingestion of your source data. When you add a workflow that loads or updates the data lake, you can choose a blueprint or template of the type of importer to add. Lake Formation provides several blueprints on the Lake Formation console for common source data types to simplify the creation of workflows. Workflows point to your data source and target and specify the frequency that they run.

For this post, use the AWS CLI to download sample data and then upload it to your S3 storage backend. Other import methods, such as Lake Formation data importers, are outside the scope of this post.

Sample from the following two datasets provided on the Registry of Open Data on AWS:

Make two copies of the Amazon customer reviews dataset in your data lake. You can use these to simulate “production” and “test” datasets and learn how to target one or both when searching your metadata catalog.

To demonstrate the flexibility of an AWS data lake, add both CSV and Parquet datasets to your data lake. In both cases, use the following naming convention for your S3 objects:

s3://BUCKET_NAME/DATABASE_NAME/TABLE_NAME/<data files>

Add Amazon customer reviews to your data lake

AWS hosts a registry to help people share and discover a variety of datasets. For this post, copy a subset of the Amazon customer reviews dataset into your data lake. You don’t have to copy the complete reviews dataset, only the smaller 226-MB portion of watch reviews. You need two copies of this data in your data lake to simulate separate “production” and “test” databases.

  1. If you have not already, install and configure the AWS CLI with IAM user access keys that include permission to read from S3 and write to your Lake Formation S3 bucket.
  2. Copy the source to your data lake:
    # Replace with your bucket name:
    YOUR_BUCKET=lf-metadata-demo
     
    aws s3 cp \
       s3://amazon-reviews-pds/parquet/product_category=Watches/ \
       s3://$YOUR_BUCKET/amazon-reviews-prod/amazon-reviews/ --recursive
     
    aws s3 cp \
       s3://amazon-reviews-pds/parquet/product_category=Watches/ \
       s3://$YOUR_BUCKET/amazon-reviews-test/amazon-reviews/ --recursive

  3. In the S3 console, confirm that your S3 bucket now contains your two Amazon reviews datasets.
  4. Inspect the contents of the folders. The datasets are in Parquet format.

Add New York taxi ride history to your data lake

Much as you did with the Amazon customer reviews dataset, copy a small subset of New York taxi ride history from the Registry of Open Data on AWS into your data lake:

  1. Copy the source data to your data lake:
    # Replace with your bucket name:
    YOUR_BUCKET=lf-metadata-demo
     
    aws s3 cp \
       “s3://nyc-tlc/trip data/green_tripdata_2018-02.csv” \
       “s3://$YOUR_BUCKET/ny-taxi/trip-data/green_tripdata_2018-02.csv”

  2. In the S3 console, validate that your S3 bucket contains CSV data for NY taxi trips.

Create catalog databases

You have created an S3 bucket to act as your data lake storage backend and added data to the bucket. However, this data is not readily available in Lake Formation until you catalog the data.

Lake Formation maintains a Hive-compatible data catalog of data within your data lake. Before you can catalog data within your S3 storage backend or use Lake Formation data importers (discussed later) to push data to S3, you must first create a database.

A Lake Formation database is a logical construct to which you later add tables. Each table contains a mapping to one or more objects in S3 that, collectively, represent that table. Tables also contain basic metadata including but not limited to file format, S3 location, column headings, and column types. Lake Formation users can also optionally define arbitrary key-value pairs for tables and columns to better describe the data and act as query-able attributes for data discovery.

You can create one or more databases and populate their tables either manually in the console, programmatically using the AWS SDKs or AWS CLI, or automatically by defining AWS Glue crawlers.

For this post, you must define three logical databases:

amazon-reviews-prod

amazon-reviews-test

ny-taxi

Then, use the cataloging process to map to the two datasets that you previously uploaded to your S3 storage backend. Remember, you intentionally created two copies of the Amazon reviews dataset to simulate both a production and test database in your data lake.

Now, create your databases. First, configure IAM users and roles as administrators within Lake Formation.

Catalog permissions are permissions that the selected IAM principal can use directly. Grantable permissions are those that the IAM principal can grant to other IAM principals later.

For example, you might want to give your database administrator (DBA) the ability to create databases, by granting permissions to the catalog. However, you can prevent the DBA from accidentally giving this access to your developers by not enabling the grantable permission.

Now that you’ve granted necessary permissions, you can proceed to create your database within the catalog.

  • For Name, enter amazon-reviews-prod.
  • For Location, enter s3://<YOUR_BUCKET>/amazon-reviews-prod.
  • For Description, enter a brief, meaningful description.
  • Check Grant All to Everyone for new tables in this database. In production, assess whether this is appropriate.

Repeat the process for the other two databases:

  • Name: amazon-reviews-test
    Location: s3://<YOUR_BUCKET>/amazon-reviews-test
  • Name: ny-taxi
    Location: s3://<YOUR_BUCKET>/ny-taxi

After completing these steps, you should have three databases in your catalog: amazon-reviews-prod, amazon-reviews-test, and ny-taxi.

Add tables from S3 to your catalog databases

In the previous section, you created three databases in your Lake Formation catalog. However, these catalog databases are empty and do not yet provide information about the specific tables, schema, file formats, or object paths in S3. To add this information, use one of the following two methods:

  • Manually define your tables in the catalog using the console, SDKs, or AWS CLI.
  • Use an AWS Glue crawler to search S3 and automatically add discovered tables to your catalog.

For this post, create and manually run one AWS Glue crawler for each of your three datasets in S3 and databases in the Lake Formation data catalog. A detailed walkthrough is outside the scope of this post. For guidance, see Working with Crawlers on the AWS Glue Console.

As you proceed, please bear the following in mind:

  • Create one crawler for each of your three datasets. You should be able to accept most of the default crawler settings. However, the S3 path for your crawlers should read:

s3://YOUR_BUCKET/amazon-reviews-prod/amazon-reviews

s3://YOUR_BUCKET/amazon-reviews-test/amazon-reviews

s3://YOUR_BUCKET/ny-taxi/trip-data

Before you run the crawlers to populate your catalog, you must assign them an IAM role. The role grants them permission to read from your data lake’s S3 bucket, write crawler logs to Amazon CloudWatch, and update your data catalog. Regardless of whether you create a new role, or use an existing role, make a note of the IAM role name. You need this information for the next step.

In addition to permissions defined within IAM, you must also explicitly grant IAM principals (roles or users) the ability to modify your Data Catalog from within Lake Formation itself. Conceptually, this is similar to the concept of bucket policies in S3 used with IAM. In the Lake Formation console, under Permissions, choose Data permissions.

Grant your AWS Glue crawlers the ability to modify your Data Catalog. Configure the following fields:

  • For IAM users and roles, select the IAM roles that you previously used for your AWS Glue crawlers.
  • For Database, select the amazon-reviews-prod, amazon-reviews-test, and ny-taxi databases.
  • For Database permissions, select all permissions.
  • Leave all Grantable permissions unselected.

After your AWS Glue crawlers have permission to modify your Lake Formation data catalog, return to the AWS Glue console and manually run your three crawlers. After a few minutes, the crawlers should complete their runs. Each should add one table to your data catalog:

o  amazon-reviews

o  amazon-reviews

o  trip-data

Verify that your catalog was updated. In the Lake Formation console, under Data catalog, choose Tables, and view the three new tables added to the corresponding data lake databases, as shown in the following screenshot.

Edit and add metadata within the catalog

The AWS Glue crawlers populate standard metadata about the tables they discover in S3, including (but not limited to) attributes such as object location, file format, column headings, and column types.

However, you can manually edit standard metadata or add additional custom metadata to the catalog to make it easier to search and improve the overall value that it provides. In the following section, I walk through several examples of editing and adding to metadata.

Edit standard metadata

The AWS Glue crawlers infer the name of columns from the first line of CSV file. To view the auto-populated column names for the ny_taxi table, look at the table properties:

  1. Under Data catalog, choose Tables.
  2. Select ny_taxi and scroll down to the Schema section.
  3. Choose Edit Schema. Your data columns’ names must consistently use snake casing, which means using the ‘_’ character between words. Change all of the id columns to match the rest of the columns. If you look at the first row of the raw data, you notice that there is inconsistent naming used. Rather than changing those files, you can manually change the metadata.
  4. Select the vendorid row and choose Edit. Make your changes to include the snake casing and choose Save.
  5. Repeat the following steps for dolocationid, ratecodeid, and pulocationid. After you make these changes, choose Save as new version.
  6. Under Data catalog, choose Tables. If you search for pulocationid, no results should return.
  7. Search for the new column name, pu_location_id. This search should return the expected result, the trip_data table from the ny-taxi database.

Add custom metadata

Now, try adding a couple of custom table properties to help organize your tables. The first table property to add is an environment variable to help you to determine whether a table is for development, testing, or production. The second table property to add is a department variable, which allows you to group tables by a department.

  1. In the Lake Formation console, under Data catalog, choose Databases.
  2. Select the ny-taxi database and choose View tables.
  3. Select the trip_data table and choose Edit Table.
  4. Under Table properties, choose Add. Set the value of environment to dev and the value of department to research1. Choose Save.
  5. Under Data catalog, choose Tables. In the search bar, type “research,” and press Enter. No results return because there isn’t a table with the table property value of research. However, searching for research1 should return the trip_data table.
  6. Go back to your table properties for trip-data and update the department property from research1 to research. After you’ve made the edit, the trip-data table appears when entering “research” as a keyword in the table search:

Conclusion

Congratulations: You have successfully created and edited your first data lake using the Lake Formation. You used the service to secure and ingest data into an S3 data lake, catalog the data, and customize the metadata of the data sources. In part 2 of this series, I show you how to discover your data by using the metadata search capabilities of Lake Formation.

 


About the Authors

Julia Soscia is a solutions architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.

 

 

 

Eric Weinberg is a systems development engineer on the AWS Envision Engineering team. He has 15 years of experience building and designing software applications.

 

 

 

 

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has more than twenty years experience in Analytics and Data Management.

 

 

 

 

Mat Werber is a solutions architect on the AWS Community SA Team. He is responsible for providing architectural guidance across the full AWS stack with a focus on Serverless, Redshift, DynamoDB, and RDS. He also has an audit background in IT governance, risk, and controls.

 

 

 

 

Getting started with AWS Lake Formation

Post Syndicated from Gordon Heinrich original https://aws.amazon.com/blogs/big-data/getting-started-with-aws-lake-formation/

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The challenges of data lakes

The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.

Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.

Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.

Walkthrough

In this walkthrough, I show you how to build and use a data lake:

  • Create a data lake administrator.
  • Register an Amazon S3 path.
  • Create a database.
  • Grant permissions.
  • Crawl the data with AWS Glue to create the metadata and table.
  • Grant access to the table data.
  • Query the data using Amazon Athena.
  • Add a new user with restricted access and verify the results.

Prerequisites

You need the following resources for this walkthrough:

  • An AWS account
  • An IAM user with the AWSLakeFormationDataAdmin For more information, see IAM Access Policies.
  • An S3 bucket named datalake-yourname-region, in the US-East (N. Virginia)
  • A folder named zipcode within your new S3 bucket.

You also must download the sample dataset. For this walkthrough, I use a table of City of New York statistics. The data is available on the DATA.GOV site, in the City of New York Demographics Statistics by Zip table. Upload the file to your S3 bucket in the /zipcode folder.

You have set up the S3 bucket and put the dataset in place. Now, set up your data lake with Lake Formation.

Step 1: Create a data lake administrator

First, designate yourself a data lake administrator to allow access to any Lake Formation resource.

Step 2: Register an Amazon S3 path

Next, register an Amazon S3 path to contain your data in the data lake.

Step 3: Create a database

Next, create a database in the AWS Glue Data Catalog to contain the zipcode table definitions.

  • For Database, enter zipcode-db.
  • For Location, enter your S3 bucket/zipcode.
  • For New tables in this database, do not select Grant All to Everyone.

Step 4: Grant permissions

Next, grant permissions for AWS Glue to use the zipcode-db database. For IAM role, select your user and AWSGlueServiceRoleDefault.

Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:

  • For IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
  • For Storage locations, enter s3://datalake-yourname-location.

Step 5: Crawl the data with AWS Glue to create the metadata and table

In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.

Create a table using an AWS Glue crawler. Use the following configuration settings:

  • Crawler name: zipcodecrawler.
  • Data stores: Select this field.
  • Choose a data store: Select S3.
  • Specified path: Select this field.
  • Include path: S3://datalake-yourname-location/zipcode.
  • Add another data store: Choose No.
  • Choose an existing IAM role: Select this field.
  • IAM role: Select AWSGlueServiceRoleDefault.
  • Run on demand: Select this field.
  • Database: Select zipcode-db.

Choose Run it now? Wait for the crawler to stop before moving to the next step.

Step 6: Grant access to the table data

Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.

  • In the navigation pane, choose Tables.
  • Choose Grant.
  • Provide the following information:
    1. For IAM role, select your user and AWSGlueServiceRoleDefault.
    2. For Table permissions, choose Select all.

Step 7: Query the data with Athena

Next, query the data in the data lake using Athena.

  • In the Athena console, choose Query Editor and select the zipcode-db
  • Choose Tables and select the zipcode table.
  • Choose Table Options (three vertical dots to the right of the table name).
  • Select Preview table.

Athena issues the following query:

SELECT * FROM “zipcode”.”zipcode” limit 10;

As you can see from the following screenshot, the datalakeadmin user can see all of the data.

Step 8: Add a new user with restricted access and verify the results

This step shows how you, as the data lake administrator, can set up a user with restricted access to specific columns.

In the IAM console, create an IAM user with administrative rights, called user1, and add AWSLakeFormationDataAdmin policy. For more information, see Adding and Removing IAM Identity Permissions.

In the Lake Formation console, grant permissions to user1 and supply the following configuration settings:

  • Database: Select zipcode-db.
  • Table: Select zipcode.
  • Columns: Choose The include columns.
  • The include columns: Choose Jurisdiction name and Count participants.
  • Table permissions: Select.
  • Grantable permissions: Select.

To verify the results of the restricted permissions, repeat step 7 when you are logged in as user1. As the following screenshot shows, user1 can only see columns that the datalakeadmin user granted them permissions to view.

Conclusion

This post showed you how to build a secure data lake using Lake Formation. It provides the mechanisms to implement governance, semantic consistency, and access controls, making your data more usable for analytics and machine learning.

For more information, see the following posts:


About the Author

Gordon Heinrich is a solutions architect working with global system integrators. He works with AWS partners and customers to provide architectural guidance on building data lakes and using AWS machine learning services. In his spare time, he enjoys spending time with his family, skiing, hiking, and mountain biking in Colorado.

Build, secure, and manage data lakes with AWS Lake Formation

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/building-securing-and-managing-data-lakes-with-aws-lake-formation/

A data lake is a centralized store of a variety of data types for analysis by multiple analytics approaches and groups. Many organizations are moving their data into a data lake. In this post, I explore how you can use AWS Lake Formation to build, secure, and manage data lakes.

Traditionally, organizations have kept data in a rigid, single-purpose system, such as an on-premises data warehouse appliance. Similarly, they have analyzed data using a single method, such as predefined BI reports. Moving data between databases or for use with different approaches, like machine learning (ML) or improvised SQL querying, required “extract, transform, load” (ETL) processing before analysis. At best, these traditional methods have created inefficiencies and delays. At worst, they have complicated security.

By contrast, cloud-based data lakes open structured and unstructured data for more flexible analysis. Any amount of data can be aggregated, organized, prepared, and secured by IT staff in advance. Analysts and data scientists can then access it in place with the analytics tools of their choice, in compliance with appropriate usage policies.

Data lakes let you combine analytics methods, offering valuable insights unavailable through traditional data storage and analysis. In a retail scenario, ML methods discovered detailed customer profiles and cohorts on non-personally identifiable data gathered from web browsing behavior, purchase history, support records, and even social media. The exercise showed the deployment of ML models on real-time, streaming, interactive customer data.

Such models could analyze shopping baskets and serve up “next best offers” in the moment, or deliver instant promotional incentives. Marketing and support staff could explore customer profitability and satisfaction in real time and define new tactics to improve sales. Around a data lake, combined analytics techniques like these can unify diverse data streams, providing insights unobtainable from siloed data.

The challenges of building data lakes

Unfortunately, the complex and time-consuming process for building, securing, and starting to manage a data lake often takes months. Even building a data lake in the cloud requires many manual and time-consuming steps:

  • Setting up storage.
  • Moving, cleaning, preparing, and cataloging data.
  • Configuring and enforcing security policies for each service.
  • Manually granting access to users.

You want data lakes to centralize data for processing and analysis with multiple services. But organizing and securing the environment requires patience.

Currently, IT staff and architects spend too much time creating the data lake, configuring security, and responding to data requests. They could spend this time acting as curators of data resources, or as advisors to analysts and data scientists. Analysts and data scientists must wait for access to needed data throughout the setup.

The following diagram shows the data lake setup process:

Setting up storage

Data lakes hold massive amounts of data. Before doing anything else, you must set up storage to hold all that data. If you are using AWS, configure Amazon S3 buckets and partitions. If you are building the data lake on premises, acquire hardware and set up large disk arrays to store all the data.

Moving data

Connect to different data sources — on-premises and in the cloud — then collect data on IoT devices. Next, collect and organize the relevant datasets from those sources, crawl the data to extract the schemas, and add metadata tags to the catalog. You can use a collection of file transfer and ETL tools:

Cleaning and preparing data

Next, collected data must be carefully partitioned, indexed, and transformed to columnar formats to optimize for performance and cost. You must clean, de-duplicate, and match related records.

Today, organizations accomplish these tasks using rigid and complex SQL statements that perform unreliably and are difficult to maintain. This complex process of collecting, cleaning, and transforming the incoming data requires manual monitoring to avoid errors. Many customers use AWS Glue for this task.

Configuring and enforcing policies

Customers and regulators require that organizations secure sensitive data. Compliance involves creating and applying data access, protection, and compliance policies. For example, you restrict access to personally identifiable information (PII) at the table, column, or row level, encrypt all data, and keep audit logs of who is accessing the data.

Today, you can secure data using access control lists on S3 buckets or third-party encryption and access control software. You create and maintain data access, protection, and compliance policies for each analytics service requiring access to the data. For example, if you are running analysis against your data lake using Amazon Redshift and Amazon Athena, you must set up access control rules for each of these services.

Many customers use AWS Glue Data Catalog resource policies to configure and control metadata access to their data. Some choose to use Apache Ranger. But these approaches can be painful and limiting. S3 policies provide at best table-level access. And you must maintain data and metadata policies separately. With Apache Ranger, you can configure metadata access to only one cluster at a time. Also, policies can become wordy as the number of users and teams accessing the data lake grows within an organization.

Making it easy to find data

Users with different needs, like analysts and data scientists, may struggle to find and trust relevant datasets in the data lake. To make it easy for users to find relevant and trusted data, you must clearly label the data in a data lake catalog. Provide users with the ability to access and analyze this data without making requests to IT.

Today, each of these steps involves a lot of manual work. Customer labor includes building data access and transformation workflows, mapping security and policy settings, and configuring tools and services for data movement, storage, cataloging, security, analytics, and ML. With all these steps, a fully productive data lake can take months to implement.

The wide range of AWS services provides all the building blocks of a data lake, including many choices for storage, computing, analytics, and security. In the nearly 13 years that AWS has been operating Amazon S3 with exabytes of data, it’s also become the clear first choice for data lakes. AWS Glue adds a data catalog and server-less transformation capabilities. Amazon EMR brings managed big data processing frameworks like Apache Spark and Apache Hadoop. Amazon Redshift Spectrum offers data warehouse functions directly on data in Amazon S3. Athena brings server-less SQL querying.

With all these services available, customers have been building data lakes on AWS for years. AWS runs over 10,000 data lakes on top of S3, many using AWS Glue for the shared AWS Glue Data Catalog and data processing with Apache Spark.

AWS has learned from the thousands of customers running analytics on AWS that most customers who want to do analytics also want to build a data lake. But many of you want this process to be easier and faster than it is today.

AWS Lake Formation (now generally available)

At AWS re:Invent 2018, AWS introduced Lake Formation: a new managed service to help you build a secure data lake in days. If you missed it, watch Andy Jassy’s keynote announcement. Lake Formation has several advantages:

  • Identify, ingest, clean, and transform data: With Lake Formation, you can move, store, catalog, and clean your data faster.
  • Enforce security policies across multiple services: After your data sources are set up, you then define security, governance, and auditing policies in one place, and enforce those policies for all users and all applications.
  • Gain and manage new insights:With Lake Formation, you build a data catalog that describes available datasets and their appropriate business uses. This catalog makes your users more productive by helping them find the right dataset to analyze.

The following screenshot illustrates Lake Formation and its capabilities.

How to create a data lake

S3 forms the storage layer for Lake Formation. If you already use S3, you typically begin by registering existing S3 buckets that contain your data. Lake Formation creates new buckets for the data lake and import data into them. AWS always stores this data in your account, and only you have direct access to it.

There is no lock-in to Lake Formation for your data. Because AWS stores data in standard formats like CSV, ORC, or Parquet, it can be used with a wide variety of AWS or third-party analytics tools.

Lake Formation also optimizes the partitioning of data in S3 to improve performance and reduce costs. The raw data you load may reside in partitions that are too small (requiring extra reads) or too large (reading more data than needed). Lake Formation organizes your data by size, time, or relevant keys to allow fast scans and parallel, distributed reads for the most commonly used queries.

How to load data and catalog metadata

Lake Formation uses the concept of blueprints for loading and cataloging data. You can run blueprints one time for an initial load or set them up to be incremental, adding new data and making it available.

With Lake Formation, you can import data from MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. You can also import from on-premises databases by connecting with Java Database Connectivity (JDBC).

Point Lake Formation to the data source, identify the location to load it into the data lake, and specify how often to load it. Blueprints discovers the source table schema, automatically convert data to the target data format, partition the data based on the partitioning schema, and track data that was already processed. All these actions can be customized.

Blueprints rely on AWS Glue as a support service. AWS Glue crawlers connect and discover the raw data that to be ingested. AWS Glue code generation and jobs generate the ingest code to bring that data into the data lake. Lake Formation uses the same data catalog for organizing the metadata. AWS Glue stitches together crawlers and jobs and allows for monitoring for individual workflows. In these ways, Lake Formation is a natural extension of AWS Glue capabilities.

The following graphics show the Blueprint Workflow and Import screens:

How to transform and prepare data for analysis

In addition to supporting all the same ETL capabilities as AWS Glue, Lake Formation introduces new Amazon ML Transforms. This feature includes a fuzzy logic blocking algorithm that can de-duplicate 400M+ records in less than 2.5 hours, which is magnitudes better than earlier approaches.

To match and de-duplicate your data using Amazon ML Transforms: First, merge related datasets. Amazon ML Transforms divides these sets into training and testing samples, then scans for exact and fuzzy matches. You can provide more data and examples for greater accuracy, putting these into production to process new data as it arrives to your data lake. The partitioning algorithm requires minimal tuning. The confidence level reflects the quality of the grouping, improving on earlier, more improvised algorithms. The following diagram shows this matching and de-duplicating workflow.

Amazon.com is currently using and vetting Amazon ML Transforms internally, at scale, for retail workloads. Lake Formation now makes these algorithms available to customers, so you can avoid the frustration of creating complex and fragile SQL statements to handle record matching and de-duplication. Amazon ML Transforms help improve data quality before analysis. For more information, see Fuzzy Matching and Deduplicating Data with Amazon ML Transforms for AWS Lake Formation.

How to set access control permissions

Lake Formation lets you define policies and control data access with simple “grant and revoke permissions to data” sets at granular levels. You can assign permissions to IAM users, roles, groups, and Active Directory users using federation. You specify permissions on catalog objects (like tables and columns) rather than on buckets and objects.

You can easily view and audit all the data policies granted to a user—in one place. Search and view the permissions granted to a user, role, or group through the dashboard; verify permissions granted; and when necessary, easily revoke policies for a user. The following screenshots show the Grant permissions console:

How to make data available for analytics

Lake Formation offers unified, text-based, faceted search across all metadata, giving users self-serve access to the catalog of datasets available for analysis. This catalog includes discovered schemas (as discussed previously) and lets you add attributes like data owners, stewards, and other business-specific attributes as table properties.

At a more granular level, you can also add data sensitivity level, column definitions, and other attributes as column properties. You can explore data by any of these properties. But access is subject to user permissions. See the following screenshot of the AWS Glue tables tab:

How to monitor activity

With Lake Formation, you can also see detailed alerts in the dashboard, and then download audit logs for further analytics.

Amazon CloudWatch publishes all data ingestion events and catalog notifications. In this way, you can identify suspicious behavior or demonstrate compliance with rules.

To monitor and control access using Lake Formation, first define the access policies, as described previously. Users who want to conduct analysis access data directly through an AWS analytics service, such as Amazon EMR for Spark, Amazon Redshift, or Athena. Or, they access data indirectly with Amazon QuickSight or Amazon SageMaker.

A service forwards the user credentials to Lake Formation for the validation of access permissions. Then Lake Formation returns temporary credentials granting access to the data in S3, as shown in the following diagrams. After a user gains access, actual reads and writes of data operate directly between the analytics service and S3. This approach removes the need for an intermediary in the critical data-processing path.

The following screenshot and diagram show how to monitor and control access using Lake Formation.

Conclusion

With just a few steps, you can set up your data lake on S3 and start ingesting data that is readily queryable. To get started, go to the Lake Formation console and add your data sources. Lake Formation crawls those sources and moves the data into your new S3 data lake.

Lake Formation can automatically lay out the data in S3 partitions; change it into formats for faster analytics, like Apache Parquet and ORC; and increase data quality through machine-learned record matching and de-duplication.

From a single dashboard, you can set up all the permissions for your data lake. Those permissions are implemented for every service accessing this data – including analytics and ML services (Amazon Redshift, Athena, and Amazon EMR for Apache Spark workloads). Lake Formation saves you the hassle of redefining policies across multiple services and provides consistent enforcement of and compliance with those policies.

Learn how to start using AWS Lake Formation.


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at AWS. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 

 

Prajakta Damle is a Principle Product Manager at Amazon Web Services.