Tag Archives: AWS Lake Formation

Anonymize and manage data in your data lake with Amazon Athena and AWS Lake Formation

Post Syndicated from Manos Samatas original https://aws.amazon.com/blogs/big-data/anonymize-and-manage-data-in-your-data-lake-with-amazon-athena-and-aws-lake-formation/

Organizations collect and analyze more data than ever before. They move as fast as they can on their journey to become more data driven by using the insights from their data.

Different roles use data for different purposes. For example, data engineers transform the data before further processing, data analysts access the data and produce reports, and data scientists with domain and technical expertise can train machine learning algorithms. Those roles require access to the data, and access has never been easier to grant.

At the same time, most organizations have to comply with regulations when dealing with their customer data. For that reason, datasets that contain personally identifiable information (PII) is often anonymized. A common example of PII can be tables and columns that contain personal information about an individual (such as first name and last name) or tables with columns that, if joined with another table, can trace back to an individual.

You can use AWS Analytics services to anonymize your datasets. In this post, I describe how to use Amazon Athena to anonymize a dataset.  You can then use AWS Lake Formation to provide the right access to the right personas.

Use case

To better understand the concept, we use a straightforward use case: analysts in your organization need access to a dataset with sales data, some of which contains PII information. As the data lake admin, you’re not comfortable with all personnel having access to customers’ PII. To address this, you can use an anonymized dataset.

This use case has two users:

  • datalake_admin – Responsible for data anonymization and making sure the right permissions are enforced. They classify the data, generate anonymized datasets, and configures the required permissions.
  • datalake_analyst – Only has access to the anonymized dataset. They can extract patterns for users without tracing the request back to an individual customer.

The following AWS CloudFormation template generates the AWS Glue tables that you use later in this post:

However, the template doesn’t create the datalake_admin and datalake_analyst users. For more information about personas in Lake Formation, see Lake Formation Personas and IAM Permissions Reference.

Solution architecture

For this solution, you use the following services:

  • Lake Formation – Lake Formation makes it easy to set up a secure data lake—a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. The data lake admin can easily label the data and give users permission to access authorized datasets.
  • Athena – Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run. For this use case, the data lake admin uses Athena to anonymize the data, after which the data analyst can use Athena for interactive analytics over anonymized datasets.
  • Amazon S3Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. For this use case, you use Amazon S3 as storage for the data lake.

The following diagram illustrates the architecture for this solution.

In this architecture, there are no servers to manage. You only pay what you use. You can use the same solution for small or large datasets. The scaling happens behind the scenes but in a transparent way.

In the following sections, you look in more detail on how to do the following:

  • Label sensitive data with AWS Lake Formation
  • Anonymize data with Athena
  • Apply permissions with Lake Formation
  • Analyze the anonymized datasets

Labeling the sensitive data with Lake Formation

As a data lake admin, the first task is to label the personal information. Tags don’t enforce any security controls, but applying a good tagging strategy is a great way to describe the data. Tags are key-value pairs that you can apply for your AWS resources, including table and columns in your data lake. For this use case, you apply a very simple tagging strategy: for the columns that contain PII, you give the value PII.

You interact with the following tables from the tcp-ds dataset, which both have their data stored in Amazon S3 in CSV format:

  • store_sales – Stores sales data and references other tables that you can join together for more sophisticated business queries. The table has a foreign key with the customer table on the ss_customer_sk This key, when joined with the customer table, can uniquely identify a user. For that reason, treat this column as personal information.
  • customer – Stores customer data, a lot of which is PII. In addition to c_customer_sk, you could use data such as customer ID, (c_customer_id), customer name (c_first_name), customer last name (c_last_name), login (c_login), and email (c_email_address) to uniquely identify a customer.

To start tagging your columns (starting with the store_sales table), complete the following steps:

  1. As the data lake admin user, log in to the Lake Formation console.
  2. Choose Data Catalog Tables.
  3. Select store_sales.
  4. Choose Edit schema.
  5. Select the column you want to edit (ss_customer_sk).
  6. Choose Edit.
  7. For Key, enter Classification.
  8. For Value, enter PII.
  9. Choose Save.

To verify that you can apply the added column properties, use the Lake Formation API to get the table description.

  1. On the Data Catalog Tables page, select store_sales.
  2. Choose View properties.

The table properties look like the following JSON object:

{
"Name": "store_sales",
"DatabaseName": "tcp-ds-1tb",
"Owner": "owner",
"CreateTime": "2019-09-13T10:15:04.000Z",
"UpdateTime": "2020-03-18T16:10:34.000Z",
"LastAccessTime": "2019-09-13T10:15:03.000Z",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "ss_sold_date_sk",
"Type": "bigint",
"Parameters": {}
},
...
{
"Name": "ss_customer_sk",
"Type": "bigint",
"Parameters": {
"Classification": "PII"
}
},
...
}

The additional column properties are now in the table metadata.

  1. Repeat the preceding steps for the customer table and label the following columns:
    • c_customer_sk
    • c_customer_id
    • c_first_name
    • c_last_name
    • c_login
    • c_email_address

Adding a tag also allows you to perform metadata searches by tag attributes. For more information, see Discovering metadata with AWS Lake Formation: Part 1 and Discover metadata with AWS Lake Formation: Part 2.

Anonymizing data with Athena

The data lake admin now needs to provide the data analyst anonymized datasets for analytics. For this use case, you want to extract patterns on the customer table and the store_sales table separately, but you also want to join the two tables so you can perform more sophisticated queries.

The first step is to create a database in Lake Formation to organize tables in AWS Glue.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Choose Create database.
  3. For Name, enter a name, such as anonymised_tcp_ds_1tb.
  4. Optionally, enter an Amazon S3 path for the database and a description.
  5. Choose Create database.

The next step is to create the tables that contain the anonymized data. Before you do so, consider the significance of each anonymized column from an analytics point of view. For columns that have little or no value in the analytics process, omitting the column altogether might be the right approach. You might use other columns as primary keys to join with other tables. To make sure that you can join the tables, you can apply a hash function to the table foreign keys.

A common approach to anonymize sensitive information is hashing. A hash function is any function that you can use to map data of arbitrary size to fixed-size values. For more information, see Hash function.

The following table summarizes your strategy for each column.

TableColumn Strategy
customercustomer_first_namehash
customercustomer_last_namehash
customerc_loginomit
customercustomer_idhash
Customerc_email_addressomit
customerc_customer_skhash
store_salesss_customer_skhash

If you use the same value as the input of your hash function, it always returns the same result. In addition, and contrary to encryption, you can’t reverse hashing.

  1. Use Athena string functions to hash individual columns and generate anonymized datasets.
  2. After you create those datasets, you can use Lake Formation to apply security controls. See the following code:
CREATE table "tcp-ds-anonymized".customer
WITH (format='parquet',external_location = 's3://tcp-ds-eu-west-1-1tb-anonymised/2/customer_parquet/')
AS SELECT       
         sha256(to_utf8(cast(c_customer_sk AS varchar))) AS c_customer_sk_anonym,
         sha256(to_utf8(cast(c_customer_id AS varchar))) AS c_customer_id_anonym,
         sha256(to_utf8(cast(c_first_name AS varchar))) AS c_first_name_anonym,
         sha256(to_utf8(cast(c_last_name AS varchar))) AS c_last_name_anonym,
         c_current_cdemo_sk,
         c_current_hdemo_sk,
         c_first_shipto_date_sk,
         c_first_sales_date_sk,
         c_salutation,
         c_preferred_cust_flag,
         c_current_addr_sk,
         c_birth_day,
         c_birth_month,
         c_birth_year,
         c_birth_country,
         c_last_review_date_sk
FROM customer
  1. To preview the data, enter the following code:
SELECT c_first_name_anonym, c_last_name_anonym FROM "tcp-ds-anonymized"."customer" limit 10;

The following screenshot shows the output of your query.

  1. To repeat these steps for the stores_sales table, enter the following code:
CREATE table "tcp-ds-anonymized".store_sales
WITH (format='parquet',external_location = 's3://tcp-ds-eu-west-1-1tb-anonymised/1/store_sales/')
AS SELECT sha256(to_utf8(cast(ss_customer_sk AS varchar))) AS ss_customer_sk_anonym,
         ss_sold_date_sk,
         ss_sales_price,
         ss_sold_time_sk,
         ss_item_sk,
         ss_hdemo_sk,
         ss_addr_sk,
         ss_store_sk,
         ss_promo_sk,
         ss_ticket_number,
         ss_quantity,
         ss_wholesale_cost,
         ss_list_price,
         ss_ext_discount_amt,
         ss_external_sales_price,
         ss_ext_wholesale_cost,
         ss_ext_list_price,
         ss_ext_tax,
         ss_coupon_amt,
         ss_net_paid,
         ss_net_paid_inc_tax,
         ss_net_profit
FROM store_sales;

One of the challenges you need to overcome when working with CTAS queries is that the query’s Amazon S3 location should be unique for the table you’re creating. You can add some incremental value or timestamp to the path of the table, for example, s3:/<bucket>/<table_name>/<version>, and make sure you use a different version number every time.

You can delete older data programmatically using Amazon S3 APIs or SDK. You can also use Amazon S3 lifecycle configuration to tell Amazon S3 to transition objects to another Amazon S3 storage class. For more information, see Object lifecycle management.

You can automate the anonymization of the CTAS query with AWS Glue jobs. AWS Glue provides a lightweight Python shell job option that can call the Amazon Athena API programmatically.

Applying permissions with Lake Formation

Now that you have the table structures and anonymized datasets, you can apply the required permissions using Lake Formation.

  1. On the Lake Formation console, under Data Catalog, choose Tables.
  2. Select the tables that contain the anonymized data.
  3. From the Actions drop-down menu, under Permissions, choose Grant.
  4. For IAM users and roles, choose the IAM user for the data analyst.
  5. For Table permissions, select Select.
  6. Choose Grant.

You can now view all table permissions and verify the permissions granted to a particular principal.

Analyzing the anonymized datasets

To verify that the role can access the right tables and query the anonymized datasets, complete the following steps:

  1. Sign in to the AWS Management Console as the data analyst.
  2. Under Analytics, choose Amazon Athena.

You should see a query field, similar to the following screenshot.

You can now test your access with queries. To see the top customers by revenue and last name, enter the following code:

SELECT c_last_name_anonym,
sum(ss_sales_price) AS total_sales
FROM store_sales
JOIN customer
ON store_sales.ss_customer_sk_anonym = customer.c_customer_sk_anonym
GROUP BY c_last_name_anonym
ORDER BY total_sales DESC limit 10;

The following screenshot shows the query output.

You can also try to query a table that you don’t have access to. You should receive an error message.

Conclusion

Anonymizing dataset is often a prerequisite before users can start analyzing a dataset. In this post, we discussed how data lake admins can use Athena and Lake Formation to label and anonymize data stored in Amazon S3. You can then use Lake Formation to apply permissions to the dataset and allow other users to access the data.

The services we discussed in this post are serverless. Building serverless applications means that your developers can focus on their core product instead of worrying about managing and operating servers or runtimes, either in the cloud or on-premises. This reduced overhead lets developers reclaim time and energy that they can spend on developing great products that scale and that are reliable.

 


About the Author

Manos Samatas is a Specialist Solutions Architect in Big Data and Analytics with Amazon Web Services. Manos lives and works in London. He is specialising in architecting Big Data and Analytics solutions for Public Sector customers in EMEA region.

Enable fine-grained data access in Zeppelin Notebook with AWS Lake Formation

Post Syndicated from Behram Irani original https://aws.amazon.com/blogs/big-data/enable-fine-grained-data-access-in-zeppelin-notebook-with-aws-lake-formation/

This post explores how you can use AWS Lake Formation integration with Amazon EMR (still in beta) to implement fine-grained column-level access controls while using Spark in a Zeppelin Notebook.

My previous post Extract Salesforce.com data using AWS Glue and analyzing with Amazon Athena showed you a simple use case for extracting any Salesforce object data using AWS Glue and Apache Spark, saving it to Amazon Simple Storage Service (Amazon S3), cataloging the data using the Data Catalog in Glue, and querying it using Amazon Athena.

Preparing your data

For simplicity of setup and to build on the concept of fine-grained access control of data, you use the same data that you extracted from the Salesforce account object in the post Extract Salesforce.com data using AWS Glue and analyzing with Amazon Athena. Follow all the steps from the preceding post to create a table called sfdc_output, which you can query in Athena and see all the fields of the account object.

In the following sections, you see how to restrict access to only a select set of columns in this table for a user who queries this data using Spark SQL in Zeppelin Notebook.

Setting up Lake Formation

Lake Formation aims to simplify and accelerate the creation of data lakes. Amazon EMR integrates with Lake Formation and its security model to allow fine-grained access control on databases, tables, and columns defined in the Data Catalog for data stored in Amazon S3. Users authenticate against third-party identity providers (IdPs) through SAML, and the principal is used to determine if the user has the appropriate access to the columns within a table and partitions in the Data Catalog.

Lake Formation provides its own permissions model that augments the AWS Identity and Access Management (IAM) permissions model. This centrally defined permissions model enables fine-grained access to data stored in data lakes through a simple grant and revoke mechanism.

The following diagram illustrates the workflow.

In the preceding flow, you still authenticate the principal at the IdP, and use the IAM policy to authorize access to AWS resources. Additionally, you use Lake Formation to authorize data access. When a principal attempts to run a query in Amazon EMR against a table set up with Lake Formation, Amazon EMR requests temporary credentials for data access from Lake Formation. Lake Formation returns temporary credentials and allows data access.

For more information about setting up Lake Formation, see Setting Up AWS Lake Formation. For this use case, you want to enable the integration of Amazon EMR with Lake Formation so you can use Zeppelin Notebook to see the fine-grained data access controls in action. For this post, I’ve configured the authentication module using the third-party SAML provider Auth0. You can also use Okta or Active Directory Federation Services (ADFS) to set up authentication with the IdP of your choice. For more information about setting up IdP and to launch Amazon EMR with an AWS CloudFormation stack, see Integration with Amazon EMR.

Granting fine-grained access with Lake Formation

For the purpose of demonstrating fine-grained data access, I created a user called developer in Auth0. Suppose that for your table sfdc_output, you don’t want to give this user access to certain billing-related fields. Complete the following steps:

  1. On the Lake Formation console, choose Data permissions.
  2. For Active Directory users and groups, enter the ARN for the user developer.
  3. For Database, choose default.
  4. For Table, choose sfdc_output.
  5. For Column, choose Include columns.
  6. For Include columns, choose sic, name, accountnumber, and type.

These are the specific columns from this table that you want the developer user to have access to.

  1. For Table permissions¸ select Select.
  2. Choose Grant.

After you grant specific permissions to the developer user, you have to remove LFPassthrough access by revoking Super access given to the IAMAllowedPrincipals group. For backward compatibility, Lake Formation only passes through IAM permissions for all existing Data Catalog tables. Revoking Super access enables it to apply specific Lake Formation grants and IAM permissions.

  1. On the Lake Formation console, choose Permissions.
  2. Select IAMAllowedPrincipals for sfdc_output
  3. Choose Revoke.

You also need to register the Amazon S3 location in Lake Formation where the table data resides.

  1. On the Lake Formation console, choose Data lake locations.
  2. Select the Amazon S3 path for your table.
  3. Choose Register location.

Running a query

You can now test the restrictions by running a query.

Log in to the Zeppelin console using its URL. To access Zeppelin Notebook, you must first ensure that your cluster’s master security group is configured to allow access to the Proxy Agent (port 8442) from your desktop. Do not open your EMR master to the public (0.0.0.0/0 or ::0). It redirects the link to the IdP provider for login and authentication with the developer user credentials. After authentication is complete, create a new notebook and run a Spark SQL query against the sfdc_output table.

The following screenshot shows that even though the developer user queried for the full table, only the columns that you granted them access to in Lake Formation are visible.

As an additional exercise, you can create another user in IdP and give a different set of column access in Lake Formation to that user. Query the table again by logging in as that user and observe the corresponding security mechanism being applied.

Querying with Jupyter Notebook

If you want to use Jupyter Notebook, you can spin up an Amazon EMR notebook, attach it to the running cluster, and run the same query in it. The following screenshot shows that the results are the same.

Conclusion

This post showed how Lake Formation provides fine-grained, column-level access to tables in the Data Catalog using Spark. It enables federated single sign-on to Apache Zeppelin or Amazon EMR notebooks from your enterprise identity system that is compatible with SAML 2.0.

You can try this solution for your use-cases and if you have comments or feedback, please leave them below.

 


About the Authors

Behram Irani is a Senior Data Architect at Amazon Web Services.

 

 

 

 

Rahul Sonawane is a Senior Consultant, Big Data at Amazon Web Services.

 

Control data access and permissions with AWS Lake Formation and Amazon EMR

Post Syndicated from Nabil Ezzarhouni original https://aws.amazon.com/blogs/big-data/control-data-access-and-permissions-with-aws-lake-formation-and-amazon-emr/

What if you could control the access to your data lake, centrally? Would it be more convenient to share specific data securely with internal and external customers?

With AWS Lake Formation and its integration with Amazon EMR, you can easily perform these administrative tasks.

This post goes through a use case and reviews the steps to control the data access and permissions of your existing data lake. Before you get started, review the following:

Use case

Let’s assume your company has augmented their on-premises infrastructure with AWS. Because your data center has a fixed capacity for the company’s analytics and machine learning needs, you’re using the cloud for additional compute and storage. AWS Direct Connect links your data center to the closest AWS Region. Because your Active Directory server is still on premises, you use an Active Directory Connector to federate user authentication. For cost optimization and agility, you built a data lake to centralize business data to Amazon Simple Storage Service (Amazon S3) using Lake Formation.

Your organization is looking to improve their data analytics function and has hired external data analysis consultants. Based on the least privilege best practice, you want to only share the relevant data with the external consultants, without any personally identifiable information (PII) such as name, date of birth, and Social Security number.

You’re concerned about third-party data access in the cloud. You want a solution where the data is safe, controlled, audited, encrypted, and secure. You want to restrict access at the column level, so the PII isn’t available to the external consultants.

In addition, you want to restrict the consultants’ access to your cloud resource. They should only access the EMR cluster with a specific AWS Identity and Access Management (IAM) role.

The following diagram illustrates the architecture for this use case.

The external consultants authenticate the AWS resources against the on-premises, SAML-compatible directory service, federated with IAM. You control the access to the cloud resources from your on-premises identity provider (IdP). For more information, see About SAML 2.0-based Federation.

Lake Formation manages data access. The data lake administrator defines the data access at the column level for any principal defined in Lake Formation. The principal can be a user that you federate with the on-premises directory service. For this use case, the principal is a specific role for the external consultants that allows them to only access the EMR cluster.

Because you defined fine-grained Lake Formation permissions to your data, the external consultants can’t access employee first names, last names, and Social Security numbers. They only have access to the non-PII columns. This measure is called pseudonymization, in which you can’t identify the PII without additional data. Pseudonymization has the following benefits:

  • Centralized authentication and user and data access governance
  • Less management overhead and security improvement because there is a canonical source of authentication
  • The consultant uses an IAM role allowing only Lake Formation data access and the instance profile role associated with the EMR cluster

You don’t need to manage access to Amazon S3; access is centralized from Lake Formation. If you want to share the data lake data with more users, you only need to define it one time in Lake Formation.

In the next section, you see how to implement this solution.

Creating a data lake

Before you get started, create a data lake. You can control the access to the data lake with policies and permissions, and define permissions at the database, table, or column level.

When creating the database, complete the following steps to enable fine-grained access control with Lake Formation permissions.

  1. On the Lake Formation console, under Data catalog, choose Databases.
  2. Choose Create database.

  1. For Name, enter a name for your database.
  2. Deselect Use only IAM access control for new tables in this database.

 This step enables fine-grained access control with Lake Formation permissions.

  1. Choose Create database.

Adjusting the permissions

  1. Under Permissions, choose Data permissions.
  2. Choose Grant.

  1. For IAM users and roles, choose which specific IAM users and roles can access the data lake.

These accounts can be federated with your SAML 2.0 compatible IdP with AWS, so you can control the access from your on-premises Active Directory.

You can also define Active Directory users and groups directly, but only in the context of Amazon EMR integration with Lake Formation. For more information, see Amazon EMR Integration With AWS Lake Formation Is Now In Beta, Supporting Database, Table, and Column-level access controls for Apache Spark and Amazon EMR Components.

  1. For Database, choose your database.
  2. For Table, choose your table.
  3. For Columns, choose Exclude columns.
  4. For Exclude columns, choose which columns to exclude (for this use case, first name, last name, and ssn).
  5. For Table permissions, select Select.

This feature allows you to control column-level access for an IAM user or a role.

  1. Choose Grant.

Integrating Lake Formation with Amazon EMR

From Amazon EMR version 5.26 onwards, you can launch an EMR cluster that integrates with Lake Formation. Amazon EMR can only access specific columns or data based on the permissions defined in Lake Formation. For more information, see Architecture of SAML-Enabled Single Sign-On and Fine-Grained Access Control.

A key requirement is to have an external IdP, such as Microsoft Active Directory, Okta, or Auth0, defined for the EMR cluster. The benefit is that you can use your existing enterprise directory, compliance, and audits to govern the data access to Lake Formation. For instructions, see Integrating Amazon EMR with AWS Lake Formation (Beta).

When integration is complete, the consultants can consume the data from Amazon EMR via Zeppelin or Apache Spark, without accessing the PII.

Additional security measures

As with most AWS services, Amazon EMR and Lake Formation use IAM features. With IAM, you can define IAM users or roles to grant access to other AWS services and data.

On top of this security model, AWS CloudTrail tracks all AWS API requests. You can use this audit trail for governance and compliance purposes because you can trace all usage of AWS resources.

To protect the data, you can use in-transit and at-rest encryption. You can also define specific security configurations to apply to your EMR cluster. For more information, see Encryption Options.

For additional security services, you can use Amazon GuardDuty (a threat detection service) and Amazon Macie (data discovery and protection at scale). For more information, see Security, Identity, and Compliance on AWS.

Conclusion

Data usage has grown rapidly in terms of formats and sizes. Managing different technologies (relational databases, NoSQL, Graph, flat files, and more) adds management overhead and costs. Due to constant competition, as data size soars with compute and storage needs, organizations require more agility and speed.

In addition, you may be sharing business needs data with more internal and external customers. This causes increasing complexity in data governance and additional burdens in permission and access management.

In this post, we explained how you can control the access and permission of your data lake. AWS allows you to support your organization’s growth and security because data access is controlled, encrypted, and audited. This is possible in the context of a hybrid infrastructure with an on-premises IdP controlling access to AWS and other resources.

You can start your data lake project and share data with different roles in your organization in a scalable and secure way. It takes minutes instead of months, without mobilizing an important engineering footprint.

As a next step, you can use the curated data as the foundation for machine learning projects with AWS Machine Learning tools.

 


About the Authors

Nabil Ezzarhouni is a Partner Solutions Architect with AWS. His interests are DevOps, Machine Learning and his dog Bandit (he’s a born-again pet lover).

 

 

 

 

 

Pawan Matta is a Solutions Architect with AWS. Pawan enjoys working with customers and helping them in area of Storage and Migration. During his free time, Pawan loves spending time watching cricket and playing video games with friends

 

 

 

Enable fine-grained permissions for Amazon QuickSight authors in AWS Lake Formation

Post Syndicated from Adnan Hasan original https://aws.amazon.com/blogs/big-data/enable-fine-grained-permissions-for-amazon-quicksight-authors-in-aws-lake-formation/

We’re excited to announce the integration of Amazon QuickSight with the AWS Lake Formation security model, which provides fine-grained access control for QuickSight authors. Data lake administrators can now use the Lake Formation console to grant QuickSight users and groups permissions to AWS Glue Data Catalog databases, tables, and Amazon Simple Storage Service (Amazon S3) buckets that are registered and managed via Lake Formation.

This new feature enhances the fine-grained access control capability previously introduced in QuickSight, which allows admins to use AWS Identity and Access Management (IAM) policies to scope down QuickSight author access to Amazon S3, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon Redshift. The scope-down access is enforced by attaching IAM policies to the QuickSight user or a group in the QuickSight portal. For more information, see Introducing Amazon QuickSight fine-grained access control over Amazon S3 and Amazon Athena.

For Athena-based datasets, you’re no longer required to use IAM policies to scope down QuickSight author access to Amazon S3, or Data Catalog databases and tables. You can grant permissions directly in the Lake Formation console. An added benefit is that you can also grant column-level permissions to the QuickSight users and groups. Lake Formation handles all this for you centrally.

This feature is currently available in the QuickSight Enterprise edition in the following Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)

It will soon be available in all Regions where Lake Formation exists as of this post. For more information, see Region Table.

This post compares the new fine-grained permissions model in Lake Formation to the IAM policy-based access control in QuickSight. It also provides guidance on how to migrate fine-grained permissions for QuickSight users and groups to Lake Formation.

QuickSight fine-grained permissions vs. Lake Formation permissions

In QuickSight, you can limit user or group access to AWS resources by attaching a scope-down IAM policy. If no such policies exist for a user or a group (that the user is a member of), QuickSight service role permissions determine access to the AWS resources. The following diagram illustrates how permissions work for a QuickSight user trying to create an Athena dataset.

With the Lake Formation integration, the permissions model changes slightly. The two important differences while creating an Athena dataset are:

  • Users can view the Data Catalog resources (databases and tables) that have one of the following:
    1. The IAMAllowedPrincipal security group is granted Super permission to the resource in Lake Formation.
    2. An ARN for the QuickSight user or group (that the user is a member of) is explicitly granted permissions to the resource in Lake Formation.
  • If the S3 source bucket for the Data Catalog resource is registered in Lake Formation. Amazon S3 access settings in QuickSight are ignored, including scope-down IAM policies for users and groups.

The following diagram shows the change in permission model when a QuickSight user tries to create an Athena dataset.

The following sections dive into how fine-grained permissions work in QuickSight and how you can migrate the existing permissions to the Lake Formation security model.

Existing fine-grained access control in QuickSight

For this use case, a business analyst in the marketing team, lf-gs-author, created an Athena dataset Monthly Sales in QuickSight. It was built using the month_b2bsalesdata table in AWS Glue and the data in S3 bucket b2bsalesdata.

The following screenshot shows the table details.

The following screenshot shows the dataset details.

The dataset is also shared with a QuickSight group analystgroup. See the following screenshot of the group details.

A fine-grained IAM policy enforces access to the S3 bucket b2bsalesdata for lf-qs-author and analystgroup. The following code is an example of an Amazon S3 access policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "arn:aws:s3:::"
        },
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::b2bsalesdata"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::b2bsalesdata/"
            ]
        }
    ]
}

Enabling QuickSight permissions in Lake Formation

To migrate QuickSight permissions to Lake Formation,  follow the steps described below (in the given order):

1.) Capturing the ARN for the QuickSight user and group

First, capture the QuickSight ARN for the business analyst user and marketing team group. You can use the describe-user API and the describe-group API to retrieve the user ARN and the group ARN, respectively. For example, to retrieve the ARN for the QuickSight group analystgroup, enter the following code in the AWS Command Line Interface (AWS CLI):

aws quicksight describe-group --group-name 'analystgroup' --aws-account-id 253914981264 --namespace default

Record the group ARN from the response, similar to the following code:

{
 "Status": 200,
 "Group": {
 "Arn": "arn:aws:quicksight:us-east-1:253914981264:group/default/analystgroup",
 "GroupName": "analystgroup",
 "PrincipalId": "group/d-906706bd27/3095e3ab-e901-479b-88da-92f7629b202d"
 },
 "RequestId": "504ec460-2ceb-46ca-844b-a33a46bc7080"
}

Repeat the same step to retrieve the ARN for the business analyst lf-qs-author.

2.) Granting permissions in the data lake

To grant permissions to the month_b2bsalesdata table in salesdb, complete the following steps:

  1. Sign in to the Lake Formation console as the data lake administrator.

A data lake administrator can grant any principal (IAM, QuickSight, or Active Directory) permissions to Data Catalog resources (databases and tables) or data lake locations in Amazon S3. For more information about creating a data lake administrator and the data lake security model, see AWS Lake Formation: How It Works.

  1. Choose Tables.
  2. Select month_b2bsalesdata.
  3. From the Actions drop-down menu, choose View permissions.

You see a list of principals with associated permissions for each resource type.

  1. Choose Grant.
  2. For Active Directory and Amazon QuickSight users and groups, enter the QuickSight user ARN.
  3. For Table permissions, select Select.
  4. Optionally, under Column permissions, you can grant column-level permissions to the user. This is a benefit of using Lake Formation permissions over QuickSight policies.
  5. Choose Grant.

  1. Repeat the preceding steps to grant select table permissions to analystgroup, using the ARN you recorded earlier.
  2. Select month_b2bsalesdata.
  3. From the Actions drop-down menu, choose View permissions.

The following screenshot shows the added permissions for the QuickSight user and group.

3.) Removing IAMAllowedPrincipal group permissions

For Lake Formation permissions to take effect, you must remove the IAMAllowedPrincipal group from the month_b2bsalesdata table.

  1. Select month_b2bsalesdata.
  2. From the Actions drop-down menu, choose View permissions.
  3. Select IAMAllowedPrincipals.
  4. Choose Revoke.

  1. Choose Revoke

4.) Registering your S3 bucket in Lake Formation

You can now register the S3 source bucket (b2bsalesdata) in Lake Formation. Registering the S3 bucket switches Amazon S3 authorization from QuickSight scope-down policies to Lake Formation security.

  1. Choose Data lake locations.
  2. Choose Register location.
  3. For Amazon S3 path, enter the path for your source bucket (s3://b2bsalesdata).
  4. For IAM role, choose the role with permissions to that bucket.
  5. Choose Register location.

5.) Cleaning up the scope-down policies in QuickSight

You can now remove the scope-down policies for the user and group in QuickSight. To find these policies, under Security and Permissions, choose IAM policy assignments.

6.) Creating a dataset in QuickSight

To create a dataset, complete the following steps:

  1. Log in to QuickSight as a user who is a member of analystgroup (someone besides lf-qs-author).
  2. Choose Manage data.
  3. Choose New data set.
  4. Choose Athena.
  5. For the data source name, enter Marketing Data.
  6. Choose Create data source.
  7. In the list of databases, choose salesdb.
  8. Choose month_b2bsalesdata.
  9. Choose Edit/Preview data.

The following screenshot shows the details of month_b2bsalesdata table.

You can also use custom SQL to query the data.

Conclusion

This post demonstrates how to extend the Lake Formation security model to QuickSight users and groups, which allows data lake administrators to manage data catalog resource permissions centrally from one console. As organizations embark on the journey to secure their data lakes with Lake Formation, having the ability to centrally manage fine-grained permissions for QuickSight authors can extend the data governance and enforcement of security controls at the data consumption (business intelligence) layer. You can enable these fine-grained permissions for QuickSight users and groups at the database, table, or column level, and they’re reflected in the Athena dataset in QuickSight.

Start migrating your fine-grained permissions to Lake Formation today, and leave your thoughts and questions in the comments.

 


About the Author

Adnan Hasan is a Solutions Architect with Amazon QuickSight at Amazon Web Services.

 

Enforce column-level authorization with Amazon QuickSight and AWS Lake Formation

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/enforce-column-level-authorization-with-amazon-quicksight-and-aws-lake-formation/

Amazon QuickSight is a fast, cloud-powered, business intelligence service that makes it easy to deliver insights and integrates seamlessly with your data lake built on Amazon Simple Storage Service (Amazon S3). QuickSight users in your organization often need access to only a subset of columns for compliance and security reasons. Without having a proper solution to enforce column-level security, you have to develop additional solutions, such as views, data masking, or encryption, to enforce security.

QuickSight accounts can now take advantage of AWS Lake Formation column-level authorization to enforce granular-level access control for their users.

Overview of solution

In this solution, you build an end-to-end data pipeline using Lake Formation to ingest data from an Amazon Aurora MySQL database to an Amazon S3 data lake and use Lake Formation to enforce column-level access control for QuickSight users.

The following diagram illustrates the architecture of this solution.

Walkthrough overview

The detailed steps in this solution include building a data lake using Lake Formation, which uses an Aurora MySQL database as the source and Amazon S3 as the target data lake storage. You create a workflow in Lake Formation that imports a single table from the source database to the data lake. You then use Lake Formation security features to enforce column-level security for QuickSight service on the imported table. Finally, you use QuickSight to connect to this data lake and visualize only the columns for which Lake Formation has given access to QuickSight user.

To implement the solution, you complete the following steps:

  1. Prerequisites
  2. Creating a source database
  3. Importing a single table from the source database
    • Creating a connection to the data source
    • Creating and registering your S3 bucket
    • Creating a database in the Data Catalog and granting permissions
    • Creating and running the workflow
    • Granting Data Catalog permissions
  4. Enforcing column-level security in Lake Formation
  5. Creating visualizations in QuickSight

Prerequisites

For this walkthrough, you should have the following prerequisites:

Creating a source database

In this step, create an Aurora MySQL database cluster and use the DDLs in the following GitHub repo to create an HR schema with associated tables and sample data.

You should then see the schema you created using the MySQL monitor or your preferred SQL client. For this post, I used SQL Workbench. See the following screenshot.

Record the Aurora database JDBC endpoint information; you need it in subsequent steps.

Importing a single table from the source database

Before you complete the following steps, make sure you have set up Lake Formation and met the JDBC prerequisites.

The Lake Formation setup creates a datalake_user IAM user. You need to add the same user as a QuickSight user. For instructions, see Managing User Access Inside Amazon QuickSight. For Role, choose AUTHOR.

Creating a connection to the data source

After you complete the Lake Formation prerequisites, which include creating IAM users datalake_admin and datalake_user, create a connection in your Aurora database. For instructions, see Create a Connection in AWS Glue. Provide the following information:

  • Connection name<yourPrefix>-blog-datasource
  • Connection type – JDBC
  • Database connection parameters – JDBC URL, user name, password, VPC, subnet, and security group

Creating and registering your S3 bucket

In this step, you create an S3 bucket named <yourPrefix>-blog-datalake, which you use as the root location of your data lake. After you create the bucket, you need to register the Amazon S3 path. Lastly, grant data location permissions.

Creating a database in the Data Catalog and granting permissions

Create a database in the Lake Formation Data Catalog named <yourPrefix>-blog-database, which stores the metadata tables. For instructions, see Create a Database in the Data Catalog.

After you create the database, you grant data permissions to the metadata tables to the LakeFormationWorkflowRole role, which you use to run the workflows.

Creating and running the workflow

In this step, you copy the EMPLOYEES table from the source database using a Lake Formation blueprint. Provide the following information:

  • Blueprint type – Database snapshot
  • Database connection<yourPrefix>-blog-datasource
  • Source data pathHR/EMPLOYEES
  • Target database<yourPrefix>-blog-database
  • Target storage location<yourPrefix>-blog-datalake
  • Workflow name<yourPrefix>-datalake-quicksight
  • IAM roleLakeFormationWorkflowRole
  • Table prefixblog

For instructions, see Use a Blueprint to Create a Workflow.

When the workflow is ready, you can start the workflow and check its status by choosing View graph. When the workflow is complete, you can see the employee table available in your Data Catalog under <yourPrefix>-blog-database. See the following screenshot.

You can also view the imported data using Athena, which is integrated with Lake Formation. You need to select “View Data” from “Actions” drop down menu for this purpose. See the following screenshot.

Granting Data Catalog permissions

In this step, you provide the Lake Formation Data Catalog access to the IAM user datalake_user. This is the same user that you added in QuickSight to create the dashboard. For Database permissions, select Create table and Alter for this use case, but you can change the permission level based on your specific requirements. For instructions, see Granting Data Catalog Permissions.

When this step is complete, you see the permissions for your database <yourPrefix>-blog-database.

Enforcing column-level security in Lake Formation

Now that your table is imported into the data lake, enforce column-level security to the dataset. For this use case, you want to hide the Salary and Phone_Number columns from business intelligence QuickSight users.

  1. In the Lake Formation Data Catalog, choose Databases.
  2. From the list of databases, choose <yourPrefix>-blog-database.
  3. Choose View tables.
  4. Select blog_hr_employees.
  5. From the Actions drop-down menu, choose Grant.

  1. For Active Directory and Amazon QuickSight users and groups, provide the QuickSight user ARN.

You can find the ARN by entering the code aws quicksight list-users --aws-account-id <your AWS account id> --namespace default --region us-east-1 in the AWS Command Line Interface (AWS CLI).

  1. For Database, choose <yourPrefix>-blog-database.
  2. For Table, choose blog_hr_employees.
  3. For Columns, choose Exclude columns.
  4. For Exclude columns, choose salary and phone_number.
  5. For Table permissions, select Select.

You should receive a confirmation on the console that says Permission granted for: datalake_user to Exclude: <yourPrefix>-blog-database.blog_hr_employees.[salary, phone_number].

You can also verify that appropriate permission is reflected for the QuickSight user on the Lake Formation console by navigating to the Permissions tab and filtering for your database and table.

You can also specify column-level permissions in the AWS CLI with the following code:

aws lakeformation grant-permissions --principal DataLakePrincipalIdentifier=<QuickSight User ARN> --permissions "SELECT" --resource '{ "TableWithColumns": {"DatabaseName":"<yourPrefix>-blog-database", "Name":"blog_hr_employees", "ColumnWildcard": {"ExcludedColumnNames": ["salary", "phone_number"]}}}'  --region us-west-2 --profile datalake_admin

Creating visualizations in QuickSight

In this step, you use QuickSight to access the blog_hr_employees table in your data lake. While accessing this dataset from QuickSight, you can see that QuickSight doesn’t show the salary and phone_number columns, which you excluded from the source table in the previous step.

  1. Log in to QuickSight using the datalake_user IAM user.
  2. Choose New analysis.
  3. Choose New dataset.
  4. For the data source, choose Athena.

  1. For your data source name, enter Athena-HRDB.
  2. For Database, choose <yourPrefix>-blog-database.
  3. For Tables, select blog_hr_employees.
  4. Choose Select.

  1. Choose Import to SPICE for quicker analysis or Directly query your data.

For this use case, choose Import to SPICE. This provides faster visualization in a production setup, and you can run a scheduled refresh to make sure your dashboards are referring to the current data. For more information, see Scheduled Refresh for SPICE Data Sets on Amazon QuickSight.

When you complete the previous steps, your data is imported to your SPICE machine and you arrive at the QuickSight visualization dashboard. You can see that SPICE has excluded the salary and phone_number fields from the table. In the following screenshot, we created a pie chart visualization to show how many employees are present in each department.

Cleaning up

To avoid incurring future charges, delete the resources you created in this walkthrough, including your S3 bucket, Aurora cluster, and other associated resources.

Conclusion

Restricting access to sensitive data to various users in a data lake is a very common challenge. In this post, we demonstrated how to use Lake Formation to enforce column-level access to QuickSight dashboard users. You can enhance security further with Athena workgroups. For more information, see Creating a Data Set Using Amazon Athena Data and Benefits of Using Workgroups.

 


About the Author

Avijit Goswami is a Sr. Startups Solutions Architect at AWS, helping startup customers become tomorrow’s enterprises. When not at work, Avijit likes to cook, travel, watch sports, and listen to music.

 

 

Build an end to end, automated inventory forecasting capability with AWS Lake Formation and Amazon Forecast

Post Syndicated from Syed Jaffry original https://aws.amazon.com/blogs/big-data/build-an-end-to-end-automated-inventory-forecasting-capability-with-aws-lake-formation-and-amazon-forecast/

Amazon Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including estimating product demand, inventory planning, and workforce planning.

With Forecast, there are no servers to provision or ML models to build manually. Additionally, you only pay for what you use, and there is no minimum fee or upfront commitment. To use Forecast, you only need to provide historical data for what you want to forecast, and any additional related data that may influence your forecasts. The latter may include both time-varying data, such as price, events, and weather, and categorical data, such as color, genre, or region. The service automatically trains and deploys ML models based on your data and provides you with a custom API to retrieve forecasts.

This post demonstrates how you can automate the data extraction, transformation, and use of Forecast for the use case of a retailer that requires recurring replenishment of inventory. You achieve this by using AWS Lake Formation to build a secure data lake and ingest data into it, orchestrate the data transformation using an AWS Glue workflow, and visualize the forecast results in Amazon QuickSight.

Use case background

Retailers have a recurring need to replenish inventory. For example, consider a clothing retailer that typically sells through its e-commerce and physical store channels. You need to maintain optimum levels of inventory on hand to meet demand while minimizing warehouse costs. Some of the common questions you need to answer for effective inventory management are:

  • What is the optimal quantity of inventory to reorder from my supplier for my next sales cycle?
  • What should the composition of product SKUs be in the purchase order to the supplier?
  • How do I most effectively determine the right mix and quantity of products to stock at individual retail store locations?

You can use Forecast to answer these questions. You extract data from the source systems, apply transformations to make the data ready for use in Forecast, and use Forecast to load, train, and forecast.

The following diagram shows the end-to-end system architecture of the proposed solution using Lake Formation, AWS Glue, and Amazon QuickSight.

You use Lake Formation to manage governance and access control on the data lake. Additionally, you use the following resources:

  1. Lake Formation blueprint to ingest sales data into a data lake
  2. AWS Lambda and Amazon S3 event notification to trigger an AWS Glue workflow
  3. AWS Glue workflow to trigger the execution of the data transform AWS Glue job
  4. AWS Glue workflow to orchestrate the three steps within Forecast (load, train, forecast)
  5. Forecast to export the forecast results into the data lake
  6. AWS Glue to trigger a crawler on the exported forecast results
  7. Amazon Athena and Amazon QuickSight to visualize the exported forecast results

Setting up the required IAM policies

Before you get started, you need to set up the required IAM policies. Complete the following steps:

  1. Sign in to the IAM console as a user with the AdministratorAccess AWS managed policy.
  2. Create an IAM user named report_builder to use when building your Amazon QuickSight analysis report and dashboard for visualization.
  3. Grant the AmazonAthenaFullAccess policy to the user.In the following steps, you create an IAM role for the AWS Glue jobs, crawler, and workflow to assume during their execution.
  4. On the IAM console, choose Roles.
  5. Choose Create role.
  6. On the Create role page, choose AWS service, and then choose Glue.
  7. Choose Next: Permissions.
  8. From the list of available policies, search for the AWSGlueServiceRole policy and select it.
  9. Name the role GLUE_WORKFLOW_ROLE.
  10. Choose Create role.
  11. On the Roles page, search for and choose GLUE_WORKFLOW_ROLE.
  12. On the Trust relationships tab, choose Edit trust relationship.
  13. Add the following assume role for Forecast :
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "forecast.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }

  14. Choose Update Trust Policy.
  15. On the role Summary page, on the Permissions tab, choose Add inline policy.
  16. Add the following inline policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "lakeformation:GetDataAccess",
                    "lakeformation:GrantPermissions"
                ],
                "Resource": "*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject",
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::forecast-blog-processed/*",
                    "arn:aws:s3:::forecast-blog-published/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
      "forecast:CreateDataset",
      "forecast:CreateDatasetGroup",
      "forecast:CreateDatasetImportJob",
      "forecast:CreateForecast",
      "forecast:CreateForecastExportJob",
      "forecast:CreatePredictor",
      "forecast:DescribeDataset",
      "forecast:DescribeDatasetGroup",
      "forecast:DescribeDatasetImportJob",
      "forecast:DescribeForecast",
      "forecast:DescribeForecastExportJob",
      "forecast:DescribePredictor",
      "forecast:ListDatasetGroups",
      "forecast:ListDatasetImportJobs",
      "forecast:ListDatasets",
      "forecast:ListForecastExportJobs",
      "forecast:ListForecasts",
      "forecast:ListPredictors",
      "forecast:UpdateDatasetGroup"
                ],
                "Resource": "*"
            },
            {
                "Effect": "Allow",
                "Action": [
                    "iam:PassRole",
                    "iam:GetRole"
                ],
                "Resource": [
                    "arn:aws:iam::*:role/AWSGlueServiceRole*",
                    "arn:aws:iam::[your aws account id]:role/GLUE_WORKFLOW_ROLE"
                ],
                "Condition": {
                    "StringLike": {
                        "iam:PassedToService": [
                            "glue.amazonaws.com"
                        ]
                    }
                }
            },
            {
                "Effect": "Allow",
                "Action": [
                    "iam:PassRole",
                    "iam:GetRole"
                ],
                "Resource": "arn:aws:iam::[your aws account id]:role/GLUE_WORKFLOW_ROLE",
                "Condition": {
                    "StringEquals": {
                        "iam:PassedToService": "forecast.amazonaws.com"
                    }
                }
            }
        ]
    }

Setting up your data lake storage

Next, you need a data lake to use in this automation. You create S3 buckets for data storage and apply appropriate security and governance. If you already have a data lake on Amazon S3, you can continue to use those S3 buckets with Lake Formation.

On the Amazon S3 console, create the following buckets:

  • forecast-blog-landing (for raw data ingest)
  • forecast-blog-processed (for the transformed data)
  • forecast-blog-published (for end consumers to access results)

This post includes walkthrough instructions for the landing folder, which you can repeat for the other two folders.

Enabling centralized access control on your data lake

You use Lake Formation’s centralized access control to enable access to the underlying S3 buckets for users and roles. With this model, you don’t need to create any additional IAM access policies or S3 bucket policies for your users and roles.

  1. Sign in to the AWS Lake Formation console as a data lake administrator IAM user.For instructions on setting up a data lake administrator user, see Create a Data Lake Administrator.To manage access control from Lake Formation, you register the three S3 buckets you created earlier as data lake locations with Lake Formation.
  1. On the Lake Formation dashboard, choose Register Location.
  2. For Amazon S3 path, enter your S3 bucket location (s3://forecast-blog-landing).
  3. Choose Register location.
  4. Repeat these steps for the processed and published.

Setting up the Lake Formation data catalog for your data lake

You create three databases in the Lake Formation data catalog, one for each S3 bucket you created earlier. All transformations done via AWS Glue operate on the databases in this catalog.

  1. On the Lake Formation console, under Data catalog, choose Databases.
  2. Choose Create database.
  3. In the Database details section, for Name, enter the name for the database (forecast-blog-landing-db).
  4. For Location, enter the location to the corresponding S3 bucket.
  5. For Description, add an optional description.
  6. Deselect Use only IAM access control for new tables in this database.
  7. Choose Create database.
  8. Repeat the above steps for the processed and published.
  9. On the Databases page, select the database you just created.
  10. From the Actions drop-down menu, choose Grant.
  11. For IAM users and roles, choose the GLUE_WORKFLOW_ROLE you created earlier.
  12. For Database permissions, select Create table.
  13. Choose Grant.At this stage, you have set up your data lake with Lake Formation. The following diagram illustrates your resources so far.

You’re now ready to ingest sales data into your data lake.

Ingesting data

This post uses an example MySQL table called sales, which contains 2 years of sales history of a single product. The schema of the table is as follows:

Column NameColumn Type
InvoiceNoint
StockCodeint
Descriptionvarchar(200)
Quantityint
InvoiceDatevarchar(50)
UnitPricefloat
CustomerIDint
StoreLocationvarchar(100)
CustomerNamevarchar(200)
  1. To begin the ingestion, create the source database.The following CloudFormation template creates a free-tier RDS MySQL instance with a database named sales_schema within a new VPC with the required sample data for this post. The template deploys in the us-west-2 Region.

  1. Create a new Glue connection for the source database.
  2. On the Lake Formation console, choose Blueprints.
  3. Choose Use a blueprint.
  4. Select Incremental database as the blueprint type for a regular ingest of sales data from the source relational database.
  5. Follow the prompts to complete the incremental blueprint setup.
  6. For Database connection, choose forecast-blog-db.
  7. For Source data path, enter sales_schema/sales.
  8. For Target database, choose forecast-blog-landing-db.
  9. For Target storage location, enter s3://forecast-blog-landing.
  10. For Data format, choose Parquet.
  11. For Workflow name, enter forecast-blog-wf.
  12. For IAM role, choose GLUE_WORKFLOW_ROLE.
  13. For Table prefix, enter blog.

The target location must be the landing S3 bucket that you created earlier, and table prefix must be set to blog. This is the raw data that subsequent AWS Glue jobs transform and process. For information about using an incremental database blueprint, see Importing Data Using Workflows.

Start the blueprint after you create it.

Orchestrating data transformation and forecast generation

When you have the raw sales data ingested into the data lake landing bucket, you execute a custom AWS Glue workflow to orchestrate the automation of data transformation, AWS Forecast load/train/forecast execution, and making the forecasts available for business dashboards. Complete the following steps:

  1. Create an AWS Glue crawler to crawl the published S3 bucket that you created earlier.
    1. Use the GLUE_WORKFLOW_ROLE that you created earlier.
  2. Create the following AWS Glue jobs to use in the forecasting automation.
    1. Create the data transformation job as a Spark job, and create the remaining jobs as Python shell (Python 3) jobs.
    2. For IAM role, use GLUE_WORKFLOW_ROLE.
  3. On the Connections page, choose Save job and edit script without selecting any connections. The following jobs are available on GitHub:

Next, you create a new AWS Glue workflow to orchestrate the entire automation. The workflow lets you build and orchestrate a sequence of AWS Glue jobs and crawlers via triggers to complete a complex process.

  1. On the AWS Glue console, choose Workflows.
  2. Choose Add workflow.
  3. For Workflow name, enter AmazonForecastWorkflow.
  4. For Description, add an optional description.
  5. For Default run properties, enter the keys and values in the following table.
KeyValue
landingDBforecast-blog-landing-db
landingDBTableblog_sales_schema_sales
processedBucketforecast-blog-processed
publishedBucketforecast-blog-published
  1. Choose Add workflow.
    After you create the workflow, you add triggers, jobs, and a crawler in your workflow.
  1. Choose the workflow you created earlier.
  2. Choose Add trigger.
  3. For name, enter StartWorkflow.
  4. For Trigger type, select On demand.
  5. Choose Add.
  6. On the Jobs tab, choose the job you created earlier.
  7. Choose Add.
  8. Choose Add node.
  9. Create a new trigger to watch the end of the transform job and start the data import job.
  10. For Name, enter StartDataImport.
  11. For Trigger type, select Event.
  12. For Trigger logic, leave as Start after ANY watched event.
  13. Choose Add.
  1. Choose Add node to the left of the trigger and choose the data transformation job you created earlier.
  2. Choose Add node to the right of the trigger and choose the import data job you created earlier.
  1. Repeat these steps to create the following triggers:
TriggerGlue job to completeGlue job to start
CheckImportTrigger

 

Importing data into Forecast

 

Checking the load data job status
StartPredictorTrigger

Checking the load data job status

 

Training the Forecast predictor

 

CheckPredictorTrigger

 

Training the Forecast predictor

 

Checking the train predictor job status
GenerateForecastTriggerChecking the train predictor job statusGenerating forecast
CheckForecastTriggerGenerating forecastChecking the forecast job status
ExportForecastTriggerChecking the forecast job statusExporting forecast to data lake published bucket
CheckExportTriggerExporting forecast to data lake published bucket

Checking the export forecast job status

 

StartCrawlerTrigger

 

Checking the export forecast job status

 

Crawler you created to crawl the published S3 bucket
  1. From the Actions drop-down menu, choose Run.

This starts the end-to-end forecasting process.

Visualizing your forecasts

To provide your users with a dashboard that refreshes regularly with new forecasts, set up an Amazon QuickSight report and a dashboard and connection to the data lake via Athena. You can use Amazon QuickSight and the Athena data source to access the forecast data and make visualizations.

First, you need to grant access to the forecast data to QuickSight. Identify the Amazon QuickSight service role in your account. Amazon QuickSight assumes the service role (aws-quicksight-service-role-v0) to interact with other AWS services. The service role is automatically created when you start using Amazon QuickSight.

  1. As the data lake administrator user, on the Lake Formation console, locate the table created by the AWS Glue workflow under your published-db.This is the table that has the exported forecast data to visualize.
  2. Grant Select access on this table to the Amazon QuickSight service role using the Grant action within Lake Formation.For more information, see Granting Data Catalog Permissions.You now create a dashboard to visualize the forecast data in Amazon QuickSight.
  3. On the Amazon QuickSight console, choose Manage data.
  4. Create a data source for Athena.
  5. For Data source name, enter forecast-blog-published-db.
  6. Choose Create data source.
  7. For Database, choose forecast-blog-published-db.
  8. Select the table that your crawler created for the exported forecast (forecast_blog_published).
  9. Choose Select.
  10. Choose Visualize.
  11. Create a new analysis on the dataset.
  12. Publish a visualization dashboard.The following screenshot is a QuickSight dashboard displaying your exported forecast. The dashboard was created from a line chart analysis with the columns p10, p50, and p90 values selected for the y-axis and date selected for the x-axis.

Forecast generates probabilistic forecasts so you can generate forecasts at different percentiles depending on your specific use case (for example, if under-stocking or over-stocking is critical to the business). The preceding graph represents the upper and lower bands of forecasted inventory values for the product in your sample data and a selected location (applied as a filter) for a 10-day period. Using this, you can decide on the optimum levels of stock to hold or order for that week.

The p10 is the lower boundary, meaning that there’s only a 10% chance that the actual value is below this line. However, P90 is the upper bound, meaning that there’s a 90% chance that the actual value is below this line. As your training data becomes more comprehensive, the p10 and p90 start to converge. You can also generate forecasts on custom quantiles of your choosing.

For conservative planning, choose a value closer to the p90, which means you’re willing to purchase more inventory than what you actually sell. For aggressive planning, choose a value closer to the p10, which means that you’re willing to accept the risk of running out of inventory.

Conclusion

In this post, you learned how to build an automated inventory forecasting capability for your business on AWS using AI through Forecast and Lake Formation. You learned how to set up a data lake on AWS with the required security governance using Lake Formation. You also learned how to automate the end-to-end process of ingesting sales data into your data lake and automating the data transformation; loading, training, and generating forecasts with Forecast; and making the forecasts accessible to your end-users via Amazon QuickSight visualizations.

 


About the Author

Syed Jaffry is a solutions architect with Amazon Web Services. He works with Financial Services customers to help them deploy secure, resilient, scalable and high performance applications in the cloud.

 

 

Build an AWS Well-Architected environment with the Analytics Lens

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/build-an-aws-well-architected-environment-with-the-analytics-lens/

Building a modern data platform on AWS enables you to collect data of all types, store it in a central, secure repository, and analyze it with purpose-built tools. Yet you may be unsure of how to get started and the impact of certain design decisions. To address the need to provide advice tailored to specific technology and application domains, AWS added the concept of well-architected lenses 2017. AWS now is happy to announce the Analytics Lens for the AWS Well-Architected Framework. This post provides an introduction of its purpose, topics covered, common scenarios, and services included.

The new Analytics Lens offers comprehensive guidance to make sure that your analytics applications are designed in accordance with AWS best practices. The goal is to give you a consistent way to design and evaluate cloud architectures, based on the following five pillars:

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization

The tool can help you assess the analytics workloads you have deployed in AWS by identifying potential risks and offering suggestions for improvements.

Using the Analytics Lens to address common requirements

The Analytics Lens models both the data architecture at the core of the analytics applications and the application behavior itself. These models are organized into the following six areas, which encompass the vast majority of analytics workloads deployed on AWS:

  1. Data ingestion
  2. Security and governance
  3. Catalog and search
  4. Central storage
  5. Processing and analytics
  6. User access

The following diagram illustrates these areas and their related AWS services.

There are a number of common scenarios where the Analytics Lens applies, such as the following:

  • Building a data lake as the foundation for your data and analytics initiatives
  • Efficient batch data processing at scale
  • Building a platform for streaming ingest and real-time event processing
  • Handling big data processing and streaming
  • Data-preparation operations

Whichever of these scenarios fits your needs, building to the principles of the Analytics Lens in the AWS Well-Architected Framework can help you implement best practices for success.

The Analytics Lens explains when and how to use the core services in the AWS analytics portfolio. These include Amazon Kinesis, Amazon Redshift, Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation. It also explains how Amazon Simple Storage Service (Amazon S3) can serve as the storage for your data lake and how to integrate with relevant AWS security services. With reference architectures, best practices advice, and answers to common questions, the Analytics Lens can help you make the right design decisions.

Conclusion

Applying the lens to your existing architectures can validate the stability and efficiency of your design (or provide recommendations to address the gaps that are identified). AWS is committed to the Analytics Lens as a living tool; as the analytics landscape evolves and new AWS services come on line, we’ll update the Analytics Lens appropriately. Our mission will always be to help you design and deploy well-architected applications.

For more information about building your own Well-Architected environment using the Analytics Lens, see the Analytics Lens whitepaper.

Special thanks to the following individuals who contributed to building this resource, among many others who helped with review and implementation: Radhika Ravirala, Laith Al-Saadoon, Wallace Printz, Ujjwal Ratan, and Neil Mukerje.

Are there questions you’d like to see answered in the tool? Share your thoughts and questions in the comments.

 


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at Amazon Web Services. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 


Radhika Ravirala is a specialist solutions architect at Amazon Web Services, where she helps customers craft distributed analytics applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley.

Integrating AWS Lake Formation with Amazon RDS for SQL Server

Post Syndicated from Marcio Morales original https://aws.amazon.com/blogs/big-data/integrating-aws-lake-formation-with-amazon-rds-for-sql-server/

To grow and develop your business, you must collect data from a myriad of sources (such as relational and NoSQL databases, clickstream, and IoT events) and combine it using analytics to extract insight. AWS Lake Formation enables you to ingest data from many different sources into a data lake based in Amazon S3. Furthermore, you can use Lake Formation to control access to this data from a single place.

This post shows how to ingest data from Amazon RDS into a data lake on Amazon S3 using Lake Formation blueprints and how to have column-level access controls for running SQL queries on the extracted data from Amazon Athena.

A blueprint is a data management template that enables you to ingest data into a data lake easily. Lake Formation provides several blueprints, each for a predefined source type, such as a relational database or AWS CloudTrail logs. From a blueprint, you can create a workflow. Workflows consist of AWS Glue crawlers, jobs, and triggers that orchestrate the loading and update of data. Blueprints take the data source, data target, and schedule as input to configure the workflow.

For more information about Lake Formation, see AWS Lake Formation: How It Works.

Walkthrough overview

A data lake is a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. A data lake enables you to break down data silos and combine different types of analytics to gain insights and guide better business decisions.

Lake Formation makes it easy to set up a secure data lake quickly. The walkthrough contains the following steps:

  1. Register an S3 bucket as a data lake storage
  2. Create a logical database to hold the tables that represent your data
  3. Create a workflow for ingesting data from Amazon RDS for SQL Server to the data lake
  4. Grant permissions to access the tables in the data lake

Prerequisites

To complete this walkthrough, you must have the following:

  • An Amazon RDS for SQL Server up and running
  • An IAM user with access to Lake Formation, Amazon S3, AWS Glue, and Athena

Registering your Amazon S3 storage

To register your S3 bucket, complete the following steps:

  1. On the AWS Lake Formation console, choose Get Started.
  2. Choose Register location.
  3. For Amazon S3 path, choose Browse.
  4. Choose the S3 bucket you designated for your data lake.
  5. Make sure that the S3 bucket you selected has the correct IAM permissions to work with Lake Formation.
  6. Choose Review location permissions.
  7. For IAM role, choose an IAM role designated to allow Lake Formation to read and write data in the S3 bucket.You have the option to let Lake Formation create an IAM role with the necessary permissions, or you can select an existing IAM role.
  8. Choose Register location.
  9. On the AWS Lake Formation console, under Register and ingest, choose Data lake locations.You can see your S3 bucket registered.

Creating a database

You are now ready to create a database to hold your data lake tables. Databases are logical and can be treated as namespaces. They are containers for the metadata tables that the AWS Glue Data Catalog stores. Databases can have an optional location property. This location needs to be within an Amazon S3 location that is registered with Lake Formation.

To create your database, complete the following steps:

  1. Choose Create database.
  2. For Name, enter a name for your database.
  3. For Location, enter the location of your S3 bucket.
  4. Choose Create database.

Creating a workflow to ingest data

Now that your data lake is set up, you are ready to ingest data. Lake Formation includes a feature called blueprints, which allows you to ingest data from MySQL, PostgreSQL, Oracle, and SQL Server databases into your data lake. Blueprints allow you to ingest data by copying a table or incrementally loading newly inserted data. When you use blueprints, they create an AWS Glue workflow and all relevant jobs and crawlers to enable the discovery and movement of data.

To create your workflow to ingest data, complete the following steps:

  1. On the Lake Formation console, under Register and ingest, choose Blueprints.
  2. In the Blueprint overview section, choose Use blueprints.
  3. For Blueprint type, select what type of data you want to ingest to your data lake.For this walkthrough, choose Database snapshot, because the source table is not very large. For large tables, an incremental database may be a better option.To ingest data to your data lake, you must define a database connection with parameters to allow AWS Glue to access the source database.
  4. In the Import source section, for Database connection, choose Create a connection in AWS Glue.This step redirects you to the AWS Glue console.
  5. On the Connections page, for Connection name, enter the name of your connection.
  6. For Connection type, choose your driver.You can choose between Amazon RDS, Amazon Redshift, and JDBC as the driver. For this walkthrough, choose JDBC, so your job can talk to both Amazon RDS and SQL Server on Amazon EC2 using the same connection configuration.
  7. Select if the connection requires SSL.
  8. Choose Next.
  9. For JDBC URL, you must follow the path structure mentioned in the AWS Glue documentation.Change the value MYRDSENDPOINT in the JDBC URL with the real Amazon RDS endpoint from your environment.
  10. Enter a username and password that has read permission to your database.
  11. Choose the VPC and Subnet where your database instance is running.
  12. Choose a Security Group that allows AWS Glue to connect to your database instance.
  13. Choose Next.
  14. On the Review all steps page, choose Finish.You return to the Lake Formation console.
  15. In the Import source section, choose the refresh button next to the Database connection drop-down menu.
  16. Choose the connection you just created.
  17. For Source data path, enter the database name.If you choose to only read a particular table, you can update this field to reflect the exact database, schema, and table you want. Wildcards are allowed.
  18. As an option, enter any exclude patterns to exclude any data you may not want to ingest into your data lake.For more information, see Include and Exclude Patterns.You are now ready to configure the target of the workflow.
  19. In the Import target section, for Target database, choose the database you just created.
  20. For Target storage location, enter the data lake location you created.
  21. For Data format, choose Parquet.For more information about Parquet, see the Apache Parquet website.
  22. For Import frequency, choose your import frequency.For this walkthrough, choose Run on demand.
  23. In the Import options section, for Workflow name, enter a name for the workflow.
  24. For IAM Role, choose an IAM Role that allows AWS Glue to call AWS services on your behalf.For instructions on setting up this IAM role, see documentation.
  25. For Table prefix, enter the table prefix the job uses when creating the final output table in the data lake.This post enters sqlserver as a way to discover the tables the workflow ingests quickly.
  26. For Maximum capacity, enter the number of DPUs to allocate when the job runs.This post enters 5.
  27. For Concurrency, enter the maximum number of concurrent runs that are allowed for this job.This post enters 5.
  28. Choose Create.
  29. On the Lake Formation console, choose Blueprints.You can see the workflow you created under Workflows.
  30. Choose the workflow.
  31. From the Actions drop-down menu, choose Start.The process of ingestion and transformation varies based on the database size.
  32. Choose the workflow.
  33. Choose the run ID of the current run.This step redirects you to the AWS Glue console, where you can dig deeper into each task of the workflow.

Granting permissions

Now that you have data in your data lake on Amazon S3 and metadata tables in your database on the Data Catalog, you can grant users permissions to access the data. Complete the following steps:

  1. Choose Grant permissions.
  2. For IAM users and roles, choose the IAM users or roles that are allowed to access the data.You have the option to grant access to a database, a set of tables within a database, or a specific set of columns within a table.

    For this post, the login is federated through Active Directory Federation Services (ADFS). Active Directory Domain Services Groups (ADDS) manages access privileges, and users can select which IAM Role to assume.

    This post assumes the IAM Role ADFS-Marketing. It has access to the database Lakeformationdatabase and table sqlserver_appdatabase01_dbo_customer. Furthermore, the role does not have permission to access the phone column.You can now verify that permissions are working as expected.

  3. Log in with a federated user assuming the ADFS-Marketing role.The following screenshot shows that the phone column isn’t available.
    However, if you assume a role that has access to all of the specific table, you can view the phone column. See the following screenshot.

Conclusion

This post demonstrated how to quickly register an Amazon S3 data lake in Lake Formation, ingest data from an operational SQL Server database using blueprints and grant permissions at the column level. Not only this but, also show customers that is possible to reduce third-party tools license costs using AWS Services.

 


About the Author

Marcio Morales is a Senior Microsoft Specialist SA at AWS.

 

 

 

A public data lake for analysis of COVID-19 data

Post Syndicated from AWS Data Lake Team original https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/

As the COVID-19 pandemic continues to threaten and take lives around the world, we must work together across organizations and scientific disciplines to fight this disease. Innumerable healthcare workers, medical researchers, scientists, and public health officials are already on the front lines caring for patients, searching for therapies, educating the public, and helping to set policy. At AWS, we believe that one way we can help is to provide these experts with the data and tools needed to better understand, track, plan for, and eventually contain and neutralize the virus that causes COVID-19.

Today, we are making a public AWS COVID-19 data lake available – a centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and related coronaviruses from the Allen Institute for AI. We will regularly add to this data lake as other reliable sources make their data publicly available.

The breakthroughs that can win the battle against this disease arrive faster when it’s easy for everyone to access and experiment with this vital information. The AWS COVID-19 data lake allows experimenters to quickly run analyses on the data in place without wasting time extracting and wrangling data from all the available data sources. They can use AWS or third-party tools to perform trend analysis, do keyword search, perform question/answer analysis, build and run machine learning models, or run custom analyses to meet their specific needs. Since every stakeholder in this battle brings their own perspective, users can choose to work with the public data lake, combine it with their own data, or subscribe to the source datasets directly through AWS Data Exchange.

We imagine local health authorities could build dashboards to track infections and collaborate to efficiently deploy vital resources like hospital beds and ventilators. Or epidemiologists could complement their own models and datasets to generate better forecasts of hotspots and trends.

For example, at Chan Zuckerberg Biohub, a nonprofit where leaders in science and technology collaborate to cure, prevent, or manage disease, scientists are using the AWS COVID-19 data lake for new epidemiological insights. “Our team of researchers is now analyzing trends in disease spread, its geography, and time evolution by leveraging datasets from the AWS COVID-19 data lake, combined with our own data, in order to better predict COVID epidemiology,” said Jim Karkanias, Vice President of Data Science and Information Technology at Chan Zuckerberg Biohub.

This post walks you through examples of how to use the AWS COVID-19 data lake for analysis. This data lake is comprised of data in a publicly readable Amazon S3 bucket (s3://covid19-lake). The post shows how to set up the definitions for that data in an AWS Glue Data Catalog to expose it to analytics engines. You can then query the AWS COVID-19 data lake with Amazon Athena, a serverless SQL query engine.

Prerequisites

This post assumes you have the following:

  • Access to an AWS account
  • Permissions to create an AWS CloudFormation stack
  • Permissions to create AWS Glue resources (catalog databases and tables)

Configuring access to the data using a CloudFormation template

To make the data from the AWS COVID-19 data lake available in the Data Catalog in your AWS account, create a CloudFormation stack using the following template. If you are signed in to your AWS account, the following link fills out most of the stack creation form for you. All you need to do is choose Create stack. For instructions on creating a CloudFormation stack, see Get Started in the Cloud Formation documentation.

This template creates a covid-19 database in your Data Catalog and tables that point to the public AWS COVID-19 data lake. You do not need to host the data in your account, and you can rely on AWS to refresh the data as datasets are updated through AWS Data Exchange.

Exploring the data through the Data Catalog in your AWS account

When the CloudFormation stack shows a status of CREATE_COMPLETE, access the Glue Data Catalog to see the tables that the template created. You should see the following tables:

  • Global Coronavirus (COVID-19) Data – Tracks confirmed COVID-19 cases in provinces, states, and countries across the world with a breakdown to the county level in the US.

 

Table NameDescriptionSourceProvider
enigma_jhuConfirmed COVID-19 casesJohns HopkinsEnigma

 

 

Table NameDescriptionSourceProvider
nytimes_statesData on COVID-19 cases at US state levelNY TimesRearc
nytimes_countiesData on COVID-19 cases at US county level

 

 

Table NameDescriptionSourceProvider
covid_testing_states_dailyUSA total test daily trend by stateCOVID Tracking ProjectRearc
covid_testing_us_dailyUSA total test daily trend
covid_testing_us_totalUSA total tests

 

 

Table NameDescriptionSourceProvider
hospital_bedsHospital beds and their utilization in the USDefinitive HealthcareRearc

 

 

Table NameDescriptionSource/Provider
alleninstitute_metadataMetadata on papers pulled from the CORD-19 dataset. The sha column indicates the paper ID, which is the file name of the paper in the data lake.Allen Institute for AI
alleninstitute_comprehend_medicalResults from Amazon Comprehend Medical run against the CORD-19 dataset.

 

  • Lookup tables to support visualizations.

 

Table NameDescription
country_codesLookup table for country codes
county_populationsLookup table for the population for each county based on recent census data
us_state_abbreviationsLookup table for US state abbreviations

In addition, you can see descriptions of the columns in these tables. For example, the following screenshot shows the metadata of the table containing COVID-19 cases from Johns Hopkins.

Querying data via Amazon Athena

This section demonstrates how to query these tables using Athena. Athena is a serverless interactive query service that makes it easy to analyze the data in the AWS COVID19 data lake. Athena supports SQL, a common language that data analysts use for analyzing structured data. To query the data, complete the following steps:

  1. Sign in to the Athena console.

If this is the first time you are using Athena, you must specify a query result location on Amazon S3.

  1. From the drop-down menu, choose the covid-19 database.
  2. Enter your query.

The following query returns the growth of confirmed cases for the past 7 days joined side-by-side with hospital bed availability, broken down by US county:

SELECT 
  cases.fips, 
  admin2 as county, 
  province_state, 
  confirmed,
  growth_count, 
  sum(num_licensed_beds) as num_licensed_beds, 
  sum(num_staffed_beds) as num_staffed_beds, 
  sum(num_icu_beds) as num_icu_beds
FROM 
  "covid-19"."hospital_beds" beds, 
  ( SELECT 
      fips, 
      admin2, 
      province_state, 
      confirmed, 
      last_value(confirmed) over (partition by fips order by last_update) - first_value(confirmed) over (partition by fips order by last_update) as growth_count,
      first_value(last_update) over (partition by fips order by last_update desc) as most_recent,
      last_update
    FROM  
      "covid-19"."enigma_jhu" 
    WHERE 
      from_iso8601_timestamp(last_update) > now() - interval '7' day AND country_region = 'US') cases
WHERE 
  beds.fips = cases.fips AND last_update = most_recent
GROUP BY cases.fips, confirmed, growth_count, admin2, province_state
ORDER BY growth_count desc

The following screenshot shows the results of this query.

Athena also allows you to run these queries through REST APIs, for example, for building your own visualizations. Moreover, Athena is just one of the many engines that you can use on the data lake. For example, you can use Amazon Redshift Spectrum to join lake data with other datasets in your Redshift data warehouse, or use Amazon QuickSight to visualize your datasets.

We have also created a public Amazon QuickSight dashboard from the COVID-19 case tracking data, testing data, and hospital bed data. You can track daily updates with this dashboard. You can also drill-down to see breakdowns by country, province, and county without having to write a line of SQL. The following is a recent screenshot of the dashboard.

CORD-19 research articles

The CORD-19 dataset is a collection of metadata and full-text of research articles about COVID-19, SARS-CoV-2, and related coronaviruses. You can index this data with Amazon Kendra for question/answer exploration, or enrich the data with Amazon Comprehend Medical. We have already done the latter and put it in the table called alleninstitute_comprehend_medical.

The alleninsitute_metadata table provides detailed fields for each paper, such as the title, authors, journal, and URL. The alleninstitute_comprehend_medical table contains key medical concepts such as medical condition, medication, dosage, strength, and frequency. With this metadata, you can quickly query over concepts, analyze or aggregate over authors and journals, and locate papers.

Aggregating over journals

Using IL-6 inhibitors is a possible therapy for COVID-19, and clinical trials are underway. To demonstrate how to use these tables, this post presents a use case in which you want to understand which journals discuss IL-6 the most by counting the papers they published. You can do this by running the following query:

SELECT m.journal,
       count(distinct(cm.paper_id)) as paper_count
FROM "covid-19".alleninstitute_metadata m
JOIN "covid-19".alleninstitute_comprehend_medical cm
    ON (contains(split(m.sha, '; '), cm.paper_id))
WHERE contains(generic_name, 'IL-6')
GROUP BY  m.journal
ORDER BY paper_count desc

The following screenshot shows an example of the results. The data provider updates this dataset over time, so your results may look different (here, we notice that the second highest count has no journal information).

Drilling down into papers

To see the URLs and the titles of the papers in one of these journals, you simply query both these tables again. For example, to drill into IL-6 related papers in the Crit Care journal, enter the following query:

SELECT distinct m.url, m.title
FROM "covid-19".alleninstitute_metadata m
JOIN "covid-19".alleninstitute_comprehend_medical cm
    ON (contains(split(m.sha, '; '), cm.paper_id))
WHERE contains(generic_name, 'IL-6')
      AND m.journal = 'Crit Care'

The following screenshot shows an example of the results.

These examples are a few of the innumerable analyses you can run on the public data lake. You incur no additional cost for accessing the AWS COVID-19 data lake beyond the standard charges for the AWS services that you use. For example, if you use Athena, you will incur the costs for running queries and the data storage in the S3 query result location, but incur no costs for accessing the data lake. In addition, if you want this data in raw form, you can subscribe to, download, and stay up-to-date through AWS Data Exchange. We encourage you to try using the public AWS COVID-19 data lake yourself.

Conclusion

Combining our efforts across organizations and scientific disciplines can help us win the fight against the COVID-19 pandemic. With the AWS COVID-19 data lake, anyone can experiment with and analyze curated data related to the disease, as well as share their own data and results. We believe that through an open and collaborative effort that combines data, technology, and science, we can inspire insights and foster breakthroughs necessary to contain, curtail, and ultimately cure COVID-19.

For daily updates on how AWS is addressing the crisis, see Amazon’s COVID-19 blog.

 


About the Authors

The AWS Data Lake Team members are Roy Ben-Alta, Jason Berkowitz, Chris Casey, Patrick Combes, Lucy Friedmann, Fred Lee, Megan Maxwell, Rourke McNamara, Herain Oberoi, Stephen Orban, Brian Ross, Nikki Rouda, Noah Schwartz, Noritaka Sekiyama, Mehul A. Shah, Ben Snively, and Ying Wang.

Matching patient records with the AWS Lake Formation FindMatches transform

Post Syndicated from Dhawalkumar Patel original https://aws.amazon.com/blogs/big-data/matching-patient-records-with-the-aws-lake-formation-findmatches-transform/

Patient matching is a major obstacle in achieving healthcare interoperability. Mismatched patient records and inability to retrieve patient history can cause significant barriers to informed clinical decision-making and result in missed diagnoses or delayed treatments. Additionally, healthcare providers often invest in patient data deduplication, especially when the number of patient records is growing rapidly in their databases. Electronic Health Records (EHRs) have significantly improved patient safety and care coordination in recent years; however, accurate patient matching remains a challenge for many healthcare organizations.

Duplicate patient records emerge for a variety of reasons, including human-generated number insertion, deletion, substitution, or transposition errors. Optical Character Recognition (OCR) software, which digitizes patient records, may also introduce errors.

Multiple types of record-matching algorithms exist to solve this problem. These include basic deterministic methods such as grouping and comparing relevant fields (such as SSN, name, or date of birth), phonetic encoding systems, and more advanced algorithms using machine learning (ML).

AWS Lake Formation is a HIPAA-eligible service that helps you build a secure data lake in a few simple steps. Lake Formation also contains FindMatches, an ML transform that enables you to match records across different datasets and identify and remove duplicate records, with little to no human intervention.

This post shows you how to use the FindMatches ML transform to identify matching patient records in a synthetically generated dataset. To use FindMatches, you don’t have to write code or know how ML works. This is useful for finding matches in your data when it doesn’t include a reliable, unique personal identifier, even if the fields don’t match exactly.

Patient dataset

Various regulations in different countries govern patient data due to its sensitive nature. As such, the availability of patient data on which to train matching algorithms is often scarce, which complicates model development. A common method to get around these challenges is synthetic data. This post generates patient data based on the Open Source Freely Extensible Biomedical Record Linkage Program (FEBRL). FEBRL uses Hidden Markov Models (HMMs) to prepare the name and address data for patient record matching. It also allows for mimicking real-life patient datasets that lead to duplicates, which may have the following mismatches:

  • Blank fields.
  • Typographical errors such as misspelling, transposing the characters, or swapping the fields.
  • Shortened middle names versus records complete middle names.
  • Various formats of mailing address and its components.
  • OCR-related errors.
  • Phonetical errors.
  • No globally unique patient or person identifier. Every healthcare provider may have a patient identifier assigned to the same person but may not have a person identifier like SSN, so they have datasets without keys.

FEBRL can generate these types of datasets based on configurable parameters to change the probabilities of each type of error, and thus incorporate a variety of scenarios leading to duplicates. The generation of the synthetic dataset is beyond the scope of this post; this post provides a pre-generated dataset for you to explore. In brief, below are the steps to generate the synthetic dataset which will be used to run FindMatches:

  1. Download and install FEBRL.
  2. Modify the parameters to create a dataset mimicking your expectations. For more information, see the dataset generation instructions of FEBRL.
  3. Cleanse the dataset (this confirms the same schema for each record, and removes single quotes and family role).

The dataset for this post uses AWS Region US East (N. Virginia).

FRBRL patient data structure

The following table shows the structure of FEBRL patient data. The data includes 40,000 records.

The original record and duplicate records are grouped together. The patient_id values are generated in a specific format: rec-<record number>-org/dup-<duplicate record number>, followed by the FEBRL data generator tool.

The following table is a preview of what you’re working to achieve with the FindMatches ML transform. Once you match datasets, the resulting table mirrors the input table’s structure and data and adds a match_id column. Matched records display the same match_id value. It is still possible to have false positives and false negatives, but the transform is still extremely beneficial.

Prerequisites

The sample synthetic patient dataset for this post uses Region US East (N. Virginia), and thus all the steps mentioned in this post must be performed in the same AWS region i.e. us-east-1 but the steps are easily modifiable if your data is in a different region.

Solution architecture

The following diagram shows the solution’s architecture.

 

 

Solution overview

At a high level, the matching process includes the following steps:

  1. Upload raw patient dataset in csv format on Amazon S3 Bucket
  2. Crawl the uploaded patient dataset using AWS Glue crawler
  3. Catalog your patient data with the AWS Glue Data Catalog and create a FindMatches ML transform.
  4. Create a label set, either using ML transform or manually, and teach FindMatches by providing labeled examples of matching and non-matching records. Upload your labels and estimate the quality of the prediction. Add more labelsets and repeat this step as required to get the required Precision, Accuracy and Recall rate.
  5. Create and execute an AWS Glue ETL job that uses your FindMatches transform.
  6. Store the results of FindMatches transform on Amazon S3 bucket
  7. Create an AWS Glue data catalog of FindMatches ML transform results.
  8. Review the transform results with Amazon Athena.

Cataloging your data and creating a FindMatches ML transform

FindMatches operates on tables defined in the AWS Glue Data Catalog. Use AWS Glue crawlers to discover and catalog the patient data. You can use the FEBRL patient data generated for this post.

The Cloudformation stack provided below creates the resources in the AWS region - us-east-1 (US East (N. Virginia)

To create the Catalog and FindMatches ML transform in AWS Glue, launch the following stack:


This stack creates the following resources:

  • An Amazon S3 bucket that stores the ML transform results (configurable as part of the launch). You can find the name of the bucket on the AWS CloudFormation stacks console, under the Outputs This post uses the name S3BucketName.
  • An IAM role to allow AWS Glue to access other services, including S3.
  • An AWS Glue database (configurable as part of the launch).
  • An AWS Glue table that represents the Public Original Synthetic Patients Dataset (configurable as part of the launch).
  • An AWS Glue ML transform with the source as your AWS Glue table, with Accuracy set to 1 and Precision set to 0.9.

For more information, see Integrate and deduplicate datasets using AWS Lake Formation FindMatches.

Tuning the ML transform

The safety risks of false positive matches, in which clinicians believe incorrect information about the patient to be accurate, may be greater than the safety risks of false negative matches, in which clinicians lack access to the existing information about the patient. (For more information, see the related study on the NCBI website.) Therefore, moving the Recall vs. Prevision slider toward Precision has a higher level of confidence to identify if the records belong to the same patient and minimizes the safety risks of false positive matches.

A higher Accuracy setting helps achieve higher recall, at the expense of a longer runtime (and thus cost) necessary to compare more records.

To achieve the comparatively better results for this particular dataset, the launch stack already created the transform for you with the Recall vs. Precision slider set to 0.9 toward Precision and the Lower Cost vs. Accuracy slider set to Accuracy. If needed, you can adjust these values later by selecting the transform and using the Tune menu.

Teaching FindMatches using labeled data

After running the launch stack successfully, you can train the transform by supplying it with matching and non-matching records using labelsets.

Creating a labeling set

You can create a labeling set yourself or allow AWS Glue to generate the labeling set based on heuristics.

AWS Glue extracts records from your source data and suggests potential matching records. The generated labelset file contains approximately 100 data samples for you to work on.

This post provides you with an AWS Glue generated labeled data file that you can use, with a fully populated label column. This file is fully ready for consumption.

If you choose to use the pre-generated labeled data file provided in this post, skip the below labeling file generation steps

To create the training set, complete the following steps:

  1. On the AWS Glue console, Under ETL, Jobs and ML transforms you will find the ML transform with name cfn-findmatches-ml-transform-demo created for you by the stack provided.
  2. Choose the ML transform cfn-findmatches-ml-transform-demo and click on Action and select Teach transform
  3. For Teach the transform using labels, choose I do not have labels.
  4. Choose Generate labeling file.
  5. Provide the S3 path to store the generated label file.
  6. Choose Next.

The following table shows the generated labeled data file with an empty label column.

You need to populate the label column by marking the records that are a real match with the same value. Each labelset should contain positive and negative match examples.

This post provides you with a labeled data file that you can use, with a fully populated label column. This file is fully ready for consumption.

The following table shows the table with a completed label column.

The labeling file has the same schema as the input data, plus two additional columns: labeling_set_id and label.

The training dataset is divided into labeling sets. Each labeling set displays a labeling_set_id value. This identification simplifies the labeling process, enabling you to focus on the match relationship of records within the same labeling set, rather than having to scan the entire file. For the preceding dataset, extract the label values from patient_id by removing the suffix -org and -dup using regular expression. But in general, you would assign these labels according to which records should match based on the attribute values.

If you specify the same label value for two or more records within a labeling set, you teach the FindMatches transform to consider these records a match. On the other hand, when two or more records have different labels within the same labeling set, FindMatches learns that these records aren’t considered a match. The transform evaluates record relationships between records within the same labeling set, not across labeling sets.

You should label a few hundred records to achieve a modest match quality, and a few thousand records to achieve a high match quality.

Uploading your labels and reviewing match quality

After you create the labeled dataset (which needs to be in .csv format), teach FindMatches where to find it. Complete the following steps:

  1. On the AWS Glue console, select the transform that you created earlier.
  2. Choose Action.
  3. Choose Teach transform.
  4. For Upload labels, choose I have labels.
  5. Choose Upload labeling file from S3.
  6. Choose Next.
  7. If you want to use the labelset provided in this blog post, download the labelset here.
  8. Create a folder with name trainingset in the same S3 bucket created by the previously launched cloudformation template above.
  9. Upload the above labelset in the trainingset folder in the same S3 bucket
  10. Choose Overwrite my existing labels.You are only using one set of labels. If adding labels iteratively, choose the Append to my existing labels option.
  11. Choose Upload.With the labels uploaded, your transform is now ready to use. Though not strictly required, check the transform match quality by reviewing the metrics of matching and non-matching records.
  12. Choose Estimate transform quality.The transform quality estimate learns using 70% of your labels. After it’s trained, the quality estimate tests how well the transform learned to identify matching records against the remaining 30%. Finally, the transform generates quality metrics by comparing the matches and non-matches predicted by the algorithm vs. your actual labels. This process may take up to several minutes.

Your results should be similar to those in the following screenshot.

Consider these metrics approximate, because the test uses only a small subset of data for estimating quality. If you’re satisfied with the metrics, proceed with creating and running a record-matching job. Or, to improve matching quality further, upload more labeled records.

Creating and running an AWS Glue ETL job

After you create a FindMatches transform and verify that it learned to identify matching records in your data, you’re ready to identify matches in your complete dataset. To create and run a record-matching job, complete the following steps:

  1. Create a transformresults folder inside the S3 bucket that the AWS CloudFormation template created when you launched the stack.This folder stores the MLTransform results from your AWS Glue job.
  2. On the AWS Glue console, under Jobs, choose Add job.
  3. Under Configure the job properties, for Name, enter a name for the job.
  4. For IAM role, choose your role from the dropdown menu.Choose the IAM role that the AWS CloudFormation stack created, called AWSGlueServiceRoleMLTransform. For more information, see Create an IAM Role for AWS Glue.
  5. Select Spark as the Type with Glue version as Spark 2.2, Python 2 (Glue version 0.9)
  6. Select job run as A proposed script generated by AWS Glue.
  7. For Choose a data source, choose the transform data source.This post uses the data source cfn_table_patient.
  8. Under Choose a transform type, choose Find matching records.
  9. For Worker type, choose G.2X.
  10. For Number of workers, enter 10.You can add more workers based on the size of the datasets by increasing this number.
  11. To review records identified as duplicate, do not select Remove duplicate records.
  12. Choose Next.
  13. Choose the transform that you created.
  14. Choose Next.
  15. For Choose a data target, choose Create tables in your data target.
  16. For Data store, choose Amazon S3.
  17. For Format, choose CSV.
  18. For Compression type, choose None.
  19. For Target path, choose a path for the job’s output.The target path is the S3 bucket that AWS CloudFormation created, along with the folder named transformresults you created previously.
  20. Choose Save job and edit script.The script is now generated for your job and ready to use. Alternatively, you can customize the script further to suit your specific ETL needs.
  21. To start identifying matches in this dataset, choose Run job as shown in the below screen.For now, leave the job parameters with the default settings, and close this page after starting the job.The following screenshot shows the proposed Python Spark script generated by AWS Glue using the ML Transform.If the execution is successful, FindMatches shows the run status as Succeeded. The execution might take a few minutes to finish.

FindMatches saves your output data as multi-part .csv files in the target path that you specified during the job creation. The resulting .csv files mirror the input table’s structure and data, but have a match_id column. Matched records display the same match_id value.

Creating a Data Catalog of your transform results

To view the output, you can either download the multi-part .csv files from the S3 bucket directly and review it via your preferred editor, or you can run SQL-like queries against the data stored on S3 using Athena. To view the data using Athena, you need to crawl the folder with the multi-part .csv files that you created as part of the output of your FindMatches ETL job.

Go to AWS Glue and create a new table using AWS Glue crawlers in the existing database for patient matching that holds the records from the output of your FindMatches ETL job with the source data as the folder of your S3 bucket containing multi-part .csv files. For this post, the source data is the folder transformresults in the bucket created by the AWS CloudFormation stack.

To create a crawler, complete the following steps:

  1. On the AWS Glue console, under Data Catalog, choose Crawlers.
  2. Click Add crawler to create a new crawler to crawl the transform results.
  3. Provide the name of the crawler and click Next.
  4. Choose Data stores as the Crawler source type and click Next.
  5. In the Add a data store section, for Choose a data store, choose S3.
  6. For Crawl data in, choose Specified path in my account.
  7. For Include path, enter the name of your path. This should be the same S3 bucket created by cloudformation previously along with the folder named transformresults you created. Verify the folder has multi-part csv files created.
  8. Choose Next.
  9. In the Choose an IAM role section, choose Choose the IAM role.
  10. For IAM role, enter the name of the crawler.
  11. Choose Next.
  12. Select Run on demand for Frequency.
  13. Configure the crawler’s output with Database set to cfn-database-patient.
  14. Set the Prefix added to tables value to be table_results_. This will help identify the table containing the transform results.
  15. Click on Finish.
  16. Select the same crawler and click on Run the crawler.After crawler execution is successful, you should see a new table created corresponding to the crawler settings in the respective database you selected during crawler configuration.
  17. From the AWS Glue console, under Databases, choose Tables.
  18. Choose Action.
  19. Choose Edit table details.
  20. For Serde Serialization lib, enter org.apache.hadoop.hive.serde2.OpenCSVSerde.
  21. Under Serde parameters, add key escapeChar, with value as \\.
  22. Add key quoteChar with value as " (double quotes).
  23. Set key field.delim with value as ,
  24. Add key separatorChar with value as ,You can set the Serde parameters as per your requirements based on the type of datasets you have.
  25. Edit the schema of the table by setting the data types of all columns to String. To edit the schema of the table, click on the table and click on the Edit schema button.

You can also choose to retain the inferred data types by crawler as per your requirements. This post sets all to the String datatype for the sake of simplicity, except for the match_id column, which is set as bigint.

Reviewing the output with Amazon Athena

To review the output with Amazon Athena, complete the following steps:

  1. From the Data Catalog, choose Tables.
  2. Choose the table name created by your crawler for the results.
  3. Choose Action.
  4. Choose View data.

    The Athena console opens.  If you are running Amazon Athena for the first time, you might have to click on Get Started. Before you run your first query, you will also need to set up a query result location in Amazon S3. Click on set up a query result location in Amazon S3 on Amazon Athena console and set the location of the query results. You can create additional folder in the same Amazon S3 bucket created previously by the cloudformation. Please make sure the S3 path ends with a /.
  5. Choose the appropriate database.For this post, choose cfn-database-patient. You might need to refresh the data source if you do not see the database in the drop down.
  6. Choose the results table that contains the FindMatches output containing the patient records with the match_id column. In this case, it will be table_results_transformresults. If you chose a different name for the results table, the below query needs to be changed to reflect the correct table name.
  7. Run the below query by choosing Run query.
    SELECT * FROM "cfn-database-patient"."table_results_transformresults" order by match_id;

The following screenshot shows your output.

Security Considerations

AWS Lake Formation helps protect your data by giving you a central location in which you can configure granular data access policies, regardless of which services you use to access it.

To centralize data access policy controls using Lake Formation, first shut down direct access to your buckets in S3 so Lake Formation manages all data access. Configure data protection and access policies using Lake Formation, which enforces those policies across all the AWS services accessing data in your lake. You can configure users and roles and define the data these roles can access, down to the table and column level.

AWS Lake Formation provides a permissions model that is based on a simple grant/revoke mechanism. Lake Formation permissions combine with IAM permissions to control access to data stored in data lakes and to the metadata that describes that data. For more information, see Security and Access Control to Metadata and Data in Lake Formation.

Lake Formation currently supports Server-Side-Encryption on S3 (SSE-S3, AES-265). Lake Formation also supports private endpoints in your VPC and records all activity in AWS CloudTrail, so you have network isolation and auditability.

AWS Lake Formation service is a HIPAA eligible service.

Summary

This post demonstrated how to find matching records in a patient database using the Lake Formation FindMatches ML transform. It allows you to find matches when the records in two datasets don’t share a common identifier or include duplicates. This method helps you find matches between dataset rows when fields don’t match exactly or attributes are missing or corrupted.

You are now ready to start building with Lake Formation and try FindMatches on your data. Please share your feedback and questions in the comments.

 


About the Authors

Dhawalkumar Patel is a Senior Solutions Architect at Amazon Web Services. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He is currently focused on Machine Learning and Serverless technologies

 

 

 

Ujjwal Ratan is a principal machine learning specialist solution architect in the Global Healthcare and Lifesciences team at Amazon Web Services. He works on the application of machine learning and deep learning to real world industry problems like medical imaging, unstructured clinical text, genomics, precision medicine, clinical trials and quality of care improvement. He has expertise in scaling machine learning/deep learning algorithms on the AWS cloud for accelerated training and inference. In his free time, he enjoys listening to (and playing) music and taking unplanned road trips with his family.

 

 

 

 

Provisioning the Intuit Data Lake with Amazon EMR, Amazon SageMaker, and AWS Service Catalog

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/provisioning-the-intuit-data-lake-with-amazon-emr-amazon-sagemaker-and-aws-service-catalog/

This post shares Intuit’s learnings and recommendations for running a data lake on AWS. The Intuit Data Lake is built and operated by numerous teams in Intuit Data Platform. Thanks to Tristan Baker (Chief Architect), Neil Lamka (Principal Product Manager), Achal Kumar (Development Manager), Nicholas Audo, and Jimmy Armitage for their feedback and support.

A data lake is a centralized repository for storing structured and unstructured data at any scale. At Intuit, creating such a pile of raw data is easy. However, more interesting challenges present themselves:

  1. How should AWS accounts be organized?
  2. What ingestion methods will be used? How will analysts find the data they need?
  3. Where should data be stored? How should access be managed?
  4. What security measures are needed to protect Intuit’s sensitive data?
  5. Which parts of this ecosystem can be automated?

This post outlines the approach taken by Intuit, though it is important to remember that there are many ways to build a data lake (for example, AWS Lake Formation).

We’ll cover the technologies and processes involved in creating the Intuit Data Lake at a high level, including the overall structure and the automation used in provisioning accounts and resources. Watch this space in the future for more detailed blog posts on specific aspects of the system, from the other teams and engineers who worked together to build the Intuit Data Lake.

Architecture

Account Structure

Data lakes typically follow a hub-and-spoke model, with the hub account containing shared services that control access to data sources. For the purposes of this post, we’ll refer to the hub account as Central Data Lake.

In this pattern, access to Central Data Lake is apportioned to spoke accounts called Processing Accounts. This model maintains separation between end users and allows for division of billing among distinct business units.

 

 

It is common to maintain two ecosystems: pre-production (Pre-Prod) and production (Prod). This allows data lake administrators to silo access to data by preventing connectivity between Pre-Prod and Prod.

To enable experimentation and testing, it may also be advisable to maintain separate VPC-based environments within Pre-Prod accounts, such as dev, qa, and e2e. Processing Account VPCs would then be connected to the corresponding VPC in Central Data Lake.

Note that at first, we connected accounts via VPC Peering. However, as we scaled we quickly approached the hard limit of 125 VPC peering connections, requiring us to migrate to AWS Transit Gateway. As of this writing, we connect multiple new Processing Accounts weekly.

 

 

Central Data Lake

There may be numerous services running in a hub account, but we’ll focus on the aspects that are most relevant to this blog: ingestion, sanitization, storage, and a data catalog.

 

 

Ingestion, Sanitization, and Storage

A key component to Central Data Lake is a uniform ingestion pattern for streaming data. One example is an Apache Kafka cluster running on Amazon EC2. (You can read about how Intuit engineers do this in another AWS blog.) As we deal with hundreds of data sources, we’ve enabled access to ingestion mechanisms via AWS PrivateLink.

Note: Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an alternative for running Apache Kafka on Amazon EC2, but was not available at the start of Intuit’s migration.

In addition to stream processing, another method of ingestion is batch processing, such as jobs running on Amazon EMR. After data is ingested by one of these methods, it can be stored in Amazon S3 for further processing and analysis.

Intuit deals with a large volume of customer data, and each field is carefully considered and classified with a sensitivity level. All sensitive data that enters the lake is encrypted at the source. The ingestion systems retrieve the encrypted data and move it into the lake. Before it is written to S3, the data is sanitized by a proprietary RESTful service. Analysts and engineers operating within the data lake consume this masked data.

Data Catalog

A data catalog is a common way to give end users information about the data and where it lives. One example is a Hive Metastore backed by Amazon Aurora. Another alternative is the AWS Glue Data Catalog.

Processing Accounts

When Processing Accounts are delivered to end users, they include an identical set of resources. We’ll discuss the automation of Processing Accounts below, but the primary components are as follows:

 

 

                           Processing Account structure upon delivery to the customer

 

Data Storage Mechanisms

One reasonable question is whether all data should reside in Central Data Lake, or if it’s acceptable to distribute data across multiple accounts. A data lake might employ a combination of the two approaches, and classify data locations as primary or secondary.

The primary location for data is Central Data Lake, and it arrives there via the ingestion pipelines discussed previously. Processing Accounts can read from the primary source, either directly from the ingestion pipelines or from S3. Processing Accounts can contribute their transformed data back into Central Data Lake (primary), or store it in their own accounts (secondary). The proper storage location depends on the type of data, and who needs to consume it.

One rule worth enforcing is that no cross-account writes should be permitted. In other words, the IAM principal (in most cases, an IAM role assumed by EC2 via an instance profile) must be in the same account as the destination S3 bucket. This is because cross-account delegation is not supported—specifically, S3 bucket policies in Central Data Lake cannot grant Processing Account A access to objects written by a role in Processing Account B.

Another possibility is for EMR to assume different IAM roles via a custom credentials provider (see this AWS blog), but we chose not to go down this path at Intuit because it would have required many EMR jobs to be rewritten.

 

 

Data Access Patterns

The majority of end users are interested in the data that resides in S3. In Central Data Lake and some Processing Accounts, there may be a set of read-only S3 buckets: any account in the data lake ecosystem can read data from this type of bucket.

To facilitate management of S3 access for read-only buckets, we built a mechanism to control S3 bucket policies, administered entirely via code. Our deployment pipelines use account metadata to dynamically generate the correct S3 bucket policy based on the type of account (Pre-Prod or Prod). These policies are committed back into our code repository for auditability and ease of management.

We employ the same method for managing KMS key policies, as we use KMS with customer managed customer master keys (CMKs) for at-rest encryption in S3.

Here’s an example of a generated S3 bucket policy for a read-only bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ProcessingAccountReadOnly",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111111111111:root",
                    "arn:aws:iam::222222222222:root",
                    "arn:aws:iam::333333333333:root",
                    "arn:aws:iam::444444444444:root",
                    "arn:aws:iam::555555555555:root",
                    ...
                    ...
                    ...
                    "arn:aws:iam::999999999999:root",
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::intuit-data-lake-example/*",
                "arn:aws:s3:::intuit-data-lake-example"
            ]
        }
    ]
}

Note that we grant access at the account level, rather than using explicit IAM principal ARNs. Because the reads are cross-account, permissions are also required on the IAM principals in Processing Accounts. Maintaining these policies—with automation, at that level of granularity—is untenable at scale. Furthermore, using specific IAM principal ARNs would create an external dependency on foreign accounts. For example, if a Processing Account deletes an IAM role that is referenced in an S3 bucket policy in Central Data Lake, the bucket policy can no longer be saved, causing interruptions to deployment pipelines.

Security

Security is mission critical for any data lake. We’ll mention a subset of the controls we use, but not dive deep.

Encryption

Encryption can be enforced both in transit and at rest, using multiple methods:

  1. Traffic within the lake should use the latest version of TLS (1.2 as of this writing)
  2. Data can be encrypted with application-level (client-side) encryption
  3. KMS keys can used for at-rest encryption of S3, EBS, and RDS

Ingress and Egress

There’s nothing out of the ordinary in our approach to ingress and egress, but it’s worth mentioning the standard patterns we’ve found important:

Policies restricting ingress and egress are the primary points at which a data lake can guarantee quality (ingress) and prevent loss (egress).

Authorization

Access to the Intuit Data Lake is controlled via IAM roles, meaning no IAM users (with long-term credentials) are created. End users are granted access via an internal service that manages role-based, federated access to AWS accounts. Regular reviews are conducted to remove nonessential users.

Configuration Management

We use an internal fork of Cloud Custodian, which is a suite of preventative, detective, and responsive controls consisting of Amazon CloudWatch Events and AWS Config rules. Some of the violations it reports and (optionally) mitigates include:

  • Unauthorized CIDRs in inbound security group rules
  • Public S3 bucket policies and ACLs
  • IAM user console access
  • Unencrypted S3 buckets, EBS volumes, and RDS instances

Lastly, Amazon GuardDuty is enabled in all Intuit Data Lake accounts and is monitored by Intuit Security.

Automation

If there is one thing we’ve learned building the Intuit Data Lake, it is to automate everything.

There are four areas of automation we’ll discuss in this blog:

  1. Creation of Processing Accounts
  2. Processing Account Orchestration Pipeline
  3. Processing Account Terraform Pipeline
  4. EMR and SageMaker deployment via Service Catalog

Creation of Processing Accounts

The first step in creating a Processing Account is to make a request through an internal tool. This triggers automation that provisions an Intuit-stamped AWS account under the correct business unit.

 

Note: AWS Control Tower’s Account Factory was not available at the start of our journey, but it can be leveraged to provision new AWS accounts in a secured, best practice, self-service way.

Account setup also includes automated VPC creation (with optional VPN), fully automated using Service Catalog. End users simply specify subnet sizes.

It’s worth noting that Intuit leverages Service Catalog for self-service deployment of other common patterns, including ingress security groups, VPC endpoints, and VPC peering. Here’s an example portfolio:

Processing Account Orchestration Pipeline

After account creation and VPC provisioning, the Processing Account Orchestration Pipeline runs. This pipeline executes one-time tasks required for Processing Accounts. These tasks include:

  • Bootstrapping an IAM role for use in further configuration management
  • Creation of KMS keys for S3, EBS, and RDS encryption
  • Creation of variable files for the new account
  • Updating the master configuration file with account metadata
  • Generation of scripts to orchestrate the Terraform pipeline discussed below
  • Sharing Transit Gateways via Resource Access Manager

Processing Account Terraform Pipeline

This pipeline manages the lifecycle of dynamic, frequently-updated resources, including IAM roles, S3 buckets and bucket policies, KMS key policies, security groups, NACLs, and bastion hosts.

There is one pipeline for every Processing Account, and each pipeline deploys a series of layers into the account, using a set of parameterized deployment jobs. A layer is a logical grouping of Terraform modules and AWS resources, providing a way to shrink Terraform state files and reduce blast radius if redeployment of specific resources is required.

EMR and SageMaker Deployment via Service Catalog

AWS Service Catalog facilitates the provisioning of Amazon EMR and Amazon SageMaker, allowing end users to launch EMR clusters and SageMaker notebook instances that work out of the box, with embedded security.

Service Catalog allows data scientists and data engineers to launch EMR clusters in a self-service fashion with user-friendly parameters, and provides them with the following:

  • Bootstrap action to enable connectivity to services in Central Data Lake
  • EC2 instance profile to control S3, KMS, and other granular permissions
  • Security configuration that enables at-rest and in-transit encryption
  • Configuration classifications for optimal EMR performance
  • Encrypted AMI with monitoring and logging enabled
  • Custom Kerberos connection to LDAP

For SageMaker, we use Service Catalog to launch notebook instances with custom lifecycle configurations that set up connections or initialize the following: Hive Metastore, Kerberos, security, Splunk logging, and OpenDNS. You can read more about lifecycle configurations in this AWS blog. Launching a SageMaker notebook instance with best-practice configuration is as easy as follows:

 

 

Conclusion

This post illustrates the building blocks we used in creating the Intuit Data Lake. Our solution isn’t wholly unique, but comprised of common-sense approaches we’ve gleaned from dozens of engineers across Intuit, representing decades of experience. These practices have enabled us to push petabytes of data into the lake, and serve hundreds of Processing Accounts with varying needs. We are still building, but we hope our story helps you in your data lake journey.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Ben Covi is a staff software engineer at Intuit. At any given moment, he’s probably losing a game of Catan.

 

 

 

Access and manage data from multiple accounts from a central AWS Lake Formation account

Post Syndicated from Shilpa Mehta original https://aws.amazon.com/blogs/big-data/access-and-manage-data-from-multiple-accounts-from-a-central-aws-lake-formation-account/

This post shows how to access and manage data in multiple accounts from a central AWS Lake Formation account. The walkthrough demonstrates a centralized catalog residing in the master Lake Formation account, with data residing in the different accounts.  The post shows how to grant access permissions from the Lake Formation service to read, write and update the catalog and access data in different accounts.

The post uses two datasets of data to determine if there is a cor-relation between the news generated around the world (gdelt) ) and the number of reviews that Amazon’s products received (amazonreviews).

Prerequisites

This walkthrough requires the use of three accounts, each with S3 buckets and their account numbers.

Setting up the Environment

The three accounts are as follows:

  • Account Products (AP) – This is the account in which the Amazon product reviews are stored. This post deploys the configuration using AWS CloudFormation.
  • Account External (AE) – This account monitors the world’s broadcast, print, and web news from around the world in over 100 languages. It identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images, and events driving global society every second of every day. This post deploys the configuration using AWS CloudFormation.
  • Main Account (MA) – This is the main account, which gathers data from the other two accounts. This post configures Lake Formation in this account. This account has access to the product data and world news account.

The following diagram shows the account architecture.

Account Products

Deploy the following AWS Cloud Formation template in the AP.

This creates an S3 bucket called productsaccountcf-bucketname-1pcfoxar1pxp (templateName-bucket-name-random_string) and a cross account bucket policy in the AP. This policy gives the main account root ID access to this bucket.

  • It uses a Lambda function to download the amazonreviews dataset into the new bucket created.

Make sure you insert the account number of your main account in the datalake-AccountId field. The following screenshot shows that for this post, the MA account number is 1111111111111.

Account External

Deploy the  following AWS Cloud Formation template in the AE.

This creates an S3 bucket called externalaccountcf-resultbucket-12ecq638afqiq (templateName-bucket-name_random_string) and a cross account bucket policy in the AE. This policy gives the main account access to this bucket. Due to the nature of the dataset this template also creates tables in the data catalog. Instead of AWS Glue crawlers, Athena queries are created on the structure of the tables.

The template executes queries in Amazon Athena to download the gdelt dataset, and to create the metadata of the tables that Lake Formation uses.

Make sure you insert the account number of your main account datalakeAccountid field. For this post, the MA account number is 1111111111111.Though the CloudFormation console will show CREATE_COMPLETE, a query is still executing, which you can observe in the Athena console. You can access the Athena console from AWS Management Console. The query that continues to run is creating a new table in the AE with the data in Parquet format so that queries can perform better.

The following screenshot shows your query history and status.

You are now ready to go to Lake Formation in your main account and start configuring.

Registering data stores in Lake Formation

Login to the Lake Formation console in your main account. If it’s the first time you are accessing Lake Formation, you need to add administrators to the account. The user is the account user you have logged into.

To add your data lakes, complete the following steps:

  1. On the Lake Formation console, under Register and ingest, choose Data lake locations. The page displays a list of S3 buckets that are marked as data lake storage resources for Lake Formation. Here, a single S3 bucket may act as the repository for many datasets, or you could use separate buckets for separate data sources. This post registers the S3 buckets in the other accounts and creates a master catalog in Lake Formation
  2. Choose Register location.The following screenshot shows the Data lake locations pane.You can now register both buckets you created in your AP and AE.
  3. In Amazon S3 location, for Amazon S3 path, enter s3://productsaccountcf-bucketname-1pcfoxar1pxp
  4. For IAM role, select You need an IAM role that gives Lake Formation the necessary permissions (GetObject, PutObject, DeleteObject, and ListBucket) to properly use your S3 bucket as a data lake. This default role has the necessary permissions. Alternatively, select a pre-existing IAM role that has required permissions and is configured with lakeformation.amazonaws.com as a trusted entity.
  5. Choose Register location.The following screenshot shows the Amazon S3 location pane.You now have a storage resource and are ready to register the second bucket.
  6. Repeat the previous steps, but in Step 3, register the bucket externalaccountcf-resultbucket-12ecq638afqiq from your AE (888888888888) as s3://externalaccountcf-resultbucket-12ecq638afqiq.

Setting up your IAM role

You need an IAM role that allows Lake Formation to create catalog tables of the datasets in the storage locations. Complete the following steps:

  1. On the AWS console, access IAM and create an IAM role
  2. Attach AWS Glue and AWS Lambda policies as described here.
  3. Edit the trust relationship for the role with the following policy:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "",
          "Effect": "Allow",
          "Principal": {
            "Service": ["glue.amazonaws.com", "lambda.amazonaws.com"]
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }

Add an inline policy  to give the role permissions to S3 to execute Athena queries, AWS Glue and to publish AWS CloudWatch logs as shown in the following screenshot.

Creating a database

Lake Formation maintains a Hive-compatible data catalog within your data lake. Before you can catalog data within your S3 storage backend or use Lake Formation data importers to push data to S3 (which this post discusses later), you must first create a database within your Lake Formation catalog.

A Lake Formation database is a logical construct to which you can later add tables. Each table contains a mapping to one or more objects in S3 that, collectively, represent that table. Tables also contain basic column metadata such as file format, S3 location, and column definitions. Optionally, you can also define arbitrary key-value pairs for tables and columns to better describe the data and act as queryable attributes for data discovery.

You can create one or more databases and populate their tables either manually in the console, programmatically via the AWS SDKs or AWS CLI, or automatically by defining AWS Glue crawlers.

This post defines two logical databases, amazonreviews and gdelt.

  1. On the Lake Formation console, under Data catalog, choose Databases.
  2. Choose Create database.The following screenshot shows the Databases pane.
  3. For Name, enter amazonreviews.
  4. For Location, enter s3://productsaccountcf-bucketname-1pcfoxar1pxp/amazonreviews.
  5. For Description, enter a brief, meaningful description.
  6. Clear Grant All to Everyone for new tables in this database.
  7. Choose Create database.The following screenshot shows the database details.
    1. Create gdelt database. Set Name to gdelt
    2. Set Locationto s3://externalaccountcf-resultbucket-12ecq638afqiq/gdelt
    3. Set Description to a brief, meaningful description like the one shown below
    4. Uncheck the box for Grant All to Everyone for new tables in this database

Granting permissions

You now have your databases and need to grant permissions to the role you created in Lake Formation. You need to configure your IAM users and roles as administrators.

  1. On the Lake Formation console, under Permissions, choose Admins and database creators. The following screenshot shows the Admins and database creators.
  2. Under Permissions choose Data Permissions
  3. From the Actions menu, choose Grant
  4. Select your new IAM role.
  5. For Database permissions, choose Create table and Grant all.
  6. Choose Grant.The following screenshot shows the Grant permissions pane. Repeat the previous steps for the amazonreviews and gdelt databases.
  7. Repeat the previous steps for the amazonreviews and gdelt.The next step is granting your role permissions to the data lakes you created.
  8. From Permissions, choose Data locations.
  9. Choose Grant
  10. Select your new role.The following screenshot shows the Data locations pane.
  11. For IAM users and roles, select your role.
  12. For Storage locations, enter s3://productsaccountcf-bucketname-1pcfoxar1pxp.
  13. Choose Grant.The following screenshot shows the Grant permissions pane.
  14. Repeat the steps for datalake s3://externalaccountcf-resultbucket-12ecq638afqiq/

Setting up your main account

Deploy the following CloudFormation stack in the MA.

This creates tables in the amazonreviews and gdelt databases

From the Actions menu choose Grant. Select the role or users to grant access and select the two check boxes, and choose Grant.

Querying the data

Now that you have the data in the catalog, you can perform queries from the master account with Athena between the datasets in different accounts.

Grant table permissions by completing the following steps:

  1. On the Lake Formation console, under Data catalog, choose Tables.
  2. Choose the table to query.
  3. From the Actions menu, choose Enter your role or user name.The following screenshot shows the Tables pane.
  1. For Table permissions, select Select.
  2. For Grantable permissions, select Alter, Insert, Drop, Delete, Select, and Grant all.The following screenshot shows the Grant permissions pane.
  1. Repeat the previous steps for the events table in the gdeltYou are now ready to query the data.
  1. In Tables, select the events
  2. From the Actions menu, choose View data.The following screenshot shows the Tables pane.
  1. Repeat the previous steps for the events table in the gdeltYou are now ready to query the data.
  1. In Tables, select the events
  2. From the Actions menu, choose View data.
  3. You will be taken to the AWS Athena consoleThe following screenshot shows the Athena console.
  1. Use the Query Editor tab and enter SQL queries for the reviews and events

To query the information by date, standardize the date columns and do aggregations by creating views. In sequential order, run the following queries:

CREATE VIEW amznrevw.aggreviews AS
SELECT count() as reviewcount, star_rating, verified_purchase, from_iso8601_date(review_date) as reviewdate FROM "amznrevw"."reviews2" group by star_rating, verified_purchase, review_date;

CREATE VIEW gdelt.eventsformatted AS
SELECT from_iso8601_date(substr(cast(day as varchar),1,4) || '-' || substr(cast(day as varchar),5,2)||'-' || substr(cast(day as varchar),7,2)) as eventdate, actor1code,actor1name,actor1countrycode, actor2code,actor2name,actor2countrycode FROM "gdelt"."events" ;

CREATE VIEW gdelt.eventsagregated AS
select count() as numevetns, eventdate, (count(distinct actor1code) + count(distinct actor2code)) as numactors from gdelt.eventsformatted group by eventdate;

You are now ready to query. You can determine how many gdelt events were in the five days with the most amount of reviews by performing the following query:

select eventdate, sum(reviewcount) as totalreviews, sum(numevetns) as totalnumevetns from gdelt.eventsagregated as event, amznrevw.aggreviews as review where event.eventdate = review.reviewdate group by eventdate order by totalreviews desc, totalnumevetns desc limit 5;

The following screenshot shows the query results.

January 3, 2015, had the most reviews but not the most gdelt events (833,890).

You can also discover how many reviews where performed in the five days with the most amount of gdelt events by performing the following query:

select eventdate, sum(reviewcount) as totalreviews, sum(numevetns) as totalnumevetns from gdelt.eventsagregated as event, amznrevw.aggreviews as review where event.eventdate = review.reviewdate group by eventdate order by totalnumevetns desc, totalreviews desc limit 5;

The following screenshot shows the query results.

January 25, 2012, had 2 million events but only 378 reviews.

You can also perform a final query to check the correlation between the two with the following query:

SELECT corr(reviewcount,
numevetns) AS review_event_correlation
FROM gdelt.eventsagregated AS event, amznrevw.aggreviews AS review
WHERE event.eventdate = review.reviewdate

The following screenshot shows the query results.

You can likely identify that there is no correlation between the two columns.

Conclusion

This post demonstrated how to set up cross-account access of datastores through a central Lake Formation catalog. The solution walked through creating two S3 buckets in external accounts, downloading some datasets on these buckets, and giving Lake Formation permission to access the data. You also learned how to govern the data in the data lakes from Lake Formation, and how to query the data in the two data lakes using Athena and Glue crawlers.

 


About the Authors


Shilpa Mehta is a Data Lab solutions architect at AWS.
Shilpa helps our customers architect and build data and analytics prototypes in just four days in the AWS Data Lab.

 

 

 

Laura Caicedo Camacho is a solutions architect at AWS. She works with customers to help them embrace and adopt the cloud.

 

 

 

 

Luis Caro Perez is a solutions architect at AWS.  He works with our customers to provide guidance and technical assistance on their applications, helping them improving the value of their solutions when using AWS.

 

 

 

Discover metadata with AWS Lake Formation: Part 2

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/big-data/discover-metadata-with-aws-lake-formation-part-2/

Data lakes are an increasingly popular way to aggregate, store, and analyze both structured and unstructured data. AWS Lake Formation makes it easy for you to set up, secure, and manage your data lakes.

In Part 1 of this post series, you learned how to create and explore a data lake using Lake Formation. This post walks you through data discovery using the metadata search capabilities of Lake Formation in the console, and metadata search results restricted by column permissions.

Prerequisites

For this post, you need the following:

Metadata search in the console

In this post, we demonstrate the catalog search capabilities offered by the Lake Formation console:

  • Search by classification
  • Search by keyword
  • Search by tag: attribute
  • Multiple filter searches

Search by classification

Using the metadata catalog search capabilities, search across all tables within your data lake. Two share the name amazon_reviews but separately belong to your simulated “prod” and “test” databases, and the third is trip-data.

  1. In the Lake Formation console, under Data catalog, choose Tables.
  2. In the search bar, under Resource Attributes, choose Classification, type CSV, and press Enter. You should see only the trip_data table, which you formatted as CSV in your data lake. The amazon_reviews tables do not appear because they are in Parquet format.
  3. In the Name column, choose trip_data. Under Table details, you can see that the classification CSV is correctly identified by the metadata search filter.

Search by keyword

Next, search across your entire data lake filtering metadata by keyword.

  1. To refresh the list of tables, under Data catalog, choose Tables again.
  2. From the search bar, type star_rating, and press Enter. Now that you have applied the filter, you should see only the amazon_reviews tables because they both contain a column named star_rating.
  3. By choosing either of the two tables, you can scroll down to the Schema section, and confirm that they contain a star_rating column.

Search by tag: attribute

Next, search across your data lake and filter results by metadata tags and their attribute value.

  1. To refresh the list of tables, under Data catalog, choose Tables.
  2. From the search bar, type department: research, and press Enter. Now that you have applied the filter, you should see only the trip_data table because this is the only table containing the value of ‘research’ in the table property of ‘department’.
  3. Select the trip_data table. Under Table details, you can see the tag: attribute of department | research listed under Table properties.

Multiple filter searches

Finally, try searching across your entire data lake using multiple filters at one time.

  1. To refresh the list of tables, under Data catalog, choose Tables.
  2. In the search bar, choose Location, type S3, and press Enter. For this post, all of your catalog tables are in S3, so all three tables display.
  3. In the search bar, choose Classification, type parquet, and press Enter. You should see only the amazon_reviews tables because they are the only tables stored in S3 in Parquet format.
  4. Choose either of the displayed amazon_reviews tables. Under Table details, you can see that the following is true.
  • Location: S3
  • Classification: parquet

Metadata search results restricted by column permissions

The metadata search capabilities return results based on the permissions specified within Lake Formation. If a user or a role does not have permission to a particular database, table, or column, that element doesn’t appear in that user’s search results.

To demonstrate this, first create an IAM user, dataResearcher, with AWS Management Console access. Make sure to store the password somewhere safe.

To simplify this post, attach the AdministratorAccess policy to the user. This policy grants full access to your AWS account, which is overly permissive. I recommend that you either remove this user after completing the post, or remove this policy, and enable multi-factor authentication (MFA). For more information, see Creating an IAM user in the console.

In Part 1 of this series, you allowed Everyone to view the tables that the AWS Glue crawlers created. Now, revoke those permissions for the ny-taxi database.

  1. In the Lake Formation console, under Permissions, choose Data permissions.
  2. Scroll down or search until you see the Everyone record for the trip_data table.
  3. Select the record and choose Revoke, Revoke.

Now, your dataResearcher IAM user cannot see the ny-taxi database or the trip_data table. Resolve this issue by setting up Lake Formation permissions.

  1. Under Permissions, choose Data Permission, Grant.
  2. Select the dataResearcher user, the ny-taxi database, and the trip_data table.
  3. Under Table permissions, check Select and choose Grant.
  4. Log out of the console and sign back in using the dataResearcher IAM user that you created earlier.
  5. In the Lake Formation console, choose Tables, select the trip_data table, and look at its properties:

The dataResearcher user currently has visibility across all of these columns. However, you don’t want to allow this user to see the pickup or drop off locations, as those are potential privacy risks. Remove these columns from the dataResearcher user’s permissions.

  1. Log out of the dataResearcher user and log back in with your administrative account.
  2. In the Lake Formation console, under Permissions, choose Data Permissions.
  3. Select the dataResearcher record and choose Revoke.
  4. On the Revoke page, under Column, choose All columns except the exclude columns and then choose the vendor_id, passenger_count, trip_distance, and total_amount columns.
  5. Under Table permissions, check Select. These settings revoke all permissions of the dataResearcher user to the trip_data table except those selected in the window. In other words, the dataResearcher user can only Select(view) the four selected columns.
  6. Choose Revoke.
  7. Log back in as the dataResearcher user.
  8. In the Lake Formation console, choose Data catalog, Tables. Search for vendor_id and press Enter. The trip_data table appears in the search, as shown in the following screenshot.
  9. Search for pu_location_id. This returns no results because you revoked permissions to this column, as shown in the following screenshot.

Conclusion

Congratulations: You have learned how to use the metadata search capabilities of Lake Formation. By defining specific user permissions, Lake Formation allowed you to grant and revoke access to metadata in the Data Catalog as well as the underlying data stored in S3. Therefore, you can discover your data sources across your entire AWS environment using a single pane of glass. To learn more, see AWS Lake Formation.

 


About the Authors

Julia Soscia is a solutions architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.

 

 

 

Eric Weinberg is a systems development engineer on the AWS Envision Engineering team. He has 15 years of experience building and designing software applications.

 

 

 

 

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has more than twenty years experience in Analytics and Data Management.

 

 

 

 

Mat Werber is a solutions architect on the AWS Community SA Team. He is responsible for providing architectural guidance across the full AWS stack with a focus on Serverless, Redshift, DynamoDB, and RDS. He also has an audit background in IT governance, risk, and controls.

 

 

 

 

Discovering metadata with AWS Lake Formation: Part 1

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/big-data/discovering-metadata-with-aws-lake-formation-part-1/

Data lakes are an increasingly popular way to create a single repository to store and analyze both structured and unstructured data. AWS Lake Formation makes it easy for you to set up, secure, and manage data lakes. This post walks you through the creation and exploration of a data lake using Lake Formation:

  • Creating the data lake

o  Adding data to your data lake

o  Creating catalog databases

o  Adding tables from Amazon S3 to catalog databases

  • Editing and adding metadata within the catalog

o  Editing standard metadata

o  Adding custom metadata

Prerequisites

For this post, you need the following:

Create the data lake

In the AWS Lake Formation console, in the left navigation pane, choose Register and ingest, Data lake locations. Select a single S3 bucket to house several independent data sources in your data lake. For more information, see What is AWS Lake Formation?

Add data to your data lake

Now that you have an S3 bucket configured as a storage resource for Lake Formation, you must add data to your data lake. You can add data to your data lake’s S3 bucket storage resource using AWS SDKs, AWS CLI, the S3 console, or a Lake Formation blueprint.

With Lake Formation, you can discover and set up the ingestion of your source data. When you add a workflow that loads or updates the data lake, you can choose a blueprint or template of the type of importer to add. Lake Formation provides several blueprints on the Lake Formation console for common source data types to simplify the creation of workflows. Workflows point to your data source and target and specify the frequency that they run.

For this post, use the AWS CLI to download sample data and then upload it to your S3 storage backend. Other import methods, such as Lake Formation data importers, are outside the scope of this post.

Sample from the following two datasets provided on the Registry of Open Data on AWS:

Make two copies of the Amazon customer reviews dataset in your data lake. You can use these to simulate “production” and “test” datasets and learn how to target one or both when searching your metadata catalog.

To demonstrate the flexibility of an AWS data lake, add both CSV and Parquet datasets to your data lake. In both cases, use the following naming convention for your S3 objects:

s3://BUCKET_NAME/DATABASE_NAME/TABLE_NAME/<data files>

Add Amazon customer reviews to your data lake

AWS hosts a registry to help people share and discover a variety of datasets. For this post, copy a subset of the Amazon customer reviews dataset into your data lake. You don’t have to copy the complete reviews dataset, only the smaller 226-MB portion of watch reviews. You need two copies of this data in your data lake to simulate separate “production” and “test” databases.

  1. If you have not already, install and configure the AWS CLI with IAM user access keys that include permission to read from S3 and write to your Lake Formation S3 bucket.
  2. Copy the source to your data lake:
    # Replace with your bucket name:
    YOUR_BUCKET=lf-metadata-demo
     
    aws s3 cp \
       s3://amazon-reviews-pds/parquet/product_category=Watches/ \
       s3://$YOUR_BUCKET/amazon-reviews-prod/amazon-reviews/ --recursive
     
    aws s3 cp \
       s3://amazon-reviews-pds/parquet/product_category=Watches/ \
       s3://$YOUR_BUCKET/amazon-reviews-test/amazon-reviews/ --recursive

  3. In the S3 console, confirm that your S3 bucket now contains your two Amazon reviews datasets.
  4. Inspect the contents of the folders. The datasets are in Parquet format.

Add New York taxi ride history to your data lake

Much as you did with the Amazon customer reviews dataset, copy a small subset of New York taxi ride history from the Registry of Open Data on AWS into your data lake:

  1. Copy the source data to your data lake:
    # Replace with your bucket name:
    YOUR_BUCKET=lf-metadata-demo
     
    aws s3 cp \
       “s3://nyc-tlc/trip data/green_tripdata_2018-02.csv” \
       “s3://$YOUR_BUCKET/ny-taxi/trip-data/green_tripdata_2018-02.csv”

  2. In the S3 console, validate that your S3 bucket contains CSV data for NY taxi trips.

Create catalog databases

You have created an S3 bucket to act as your data lake storage backend and added data to the bucket. However, this data is not readily available in Lake Formation until you catalog the data.

Lake Formation maintains a Hive-compatible data catalog of data within your data lake. Before you can catalog data within your S3 storage backend or use Lake Formation data importers (discussed later) to push data to S3, you must first create a database.

A Lake Formation database is a logical construct to which you later add tables. Each table contains a mapping to one or more objects in S3 that, collectively, represent that table. Tables also contain basic metadata including but not limited to file format, S3 location, column headings, and column types. Lake Formation users can also optionally define arbitrary key-value pairs for tables and columns to better describe the data and act as query-able attributes for data discovery.

You can create one or more databases and populate their tables either manually in the console, programmatically using the AWS SDKs or AWS CLI, or automatically by defining AWS Glue crawlers.

For this post, you must define three logical databases:

amazon-reviews-prod

amazon-reviews-test

ny-taxi

Then, use the cataloging process to map to the two datasets that you previously uploaded to your S3 storage backend. Remember, you intentionally created two copies of the Amazon reviews dataset to simulate both a production and test database in your data lake.

Now, create your databases. First, configure IAM users and roles as administrators within Lake Formation.

Catalog permissions are permissions that the selected IAM principal can use directly. Grantable permissions are those that the IAM principal can grant to other IAM principals later.

For example, you might want to give your database administrator (DBA) the ability to create databases, by granting permissions to the catalog. However, you can prevent the DBA from accidentally giving this access to your developers by not enabling the grantable permission.

Now that you’ve granted necessary permissions, you can proceed to create your database within the catalog.

  • For Name, enter amazon-reviews-prod.
  • For Location, enter s3://<YOUR_BUCKET>/amazon-reviews-prod.
  • For Description, enter a brief, meaningful description.
  • Check Grant All to Everyone for new tables in this database. In production, assess whether this is appropriate.

Repeat the process for the other two databases:

  • Name: amazon-reviews-test
    Location: s3://<YOUR_BUCKET>/amazon-reviews-test
  • Name: ny-taxi
    Location: s3://<YOUR_BUCKET>/ny-taxi

After completing these steps, you should have three databases in your catalog: amazon-reviews-prod, amazon-reviews-test, and ny-taxi.

Add tables from S3 to your catalog databases

In the previous section, you created three databases in your Lake Formation catalog. However, these catalog databases are empty and do not yet provide information about the specific tables, schema, file formats, or object paths in S3. To add this information, use one of the following two methods:

  • Manually define your tables in the catalog using the console, SDKs, or AWS CLI.
  • Use an AWS Glue crawler to search S3 and automatically add discovered tables to your catalog.

For this post, create and manually run one AWS Glue crawler for each of your three datasets in S3 and databases in the Lake Formation data catalog. A detailed walkthrough is outside the scope of this post. For guidance, see Working with Crawlers on the AWS Glue Console.

As you proceed, please bear the following in mind:

  • Create one crawler for each of your three datasets. You should be able to accept most of the default crawler settings. However, the S3 path for your crawlers should read:

s3://YOUR_BUCKET/amazon-reviews-prod/amazon-reviews

s3://YOUR_BUCKET/amazon-reviews-test/amazon-reviews

s3://YOUR_BUCKET/ny-taxi/trip-data

Before you run the crawlers to populate your catalog, you must assign them an IAM role. The role grants them permission to read from your data lake’s S3 bucket, write crawler logs to Amazon CloudWatch, and update your data catalog. Regardless of whether you create a new role, or use an existing role, make a note of the IAM role name. You need this information for the next step.

In addition to permissions defined within IAM, you must also explicitly grant IAM principals (roles or users) the ability to modify your Data Catalog from within Lake Formation itself. Conceptually, this is similar to the concept of bucket policies in S3 used with IAM. In the Lake Formation console, under Permissions, choose Data permissions.

Grant your AWS Glue crawlers the ability to modify your Data Catalog. Configure the following fields:

  • For IAM users and roles, select the IAM roles that you previously used for your AWS Glue crawlers.
  • For Database, select the amazon-reviews-prod, amazon-reviews-test, and ny-taxi databases.
  • For Database permissions, select all permissions.
  • Leave all Grantable permissions unselected.

After your AWS Glue crawlers have permission to modify your Lake Formation data catalog, return to the AWS Glue console and manually run your three crawlers. After a few minutes, the crawlers should complete their runs. Each should add one table to your data catalog:

o  amazon-reviews

o  amazon-reviews

o  trip-data

Verify that your catalog was updated. In the Lake Formation console, under Data catalog, choose Tables, and view the three new tables added to the corresponding data lake databases, as shown in the following screenshot.

Edit and add metadata within the catalog

The AWS Glue crawlers populate standard metadata about the tables they discover in S3, including (but not limited to) attributes such as object location, file format, column headings, and column types.

However, you can manually edit standard metadata or add additional custom metadata to the catalog to make it easier to search and improve the overall value that it provides. In the following section, I walk through several examples of editing and adding to metadata.

Edit standard metadata

The AWS Glue crawlers infer the name of columns from the first line of CSV file. To view the auto-populated column names for the ny_taxi table, look at the table properties:

  1. Under Data catalog, choose Tables.
  2. Select ny_taxi and scroll down to the Schema section.
  3. Choose Edit Schema. Your data columns’ names must consistently use snake casing, which means using the ‘_’ character between words. Change all of the id columns to match the rest of the columns. If you look at the first row of the raw data, you notice that there is inconsistent naming used. Rather than changing those files, you can manually change the metadata.
  4. Select the vendorid row and choose Edit. Make your changes to include the snake casing and choose Save.
  5. Repeat the following steps for dolocationid, ratecodeid, and pulocationid. After you make these changes, choose Save as new version.
  6. Under Data catalog, choose Tables. If you search for pulocationid, no results should return.
  7. Search for the new column name, pu_location_id. This search should return the expected result, the trip_data table from the ny-taxi database.

Add custom metadata

Now, try adding a couple of custom table properties to help organize your tables. The first table property to add is an environment variable to help you to determine whether a table is for development, testing, or production. The second table property to add is a department variable, which allows you to group tables by a department.

  1. In the Lake Formation console, under Data catalog, choose Databases.
  2. Select the ny-taxi database and choose View tables.
  3. Select the trip_data table and choose Edit Table.
  4. Under Table properties, choose Add. Set the value of environment to dev and the value of department to research1. Choose Save.
  5. Under Data catalog, choose Tables. In the search bar, type “research,” and press Enter. No results return because there isn’t a table with the table property value of research. However, searching for research1 should return the trip_data table.
  6. Go back to your table properties for trip-data and update the department property from research1 to research. After you’ve made the edit, the trip-data table appears when entering “research” as a keyword in the table search:

Conclusion

Congratulations: You have successfully created and edited your first data lake using the Lake Formation. You used the service to secure and ingest data into an S3 data lake, catalog the data, and customize the metadata of the data sources. In part 2 of this series, I show you how to discover your data by using the metadata search capabilities of Lake Formation.

 


About the Authors

Julia Soscia is a solutions architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.

 

 

 

Eric Weinberg is a systems development engineer on the AWS Envision Engineering team. He has 15 years of experience building and designing software applications.

 

 

 

 

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has more than twenty years experience in Analytics and Data Management.

 

 

 

 

Mat Werber is a solutions architect on the AWS Community SA Team. He is responsible for providing architectural guidance across the full AWS stack with a focus on Serverless, Redshift, DynamoDB, and RDS. He also has an audit background in IT governance, risk, and controls.

 

 

 

 

Getting started with AWS Lake Formation

Post Syndicated from Gordon Heinrich original https://aws.amazon.com/blogs/big-data/getting-started-with-aws-lake-formation/

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The challenges of data lakes

The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.

Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.

Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.

Walkthrough

In this walkthrough, I show you how to build and use a data lake:

  • Create a data lake administrator.
  • Register an Amazon S3 path.
  • Create a database.
  • Grant permissions.
  • Crawl the data with AWS Glue to create the metadata and table.
  • Grant access to the table data.
  • Query the data using Amazon Athena.
  • Add a new user with restricted access and verify the results.

Prerequisites

You need the following resources for this walkthrough:

  • An AWS account
  • An IAM user with the AWSLakeFormationDataAdmin For more information, see IAM Access Policies.
  • An S3 bucket named datalake-yourname-region, in the US-East (N. Virginia)
  • A folder named zipcode within your new S3 bucket.

You also must download the sample dataset. For this walkthrough, I use a table of City of New York statistics. The data is available on the DATA.GOV site, in the City of New York Demographics Statistics by Zip table. Upload the file to your S3 bucket in the /zipcode folder.

You have set up the S3 bucket and put the dataset in place. Now, set up your data lake with Lake Formation.

Step 1: Create a data lake administrator

First, designate yourself a data lake administrator to allow access to any Lake Formation resource.

Step 2: Register an Amazon S3 path

Next, register an Amazon S3 path to contain your data in the data lake.

Step 3: Create a database

Next, create a database in the AWS Glue Data Catalog to contain the zipcode table definitions.

  • For Database, enter zipcode-db.
  • For Location, enter your S3 bucket/zipcode.
  • For New tables in this database, do not select Grant All to Everyone.

Step 4: Grant permissions

Next, grant permissions for AWS Glue to use the zipcode-db database. For IAM role, select your user and AWSGlueServiceRoleDefault.

Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:

  • For IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
  • For Storage locations, enter s3://datalake-yourname-location.

Step 5: Crawl the data with AWS Glue to create the metadata and table

In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.

Create a table using an AWS Glue crawler. Use the following configuration settings:

  • Crawler name: zipcodecrawler.
  • Data stores: Select this field.
  • Choose a data store: Select S3.
  • Specified path: Select this field.
  • Include path: S3://datalake-yourname-location/zipcode.
  • Add another data store: Choose No.
  • Choose an existing IAM role: Select this field.
  • IAM role: Select AWSGlueServiceRoleDefault.
  • Run on demand: Select this field.
  • Database: Select zipcode-db.

Choose Run it now? Wait for the crawler to stop before moving to the next step.

Step 6: Grant access to the table data

Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.

  • In the navigation pane, choose Tables.
  • Choose Grant.
  • Provide the following information:
    1. For IAM role, select your user and AWSGlueServiceRoleDefault.
    2. For Table permissions, choose Select all.

Step 7: Query the data with Athena

Next, query the data in the data lake using Athena.

  • In the Athena console, choose Query Editor and select the zipcode-db
  • Choose Tables and select the zipcode table.
  • Choose Table Options (three vertical dots to the right of the table name).
  • Select Preview table.

Athena issues the following query:

SELECT * FROM “zipcode”.”zipcode” limit 10;

As you can see from the following screenshot, the datalakeadmin user can see all of the data.

Step 8: Add a new user with restricted access and verify the results

This step shows how you, as the data lake administrator, can set up a user with restricted access to specific columns.

In the IAM console, create an IAM user with administrative rights, called user1, and add AWSLakeFormationDataAdmin policy. For more information, see Adding and Removing IAM Identity Permissions.

In the Lake Formation console, grant permissions to user1 and supply the following configuration settings:

  • Database: Select zipcode-db.
  • Table: Select zipcode.
  • Columns: Choose The include columns.
  • The include columns: Choose Jurisdiction name and Count participants.
  • Table permissions: Select.
  • Grantable permissions: Select.

To verify the results of the restricted permissions, repeat step 7 when you are logged in as user1. As the following screenshot shows, user1 can only see columns that the datalakeadmin user granted them permissions to view.

Conclusion

This post showed you how to build a secure data lake using Lake Formation. It provides the mechanisms to implement governance, semantic consistency, and access controls, making your data more usable for analytics and machine learning.

For more information, see the following posts:


About the Author

Gordon Heinrich is a solutions architect working with global system integrators. He works with AWS partners and customers to provide architectural guidance on building data lakes and using AWS machine learning services. In his spare time, he enjoys spending time with his family, skiing, hiking, and mountain biking in Colorado.

Build, secure, and manage data lakes with AWS Lake Formation

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/building-securing-and-managing-data-lakes-with-aws-lake-formation/

A data lake is a centralized store of a variety of data types for analysis by multiple analytics approaches and groups. Many organizations are moving their data into a data lake. In this post, I explore how you can use AWS Lake Formation to build, secure, and manage data lakes.

Traditionally, organizations have kept data in a rigid, single-purpose system, such as an on-premises data warehouse appliance. Similarly, they have analyzed data using a single method, such as predefined BI reports. Moving data between databases or for use with different approaches, like machine learning (ML) or improvised SQL querying, required “extract, transform, load” (ETL) processing before analysis. At best, these traditional methods have created inefficiencies and delays. At worst, they have complicated security.

By contrast, cloud-based data lakes open structured and unstructured data for more flexible analysis. Any amount of data can be aggregated, organized, prepared, and secured by IT staff in advance. Analysts and data scientists can then access it in place with the analytics tools of their choice, in compliance with appropriate usage policies.

Data lakes let you combine analytics methods, offering valuable insights unavailable through traditional data storage and analysis. In a retail scenario, ML methods discovered detailed customer profiles and cohorts on non-personally identifiable data gathered from web browsing behavior, purchase history, support records, and even social media. The exercise showed the deployment of ML models on real-time, streaming, interactive customer data.

Such models could analyze shopping baskets and serve up “next best offers” in the moment, or deliver instant promotional incentives. Marketing and support staff could explore customer profitability and satisfaction in real time and define new tactics to improve sales. Around a data lake, combined analytics techniques like these can unify diverse data streams, providing insights unobtainable from siloed data.

The challenges of building data lakes

Unfortunately, the complex and time-consuming process for building, securing, and starting to manage a data lake often takes months. Even building a data lake in the cloud requires many manual and time-consuming steps:

  • Setting up storage.
  • Moving, cleaning, preparing, and cataloging data.
  • Configuring and enforcing security policies for each service.
  • Manually granting access to users.

You want data lakes to centralize data for processing and analysis with multiple services. But organizing and securing the environment requires patience.

Currently, IT staff and architects spend too much time creating the data lake, configuring security, and responding to data requests. They could spend this time acting as curators of data resources, or as advisors to analysts and data scientists. Analysts and data scientists must wait for access to needed data throughout the setup.

The following diagram shows the data lake setup process:

Setting up storage

Data lakes hold massive amounts of data. Before doing anything else, you must set up storage to hold all that data. If you are using AWS, configure Amazon S3 buckets and partitions. If you are building the data lake on premises, acquire hardware and set up large disk arrays to store all the data.

Moving data

Connect to different data sources — on-premises and in the cloud — then collect data on IoT devices. Next, collect and organize the relevant datasets from those sources, crawl the data to extract the schemas, and add metadata tags to the catalog. You can use a collection of file transfer and ETL tools:

Cleaning and preparing data

Next, collected data must be carefully partitioned, indexed, and transformed to columnar formats to optimize for performance and cost. You must clean, de-duplicate, and match related records.

Today, organizations accomplish these tasks using rigid and complex SQL statements that perform unreliably and are difficult to maintain. This complex process of collecting, cleaning, and transforming the incoming data requires manual monitoring to avoid errors. Many customers use AWS Glue for this task.

Configuring and enforcing policies

Customers and regulators require that organizations secure sensitive data. Compliance involves creating and applying data access, protection, and compliance policies. For example, you restrict access to personally identifiable information (PII) at the table, column, or row level, encrypt all data, and keep audit logs of who is accessing the data.

Today, you can secure data using access control lists on S3 buckets or third-party encryption and access control software. You create and maintain data access, protection, and compliance policies for each analytics service requiring access to the data. For example, if you are running analysis against your data lake using Amazon Redshift and Amazon Athena, you must set up access control rules for each of these services.

Many customers use AWS Glue Data Catalog resource policies to configure and control metadata access to their data. Some choose to use Apache Ranger. But these approaches can be painful and limiting. S3 policies provide at best table-level access. And you must maintain data and metadata policies separately. With Apache Ranger, you can configure metadata access to only one cluster at a time. Also, policies can become wordy as the number of users and teams accessing the data lake grows within an organization.

Making it easy to find data

Users with different needs, like analysts and data scientists, may struggle to find and trust relevant datasets in the data lake. To make it easy for users to find relevant and trusted data, you must clearly label the data in a data lake catalog. Provide users with the ability to access and analyze this data without making requests to IT.

Today, each of these steps involves a lot of manual work. Customer labor includes building data access and transformation workflows, mapping security and policy settings, and configuring tools and services for data movement, storage, cataloging, security, analytics, and ML. With all these steps, a fully productive data lake can take months to implement.

The wide range of AWS services provides all the building blocks of a data lake, including many choices for storage, computing, analytics, and security. In the nearly 13 years that AWS has been operating Amazon S3 with exabytes of data, it’s also become the clear first choice for data lakes. AWS Glue adds a data catalog and server-less transformation capabilities. Amazon EMR brings managed big data processing frameworks like Apache Spark and Apache Hadoop. Amazon Redshift Spectrum offers data warehouse functions directly on data in Amazon S3. Athena brings server-less SQL querying.

With all these services available, customers have been building data lakes on AWS for years. AWS runs over 10,000 data lakes on top of S3, many using AWS Glue for the shared AWS Glue Data Catalog and data processing with Apache Spark.

AWS has learned from the thousands of customers running analytics on AWS that most customers who want to do analytics also want to build a data lake. But many of you want this process to be easier and faster than it is today.

AWS Lake Formation (now generally available)

At AWS re:Invent 2018, AWS introduced Lake Formation: a new managed service to help you build a secure data lake in days. If you missed it, watch Andy Jassy’s keynote announcement. Lake Formation has several advantages:

  • Identify, ingest, clean, and transform data: With Lake Formation, you can move, store, catalog, and clean your data faster.
  • Enforce security policies across multiple services: After your data sources are set up, you then define security, governance, and auditing policies in one place, and enforce those policies for all users and all applications.
  • Gain and manage new insights:With Lake Formation, you build a data catalog that describes available datasets and their appropriate business uses. This catalog makes your users more productive by helping them find the right dataset to analyze.

The following screenshot illustrates Lake Formation and its capabilities.

How to create a data lake

S3 forms the storage layer for Lake Formation. If you already use S3, you typically begin by registering existing S3 buckets that contain your data. Lake Formation creates new buckets for the data lake and import data into them. AWS always stores this data in your account, and only you have direct access to it.

There is no lock-in to Lake Formation for your data. Because AWS stores data in standard formats like CSV, ORC, or Parquet, it can be used with a wide variety of AWS or third-party analytics tools.

Lake Formation also optimizes the partitioning of data in S3 to improve performance and reduce costs. The raw data you load may reside in partitions that are too small (requiring extra reads) or too large (reading more data than needed). Lake Formation organizes your data by size, time, or relevant keys to allow fast scans and parallel, distributed reads for the most commonly used queries.

How to load data and catalog metadata

Lake Formation uses the concept of blueprints for loading and cataloging data. You can run blueprints one time for an initial load or set them up to be incremental, adding new data and making it available.

With Lake Formation, you can import data from MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. You can also import from on-premises databases by connecting with Java Database Connectivity (JDBC).

Point Lake Formation to the data source, identify the location to load it into the data lake, and specify how often to load it. Blueprints discovers the source table schema, automatically convert data to the target data format, partition the data based on the partitioning schema, and track data that was already processed. All these actions can be customized.

Blueprints rely on AWS Glue as a support service. AWS Glue crawlers connect and discover the raw data that to be ingested. AWS Glue code generation and jobs generate the ingest code to bring that data into the data lake. Lake Formation uses the same data catalog for organizing the metadata. AWS Glue stitches together crawlers and jobs and allows for monitoring for individual workflows. In these ways, Lake Formation is a natural extension of AWS Glue capabilities.

The following graphics show the Blueprint Workflow and Import screens:

How to transform and prepare data for analysis

In addition to supporting all the same ETL capabilities as AWS Glue, Lake Formation introduces new Amazon ML Transforms. This feature includes a fuzzy logic blocking algorithm that can de-duplicate 400M+ records in less than 2.5 hours, which is magnitudes better than earlier approaches.

To match and de-duplicate your data using Amazon ML Transforms: First, merge related datasets. Amazon ML Transforms divides these sets into training and testing samples, then scans for exact and fuzzy matches. You can provide more data and examples for greater accuracy, putting these into production to process new data as it arrives to your data lake. The partitioning algorithm requires minimal tuning. The confidence level reflects the quality of the grouping, improving on earlier, more improvised algorithms. The following diagram shows this matching and de-duplicating workflow.

Amazon.com is currently using and vetting Amazon ML Transforms internally, at scale, for retail workloads. Lake Formation now makes these algorithms available to customers, so you can avoid the frustration of creating complex and fragile SQL statements to handle record matching and de-duplication. Amazon ML Transforms help improve data quality before analysis. For more information, see Fuzzy Matching and Deduplicating Data with Amazon ML Transforms for AWS Lake Formation.

How to set access control permissions

Lake Formation lets you define policies and control data access with simple “grant and revoke permissions to data” sets at granular levels. You can assign permissions to IAM users, roles, groups, and Active Directory users using federation. You specify permissions on catalog objects (like tables and columns) rather than on buckets and objects.

You can easily view and audit all the data policies granted to a user—in one place. Search and view the permissions granted to a user, role, or group through the dashboard; verify permissions granted; and when necessary, easily revoke policies for a user. The following screenshots show the Grant permissions console:

How to make data available for analytics

Lake Formation offers unified, text-based, faceted search across all metadata, giving users self-serve access to the catalog of datasets available for analysis. This catalog includes discovered schemas (as discussed previously) and lets you add attributes like data owners, stewards, and other business-specific attributes as table properties.

At a more granular level, you can also add data sensitivity level, column definitions, and other attributes as column properties. You can explore data by any of these properties. But access is subject to user permissions. See the following screenshot of the AWS Glue tables tab:

How to monitor activity

With Lake Formation, you can also see detailed alerts in the dashboard, and then download audit logs for further analytics.

Amazon CloudWatch publishes all data ingestion events and catalog notifications. In this way, you can identify suspicious behavior or demonstrate compliance with rules.

To monitor and control access using Lake Formation, first define the access policies, as described previously. Users who want to conduct analysis access data directly through an AWS analytics service, such as Amazon EMR for Spark, Amazon Redshift, or Athena. Or, they access data indirectly with Amazon QuickSight or Amazon SageMaker.

A service forwards the user credentials to Lake Formation for the validation of access permissions. Then Lake Formation returns temporary credentials granting access to the data in S3, as shown in the following diagrams. After a user gains access, actual reads and writes of data operate directly between the analytics service and S3. This approach removes the need for an intermediary in the critical data-processing path.

The following screenshot and diagram show how to monitor and control access using Lake Formation.

Conclusion

With just a few steps, you can set up your data lake on S3 and start ingesting data that is readily queryable. To get started, go to the Lake Formation console and add your data sources. Lake Formation crawls those sources and moves the data into your new S3 data lake.

Lake Formation can automatically lay out the data in S3 partitions; change it into formats for faster analytics, like Apache Parquet and ORC; and increase data quality through machine-learned record matching and de-duplication.

From a single dashboard, you can set up all the permissions for your data lake. Those permissions are implemented for every service accessing this data – including analytics and ML services (Amazon Redshift, Athena, and Amazon EMR for Apache Spark workloads). Lake Formation saves you the hassle of redefining policies across multiple services and provides consistent enforcement of and compliance with those policies.

Learn how to start using AWS Lake Formation.


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at AWS. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 

 

Prajakta Damle is a Principle Product Manager at Amazon Web Services.