All posts by Julia Soscia

Discover metadata with AWS Lake Formation: Part 2

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/big-data/discover-metadata-with-aws-lake-formation-part-2/

Data lakes are an increasingly popular way to aggregate, store, and analyze both structured and unstructured data. AWS Lake Formation makes it easy for you to set up, secure, and manage your data lakes.

In Part 1 of this post series, you learned how to create and explore a data lake using Lake Formation. This post walks you through data discovery using the metadata search capabilities of Lake Formation in the console, and metadata search results restricted by column permissions.

Prerequisites

For this post, you need the following:

Metadata search in the console

In this post, we demonstrate the catalog search capabilities offered by the Lake Formation console:

  • Search by classification
  • Search by keyword
  • Search by tag: attribute
  • Multiple filter searches

Search by classification

Using the metadata catalog search capabilities, search across all tables within your data lake. Two share the name amazon_reviews but separately belong to your simulated “prod” and “test” databases, and the third is trip-data.

  1. In the Lake Formation console, under Data catalog, choose Tables.
  2. In the search bar, under Resource Attributes, choose Classification, type CSV, and press Enter. You should see only the trip_data table, which you formatted as CSV in your data lake. The amazon_reviews tables do not appear because they are in Parquet format.
  3. In the Name column, choose trip_data. Under Table details, you can see that the classification CSV is correctly identified by the metadata search filter.

Search by keyword

Next, search across your entire data lake filtering metadata by keyword.

  1. To refresh the list of tables, under Data catalog, choose Tables again.
  2. From the search bar, type star_rating, and press Enter. Now that you have applied the filter, you should see only the amazon_reviews tables because they both contain a column named star_rating.
  3. By choosing either of the two tables, you can scroll down to the Schema section, and confirm that they contain a star_rating column.

Search by tag: attribute

Next, search across your data lake and filter results by metadata tags and their attribute value.

  1. To refresh the list of tables, under Data catalog, choose Tables.
  2. From the search bar, type department: research, and press Enter. Now that you have applied the filter, you should see only the trip_data table because this is the only table containing the value of ‘research’ in the table property of ‘department’.
  3. Select the trip_data table. Under Table details, you can see the tag: attribute of department | research listed under Table properties.

Multiple filter searches

Finally, try searching across your entire data lake using multiple filters at one time.

  1. To refresh the list of tables, under Data catalog, choose Tables.
  2. In the search bar, choose Location, type S3, and press Enter. For this post, all of your catalog tables are in S3, so all three tables display.
  3. In the search bar, choose Classification, type parquet, and press Enter. You should see only the amazon_reviews tables because they are the only tables stored in S3 in Parquet format.
  4. Choose either of the displayed amazon_reviews tables. Under Table details, you can see that the following is true.
  • Location: S3
  • Classification: parquet

Metadata search results restricted by column permissions

The metadata search capabilities return results based on the permissions specified within Lake Formation. If a user or a role does not have permission to a particular database, table, or column, that element doesn’t appear in that user’s search results.

To demonstrate this, first create an IAM user, dataResearcher, with AWS Management Console access. Make sure to store the password somewhere safe.

To simplify this post, attach the AdministratorAccess policy to the user. This policy grants full access to your AWS account, which is overly permissive. I recommend that you either remove this user after completing the post, or remove this policy, and enable multi-factor authentication (MFA). For more information, see Creating an IAM user in the console.

In Part 1 of this series, you allowed Everyone to view the tables that the AWS Glue crawlers created. Now, revoke those permissions for the ny-taxi database.

  1. In the Lake Formation console, under Permissions, choose Data permissions.
  2. Scroll down or search until you see the Everyone record for the trip_data table.
  3. Select the record and choose Revoke, Revoke.

Now, your dataResearcher IAM user cannot see the ny-taxi database or the trip_data table. Resolve this issue by setting up Lake Formation permissions.

  1. Under Permissions, choose Data Permission, Grant.
  2. Select the dataResearcher user, the ny-taxi database, and the trip_data table.
  3. Under Table permissions, check Select and choose Grant.
  4. Log out of the console and sign back in using the dataResearcher IAM user that you created earlier.
  5. In the Lake Formation console, choose Tables, select the trip_data table, and look at its properties:

The dataResearcher user currently has visibility across all of these columns. However, you don’t want to allow this user to see the pickup or drop off locations, as those are potential privacy risks. Remove these columns from the dataResearcher user’s permissions.

  1. Log out of the dataResearcher user and log back in with your administrative account.
  2. In the Lake Formation console, under Permissions, choose Data Permissions.
  3. Select the dataResearcher record and choose Revoke.
  4. On the Revoke page, under Column, choose All columns except the exclude columns and then choose the vendor_id, passenger_count, trip_distance, and total_amount columns.
  5. Under Table permissions, check Select. These settings revoke all permissions of the dataResearcher user to the trip_data table except those selected in the window. In other words, the dataResearcher user can only Select(view) the four selected columns.
  6. Choose Revoke.
  7. Log back in as the dataResearcher user.
  8. In the Lake Formation console, choose Data catalog, Tables. Search for vendor_id and press Enter. The trip_data table appears in the search, as shown in the following screenshot.
  9. Search for pu_location_id. This returns no results because you revoked permissions to this column, as shown in the following screenshot.

Conclusion

Congratulations: You have learned how to use the metadata search capabilities of Lake Formation. By defining specific user permissions, Lake Formation allowed you to grant and revoke access to metadata in the Data Catalog as well as the underlying data stored in S3. Therefore, you can discover your data sources across your entire AWS environment using a single pane of glass. To learn more, see AWS Lake Formation.

 


About the Authors

Julia Soscia is a solutions architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.

 

 

 

Eric Weinberg is a systems development engineer on the AWS Envision Engineering team. He has 15 years of experience building and designing software applications.

 

 

 

 

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has more than twenty years experience in Analytics and Data Management.

 

 

 

 

Mat Werber is a solutions architect on the AWS Community SA Team. He is responsible for providing architectural guidance across the full AWS stack with a focus on Serverless, Redshift, DynamoDB, and RDS. He also has an audit background in IT governance, risk, and controls.

 

 

 

 

Discovering metadata with AWS Lake Formation: Part 1

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/big-data/discovering-metadata-with-aws-lake-formation-part-1/

Data lakes are an increasingly popular way to create a single repository to store and analyze both structured and unstructured data. AWS Lake Formation makes it easy for you to set up, secure, and manage data lakes. This post walks you through the creation and exploration of a data lake using Lake Formation:

  • Creating the data lake

o  Adding data to your data lake

o  Creating catalog databases

o  Adding tables from Amazon S3 to catalog databases

  • Editing and adding metadata within the catalog

o  Editing standard metadata

o  Adding custom metadata

Prerequisites

For this post, you need the following:

Create the data lake

In the AWS Lake Formation console, in the left navigation pane, choose Register and ingest, Data lake locations. Select a single S3 bucket to house several independent data sources in your data lake. For more information, see What is AWS Lake Formation?

Add data to your data lake

Now that you have an S3 bucket configured as a storage resource for Lake Formation, you must add data to your data lake. You can add data to your data lake’s S3 bucket storage resource using AWS SDKs, AWS CLI, the S3 console, or a Lake Formation blueprint.

With Lake Formation, you can discover and set up the ingestion of your source data. When you add a workflow that loads or updates the data lake, you can choose a blueprint or template of the type of importer to add. Lake Formation provides several blueprints on the Lake Formation console for common source data types to simplify the creation of workflows. Workflows point to your data source and target and specify the frequency that they run.

For this post, use the AWS CLI to download sample data and then upload it to your S3 storage backend. Other import methods, such as Lake Formation data importers, are outside the scope of this post.

Sample from the following two datasets provided on the Registry of Open Data on AWS:

Make two copies of the Amazon customer reviews dataset in your data lake. You can use these to simulate “production” and “test” datasets and learn how to target one or both when searching your metadata catalog.

To demonstrate the flexibility of an AWS data lake, add both CSV and Parquet datasets to your data lake. In both cases, use the following naming convention for your S3 objects:

s3://BUCKET_NAME/DATABASE_NAME/TABLE_NAME/<data files>

Add Amazon customer reviews to your data lake

AWS hosts a registry to help people share and discover a variety of datasets. For this post, copy a subset of the Amazon customer reviews dataset into your data lake. You don’t have to copy the complete reviews dataset, only the smaller 226-MB portion of watch reviews. You need two copies of this data in your data lake to simulate separate “production” and “test” databases.

  1. If you have not already, install and configure the AWS CLI with IAM user access keys that include permission to read from S3 and write to your Lake Formation S3 bucket.
  2. Copy the source to your data lake:
    # Replace with your bucket name:
    YOUR_BUCKET=lf-metadata-demo
     
    aws s3 cp \
       s3://amazon-reviews-pds/parquet/product_category=Watches/ \
       s3://$YOUR_BUCKET/amazon-reviews-prod/amazon-reviews/ --recursive
     
    aws s3 cp \
       s3://amazon-reviews-pds/parquet/product_category=Watches/ \
       s3://$YOUR_BUCKET/amazon-reviews-test/amazon-reviews/ --recursive

  3. In the S3 console, confirm that your S3 bucket now contains your two Amazon reviews datasets.
  4. Inspect the contents of the folders. The datasets are in Parquet format.

Add New York taxi ride history to your data lake

Much as you did with the Amazon customer reviews dataset, copy a small subset of New York taxi ride history from the Registry of Open Data on AWS into your data lake:

  1. Copy the source data to your data lake:
    # Replace with your bucket name:
    YOUR_BUCKET=lf-metadata-demo
     
    aws s3 cp \
       “s3://nyc-tlc/trip data/green_tripdata_2018-02.csv” \
       “s3://$YOUR_BUCKET/ny-taxi/trip-data/green_tripdata_2018-02.csv”

  2. In the S3 console, validate that your S3 bucket contains CSV data for NY taxi trips.

Create catalog databases

You have created an S3 bucket to act as your data lake storage backend and added data to the bucket. However, this data is not readily available in Lake Formation until you catalog the data.

Lake Formation maintains a Hive-compatible data catalog of data within your data lake. Before you can catalog data within your S3 storage backend or use Lake Formation data importers (discussed later) to push data to S3, you must first create a database.

A Lake Formation database is a logical construct to which you later add tables. Each table contains a mapping to one or more objects in S3 that, collectively, represent that table. Tables also contain basic metadata including but not limited to file format, S3 location, column headings, and column types. Lake Formation users can also optionally define arbitrary key-value pairs for tables and columns to better describe the data and act as query-able attributes for data discovery.

You can create one or more databases and populate their tables either manually in the console, programmatically using the AWS SDKs or AWS CLI, or automatically by defining AWS Glue crawlers.

For this post, you must define three logical databases:

amazon-reviews-prod

amazon-reviews-test

ny-taxi

Then, use the cataloging process to map to the two datasets that you previously uploaded to your S3 storage backend. Remember, you intentionally created two copies of the Amazon reviews dataset to simulate both a production and test database in your data lake.

Now, create your databases. First, configure IAM users and roles as administrators within Lake Formation.

Catalog permissions are permissions that the selected IAM principal can use directly. Grantable permissions are those that the IAM principal can grant to other IAM principals later.

For example, you might want to give your database administrator (DBA) the ability to create databases, by granting permissions to the catalog. However, you can prevent the DBA from accidentally giving this access to your developers by not enabling the grantable permission.

Now that you’ve granted necessary permissions, you can proceed to create your database within the catalog.

  • For Name, enter amazon-reviews-prod.
  • For Location, enter s3://<YOUR_BUCKET>/amazon-reviews-prod.
  • For Description, enter a brief, meaningful description.
  • Check Grant All to Everyone for new tables in this database. In production, assess whether this is appropriate.

Repeat the process for the other two databases:

  • Name: amazon-reviews-test
    Location: s3://<YOUR_BUCKET>/amazon-reviews-test
  • Name: ny-taxi
    Location: s3://<YOUR_BUCKET>/ny-taxi

After completing these steps, you should have three databases in your catalog: amazon-reviews-prod, amazon-reviews-test, and ny-taxi.

Add tables from S3 to your catalog databases

In the previous section, you created three databases in your Lake Formation catalog. However, these catalog databases are empty and do not yet provide information about the specific tables, schema, file formats, or object paths in S3. To add this information, use one of the following two methods:

  • Manually define your tables in the catalog using the console, SDKs, or AWS CLI.
  • Use an AWS Glue crawler to search S3 and automatically add discovered tables to your catalog.

For this post, create and manually run one AWS Glue crawler for each of your three datasets in S3 and databases in the Lake Formation data catalog. A detailed walkthrough is outside the scope of this post. For guidance, see Working with Crawlers on the AWS Glue Console.

As you proceed, please bear the following in mind:

  • Create one crawler for each of your three datasets. You should be able to accept most of the default crawler settings. However, the S3 path for your crawlers should read:

s3://YOUR_BUCKET/amazon-reviews-prod/amazon-reviews

s3://YOUR_BUCKET/amazon-reviews-test/amazon-reviews

s3://YOUR_BUCKET/ny-taxi/trip-data

Before you run the crawlers to populate your catalog, you must assign them an IAM role. The role grants them permission to read from your data lake’s S3 bucket, write crawler logs to Amazon CloudWatch, and update your data catalog. Regardless of whether you create a new role, or use an existing role, make a note of the IAM role name. You need this information for the next step.

In addition to permissions defined within IAM, you must also explicitly grant IAM principals (roles or users) the ability to modify your Data Catalog from within Lake Formation itself. Conceptually, this is similar to the concept of bucket policies in S3 used with IAM. In the Lake Formation console, under Permissions, choose Data permissions.

Grant your AWS Glue crawlers the ability to modify your Data Catalog. Configure the following fields:

  • For IAM users and roles, select the IAM roles that you previously used for your AWS Glue crawlers.
  • For Database, select the amazon-reviews-prod, amazon-reviews-test, and ny-taxi databases.
  • For Database permissions, select all permissions.
  • Leave all Grantable permissions unselected.

After your AWS Glue crawlers have permission to modify your Lake Formation data catalog, return to the AWS Glue console and manually run your three crawlers. After a few minutes, the crawlers should complete their runs. Each should add one table to your data catalog:

o  amazon-reviews

o  amazon-reviews

o  trip-data

Verify that your catalog was updated. In the Lake Formation console, under Data catalog, choose Tables, and view the three new tables added to the corresponding data lake databases, as shown in the following screenshot.

Edit and add metadata within the catalog

The AWS Glue crawlers populate standard metadata about the tables they discover in S3, including (but not limited to) attributes such as object location, file format, column headings, and column types.

However, you can manually edit standard metadata or add additional custom metadata to the catalog to make it easier to search and improve the overall value that it provides. In the following section, I walk through several examples of editing and adding to metadata.

Edit standard metadata

The AWS Glue crawlers infer the name of columns from the first line of CSV file. To view the auto-populated column names for the ny_taxi table, look at the table properties:

  1. Under Data catalog, choose Tables.
  2. Select ny_taxi and scroll down to the Schema section.
  3. Choose Edit Schema. Your data columns’ names must consistently use snake casing, which means using the ‘_’ character between words. Change all of the id columns to match the rest of the columns. If you look at the first row of the raw data, you notice that there is inconsistent naming used. Rather than changing those files, you can manually change the metadata.
  4. Select the vendorid row and choose Edit. Make your changes to include the snake casing and choose Save.
  5. Repeat the following steps for dolocationid, ratecodeid, and pulocationid. After you make these changes, choose Save as new version.
  6. Under Data catalog, choose Tables. If you search for pulocationid, no results should return.
  7. Search for the new column name, pu_location_id. This search should return the expected result, the trip_data table from the ny-taxi database.

Add custom metadata

Now, try adding a couple of custom table properties to help organize your tables. The first table property to add is an environment variable to help you to determine whether a table is for development, testing, or production. The second table property to add is a department variable, which allows you to group tables by a department.

  1. In the Lake Formation console, under Data catalog, choose Databases.
  2. Select the ny-taxi database and choose View tables.
  3. Select the trip_data table and choose Edit Table.
  4. Under Table properties, choose Add. Set the value of environment to dev and the value of department to research1. Choose Save.
  5. Under Data catalog, choose Tables. In the search bar, type “research,” and press Enter. No results return because there isn’t a table with the table property value of research. However, searching for research1 should return the trip_data table.
  6. Go back to your table properties for trip-data and update the department property from research1 to research. After you’ve made the edit, the trip-data table appears when entering “research” as a keyword in the table search:

Conclusion

Congratulations: You have successfully created and edited your first data lake using the Lake Formation. You used the service to secure and ingest data into an S3 data lake, catalog the data, and customize the metadata of the data sources. In part 2 of this series, I show you how to discover your data by using the metadata search capabilities of Lake Formation.

 


About the Authors

Julia Soscia is a solutions architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.

 

 

 

Eric Weinberg is a systems development engineer on the AWS Envision Engineering team. He has 15 years of experience building and designing software applications.

 

 

 

 

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has more than twenty years experience in Analytics and Data Management.

 

 

 

 

Mat Werber is a solutions architect on the AWS Community SA Team. He is responsible for providing architectural guidance across the full AWS stack with a focus on Serverless, Redshift, DynamoDB, and RDS. He also has an audit background in IT governance, risk, and controls.

 

 

 

 

Introducing the “Preparing for the California Consumer Privacy Act” whitepaper

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/security/introducing-the-preparing-for-the-california-consumer-privacy-act-whitepaper/

AWS has published a whitepaper, Preparing for the California Consumer Protection Act, to provide guidance on designing and updating your cloud architecture to follow the requirements of the California Consumer Privacy Act (CCPA), which goes into effect on January 1, 2020.

The whitepaper is intended for engineers and solution builders, but it also serves as a guide for qualified security assessors (QSAs) and internal security assessors (ISAs) so that you can better understand the range of AWS products and services that are available for you to use.

The CCPA was enacted into law on June 28, 2018 and grants California consumers certain privacy rights. The CCPA grants consumers the right to request that a business disclose the categories and specific pieces of personal information collected about the consumer, the categories of sources from which that information is collected, the “business purposes” for collecting or selling the information, and the categories of third parties with whom the information is shared. This whitepaper looks to address the three main subsections of the CCPA: data collection, data retrieval and deletion, and data awareness.

To read the text of the CCPA please visit the website for California Legislative Information.

If you have questions or want to learn more, contact your account executive or leave a comment below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author photo

Julia Soscia

Julia is a Solutions Architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.

Author photo

Anthony Pasquarielo

Anthony is a Solutions Architect at Amazon Web Services. He’s based in New York City. His main focus is providing customers technical guidance and consultation during their cloud journey. Anthony enjoys delighting customers by designing well-architected solutions that drive value and provide growth opportunity for their business.

Author photo

Justin De Castri

Justin is a Manager of Solutions Architecture at Amazon Web Services based in New York City. His primary focus is helping customers build secure, scaleable, and cost optimized solutions that are aligned with their business objectives.