Tag Archives: Analytics

Orchestrate Amazon Redshift-Based ETL workflows with AWS Step Functions and AWS Glue

Post Syndicated from Ben Romano original https://aws.amazon.com/blogs/big-data/orchestrate-amazon-redshift-based-etl-workflows-with-aws-step-functions-and-aws-glue/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud that offers fast query performance using the same SQL-based tools and business intelligence applications that you use today. Many customers also like to use Amazon Redshift as an extract, transform, and load (ETL) engine to use existing SQL developer skillsets, to quickly migrate pre-existing SQL-based ETL scripts, and—because Amazon Redshift is fully ACID-compliant—as an efficient mechanism to merge change data from source data systems.

In this post, I show how to use AWS Step Functions and AWS Glue Python Shell to orchestrate tasks for those Amazon Redshift-based ETL workflows in a completely serverless fashion. AWS Glue Python Shell is a Python runtime environment for running small to medium-sized ETL tasks, such as submitting SQL queries and waiting for a response. Step Functions lets you coordinate multiple AWS services into workflows so you can easily run and monitor a series of ETL tasks. Both AWS Glue Python Shell and Step Functions are serverless, allowing you to automatically run and scale them in response to events you define, rather than requiring you to provision, scale, and manage servers.

While many traditional SQL-based workflows use internal database constructs like triggers and stored procedures, separating workflow orchestration, task, and compute engine components into standalone services allows you to develop, optimize, and even reuse each component independently. So, while this post uses Amazon Redshift as an example, my aim is to more generally show you how to orchestrate any SQL-based ETL.

Prerequisites

If you want to follow along with the examples in this post using your own AWS account, you need a Virtual Private Cloud (VPC) with at least two private subnets that have routes to an S3 VPC endpoint.

If you don’t have a VPC, or are unsure if yours meets these requirements, I provide an AWS CloudFormation template stack you can launch by selecting the following button. Provide a stack name on the first page and leave the default settings for everything else. Wait for the stack to display Create Complete (this should only take a few minutes) before moving on to the other sections.

Scenario

For the examples in this post, I use the Amazon Customer Reviews Dataset to build an ETL workflow that completes the following two tasks which represent a simple ETL process.

  • Task 1: Move a copy of the dataset containing reviews from the year 2015 and later from S3 to an Amazon Redshift table.
  • Task 2: Generate a set of output files to another Amazon S3 location which identifies the “most helpful” reviews by market and product category, allowing an analytics team to glean information about high quality reviews.

This dataset is publicly available via an Amazon Simple Storage Service (Amazon S3) bucket. Complete the following tasks to get set up.

Solution overview

The following diagram highlights the solution architecture from end to end:

The steps in this process are as follows:

  1. The state machine launches a series of runs of an AWS Glue Python Shell job (more on how and why I use a single job later in this post!) with parameters for retrieving database connection information from AWS Secrets Manager and an .sql file from S3.
  2. Each run of the AWS Glue Python Shell job uses the database connection information to connect to the Amazon Redshift cluster and submit the queries contained in the .sql file.
    1. For Task 1: The cluster utilizes Amazon Redshift Spectrum to read data from S3 and load it into an Amazon Redshift table. Amazon Redshift Spectrum is commonly used as an means for loading data to Amazon Redshift. (See Step 7 of Twelve Best Practices for Amazon Redshift Spectrum for more information.)
    2. For Task 2: The cluster executes an aggregation query and exports the results to another Amazon S3 location via UNLOAD.
  3. The state machine may send a notification to an Amazon Simple Notification Service (SNS) topic in the case of pipeline failure.
  4. Users can query the data from the cluster and/or retrieve report output files directly from S3.

I include an AWS CloudFormation template to jumpstart the ETL environment so that I can focus this post on the steps dedicated to building the task and orchestration components. The template launches the following resources:

  • Amazon Redshift Cluster
  • Secrets Manager secret for storing Amazon Redshift cluster information and credentials
  • S3 Bucket preloaded with Python scripts and .sql files
  • Identity and Access Management (IAM) Role for AWS Glue Python Shell jobs

See the following resources for how to complete these steps manually:

Be sure to select at least two private subnets and the corresponding VPC, as shown in the following screenshot. If you are using the VPC template from above, the VPC appears as 10.71.0.0/16 and the subnet names are A private and B private.

The stack should take 10-15 minutes to launch. Once it displays Create Complete, you can move on to the next section. Be sure to take note of the Resources tab in the AWS CloudFormation console, shown in the following screenshot, as I refer to these resources throughout the post.

Building with AWS Glue Python Shell

Begin by navigating to AWS Glue in the AWS Management Console.

Making a connection

Amazon Redshift cluster resides in a VPC, so you first need to create a connection using AWS Glue. Connections contain properties, including VPC networking information, needed to access your data stores. You eventually attach this connection to your Glue Python Shell Job so that it can reach your Amazon Redshift cluster.

Select Connections from the menu bar, and then select Add connection. Give your connection a name like blog_rs_connection,  select Amazon Redshift as the Connection type, and then select Next, as shown in the following screenshot.

Under Cluster, enter the name of the cluster that the AWS CloudFormation template launched, i.e blogstack-redshiftcluster-####. Because the Python code I provide for this blog already handles credential retrieval, the rest of the values around database information you enter here are largely placeholders. The key information you are associating with the connection is networking-related.

Please note that you are not able to test the connection without the correct cluster information.  If you are interested in doing so, note that Database name and Username are auto-populated after selecting the correct cluster, as shown in the following screenshot. Follow the instructions here to retrieve the password information from Secrets Manager to copy into the Password field.

ETL code review

Take a look at the two main Python scripts used in this example:

Pygresql_redshift_common.py is a set of functions that can retrieve cluster connection information and credentials from Secrets Manger, make a connection to the cluster, and submit queries respectively. By retrieving cluster information at runtime via a passed parameter, these functions allow the job to connect to any cluster to which it has access. You can package these functions into a library by following the instructions to create a python .egg file (already completed as a part of the AWS CloudFormation template launch). Note that AWS Glue Python Shell supports several python libraries natively.

import pg
import boto3
import base64
from botocore.exceptions import ClientError
import json

#uses session manager name to return connection and credential information
def connection_info(db):

	session = boto3.session.Session()
	client = session.client(
		service_name='secretsmanager'
	)

	get_secret_value_response = client.get_secret_value(SecretId=db)

	if 'SecretString' in get_secret_value_response:
		secret = json.loads(get_secret_value_response['SecretString'])
	else:
		secret = json.loads(base64.b64decode(get_secret_value_response['SecretBinary']))
		
	return secret


#creates a connection to the cluster
def get_connection(db,db_creds):

	con_params = connection_info(db_creds)
	
	rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % (con_params['host'], con_params['port'], db, con_params['username'], con_params['password'])
	rs_conn = pg.connect(dbname=rs_conn_string)
	rs_conn.query("set statement_timeout = 1200000")
	
	return rs_conn


#submits a query to the cluster
def query(con,statement):
    res = con.query(statement)
    return res

The AWS Glue Python Shell job runs rs_query.py when called. It starts by parsing job arguments that are passed at invocation. It uses some of those arguments to retrieve a .sql file from S3, then connects and submits the statements within the file to the cluster using the functions from pygresql_redshift_common.py. So, in addition to connecting to any cluster using the Python library you just packaged, it can also retrieve and run any SQL statement. This means you can manage a single AWS Glue Python Shell job for all of your Amazon Redshift-based ETL by simply passing in parameters on where it should connect and what it should submit to complete each task in your pipeline.

from redshift_module import pygresql_redshift_common as rs_common
import sys
from awsglue.utils import getResolvedOptions
import boto3

#get job args
args = getResolvedOptions(sys.argv,['db','db_creds','bucket','file'])
db = args['db']
db_creds = args['db_creds']
bucket = args['bucket']
file = args['file']

#get sql statements
s3 = boto3.client('s3') 
sqls = s3.get_object(Bucket=bucket, Key=file)['Body'].read().decode('utf-8')
sqls = sqls.split(';')

#get database connection
print('connecting...')
con = rs_common.get_connection(db,db_creds)

#run each sql statement
print("connected...running query...")
results = []
for sql in sqls[:-1]:
    sql = sql + ';'
    result = rs_common.query(con, sql)
    print(result)
    results.append(result)

print(results)

Creating the Glue Python Shell Job

Next, put that code into action:

  1. Navigate to Jobs on the left menu of the AWS Glue console page and from there, select Add job.
  2. Give the job a name like blog_rs_query.
  3. For the IAM role, select the same GlueExecutionRole you previously noted from the Resources section of the AWS CloudFormation console.
  4. For Type, select Python shell, leave Python version as the default of Python 3, and for This job runs select An existing script that you provide.
  5. For S3 path where the script is stored, navigate to the script bucket created by the AWS CloudFormation template (look for ScriptBucket in the Resources), then select the python/py file.
  6. Expand the Security configuration, script libraries, and job parameters section to add the Python .egg file with the Amazon Redshift connection library to the Python library path. It is also located in the script bucket under python /redshift_module-0.1-py3.6.egg.

When all is said and done everything should look as it does in the following screenshot:

Choose Next. Add the connection you created by choosing Select to move it under Required connections. (Recall from the Making a connection section that this gives the job the ability to interact with your VPC.) Choose Save job and edit script to finish, as shown in the following screenshot.

Test driving the Python Shell job

After creating the job, you are taken to the AWS Glue Python Shell IDE. If everything went well, you should see the rs_query.py code. Right now, the Amazon Redshift cluster is sitting there empty, so use the Python code to run the following SQL statements and populate it with tables.

  1. Create an external database (amzreviews).
  2. Create an external table (reviews) from which Amazon Redshift Spectrum can read from the source data in S3 (the public reviews dataset). The table is partitioned by product_category because the source files are organized by category, but in general you should partition on frequently filtered columns (see #4).
  3. Add partitions to the external table.
  4. Create an internal table (reviews) local to the Amazon Redshift cluster. product_id works well as a DISTKEY because it has high cardinality, even distribution, and most likely (although not explicitly part of this blog’s scenario) a column that will be used to join with other tables. I choose review_date as a SORTKEY to efficiently filter out review data that is not part of my target query (after 2015). Learn more about how to best choose DISTKEY/SORTKEY as well as additional table design parameters for optimizing performance by reading the Designing Tables documentation.
    CREATE EXTERNAL SCHEMA amzreviews 
    from data catalog
    database 'amzreviews'
    iam_role 'rolearn'
    CREATE EXTERNAL database IF NOT EXISTS;
    
    
    
    CREATE EXTERNAL TABLE amzreviews.reviews(
      marketplace varchar(10), 
      customer_id varchar(15), 
      review_id varchar(15), 
      product_id varchar(25), 
      product_parent varchar(15), 
      product_title varchar(50), 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine varchar(5), 
      verified_purchase varchar(5), 
      review_headline varchar(25), 
      review_body varchar(1024), 
      review_date date, 
      year int)
    PARTITIONED BY ( 
      product_category varchar(25))
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/';
      
      
      
    ALTER TABLE amzreviews.reviews ADD
    partition(product_category='Apparel') 
    location 's3://amazon-reviews-pds/parquet/product_category=Apparel/'
    partition(product_category='Automotive') 
    location 's3://amazon-reviews-pds/parquet/product_category=Automotive'
    partition(product_category='Baby') 
    location 's3://amazon-reviews-pds/parquet/product_category=Baby'
    partition(product_category='Beauty') 
    location 's3://amazon-reviews-pds/parquet/product_category=Beauty'
    partition(product_category='Books') 
    location 's3://amazon-reviews-pds/parquet/product_category=Books'
    partition(product_category='Camera') 
    location 's3://amazon-reviews-pds/parquet/product_category=Camera'
    partition(product_category='Grocery') 
    location 's3://amazon-reviews-pds/parquet/product_category=Grocery'
    partition(product_category='Furniture') 
    location 's3://amazon-reviews-pds/parquet/product_category=Furniture'
    partition(product_category='Watches') 
    location 's3://amazon-reviews-pds/parquet/product_category=Watches'
    partition(product_category='Lawn_and_Garden') 
    location 's3://amazon-reviews-pds/parquet/product_category=Lawn_and_Garden';
    
    
    CREATE TABLE reviews(
      marketplace varchar(10),
      customer_id varchar(15), 
      review_id varchar(15), 
      product_id varchar(25) DISTKEY, 
      product_parent varchar(15), 
      product_title varchar(50), 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine varchar(5), 
      verified_purchase varchar(5), 
      review_date date, 
      year int,
      product_category varchar(25))
      
      SORTKEY (
         review_date
        );

Do this first job run manually so you can see where all of the elements I’ve discussed come into play. Select Run Job at the top of the IDE screen. Expand the Security configuration, script libraries, and job parameters section. This is where you add in the parameters as key-value pairs, as shown in the following screenshot.

KeyValue
–dbreviews
–db_credsreviewssecret
–bucket<name of s3 script bucket>
–filesql/reviewsschema.sql

Select Run job to start it. The job should take a few seconds to complete. You can look for log outputs below the code in the IDE to watch job progress.

Once the job completes, navigate to Databases in the AWS Glue console and look for the amzreviews database and reviews table, as shown in the following screenshot. If they are there, then everything worked as planned! You can also connect to your Amazon Redshift cluster using the Redshift Query Editor or with your own SQL client tool and look for the local reviews table.

Step Functions Orchestration

Now that you’ve had a chance to run a job manually, it’s time to move onto something more programmatic that is orchestrated by Step Functions.

Launch Template

I provide a third AWS CloudFormation template for kickstarting this process as well. It creates a Step Functions state machine that calls two instances of the AWS Glue Python Shell job you just created to complete the two tasks I outlined at the beginning of this post.

For BucketName, paste the name of the script bucket created in the second AWS CloudFormation stack. For GlueJobName, type in the name of the job you just created. Leave the other information as default, as shown in the following screenshot. Launch the stack and wait for it to display Create Complete—this should take only a couple of minutes—before moving on to the next section.

Working with the Step Functions State Machine

State Machines are made up of a series of steps, allowing you to stitch together services into robust ETL workflows. You can monitor each step of execution as it happens, which means you can identify and fix problems in your ETL workflow quickly, and even automatically.

Take a look at the state machine you just launched to get a better idea. Navigate to Step Functions in the AWS Console and look for a state machine with a name like GlueJobStateMachine-######. Choose Edit to view the state machine configuration, as shown in the following screenshot.

It should look as it does in the following screenshot:

As you can see, state machines are created using JSON templates made up of task definitions and workflow logic. You can run parallel tasks, catch errors, and even pause workflows and wait for manual callback to continue. The example I provide contains two tasks for running the SQL statements that complete the goals I outlined at the beginning of the post:

  1. Load data from S3 using Redshift Spectrum
  2. Transform and writing data back to S3

Each task contains basic error handling which, if caught, routes the workflow to an error notification task. This example is a simple one to show you how to build a basic workflow, but you can refer to the Step Functions documentation for an example of more complex workflows to help build a robust ETL pipeline. Step Functions also supports reusing modular components with Nested Workflows.

SQL Review

The state machine will retrieve and run the following SQL statements:

INSERT INTO reviews
SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, star_rating, helpful_votes, total_votes, vine, verified_purchase, review_date, year, product_category
FROM amzreviews.reviews
WHERE year > 2015;

As I mentioned previously, Amazon Redshift Spectrum is a great way to run ETL using an INSERT INTO statement. This example is a simple load of the data as it is in S3, but keep in mind you can add more complex SQL statements to transform your data prior to loading.

UNLOAD ('SELECT marketplace, product_category, product_title, review_id, helpful_votes, AVG(star_rating) as average_stars FROM reviews GROUP BY marketplace, product_category, product_title, review_id, helpful_votes ORDER BY helpful_votes DESC, average_stars DESC')
TO 's3://bucket/testunload/'
iam_role 'rolearn';

This statement groups reviews by product, ordered by number of helpful votes, and writes to Amazon S3 using UNLOAD.

State Machine execution

Now that everything is in order, start an execution. From the state machine main page select Start an Execution.

Leave the defaults as they are and select Start to begin execution. Once execution begins you are taken to a visual workflow interface where you can follow the execution progress, as shown in the following screenshot.

Each of the queries takes a few minutes to run. In the meantime, you can watch the Amazon Redshift query logs to track the query progress in real time. These can be found by navigating to Amazon Redshift in the AWS Console, selecting your Amazon Redshift cluster, and then selecting the Queries tab, as shown in the following screenshot.

Once you see COMPLETED for both queries, navigate back to the state machine execution. You should see success for each of the states, as shown in the following screenshot.

Next, navigate to the data bucket in the S3 AWS Console page (refer to the DataBucket in the CloudFormation Resources tab). If all went as planned, you should see a folder named testunload in the bucket with the unloaded data, as shown in the following screenshot.

Inject Failure into Step Functions State Machine

Next, test the error handling component of the state machine by intentionally causing an error. An easy way to do this is to edit the state machine and misspell the name of the Secrets Manager secret in the ReadFilterJob task, as shown in the following screenshot.

If you want the error output sent to you, optionally subscribe to the error notification SNS Topic. Start another state machine execution as you did previously. This time the workflow should take the path toward the NotifyFailure task, as shown in the following screenshot. If you subscribed to the SNS Topic associated with it, you should receive a message shortly thereafter.

The state machine logs will show the error in more detail, as shown in the following screenshot.

Conclusion

In this post I demonstrated how you can orchestrate Amazon Redshift-based ETL using serverless AWS Step Functions and AWS Glue Python Shells jobs. As I mentioned in the introduction, the concepts can also be more generally applied to other SQL-based ETL, so use them to start building your own SQL-based ETL pipelines today!

 


About the Author

Ben Romano is a Data Lab solution architect at AWS. Ben helps our customers architect and build data and analytics prototypes in just four days in the AWS Data Lab.

 

 

 

 

Protect and Audit PII data in Amazon Redshift with DataSunrise Security

Post Syndicated from Saunak Chandra original https://aws.amazon.com/blogs/big-data/protect-and-audit-pii-data-in-amazon-redshift-with-datasunrise-security/

DataSunrise, in their own words: DataSunrise is a database security software company that offers a breadth of security solutions, including data masking (dynamic and static masking), activity monitoring, database firewalls, and sensitive data discovery for various databases. The goal is to protect databases against external and internal threats and vulnerabilities. Customers often choose DataSunrise Database Security because it gives them unified control and a single-user experience when protecting different database engines that run on AWS, including Amazon Redshift, Amazon Aurora, all Amazon RDS database engines, Amazon DynamoDB, and Amazon Athena, among others. DataSunrise Security Suite is a set of tools that can protect and audit PII data in Amazon Redshift.

DataSunrise offers passive security with data auditing in addition to active data and database security. Active security is based on predefined security policies, such as preventing unauthorized access to sensitive data, blocking suspicious SQL queries, preventing SQL-injection attacks, or dynamically masking and obfuscating data in real time. DataSunrise comes with high availability, failover, and automatic scaling.

This post focuses on active security for Amazon Redshift, in particular DataSunrise’s capabilities for masking and access control of personally identifiable information (PII), which you can back with DataSunrise’s passive security offerings such as auditing access of sensitive information. This post discusses DataSunrise security for Amazon Redshift, how it works, and how to get started.

Why you need active security for Amazon Redshift

Amazon Redshift is a massively parallel processing (MPP), fully managed petabyte-scale data warehouse (DW) solution with over 15,000 deployments worldwide. Amazon Redshift provides a database encryption mechanism to protect sensitive data, such as payment information and health insurance. For more information, see Amazon Redshift Database Encryption.

Many organizations store sensitive data, commonly classified as personally identifiable information (PII) or sensitive personal information (SPI). You may need solutions to manage access control to such sensitive information, and want to manage it efficiently and flexibly, preferably using a management tool. DataSunrise is a centralized management tool that masks that data. It resolves the PII and SPI data access control requirement by allowing you to enforce masking policies against all queries against your Amazon Redshift data warehouse.

What DataSunrise masking does

DataSunrise enables masking queries against Amazon Redshift by acting as a proxy layer between your applications and the backend stores of Amazon Redshift, enabling transparent data flow, bindings, and so on, while your end-users receive masked or obfuscated data that allows them to do their job but prevents any risk of revealing PII data unintentionally.

DataSunrise can exempt users who are authorized to access this information by composing policies and choosing from predefined policy templates that would allow those users to bypass masking constraints when needed.

How it works

DataSunrise operates as a proxy between users or applications that connect to the database and the database server. DataSunrise intercepts the traffic for in-depth analysis and filtering. It applies data masking and access control policies to enforce active security policies against your PII data. When the database firewall is enabled and a security policy violation is detected, DataSunrise can block the malicious SQL query and notify administrators via SMTP or SNMP. With real-time alerts, you can maintain continuous database security and streamline compliance.

DataSunrise operates as a proxy

Getting started with DataSunrise

You can deploy DataSunrise on a Windows or Linux instance in Amazon EC2. You can download a fully prepared DataSunrise AMI from AWS Marketplace to protect your Amazon Redshift cluster. DataSunrise Database and Data Security are available for both Windows and Linux platforms.

After deploying DataSunrise, you can configure security policies for Amazon Redshift and create data masking and access control security rules. After you configure and activate the security policies, DataSunrise enacts those policies against the user and application traffic that would connect to the database through DataSunrise’s proxy.

DataSunrise customers need to configure the Amazon Redshift cluster security group inbound rule to allow DataSunrise IP. For more information, see Amazon Redshift Cluster Security Group. Additionally, you can include the DataSunrise security group in the cluster security group when it runs on the same AWS VPC. Users can execute queries only through connecting to the DataSunrise endpoint and not to the Amazon Redshift cluster endpoint. All DB users and groups are imported from Amazon Redshift into DataSunrise for authentication and authorization to Amazon Redshift objects.

Creating a dynamic data masking rule

Masking obfuscates part or an entire column value. When a column is masked, the column values are replaced by fake values. It is effected either by replacing some original characters with fake ones or by using some masking functions. DataSunrise has many built-in masking functions for credit card numbers, e-mails, etc. Masking protects sensitive or personally identifiable data such as credit card numbers. This is not the same as encryption or hashing, which applies a sophisticated algorithm to a scalar value to convert it into another value.

You can create dynamic masking rules using object-based filters in DataSunrise’s console. DataSunrise identifies the protected objects during application calls and enforces those security rules against targeted operations, schemas, or objects in general within your Amazon Redshift cluster. Security administrators can enable those rules granularly based on the object level and caller identity. They can allow exemptions when needed by identifying authorized callers.

To perform dynamic masking in DataSunrise, you need to create data masking rules as part of defining such security policies.

Complete the following steps to create those masking policies:

  1. In the DataSunrise console, choose Masking > Dynamic Masking Rules.
  2. Choose Add Rule. Add required information.

    Create Dynamic Data Masking Rule

Create Dynamic Data Masking Rule

  1. In the Masking Settings section click Select and navigate to a table in a schema and check the columns you want to mask. See the following screenshot of the Check Columns page:

    Redshift columns to enable dynamic masking

Redshift columns to enable dynamic masking

Click Done after you decide which column to protect and choose the masking method and any relevant settings to allow business-oriented outcomes of the masked information.

In Add Rule, Filter Sessions, you can choose which users, applications, hosts, and more are affected by this rule.

Creating a static data masking rule

You can mask data permanently with static masking as opposed to dynamic data masking. It stores the objects permanently in a separate schema or database. During static masking, DataSunrise copies each selected table into a separate schema or database. So, static masking may require additional storage space. Some of these tables have columns with masked content stored on the disk. This replicated schema is a fully functional schema in which you can run user queries. The source tables remain untouched and unmasked data can be viewed. In case original data has been changed, it is required to repeat the static masking procedure again. In that case it is necessary to truncate tables with previously masked data.

  1. From the menu, choose Masking > Static Masking.
  2. In New Static Masking Task, in Source and Target Instances, choose the source database, schema, and corresponding target destination.See the following screenshot of the New Static Masking Task page:
  1. In Select Source Tables to Transfer and Columns to Mask, choose the objects to which you wish to apply masking.The following screenshot shows the list of available tables:

DataSunrise also enables you to reschedule recurring static masking jobs so you can refresh your masked records based on your source or production data.

Static data masking under DataSunrise applies to Amazon Redshift local tables. In addition to local tables, Amazon Redshift allows querying external tables in Amazon S3; however, DataSunrise does not support static masking on data stored in Amazon S3 accessed in Amazon Redshift via external tables. For more information, see Using Amazon Redshift Spectrum to Query External Data.

Creating a security/access control rule

While data masking can help in many cases to allow your Amazon Redshift users the appropriate access, you may need further enforcement of access control to filter out any operations that might violate your security strategy. DataSunrise can import database users’ and groups’ metadata from Amazon Redshift, which the DataSunrise administrator can use to configure the security profile. If you already have a set of users defined in your existing Redshift DB you don’t need to additionally recreate the users for DataSunrise. DataSunrise will use this list of users only to configure rules. DataSunrise does not modify any traffic related to the authentication process of database users by default.

  1. In the DataSunrise console, from the menu, choose Security > Rules.The following screenshot shows the Security Rules page:
  2. Choose Add Rule.The following screenshot shows the details you can enter to compose this new rule:

DataSunrise also allows you to compose your rule to restrict or allow certain users, applications, hosts, and so on from performing activities that you consider prohibited against particular objects or areas within your Amazon Redshift cluster.

The following screenshot shows the Filter Sessions page:

DataSunrise enables you to create rules for specific objects, query groups, query types, and SQL injection activities, and trigger actions when authorization errors occur.

Static masking doesn’t have a direct impact on performance, but if a customer uses the DataSunrise custom function, it could impact performance because custom functions execute on the DataSunrise server.

Using DataSunrise audit and compliance policies

From the console, Compliance > Add Compliance.

In the Compliance orchestrator page, you can initiate a scan of your Amazon Redshift cluster to identify all PII data or sensitive information in general, per your compliance standards. DataSunrise comes bundled with internal scans for HIPAA, GDPR, and other compliance standards, but you can create or amend any of those libraries to accommodate any special requirements that your security strategy mandates. The following screenshot shows the Add Compliance page:

After completing the scan, DataSunrise guides you through the process of composing rules for sensitive information within your Amazon Redshift cluster.

You can also create your audit rules manually. The following screenshot shows the New Audit Rule page:

You can set audit rules for any restriction to make sure that transactional trails are only collected when necessary. You can target objects starting from the entire database down to a single column in your Amazon Redshift cluster. See the following screenshot:

Conclusion

DataSunrise’s masking feature allows for descriptive specifications of access control to sensitive columns, in addition to built-in encryption provided by the Amazon Redshift cluster. Its proxy enables more fine-grained access control, auditing, and masking capabilities to better monitor, protect, and comply with regulatory standards that address the ever-increasing needs of securing and protecting data. DataSunrise’s integration with Amazon Redshift addresses those concerns by simplifying and automating the security rules and its applications. Keep your data safe and protected at all times!

To get started with DataSunrise with Amazon Redshift, visit DataSunrise in AWS Marketplace.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors


Saunak Chandra is a senior partner solutions architect for Redshift at AWS.
Saunak likes to experiment with new products in the technology space, alongside his day to day work. He loves exploring the nature in the Pacific Northwest. A short hiking or biking in the trails is his favorite weekend morning routine. He also likes to do yoga when he gets time from his kid.

 

 

 

Radik Chumaren is an engineering leader at DataSunrise. Radik is specializing in heterogeneous database environments with focus on building database security software in the cloud. He enjoys reading and playing soccer.

 

 

 

 

Install Python libraries on a running cluster with EMR Notebooks

Post Syndicated from Parag Chaudhari original https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

Last year, AWS introduced EMR Notebooks, a managed notebook environment based on the open-source Jupyter notebook application.

This post discusses installing notebook-scoped libraries on a running cluster directly via an EMR Notebook. Before this feature, you had to rely on bootstrap actions or use custom AMI to install additional libraries that are not pre-packaged with the EMR AMI when you provision the cluster. This post also discusses how to use the pre-installed Python libraries available locally within EMR Notebooks to analyze and plot your results. This capability is useful in scenarios in which you don’t have access to a PyPI repository but need to analyze and visualize a dataset.

Benefits of using notebook-scoped libraries with EMR Notebooks

Notebook-scoped libraries provide you the following benefits:

  • Runtime installation – You can import your favorite Python libraries from PyPI repositories and install them on your remote cluster on the fly when you need them. These libraries are instantly available to your Spark runtime environment. There is no need to restart the notebook session or recreate your cluster.
  • Dependency isolation – The libraries you install using EMR Notebooks are isolated to your notebook session and don’t interfere with bootstrapped cluster libraries or libraries installed from other notebook sessions. These notebook-scoped libraries take precedence over bootstrapped libraries. Multiple notebook users can import their preferred version of the library and use it without dependency clashes on the same cluster.
  • Portable library environment – The library package installation happens from your notebook file. This allows you to recreate the library environment when you switch the notebook to a different cluster by re-executing the notebook code. At the end of the notebook session, the libraries you install through EMR Notebooks are automatically removed from the hosting EMR cluster.

Prerequisites

To use this feature in EMR Notebooks, you need a notebook attached to a cluster running EMR release 5.26.0 or later. The cluster should have access to the public or private PyPI repository from which you want to import the libraries. For more information, see Creating a Notebook.

There are different ways to configure your VPC networking to allow clusters inside the VPC to connect to an external repository. For more information, see Scenarios and Examples in the Amazon VPC User Guide.

Using notebook-scoped libraries

This post demonstrates the notebook-scoped libraries feature of EMR Notebooks by analyzing the publicly available Amazon customer reviews dataset for books. For more information, see Amazon Customer Reviews Dataset on the Registry of Open Data for AWS.

Open your notebook and make sure the kernel is set to PySpark. Run the following command from the notebook cell:

print("Welcome to my EMR Notebook!")

You get the following output:

Output shows newly created spark session.

You can examine the current notebook session configuration by running the following command:

%%info

You get the following output:

Output shows spark session properties which include Python version and properties to enable this new feature.

The notebook session is configured for Python 3 by default (through spark.pyspark.python). If you prefer to use Python 2, reconfigure your notebook session by running the following command from your notebook cell:

%%configure -f { "conf":{ "spark.pyspark.python": "python", 
                "spark.pyspark.virtualenv.enabled": "true", 
                "spark.pyspark.virtualenv.type":"native", 
                "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv" }}

You can also verify the Python version used in your current notebook session by running the following code:

import sys
sys.version

You get the following output:

Output shows Python version 3.6.8

Before starting your analysis, check the libraries that are already available on the cluster. You can do this using the list_packages() PySpark API, which lists all the Python libraries on the cluster. Run the following code:

sc.list_packages()

You get an output similar to the following code, which shows all the available Python 3-compatible packages on your cluster:

Output shows list of Python packages along with their versions.

Load the Amazon customer reviews data for books into a Spark DataFrame with the following code:

df = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')

You are now ready to explore the data. Determine the schema and number of available columns in your dataset with the following code:

# Total columns
print(f'Total Columns: {len(df.dtypes)}')
df.printSchema()

The following code is the output:

Output shows that Spark DataFrame has 15 columns. It also shows the schema of the Spark DataFrame.

This dataset has a total of 15 columns. You can also check the total rows in your dataset by running the following code:

# Total row
print(f'Total Rows: {df.count():,}')

You get the following output:

Output shows that Spark DataFrame has more than 20 million rows.

Check the total number of books with the following code:

# Total number of books
num_of_books = df.select('product_id').distinct().count()
print(f'Number of Books: {num_of_books:,}')

You get the following output:

Output shows that there are more than 3 million books.

You can also analyze the number of book reviews by year and find the distribution of customer ratings. To do this, import the Pandas library version 0.25.1 and the latest Matplotlib library from the public PyPI repository. Install them on the cluster attached to your notebook using the install_pypi_package API. See the following code:

sc.install_pypi_package("pandas==0.25.1") #Install pandas version 0.25.1 
sc.install_pypi_package("matplotlib", "https://pypi.org/simple") #Install matplotlib from given PyPI repository

You get the following output:

Output shows that “pandas” and “matplotlib” packages are successfully installed.

The install_pypi_package PySpark API installs your libraries along with any associated dependencies. By default, it installs the latest version of the library that is compatible with the Python version you are using. You can also install a specific version of the library by specifying the library version from the previous Pandas example.

Verify that your imported packages successfully installed by running the following code:

sc.list_packages()

You get the following output:

Output shows list of Python packages along with their versions.

You can also analyze the trend for the number of reviews provided across multiple years. Use ‘toPandas()’ to convert the Spark data frame to a Pandas data frame, which you can visualize with Matplotlib. See the following code:

# Number of reviews across years
num_of_reviews_by_year = df.groupBy('year').count().orderBy('year').toPandas()

import matplotlib.pyplot as plt
plt.clf()
num_of_reviews_by_year.plot(kind='area', x='year',y='count', rot=70, color='#bc5090', legend=None, figsize=(8,6))
plt.xticks(num_of_reviews_by_year.year)
plt.xlim(1995, 2015)
plt.title('Number of reviews across years')
plt.xlabel('Year')
plt.ylabel('Number of Reviews')

The preceding commands render the plot on the attached EMR cluster. To visualize the plot within your notebook, use %matplot magic. See the following code:

%matplot plt

The following graph shows that the number of reviews provided by customers increased exponentially from 1995 to 2015. Interestingly, 2001, 2002, and 2015 are outliers, when the number of reviews dropped from the previous years.

A line chart showing number of reviews across years.

You can analyze the distribution of star ratings and visualize it using a pie chart. See the following code:

# Distribution of overall star ratings
product_ratings_dist = df.groupBy('star_rating').count().orderBy('count').toPandas()

plt.clf()
labels = [f"Star Rating: {rating}" for rating in product_ratings_dist['star_rating']]
reviews = [num_reviews for num_reviews in product_ratings_dist['count']]
colors = ['#00876c', '#89c079', '#fff392', '#fc9e5a', '#de425b']
fig, ax = plt.subplots(figsize=(8,5))
w,a,b = ax.pie(reviews, autopct='%1.1f%%', colors=colors)
plt.title('Distribution of star ratings for books')
ax.legend(w, labels, title="Star Ratings", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))

Print the pie chart using %matplot magic and visualize it from your notebook with the following code:

%matplot plt

The following pie chart shows that 80% of users gave a rating of 4 or higher. Approximately 10% of users rated their books 2 or lower. In general, customers are happy about their book purchases from Amazon.

Output shows pie chart depicting distribution of star ratings.

Lastly, use the ‘uninstall_package’ Pyspark API to uninstall the Pandas library that you installed using the install_package API. This is useful in scenarios in which you want to use a different version of a library that you previously installed using EMR Notebooks. See the following code:

sc.uninstall_package('pandas')

You get the following output:

Output shows that “pandas” package is successfully uninstalled.

Next, run the following code:

sc.list_packages()

You get the following output:

Output shows list of Python packages along with their versions.

After closing your notebook, the Pandas and Matplot libraries that you installed on the cluster using the install_pypi_package API are garbage and collected out of the cluster.

Using local Python libraries in EMR Notebooks

The notebook-scoped libraries discussed previously require your EMR cluster to have access to a PyPI repository. If you cannot connect your EMR cluster to a repository, use the Python libraries pre-packaged with EMR Notebooks to analyze and visualize your results locally within the notebook. Unlike the notebook-scoped libraries, these local libraries are only available to the Python kernel and are not available to the Spark environment on the cluster. To use these local libraries, export your results from your Spark driver on the cluster to your notebook and use the notebook magic to plot your results locally. Because you are using the notebook and not the cluster to analyze and render your plots, the dataset that you export to the notebook has to be small (recommend less than 100 MB).

To see the list of local libraries, run the following command from the notebook cell:

%%local
conda list

You get a list of all the libraries available in the notebook. Because the list is rather long, this post doesn’t include them.

For this analysis, find out the top 10 children’s books from your book reviews dataset and analyze the star rating distribution for these children’s books.

You can identify the children’s books by using customers’ written reviews with the following code:

kids_books = (
df
.where("lower(review_body) LIKE '%child%' OR lower(review_body) LIKE '%kid%' OR lower(review_body) LIKE '%infant%'OR lower(review_body) LIKE '%Baby%'")
.select("customer_id", "product_id", "star_rating", "product_title", "year")
)

Plot the top 10 children’s books by number of customer reviews with the following code:

top_10_book_titles = kids_books.groupBy('product_title') \
                       .count().orderBy('count', ascending=False) \
                       .limit(10)
top_10_book_titles.show(10, False)

You get the following output:

Output shows list of book titles with their corresponding review count.

Analyze the customer rating distribution for these books with the following code:

top_10 = kids_books.groupBy('product_title', 'star_rating') \
           .count().join(top_10_book_titles, ['product_title'], 'leftsemi') \
           .orderBy('count', ascending=False) 
top_10.show(truncate=False)

You get the following output:

Output shows list of book titles along with start ratings and review counts.

To plot these results locally within your notebook, export the data from the Spark driver and cache it in your local notebook as a Pandas DataFrame. To achieve this, first register a temporary table with the following code:

top_10.createOrReplaceTempView("top_10_kids_books")

Use the local SQL magic to extract the data from this table with the following code:

%%sql -o top_10 -n -1
SELECT product_title, star_rating, count from top_10_kids_books
GROUP BY product_title, star_rating, count
ORDER BY count Desc

For more information about these magic commands, see the GitHub repo.

After you execute the code, you get a user-interface to interactively plot your results. The following pie chart shows the distribution of ratings:

Output shows pie chart depicting distribution of star ratings.

You can also plot more complex charts by using local Matplot and seaborn libraries available with EMR Notebooks. See the following code:

%%local 
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set()
plt.clf()
top_10['book_name'] = top_10['product_title'].str.slice(0,30)
colormap = sns.color_palette("hls", 8)
pivot_df = top_10.pivot(index= 'book_name', columns='star_rating', values='count')
pivot_df.plot.barh(stacked=True, color = colormap, figsize=(15,11))
plt.title('Top 10 children books',fontsize=16)
plt.xlabel('Number of reviews',fontsize=14)
plt.ylabel('Book',fontsize=14)
plt.subplots_adjust(bottom=0.2)

You get the following output:

Output shows stacked bar chart for top 10 children books with star ratings and review counts.

Summary

This post showed how to use the notebook-scoped libraries feature of EMR Notebooks to import and install your favorite Python libraries at runtime on your EMR cluster, and use these libraries to enhance your data analysis and visualize your results in rich graphical plots. The post also demonstrated how to use the pre-packaged local Python libraries available in EMR Notebook to analyze and plot your results.

 


About the Author

Parag Chaudhari is a software development engineer at AWS.

 

 

 

Birthday Week 2019 Wrap-up

Post Syndicated from Jake Anderson original https://blog.cloudflare.com/birthday-week-2019-wrap-up/

Birthday Week 2019 Wrap-up

Birthday Week 2019 Wrap-up

This week we celebrated Cloudflare’s 9th birthday by launching a variety of new offerings that support our mission: to help build a better Internet.  Below is a summary recap of how we celebrated Birthday Week 2019.

Cleaning up bad bots

Every day Cloudflare protects over 20 million Internet properties from malicious bots, and this week you were invited to join in the fight!  Now you can enable “bot fight mode” in the Firewall settings of the Cloudflare Dashboard and we’ll start deploying CPU intensive code to traffic originating from malicious bots.  This wastes the bots’ CPU resources and makes it more difficult and costly for perpetrators to deploy malicious bots at scale. We’ll also share the IP addresses of malicious bot traffic with our Bandwidth Alliance partners, who can help kick malicious bots offline. Join us in the battle against bad bots – and, as you can read here – you can help the climate too!

Browser Insights

Speed matters, and if you manage a website or app, you want to make sure that you’re delivering a high performing website to all of your global end users. Now you can enable Browser Insights in the Speed section of the Cloudflare Dashboard to analyze website performance from the perspective of your users’ web browsers.  

WARP, the wait is over

Several months ago we announced WARP, a free mobile app purpose-built to address the security and performance challenges of the mobile Internet, while also respecting user privacy.  After months of testing and development, this week we (finally) rolled out WARP to approximately 2 million wait-list customers.  We also enabled Warp Plus, a WARP experience that uses Argo routing technology to route your mobile traffic across faster, less-congested, routes through the Internet.  Warp and Warp Plus (Warp+) are now available in the iOS and Android App stores and we can’t wait for you to give it a try!

HTTP/3 Support

Last year we announced early support for QUIC, a UDP based protocol that aims to make everything on the Internet work faster, with built-in encryption. The IETF subsequently decided that QUIC should be the foundation of the next generation of the HTTP protocol, HTTP/3. This week, Cloudflare was the first to introduce support for HTTP/3 in partnership with Google Chrome and Mozilla.

Workers Sites

Finally, to wrap up our birthday week announcements, we announced Workers Sites. The Workers serverless platform continues to grow and evolve, and every day we discover new and innovative ways to help developers build and optimize their applications. Workers Sites enables developers to easily deploy lightweight static sites across Cloudflare’s global cloud platform without having to build out the traditional backend server infrastructure to support these sites.

We look forward to Birthday Week every year, as a chance to showcase some of our exciting new offerings — but we all know building a better Internet is about more than one week.  It’s an effort that takes place all year long, and requires the help of our partners, employees and especially you — our customers. Thank you for being a customer, providing valuable feedback and helping us stay focused on our mission to help build a better Internet.

Can’t get enough of this week’s announcements, or want to learn more? Register for next week’s Birthday Week Recap webinar to get the inside scoop on every announcement.

Analyze Google Analytics data using Upsolver, Amazon Athena, and Amazon QuickSight

Post Syndicated from Roy Hasson original https://aws.amazon.com/blogs/big-data/analyze-google-analytics-data-using-upsolver-amazon-athena-and-amazon-quicksight/

In this post, we present a solution for analyzing Google Analytics data using Amazon Athena. We’re including a reference architecture built on moving hit-level data from Google Analytics to Amazon S3, performing joins and enrichments, and visualizing the data using Amazon Athena and Amazon QuickSight. Upsolver is used for data lake automation and orchestration, enabling customers to get started quickly.

Google Analytics is a popular solution for organizations who want to understand the performance of their web properties and applications. Google Analytics data is collected and aggregated to help users extract insights quickly.  This works great for simple analytics. It’s less than ideal, however, when you need to enrich Google Analytics data with other datasets to produce a comprehensive view of the customer journey.

Why analyze Google Analytics data on AWS?

Google Analytics has become the de-facto standard web analytics tool. It is offered for free at lower data volumes and provides tracking, analytics, and reporting.  It enables non-technical users to understand website performance by answering questions such as: where are users coming from? Which pages have the highest conversion rates? Where are users experiencing friction and abandoning their shopping cart?

While these questions are answered within the Google Analytics UI, there are however some limitation, such as:

  • Data sampling: Google Analytics standard edition displays sampled data when running ad hoc queries on time periods that contain more than 500,000 sessions. Large websites can easily exceed this number on a weekly or even daily basis. This can create reliability issues between different reports, as each query can be fed by a different sample of the data.
  • Difficulty integrating with existing AWS stack: Many customers have built or are in the process of building their data and analytics platform on AWS. Customers want to use the AWS analytics and machine learning capabilities with their Google Analytics data to enable new and innovative use cases.
  • Joining with external data sources: Seeing the full picture of a business’ online activity might require combining web traffic data with other sources. Google Analytics does not offer a simple way to either move raw data in or out of the system. Custom dimensions in Google Analytics can be used, but they are limited to 20 for the standard edition and are difficult to use.
  • Multi-dimensional analysis: Google Analytics custom reports and APIs are limited to seven dimensions per query. This limits the depth of analysis and requires various workarounds for more granular slicing and dicing.
  • Lack of alternatives: Google Analytics 360, which allows users to export raw data to Google BigQuery, carries a hefty annual fee. This can be prohibitive for organizations. And even with this upgrade, the native integration is only with BigQuery, which means users still can’t use their existing AWS stack.

Building or buying a new web analytics solution (including cookie-based tracking) is also cost-prohibitive, and can interrupt existing workflows that rely on Google Analytics data.

Customers are looking for a solution to enable their analysts and business users to incorporate Google Analytics data into their existing workflows using familiar AWS tools.

Moving Google Analytics data to AWS: Defining the requirements

To provide an analytics solution with the same or better level of reporting as Google Analytics, we designed our solution around the following tenets:

  1. Analytics with a low technical barrier to entry: Google Analytics is built for business users, and our solution is designed to provide a similar experience. This means that beyond ingesting the data, we want to automate the data engineering work that goes into making the data ready for analysis.  This includes data retention, partitioning, and compression. All of this work must be done under the hood and remain invisible to the user querying the data.
  2. Hit-level data: Google Analytics tracks clickstream activity based on Hits – the lowest level of interaction between a user and a webpage. These hits are then grouped into Sessions – hits within a given time period, and Users – groups of sessions (more details here). The standard Google Analytics API is limited to session and user-based queries, and does not offer any simple way of extracting hit-level data. Our solution, however, does provide access to this granular data.
  3. Unsampled data: By extracting the data from Google Analytics and storing it on Amazon S3, we are able to bypass the 500K sessions limitation. We also have access to unsampled data for any query at any scale.
  4. Data privacy: If sensitive data is stored in Google Analytics, relying on third-party ETL tools can create risks around data privacy, especially in the era of GDPR. Therefore, our solution encrypts data in transit and relies exclusively on processing within the customer’s VPC.

Solution overview

The solution is built on extracting hit-level data and storing it in a data lake architecture on Amazon S3. We then use Amazon Athena and Amazon QuickSight for analytics and reporting. Upsolver, an AWS premier solution provider, is used to automate ingestion, ETL and data management on S3. Upsolver also orchestrate the entire solution with a simple-to-use graphical user interface.  The following diagram shows the high level architecture of our solutions.

Reference architecture showing the flow of data across Google Anaytics, Amazon Athena and Amazon QuickSight

Using Upsolver’s GA connector we extract unsampled, hit-level data from Google Analytics. This data is then automatically ingested according to accepted data lake best practices and stored in an optimized form on Amazon S3. The following best practices are applied to the data:

  • Store data in Apache Parquet columnar file format to improve read performance and reduce the amount of data scanned per query.
  • Partition data by event (hit) time rather than by API query time.
  • Perform periodic compaction by which small files are merged into larger ones improving performance and optimizing compression.

Once data is stored on S3, we use Upsolver’s GUI to create structured fact tables from the Google Analytics data. Users can query them using Amazon Athena and Amazon Redshift. Upsolver provides simple to use templates to help users quickly create tables from their Google Analytics data.  Finally, we use Amazon QuickSight to create interactive dashboards to visualize the data.

The result is a complete view of our Google Analytics data. This view provides the level of self-service analytics that users have grown accustomed to, at any scale, and without the limitations outlined earlier.

Building the solution: Step by step guide

In this section, we walk through the steps to set up the environment, configure Upsolver’s Google Analytics plugin, extract the data, and begin exploring.

Step 1: Installation and permissions

  1. Sign up for Upsolver (can also be done via the AWS Marketplace).
  2. Allow Upsolver access to read data from Google Analytics and add new custom dimensions. Custom dimensions enable Upsolver to read non-sampled hit-level data directly from Google Analytics instead of creating parallel tracking mechanisms that aren’t as trust-worthy.
  3. To populate the custom dimensions that were added to Google Analytics, allow Upsolver to run a small JavaScript code on your website. If you’re using GA360, this is not required.

Step 2: Review and clean the raw data

For supported data sources, Upsolver automatically discovers the schema and collects key statistics for every field in the table. Doing so gives users a glimpse into their data.

In the following screenshot, you can see schema-on-read information on the left side, stats per field and value distribution on the right side.

Screen shot of the Upsolver UI showing schema-on-read information on the left side, stats per field and value distribution on the right side

Step 3: Publishing to Amazon Athena

Upsolver comes with four templates for creating tables in your AWS based data lake according to the Google Analytics entity being analyzed:

  • Pageviews – used to analyze user flow and behavior on specific sections of the web property using metrics such as time on page and exit rate.
  • Events – user-defined interactions such as scroll depth and link clicks.
  • Sessions – monitor a specific journey in the web property (all pageviews and events).
  • Users – understand a user’s interaction with the web property or app over time.

All tables are partitioned by event time, which helps improve query performance.

Upsolver users can choose to run the templates as-is, modify them first or create new tables unique to their needs.

The following screenshot shows the schema produced by the Pageviews template:

Screen shot of the Upsolver UI showing the schema produced by the Pageviews template:

The following screenshot shows the Pageviews and Events tables as well as the Amazon Athena views for Sessions and Users generated by the Upsolver templates.

Screenshot showing the Pageviews and Events tables as well as the Athena views for Sessions and Users generated from the Upsolver templates.

The following are a couple example queries you may want to run to extract specific insights:

-- Popular page titles 
SELECT page_title, 
       Count(*) AS num_hits 
FROM   ga_hits_pageviews 
GROUP  BY page_title 
ORDER  BY 2 DESC 
-- User aggregations from hit data 
SELECT user_id, 
       Count(*)                   AS num_hits, 
       Count(DISTINCT session_id) AS num_of_sessions, 
       Sum(session_duration)      AS total_sessions_time 
FROM   ga_hits_pageviews 
GROUP  BY user_id 

Step 4: Visualization in Amazon QuickSight

Now that the data has been ingested, cleansed, and written to S3 in a structured manner, we are ready visualize it with Amazon QuickSight. Start by creating a dashboard to mimic the one provided by Google Analytics.  But we don’t need to stop there.  We can use QuickSight ML Insights to extract deeper insights from our data.  We can also embed Amazon QuickSight visualizations into existing web portals and applications making insights available to everyone.

Screenshot of QuickSight visual ization showing several sections, one with a graph, several others with various statistics

Screen shot of QuickSight showing a global map with usage concentrations marked by bubbles, alongside a pie graph.

Sreenshot of QuickSight showing a bar graph, alongside a table with various data values.

Conclusion

With minimal setup, we were able to extract raw hit-level Google Analytics data, prepare, and stored it in a data lake on Amazon S3.  Using Upsolver, combined with Amazon Athena and Amazon QuickSight, we built a feature-complete solution for analyzing web traffic collected by Google Analytics on AWS.

Key technical benefits:

  • Schema on-read means data consumers don’t need to model the data into a table structure, and can instantly understand what their top dimensions are. For example, 85% of my users navigate my website using Google Chrome browser.
  • Graphical user interface that enables self-service consumption of Google Analytics data.
  • Fast implementation using pre-defined templates that map raw data from Google Analytics to tables in the data lake.
  • Ability to replay historical Google Analytics data stored on Amazon S3.
  • Ability to partition the data on Amazon S3 by hit time reducing complexity of handling late arriving events.
  • Optimize data on Amazon S3 automatically for improved query performance.
  • Automatically manage tables and partitions in AWS Glue Data Catalog.
  • Fully integrated with a suite of AWS native services – Amazon S3, Amazon Athena, Amazon Redshift and Amazon QuickSight.

Now that we have feature parity, we can begin to explore integrating other data sources such as CRM, sales, and customer profile to build a true 360-degree view of the customer.  Furthermore, you can now begin using AWS Machine Learning services to optimize traffic to your websites, forecast demand and personalize the user experience.

We’d love to hear what you think. Please feel free to leave a comment with any feedback or questions you may have.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors


Roy Hasson is the global business development lead of analytics and data lakes at AWS.
He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.

 

 

 

Eran Levy is the director of marketing at Upsolver.

 

 

 

 

Introducing Browser Insights

Post Syndicated from Jon Levine original https://blog.cloudflare.com/introducing-browser-insights/

Introducing Browser Insights

Speed matters. We know that when your website or app gets faster, users have a better experience and you get more conversions and more revenue. At Cloudflare, we spend our days obsessing about speed and building new features to squeeze out as much performance as possible.

But to improve speed, you first need to measure it. That’s why we’re launching Browser Insights: a new tool that measures the performance of your website from the perspective of your users. Browser Insights lets you dive in to understand where, when, and why web pages are slow. And you can enable it today, for free, with one click.

Introducing Browser Insights

Why did we build Browser Insights?

Let’s say you run an e-commerce site, and you want to make your conversion rates better. You’ve noticed that there’s a lot of traffic from visitors in Peru, but they have worse conversion than users in North America. Maybe you theorize that it takes a long time to load your checkout page, which causes customers to drop off before checking out. How would you verify that this is happening?

There are a few ways you could do this: you could check your server logs to look at timing information, or you could load the page a few times in your browser to see what’s slow.

These approaches have a few downsides though:

  • If you only look at server-side data, you miss factors that impact the end-user experience — how long did it take for the web browser to load all the necessary scripts, execute them, and paint the page?
  • If you only measure from one computer (or a small number of them), you miss the diversity of the computing population — for example, “how does this work on a phone on a 3G connection?”

To solve these problems, we use Real User Monitoring. This gives us the best of both worlds: we can run a timer inside real web browsers. This timer captures how long it takes web pages to load, from your actual users.

How does it work?

Browser Insights can be enabled with the flip of a switch in the “Speed” section of the dashboard:

Introducing Browser Insights

When it’s enabled, we add a small snippet of JavaScript code to each HTML page load that uses the standard Performance API to collect timing info. Then we can start showing you metrics about how your web pages are performing in the real world:

Introducing Browser Insights

There’s a lot of info this graph! At a high level, there are two main types of metrics

  • Request-level metrics like TCP connection time, or Request time. These metrics are counted on every page load and are impacted by Internet infrastructure, like the mobile network of your end users, or the speed of your servers.
  • Page-level metrics like Page Load Time, which take into account the many requests needed to load a web page, plus the time taken to parse HTML and execute JavaScript.

For more information about what these times mean and how we chose them, see our companion blog post.

Digging into the data

In addition to seeing several metrics about your web page performance, it’s helpful to drill into the dimensions that impact performance like URL and Country. This means you can filter down to the performance of a specific page (like your home page or checkout page), and you can see the locations where your site loads the fastest and slowest.

Going back to our example above, we want to see how performance in Peru compares to North America:

Introducing Browser Insights

Sure enough, we can confirm that there’s significant traffic from Peru, but web pages take about 13s to load on average — compared with just 4.2 seconds for users in the US. Theory confirmed!  Now we can filter all of our metrics to just Peru to understand what’s happening better:

Introducing Browser Insights

Note that “Processing” has increased the most, all the way to 12 seconds. Request times are higher as well, likely because we are connecting to an origin server in the US. Web pages are made of many individual requests, so it makes sense that, when combined, they lead to slower load times. In this example, caching faster content would probably lead to significantly page loads.

What’s coming next?

Our launch today is just the tip of the iceberg for Browser Insights. In the near future we want to add much more information that will help you understand exactly what’s slowing down your website, and what you can do to make it faster. We plan to add:

  • More metrics and dimensions, including page-level metrics like Time to First Paint and more dimensions like browser and network type
  • Subresource analytics. The average web page loads over 100 subresources, and we can provide a waterfall chart to show exactly which one is slow.
  • A/B testing, to show you how potential configuration changes will impact the performance of your own traffic
  • Error collection to monitor issues at the network layer, in JavaScript, etc
  • Alerting so that you know when performance falls below a pre-defined threshold
  • Insights powered by Cloudflare that tell you why something might be slow – for example, how your cache hit ratio impacts page load time

Protecting user privacy

Cloudflare’s mission to help build a better Internet is based on the importance we place on establishing trust with our customers, our customers’ end users, and the Internet community globally. We have a transparent business model that aligns with the interests of our customers — we make money from protecting and speeding up our customers’ Internet properties. We do not sell our customers’ (or their end users’) data.

Browser Insights requires that end users’ browsers report timing information back to Cloudflare. We designed Browser Insights so that it reports only the bare-minimum information needed to show our customers how their websites are performing. The only metrics Browser Insights collects are about timing. We do not track individual end users across our customers’ Internet properties. We encourage you to open up the Inspector in your favorite web browser to see what we’re sending back!

Try Browser Insights today

Last May we announced the all-new Speed Page. Our mission with the Speed page is to show you how fast your website is, and what you can do to make it faster. Today, we’re excited to announce that the new Speed Page is available for everyone!

Browser Insights will be available on the Speed page in early access and we’ll be working hard to bring it to everyone as soon as possible in the coming weeks. Watch this space for updates!

Introducing Browser Insights


Subscribe to this blog for daily updates on all our Birthday Week announcements.

Upgrade your resume with the AWS Certified Big Data — Specialty Certification

Post Syndicated from Tina Kelleher original https://aws.amazon.com/blogs/big-data/upgrade-your-resume-with-the-aws-certified-big-data-specialty-certification/

While most cloud computing professionals are aware of the Foundational, Associate, and Professional AWS Certifications, it’s worth mentioning that AWS also offers specialty certifications. Anyone pursuing a career that includes data analysis, data lakes, and data warehouse solutions is a solid candidate to earn the AWS Certified Big Data — Specialty certification.

The AWS Certified Big Data — Specialty certification is a great option to help grow your career. AWS Certification shows prospective employers that you have the technical skills and expertise required to perform complex data analyses using core AWS Big Data services like Amazon EMR, Amazon Redshift, Amazon QuickSight, and more. This certification validates your understanding of data collection, storage, processing, analysis, visualization, and security.

You can learn more about the full range of industry-recognized credentials that AWS offers on the AWS Certification page.

Recommended knowledge and experience

In addition to having a solid passion for cloud computing, it’s recommended that those interested in taking the AWS Certified Big Data — Specialty exam meet the following criteria:

You can find a complete list of recommended knowledge and the exam content covered in the Exam Guide. If you can tick the boxes on each of these criteria, then you’re ready to start preparing for the AWS Certified Big Data — Specialty exam.

Recommended resources

While there are no training completion requirements, AWS offers several options to help you prepare for the exam with best practices and technical skill checks to self-assess your readines

In addition to these exam prep resources, you might also find useful information on the Getting Started with Big Data on AWS and Learn to Build on AWS: Big Data pages.

Conclusion

In this post, I provided an overview of the value in earning the AWS Certified Big Data — Specialty certification. I covered the recommended knowledge that is a strong indicator of having reached a level of experience that qualifies you as a solid candidate for this AWS certification. I also provided training resources to help you brush up on your knowledge of AWS Big Data services.

For more information on the training programs offered by AWS, visit the AWS Digital and Classroom Training Overview page. You might also find helpful information on the AWS Training FAQs page.

If you have any feedback or questions, please leave a comment… and good luck on your exam!

 


About the Author

Tina Kelleher is a program manager at AWS.

 

 

 

Secure your Amazon EMR cluster from unintentional network exposure with Block Public Access configuration

Post Syndicated from Vignesh Rajamani original https://aws.amazon.com/blogs/big-data/secure-your-amazon-emr-cluster-from-unintentional-network-exposure-with-block-public-access-configuration/

AWS security groups act as a network firewall that allows you to control access to your cluster to only whitelisted IP addresses. Proper management of security groups rules is critical to protect your application and data on the cluster. Amazon EMR strongly recommends creating restrictive security group rules that include the necessary network ports, protocols, and IP addresses based on your application requirements.

While AWS account administrators can protect cloud network security in different ways, a new feature helps them prevent account users from launching clusters with misconfigured security group rules. Misconfiguration can open a broad range of cluster ports to unrestricted traffic from the public internet and expose cluster resources to outside threats.

This post discusses a new account level feature called Block Public Access (BPA) configuration that helps administrators enforce a common public access rule across all of their EMR clusters in a region.

Overview of Block Public Access configuration

BPA configuration is an account-level configuration that helps you centrally manage public network access to EMR clusters in a region. You can enable this configuration in a region and block your account users from launching clusters that allow unrestricted inbound traffic from the public IP address ( source set to 0.0.0.0/0 for IPv4 and ::/0 for IPv6) through its ports. Your applications may require specific ports to be open to the internet. In that case, configure these ports (or port ranges) in the BPA configuration as exceptions to allow public access before you launch clusters.

When account users launch clusters in the region where you have enabled BPA configuration, EMR will check the port rules defined in this configuration and  compare it with inbound traffic rules specified in the security groups associated with the clusters. If these security groups have inbound rules that open ports to the public IP address but you did not configure these ports as exception in BPA configuration, then EMR will fail the cluster creation and send an exception to the user.

 Enabling BPA configuration from the AWS Management Console

To enable BPA configuration, you need permission to call PutBlockPublicAccessConfiguration API.

  • Log in to the AWS Management Console. From the console, navigate to the Amazon EMR
  • From the navigation panel, choose Block Public Access.
  • Choose Change and select On to enable BPA.

By default, all ports are blocked except port 22 for SSH traffic. To allow more ports for public access, add them as exceptions.

  • Choose Add a port range.

Before launching your cluster, define these exceptions. The port number or range should be the only ones in the security group rules that have an inbound source IP address of 0.0.0.0/0 for IPv4 and ::/0 for IPv6.

  • Enter a port number or the range of ports for public access.
  • Choose Save Changes.

Block public access section with the "Change" hyperlink circled in red.

Block public access settings, under Exceptions section +Add a port range circled in red.

For information about configuring BPA using the AWS CLI, see Configure Block Public Access.

Summary

In this post ,we discussed a new account level feature on Amazon EMR called Block Public Access  (BPA) configuration that helps administrators manage public access to their EMR clusters. You can enable BPA configuration today and prevent your EMR cluster in a region from being unintentionally exposed to public network.

 


About the Author

Vignesh Rajamani is a senior product manager for EMR at AWS.

 

Federate Amazon QuickSight access with Okta

Post Syndicated from Loc Trinh original https://aws.amazon.com/blogs/big-data/federate-amazon-quicksight-access-with-okta/

Amazon QuickSight is a fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization. As a fully managed service, Amazon QuickSight lets you easily create and publish interactive dashboards that can then be accessed from any device and embedded into your applications, portals, and websites.

Amazon QuickSight supports identity federation through Security Assertion Markup Language 2.0 (SAML 2.0) in both Standard and Enterprise editions. With federation, you can manage users using your enterprise identity provider (IdP) and pass them to Amazon QuickSight at log-in. Such IdPs include Microsoft Active Directory Federation Services, Ping One Federation Server, and Okta.

This post provides step-by-step guidance for how to use Okta to federate access to Amazon QuickSight.

Create an Okta application

Sign in to your Okta admin dashboard. You can create a free Okta Developer Edition account.

  1. From the Okta admin dashboard ribbon, choose Applications.
  2. If you are viewing the Developer Console, switch to Classic UI, as shown in the following screenshot.
  3. Choose Add Application.
  4. Search for Amazon Web Services and choose Add.
  5. Rename Application label to Amazon QuickSight and choose Next.
  6. For Sign-On Options, choose SAML 2.0.
  7. For Default Relay State, type https://quicksight.aws.amazon.com.
  8. Right-click on Identity Provider metadata and choose Save Link As…
  9. Save the XML file to disk and choose Done. You need to use this file in the next steps.

Create a SAML provider in AWS

Open a new window and sign in to the AWS Management Console.

  1. Open the IAM console.
  2. In the navigation pane, choose Identity Providers, Create Provider.
  3. For Provider Type, choose SAML and provide a Provider Name (for example, Okta).
  4. For Metadata Document, upload the XML file from the previous steps.
  5. Choose Next Step, Create.
  6. Locate the IdP that you just created and make note of the Provider ARN

Create a role for federated users

This section describes the steps for creating an IAM SAML 2.0 federation role. While Okta is used for a single sign-on, there are two ways to provision users in Amazon QuickSight:

  • Grant the federation role permission to create new Amazon QuickSight users when a user visits for the first time.
  • Pre-provision Amazon QuickSight users using the API and add users to the appropriate groups. This is preferred for adding users to groups within Amazon QuickSight, because you can provision the user and add them to the groups at the same time.

The following steps demonstrate how to create a federation role with permission to create new Amazon QuickSight users. If you would rather pre-provision Amazon QuickSight users, instructions for using the API are at the end of this post.

  1. Open the IAM console.
  2. In the navigation pane, choose Roles, Create Role, Select type of trusted entity as SAML 2.0 federation.
  3. For SAML provider, select the IdP that you created in the previous steps (Okta).
  4. Select Allow programmatic and AWS Management Console access.
  5. Choose Next: Permissions, Create policy.
  6. In the Create policy window, navigate to the JSON tab and type the following:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "sts:AssumeRoleWithSAML",
                "Resource": "<YOUR SAML IDENTITY PROVIDER ARN>",
                "Condition": {
                    "StringEquals": {
                        "saml:aud": "https://signin.aws.amazon.com/saml"
                    }
                }
            },
            {
                "Action": [
                    "quicksight:CreateReader"
                ],
                "Effect": "Allow",
                "Resource": [
                    "arn:aws:quicksight::<YOUR ACCOUNT ID>:user/${aws:userid}"
                ]
            }
        ]
    }

The IAM policy above grants the federation role permission to self-provision an Amazon QuickSight reader with the quicksight:CreateReader action. Best practice is to grant users in your organization reader access, and then upgrade users from within the application. Instructions for upgrading users are at the end of this post.

If you would rather pre-provision Amazon QuickSight users using the API, do not include any actions in the permission policy.

  1. Choose Review Policy.
  2. For Name, enter a value (for example, QuicksightOktaFederatedPolicy) and choose Create policy.
  3. On the Create role page, choose Refresh and select your new policy.
  4. Choose Next: Tags and Next: Review.
  5. Provide a Role name (for example, QuicksightOktaFederatedRole) and Role description.
  6. Choose Create role.

Create an AWS access key for Okta

To create an access key for Okta, follow these steps.

  1. Open the IAM console.
  2. In the navigation pane, choose Users, Add user.
  3. For User name, enter a value (for example, OktaSSOUser).
  4. For Access type, choose Programmatic access.
  5. Choose Next: Permissions, Attach existing policies directly, and Create policy.
  6. On the Create policy page, navigate to the JSON tab and type the following:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "iam:ListRoles",
                    "iam:ListAccountAliases"
                ],
                "Resource": "*"
            }
        ]
    }

  7. Choose Review Policy.
  8. For a Name, enter a value (for example, OktaListRolesPolicy) and choose Create policy.
  9. On the Create user page, choose Refresh and select your new policy.
  10. Choose Next: Tags, Next: Review, and Create user.
  11. To save your access key and secret key, choose Download .csv.

Configure the Okta application

Return to the window with your Okta admin dashboard.

  1. For Identity Provider ARN (Required only for SAML SSO), provide the ARN (for example, arn:aws:iam::<YOUR ACCOUNT ID>:saml-provider/Okta) of the IdP that you created in previous steps.
  2. Choose Done.
  3. From the Applications dashboard, choose Provisioning.
  4. Choose Configure API Integration.
  5. Select Enable API Integration.
  6. For Access Key and Secret Key, provide the access key and secret key that you downloaded in previous steps.
  7. Choose Test API CredentialsSave.
  8. From the SETTINGS pane, navigate to To App.
  9. Choose Edit.
  10. Enable Create Usersand choose Save.
  11. Choose Assignments, Assign and then select the users or groups to which to grant federated access.
  12. Select the Roles and SAML User Roles to grant to the users, as shown in the following screenshot.
  13. Choose Save and Go Back, Done.

Launch Amazon QuickSight

Log in to your Okta Applications dashboard with a user (if you are using an admin account, switch to user mode) that has been granted federated access. You should see a new application with your label (for example, Amazon QuickSight). Choose on the application icon to launch Amazon QuickSight.

You can now manage your users and groups using Okta as your IdP and federate access to Amazon QuickSight.

Pre-provisioning Amazon QuickSight users

The outlined steps demonstrate how to grant users permission to self-provision Amazon QuickSight users when they visit Amazon QuickSight for the first time. If you would rather pre-provision Amazon QuickSight users, you can use the API to create users and groups and then add users to those groups.

  1. To create an Amazon QuickSight user, run the following AWS CLI Link the Amazon QuickSight user to your federated Okta username by providing the user-name parameter with the format <role name>\<email> (for example, QuicksightOktaFederatedRole\[email protected])
    aws quicksight register-user \
        --aws-account-id=<YOUR ACCOUNT ID> \
        --namespace=default \
        --email=<[email protected]> \
        --user-name=<ROLE NAME>\<[email protected]> \
        --identity-type=QUICKSIGHT \
        --user-role=READER

  2. Optionally, create an Amazon QuickSight group.
    aws quicksight create-group \
        --aws-account-id=<YOUR ACCOUNT ID> \
        --namespace=default \
        --group-name="<YOUR GROUP NAME>" \
        --description="<YOUR GROUP DESCRIPTION>"

  3. Add users to groups.
    aws quicksight create-group-membership \
        --aws-account-id=<YOUR ACCOUNT ID> \
        --namespace=default \
        --group-name="<YOUR GROUP NAME>" \
        --member-name="<YOUR MEMBER USER NAME>"

By using the Amazon QuickSight API, you can manage users, groups, and group membership. After they’re created, groups automatically become available for use when modifying permissions to data sets, analyses, or dashboards by typing in the group name instead of a specific user. For other supported group and user management functions, see List of Actions by Function.

Managing users

You can upgrade users between reader and author or admin in the Manage users tab of the Manage QuickSight screen.

  1. In the Amazon QuickSight console, choose your user name in the upper-right corner and choose Manage QuickSight.
  2. In the navigation pane, choose Manage users.
  3. Locate the user to upgrade, and select the role to grant from the Role

Deep-linking dashboards

AmazonQuickSight dashboards can be shared using the Okta application’s single sign-on URL so that users can be federated directly to specific dashboards.

To deep link to a specific Amazon QuickSight dashboard with single sign-on, first locate the Okta application’s single sign-on URL. This can be found by opening the metadata XML file that you downloaded in the Create an Okta application steps above. The URL is the value of the Location attribute in the md:SingleSignOnService element and ends with /sso/saml.

After you have the Okta application’s single sign-on URL, append ?RelayState= to the end of the URL followed by the URL to your Amazon QuickSight dashboard. For example, your deep link URL might look as follows:

https://my-test-org.okta.com/app/amazon_aws/abcdefg12345XYZ678/sso/saml?RelayState=https://us-east-1.quicksight.aws.amazon.com/sn/dashboards/11111111-abcd-1234-efghi-111111111111

By deep-linking dashboards, you can provide users with a way to use single sign-on and directly access specific dashboards.

Summary

This post provided a step-by-step guide for configuring Okta as your IdP, and using IAM roles to enable single sign-on to Amazon QuickSight. It also showed how users and groups can be managed using the Amazon QuickSight API.

Although this post demonstrated the integration of IAM and Okta, you can replicate this solution using your choice of SAML 2.0 IdPs. For other supported federation options, see Enabling Single Sign-On Access to Amazon QuickSight Using SAML 2.0.

If you have any questions or feedback, please leave a comment.

 


About the Authors

Loc Trinh is a solutions architect at Amazon Web Services.

 

 

 

 

Naresh Gautam is a senior solutions architect at Amazon Web Services.

 

 

 

Create advanced insights using Level Aware Aggregations in Amazon QuickSight

Post Syndicated from Arun Baskar original https://aws.amazon.com/blogs/big-data/create-advanced-insights-using-level-aware-aggregations-in-amazon-quicksight/

Amazon QuickSight recently launched Level Aware Aggregations (LAA), which enables you to perform calculations on your data to derive advanced and meaningful insights. In this blog post, we go through examples of applying these calculations to a sample sales dataset so that you can start using these for your own needs.

What are Level Aware Aggregations?

Level aware aggregations are aggregation calculations that can be computed at a desired level in the overall query evaluation order of QuickSight. Please check this link for details on QuickSight’s Order of Evaluation. Up until now, the only types of aggregations possible in QuickSight were Display-level and Table calculation aggregation types.

  • Display-level aggregations are aggregations that are defined by the dimensions and metrics present in the field wells of a QuickSight visual.
  • Table calculations are computed by windowing/rolling-up over the display-level aggregated values of the visual. Hence, by definition, these are calculated after the Display-level aggregations are computed.

With Level Aware Aggregations, QuickSight now allows you to aggregate values before the Display-level aggregation. For more information, please visit the Level Aware Aggregations documentation.

Customer use cases

Distribution of customers by lifetime orders

Customer question: How many customers have made one order, two orders, three orders, and so forth?

In this case, we first want to aggregate the total number of orders made by each customer, then use the output of that as a visual dimension. This isn’t feasible to compute without LAA.

Solution using LAA

1.) Compute the number of orders per customer.

Calculated field name : NumberOrdersPerCustomer

Calculated field expression : countOver({order_id}, [{Customer Id}], PRE_AGG)

This computes the number of orders per customer, before the display-level aggregation of the visual.

2.) Create the visual.

Create the visual with the above field NumberOrdersPerCustomer in the “X-Axis well of the Field Wells. Add “Count Distinct” of “Customer Id” in the “Value” section of the Field Wells to create a histogram on number of orders made by customers.

As we can see, there are around 5000 unique customers with one order, around 3500 customers with two orders, and so on.

Filter out customers based on lifetime spends

Customer Question: How do I filter out customers with life-time spend less than $100,000? My visual’s dimension (group by) and metric definitions are independent of total spend per customer.

If the group dimensions of the aforementioned aggregation(spend) is exactly the same as the group dimensions in the field well, the customer can achieve this using aggregated filters feature. But that’s not always the case. As mentioned in the customer question, the visual’s definition can be different from the filter’s aggregation.

Solution using LAA

1.) Compute sum of sales per customer.

Calculated field name :  salesPerCustomer

Calculated field expression : sumOver(sales,[{customer_id}],PRE_AGG)

PRE_AGG indicates that the computation must occur before display-level aggregation.

2.) Create the visuals.

The visual on the left shows sum of sales per segment and the visual on the right shows the total number of customers. Note that there are no filters applied at this point.

3.) Create the filter on salesPerCustomer. 

Create a filter on top of the above field salesPerCustomer to select items greater than $100,000.

4.) Apply the filter.

The above image shows applying the filter on “salesPerCustomer” greater than $100,000.

With the filter applied, we have excluded the customers whose total spend is less than $100,000, regardless of what we choose to display in the visuals.

Fixed percent of total sales even with filters applied

Customer Question: How much is the contribution of each industry to the entire company’s profit (percent of total)? I don’t want the total to recompute when filters are applied.

The existing table calculation function percentOfTotal isn’t able to solve this problem, since filters on categories are applied before computing the total. Using percentOfTotal would recalculate the total every time filters are applied. We need a solution that doesn’t consider the filtering when computing the total.

Solution using LAA

1.) Compute total sales before filters through a calculated field.

Calculated field name : totalSalesBeforeFilters

Calculated field expression : sumOver(sales,[],PRE_FILTER)

PRE_FILTER indicates that this computation must be done prior to applying filters.

The partition dimension list (second argument) is empty since we want to compute the overall total.

2.) Compute the fixed percent of total sales.

Calculated field name : fixedPercentOfTotal

Calculated field expression : sum(sales) / min(totalSalesBeforeFilters)

Note: totalSalesBeforeFilters is the same for every row of the unaggregated data. Since we want to use it post-aggregation, we are using the aggregation min on top of it. If all values are the same, max or avg aggregations can be used as well as it serves the same purpose.

3.) Create the visual.

Add “industry” field to “Rows” well. Add “sales (SUM)” and “fixedPercentOfTotal“ to the ”values“ section. Now, the percent of total metric would remain fixed even if we filter out the data based on any underlying dimension or measure.

The visual shows sales per industry along with percent of total, computed using the table calculation percentOfTotal and using Level Aware Aggregation as described above. Both the percent of total values are currently the same since there aren’t any filters applied.

The visual shows the same metrics but with industries filtered only to 5 of them. As we can see “Percent of total sales” got re-adjusted to represent only the filtered data, whereas “Fixed Percent of total sales” remains the same even after filtering. Both the metrics are valuable customer use cases now feasible through QuickSight.

Compare sales in a category to industry average

Customer question: How do I compare sales in a category to the industry average? I want the industry average to include all categories even after filtering.

Since we want the industry average to stay fixed even with filtering, we need PRE_FILTER aggregation to achieve this.

Solution using LAA

1.) Compute the industry average.

Calculated field name : IndustryAverage

Calculated field expression : avgOver(sumOver(sales,[{category}],PRE_FILTER),[],PRE_FILTER)

We first compute the sum of sales per category and then average it across all categories. It’s important to note here that we first computed a finer level aggregation and fed that into a coarser level aggregation.

2.) Compute the difference from IndustryAverage.

Calculated field name : FixedDifferenceFromIndustryAverage

Calculated field expression : sum(sales) – min(IndustryAverage)

As mentioned in one of the examples above, we use min aggregation to retain the data while going.

3.) Create the visual.

Create the visual by adding “Category” in “X axis” field well and SUM(Sales), IndustryAverage and FixedDifferenceFromIndustryAverage as the values in a bar chart.

Visual shows total sales per category, the average across all industries and each category’s difference from average.

This visual shows the same metrics, but with categories filtered to include only 6 of them. As we can see, the industry average remained the same before and after filtering, keeping the difference the same whether you choose to show all categories, some of them, or just one.

Categorize customers based on lifetime spend

Customer question: How do I classify customers based on cumulative sales contribution? I then want to use that classification as my visual’s grouping.

The objective here is create custom sized bins to classify the customer. Even though we could do this classification post display-level aggregation, we wouldn’t be able to use it as a dimension/group by in the visual.

Solution using LAA

1.) Compute sales per customer before display-level aggregation.

Calculated field name : salesPerCustomer

Calculated field expression : sumOver({sales amount},[{customer id}],PRE_AGG)

2.) Categorize Customers.

Calculated field name : Customer Category

Calculated field expression : ifelse(salesPerCustomer < 1000, “VERY_LOW”, salesPerCustomer < 10000, “LOW”, salesPerCustomer < 100000, “MEDIUM”, “HIGH”)

3.) Create the visual.

Create the visual by adding “Customer Category” to the “Y-axis” field well, “Count Distinct” of “customer id” to the value field well.

Above image shows the number of unique customers per Custom Category.

Filtering can be done on top of these categories as well to build other relevant visuals, since the categories are tagged before aggregation.

Above image shows the number of unique customers per custom category split by gender.

Availability

Level aware aggregations are available in both Standard and Enterprise editions, in all supported AWS Regions. For more information, see the Amazon QuickSight documentation.

 


About the Author

Arun Baskar is a software development engineer for QuickSight at Amazon Web Services.

 

 

 

 

Learn about AWS Services & Solutions – September AWS Online Tech Talks

Post Syndicated from Jenny Hang original https://aws.amazon.com/blogs/aws/learn-about-aws-services-solutions-september-aws-online-tech-talks/

Learn about AWS Services & Solutions – September AWS Online Tech Talks

AWS Tech Talks

Join us this September to learn about AWS services and solutions. The AWS Online Tech Talks are live, online presentations that cover a broad range of topics at varying technical levels. These tech talks, led by AWS solutions architects and engineers, feature technical deep dives, live demonstrations, customer examples, and Q&A with AWS experts. Register Now!

Note – All sessions are free and in Pacific Time.

Tech talks this month:

 

Compute:

September 23, 2019 | 11:00 AM – 12:00 PM PTBuild Your Hybrid Cloud Architecture with AWS – Learn about the extensive range of services AWS offers to help you build a hybrid cloud architecture best suited for your use case.

September 26, 2019 | 1:00 PM – 2:00 PM PTSelf-Hosted WordPress: It’s Easier Than You Think – Learn how you can easily build a fault-tolerant WordPress site using Amazon Lightsail.

October 3, 2019 | 11:00 AM – 12:00 PM PTLower Costs by Right Sizing Your Instance with Amazon EC2 T3 General Purpose Burstable Instances – Get an overview of T3 instances, understand what workloads are ideal for them, and understand how the T3 credit system works so that you can lower your EC2 instance costs today.

 

Containers:

September 26, 2019 | 11:00 AM – 12:00 PM PTDevelop a Web App Using Amazon ECS and AWS Cloud Development Kit (CDK) – Learn how to build your first app using CDK and AWS container services.

 

Data Lakes & Analytics:

September 26, 2019 | 9:00 AM – 10:00 AM PTBest Practices for Provisioning Amazon MSK Clusters and Using Popular Apache Kafka-Compatible Tooling – Learn best practices on running Apache Kafka production workloads at a lower cost on Amazon MSK.

 

Databases:

September 25, 2019 | 1:00 PM – 2:00 PM PTWhat’s New in Amazon DocumentDB (with MongoDB compatibility) – Learn what’s new in Amazon DocumentDB, a fully managed MongoDB compatible database service designed from the ground up to be fast, scalable, and highly available.

October 3, 2019 | 9:00 AM – 10:00 AM PTBest Practices for Enterprise-Class Security, High-Availability, and Scalability with Amazon ElastiCache – Learn about new enterprise-friendly Amazon ElastiCache enhancements like customer managed key and online scaling up or down to make your critical workloads more secure, scalable and available.

 

DevOps:

October 1, 2019 | 9:00 AM – 10:00 AM PT – CI/CD for Containers: A Way Forward for Your DevOps Pipeline – Learn how to build CI/CD pipelines using AWS services to get the most out of the agility afforded by containers.

 

Enterprise & Hybrid:

September 24, 2019 | 1:00 PM – 2:30 PM PT Virtual Workshop: How to Monitor and Manage Your AWS Costs – Learn how to visualize and manage your AWS cost and usage in this virtual hands-on workshop.

October 2, 2019 | 1:00 PM – 2:00 PM PT – Accelerate Cloud Adoption and Reduce Operational Risk with AWS Managed Services – Learn how AMS accelerates your migration to AWS, reduces your operating costs, improves security and compliance, and enables you to focus on your differentiating business priorities.

 

IoT:

September 25, 2019 | 9:00 AM – 10:00 AM PTComplex Monitoring for Industrial with AWS IoT Data Services – Learn how to solve your complex event monitoring challenges with AWS IoT Data Services.

 

Machine Learning:

September 23, 2019 | 9:00 AM – 10:00 AM PTTraining Machine Learning Models Faster – Learn how to train machine learning models quickly and with a single click using Amazon SageMaker.

September 30, 2019 | 11:00 AM – 12:00 PM PTUsing Containers for Deep Learning Workflows – Learn how containers can help address challenges in deploying deep learning environments.

October 3, 2019 | 1:00 PM – 2:30 PM PTVirtual Workshop: Getting Hands-On with Machine Learning and Ready to Race in the AWS DeepRacer League – Join DeClercq Wentzel, Senior Product Manager for AWS DeepRacer, for a presentation on the basics of machine learning and how to build a reinforcement learning model that you can use to join the AWS DeepRacer League.

 

AWS Marketplace:

September 30, 2019 | 9:00 AM – 10:00 AM PTAdvancing Software Procurement in a Containerized World – Learn how to deploy applications faster with third-party container products.

 

Migration:

September 24, 2019 | 11:00 AM – 12:00 PM PTApplication Migrations Using AWS Server Migration Service (SMS) – Learn how to use AWS Server Migration Service (SMS) for automating application migration and scheduling continuous replication, from your on-premises data centers or Microsoft Azure to AWS.

 

Networking & Content Delivery:

September 25, 2019 | 11:00 AM – 12:00 PM PTBuilding Highly Available and Performant Applications using AWS Global Accelerator – Learn how to build highly available and performant architectures for your applications with AWS Global Accelerator, now with source IP preservation.

September 30, 2019 | 1:00 PM – 2:00 PM PTAWS Office Hours: Amazon CloudFront – Just getting started with Amazon CloudFront and [email protected]? Get answers directly from our experts during AWS Office Hours.

 

Robotics:

October 1, 2019 | 11:00 AM – 12:00 PM PTRobots and STEM: AWS RoboMaker and AWS Educate Unite! – Come join members of the AWS RoboMaker and AWS Educate teams as we provide an overview of our education initiatives and walk you through the newly launched RoboMaker Badge.

 

Security, Identity & Compliance:

October 1, 2019 | 1:00 PM – 2:00 PM PTDeep Dive on Running Active Directory on AWS – Learn how to deploy Active Directory on AWS and start migrating your windows workloads.

 

Serverless:

October 2, 2019 | 9:00 AM – 10:00 AM PTDeep Dive on Amazon EventBridge – Learn how to optimize event-driven applications, and use rules and policies to route, transform, and control access to these events that react to data from SaaS apps.

 

Storage:

September 24, 2019 | 9:00 AM – 10:00 AM PTOptimize Your Amazon S3 Data Lake with S3 Storage Classes and Management Tools – Learn how to use the Amazon S3 Storage Classes and management tools to better manage your data lake at scale and to optimize storage costs and resources.

October 2, 2019 | 11:00 AM – 12:00 PM PTThe Great Migration to Cloud Storage: Choosing the Right Storage Solution for Your Workload – Learn more about AWS storage services and identify which service is the right fit for your business.

 

 

Implement perimeter security in EMR using Apache Knox

Post Syndicated from Varun Rao Bhamidimarri original https://aws.amazon.com/blogs/big-data/implement-perimeter-security-in-emr-using-apache-knox/

Perimeter security helps secure Apache Hadoop cluster resources to users accessing from outside the cluster. It enables a single access point for all REST and HTTP interactions with Apache Hadoop clusters and simplifies client interaction with the cluster. For example, client applications must acquire Kerberos tickets using Kinit or SPNEGO before interacting with services on Kerberos enabled clusters. In this post, we walk through setup of Apache Knox to enable perimeter security for EMR clusters.

It provides the following benefits:

  • Simplify authentication of various Hadoop services and UIs
  • Hide service-specific URL’s/Ports by acting as a Proxy
  • Enable SSL termination at the perimeter
  • Ease management of published endpoints across multiple clusters

Overview

Apache Knox

Apache Knox provides a gateway to access Hadoop clusters using REST API endpoints. It simplifies client’s interaction with services on the Hadoop cluster by integrating with enterprise identity management solutions and hiding cluster deployment details.

In this post, we run the following setup:

  • Create a virtual private cloud (VPC) based on the Amazon VPC
  • Provision an Amazon EC2 Windows instance for Active Directory domain controller.
  • Create an Amazon EMR security configuration for Kerberos and cross-realm trust.
  • Set up Knox on EMR master node and enable LDAP authentication

Visually, we are creating the following resources:

Figure 1: Provisioned infrastructure from CloudFormation

Prerequisites and assumptions

Before getting started, the following prerequisites must be met:

IMPORTANT: The templates use hardcoded user name and passwords, and open security groups. They are not intended for production use without modification.

NOTE:

  • Single VPC has been used to simplify networking
  • CloudFormationtemplates use hardcoded user names and passwords and open security groups for simplicity.

Implementation

Single-click solution deployment

If you don’t want to set up each component individually, you can use the single-step AWS CloudFormation template. The single-step template is a master template that uses nested stacks (additional templates) to launch and configure all the resources for the solution in one operation.

To launch the entire solution, click on the Launch Stack button below that directs you to the console. Do not change to a different Region because the template is designed to work only in US-EAST-1 Region.

This template requires several parameters that you must provide. See the table below, noting the parameters marked with *, for which you have to provide values. The remaining parameters have default values and should not be edited.

For this parameterUse this
1Domain Controller NameDC1
2Active Directory domainawsknox.com
3Domain NetBIOS nameAWSKNOX (NetBIOS name of the domain (up to 15 characters).
4Domain admin userUser name for the account to be added as Domain administrator. (awsadmin)
5Domain admin password *Password for the domain admin user. Must be at least eight characters containing letters, numbers, and symbols – for example, CheckSum123
6Key pair name *Name of an existing EC2 key pair to enable access to the domain controller instance.
7Instance typeInstance type for the domain controller EC2 instance.
8LDAP Bind user nameLDAP Bind user name.
Default value is: CN=awsadmin,CN=Users,DC=awsknox,DC=com
9EMR Kerberos realmEMR Kerberos realm name. This is usually the VPC’s domain name in upper case letters Eg: EC2.INTERNAL
10Cross-realm trust password *Password for cross-realm trust Eg: CheckSum123
11Trusted Active Directory DomainThe Active Directory domain that you want to trust. This is same as Active Directory in name, but in upper case letters. Default value is “AWSKNOX.COM”
12Instance typeInstance type for the domain controller EC2 instance. Default: m4.xlarge
13Instance countNumber of core instances of EMR cluster. Default: 2
14Allowed IP addressThe client IP address that can reach your cluster. Specify an IP address range in CIDR notation (for example, 203.0.113.5/32). By default, only the VPC CIDR (10.0.0.0/16) can reach the cluster. Be sure to add your client IP range so that you can connect to the cluster using SSH.
15EMR applicationsComma-separated list of applications to install on the cluster. By default it selects “Hadoop,” “Spark,” “Ganglia,” “Hive” and “HBase”
16LDAP search baseLDAP search base: Only value is : “CN=Users,DC=awshadoop,DC=com”
17LDAP search attributeProvide LDAP user search attribute. Only value is : “sAMAccountName”
18LDAP user object classProvide LDAP user object class value. Only value is : “person”
19LDAP group search baseProvide LDAP group search base value. Only value is : “dc=awshadoop, dc=com”
20LDAP group object classProvide LDAP group object class. Only value is “group”
21LDAP member attributeProvide LDAP member attribute. Only value is : “member”
22EMRLogDir *Provide an Amazon S3 bucket where the EMRLogs are stored. Also provide “s3://” as prefix.
23S3 BucketAmazon S3 bucket where the artifacts are stored. In this case, all the artifacts are stored in “aws-bigdata-blog” public S3 bucket. Do not change this value.

Deploying each component individually

If you used the CloudFormation Template in the single-step solution, you can skip this section and start from the Access the Cluster section. This section describes how to use AWS CloudFormation templates to perform each step separately in the solution.

1.     Create and configure an Amazon VPC

In this step, we set up an Amazon VPC, a public subnet, an internet gateway, a route table, and a security group.

In order for you to establish a cross-realm trust between an Amazon EMR Kerberos realm and an Active Directory domain, your Amazon VPC must meet the following requirements:

  • The subnet used for the Amazon EMR cluster must have a CIDR block of fewer than nine digits (for example, 10.0.1.0/24).
  • Both DNS resolution and DNS hostnames must be enabled (set to “yes”).
  • The Active Directory domain controller must be the DNS server for instances in the Amazon VPC (this is configured in the next step).

To launch directly through the console, choose Launch Stack.

2.     Launch and configure an Active Directory domain controller

In this step, you use an AWS CloudFormation template to automatically launch and configure a new Active Directory domain controller and cross-realm trust.

Next, launch a windows EC2 instance and install and configure an Active Directory domain controller. In addition to launching and configuring an Active Directory domain controller and cross realm trust, this AWS CloudFormation template also sets the domain controller as the DNS server (name server) for your Amazon VPC.

To launch directly through the console, choose Launch Stack.

3.     Launch and configure EMR cluster with Apache Knox

To launch a Kerberized Amazon EMR cluster, first we must create a security configuration containing the cross-realm trust configuration. For more details on this, please refer to the blog post, Use Kerberos Authentication to integerate Amazon EMR with Microsoft Active Directory.

In addition to the steps that are described in the above blog, this adds an additional step to the EMR cluster, which creates a Kerberos principal for Knox.

The CloudFormation script also updates the below parameters in core-site.xml, hive-site.xml, hcatalog-webchat-site.xml and oozie-site.xml files. You can see these in “create_emr.py” script. Once the EMR cluster is created, it also runs a shell script as an EMR step. This shell script downloads and installs Knox software on EMR master machine. It also creates a Knox topology file with the name: emr-cluster-top.

To launch directly through the console, choose Launch Stack.

Accessing the cluster

API access to Hadoop Services

One of the main reasons to use Apache Knox is the isolate the Hadoop cluster from direct connectivity by users. Below, we demonstrate how you can interact with several Hadoop services like WebHDFS, WebHCat, Oozie, HBase, Hive, and Yarn applications going through the Knox endpoint using REST API calls. The REST calls can be called on the EMR cluster or outside of the EMR cluster. However, in a production environment, EMR cluster’s security groups should be set to only allow traffic on Knox’s port number to block traffic to all other applications.

For the purposes of this blog, we make the REST calls on the EMR cluster by SSH’ing to master node on the EMR cluster using the LDAP credentials:

ssh [email protected]<EMR-Master-Machine-Public-DNS>

Replace <EMR-Master-Machine-Public-DNS> with the value from the CloudFormation outputs to the EMR cluster’s master node. Find this CloudFormation Output value from the stack you deployed in Step 3 above.

You are prompted for the ‘awsadmin’ LDAP password. Please use the password you selected during the CloudFormation stack creation.

NOTE: In order to connect, your client machine’s IP should fall within the CIDR range specified by “Allowed IP address in the CloudFormation parameters. If you are not able to connect to the master node, check the master instance’s security group for the EMR cluster has a rule to allow traffic from your client. Otherwise, your organizations firewall may be blocking your traffic.

Demonstrating access to the WebHDFS service API:

Here we will invoke the LISTSTATUS operation on WebHDFS via the knox gateway. In our setup, knox is running on port number 8449. The below command will return a directory listing of the root directory of HDFS.

curl -ku awsadmin 'https://localhost:8449/gateway/emr-cluster-top/webhdfs/v1/?op=LISTSTATUS'

You can use both “localhost” or the private DNS of the EMR master node.

You are prompted for the password. This is the same “Domain admin password” that was passed as the parameter into the CloudFormation stack.

Demonstrating access Resource Manager service API:

The Resource manager REST API provides information about the Hadoop cluster status, applications that are running on the cluster etc. We can use the below command to get the cluster information.

curl -ikv -u awsadmin -X GET 'https://localhost:8449/gateway/emr-cluster-top/resourcemanager/v1/cluster'

You are prompted for the password. This is the same “Domain admin password” that was passed as the parameter into the CloudFormation stack.

Demonstrating connecting to Hive using Beeline through Apache Knox:

We can use Beeline, a JDBC client tool to connect to HiveServer2. Here we will connect to Beeline via Knox.

Use the following command to connect to hive shell

$hive

Use the following syntax to connect to Hive from beeline

!connect jdbc:hive2://<EMR-Master-Machine-Public-DNS>:8449/;transportMode=http;httpPath=gateway/emr-cluster-top/hive;ssl=true;sslTrustStore=/home/knox/knox/data/security/keystores/gateway.jks;trustStorePassword=CheckSum123

NOTE: You must update the <EMR-Master-Machine-Public-DNS> with the public DNS name of the EMR master node.

Demonstrating submitting an Spark job using Apache Livy through Apache Knox

You can use the following command to submit a spark job to an EMR cluster. In this example, we run SparkPi program that is available in spark-examples.jar.

curl -i -k -u awsadmin -X POST --data '{"file": "s3://aws-bigdata-blog/artifacts/aws-blog-emr-knox/spark-examples.jar", "className": "org.apache.spark.examples.SparkPi", "args": ["100"]}' -H "Content-Type: application/json" https://localhost:8449/gateway/emr-cluster-top/livy/v1/batches

You can use both “localhost” or the private DNS of EMR master node.

Securely accessing Hadoop Web UIs

In addition to providing API access to Hadoop clusters, Knox also provides proxying service for Hadoop UIs. Below is a table of available UIs:

Application NameApplication URL
1Resource Managerhttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/yarn/
2Gangliahttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/ganglia/
3Apache HBasehttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/hbase/webui/master-status
4WebHDFShttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/hdfs/
5Spark Historyhttps://<EMRClusterURL>:8449/gateway/emr-cluster-top/sparkhistory/

On the first visit of any UI above, you are presented with a drop-down for login credentials. Enter the login user awsadmin and the password you specified as a parameter to your CloudFormation template.

You can now browse the UI as you were directly connected to the cluster. Below is a sample of the Yarn UI:

And the scheduler information in the Yarn UI:

Ganglia:

Spark History UI:

Lastly, HBase UI. The entire URL to the “master-status” page must be provided

Troubleshooting

It’s always clear when there is an error interacting with Apache Knox. Below are a few troubleshooting steps.

I cannot connect to the UI. I do not get any error codes.

  • Apache Knox may not be running. Check that its running by logging into the master node of your cluster and running “ps -ef | grep knox”. There should be a process running.
ps -ef | grep knox
Knox 114022 1 0 Aug24 ? 00:04:21 /usr/lib/jvm/java/bin/java -Djava.library.path=/home/knox/knox/ext/native -jar /home/knox/knox/bin/gateway.jar

If the process is not running, start the process by running “/home/knox/knox/bin/gateway.sh start” as the Knox user (sudo su – knox).

  • Your browser may not have connectivity to the cluster. Even though you may be able to SSH to the cluster, a firewall rule or security group rule may be preventing traffic on the port number that Knox is running on. You can route traffic through SSH by building an SSH tunnel and enable port forwarding.

I get an HTTP 400, 404 or 503 code when accessing a UI:

  • Ensure that the URL you are entering is correct. If you do not enter the correct path, then Knox provides an HTTP 404.
  • There is an issue with the routing rules within Apache Knox and it does not know how to route the requests. The logs for Knox are at INFO level by default and is available in /home/knox/knox/logs/. If you want to change the logging level, change the following lines in /home/knox/knox/conf/gateway-log4j.properties:log4j.logger.org.apache.knox.gateway=INFO
    #log4j.logger.org.apache.knox.gateway=DEBUGto#log4j.logger.org.apache.knox.gateway=INFO
    log4j.logger.org.apache.knox.gateway=DEBUGThe logs will provide a lot more information such as how Knox is rewriting URL’s. This could provide insight whether Knox is translating URL’s correctly.You can use the below “ldap”, “knoxcli” and “curl” commands to verify that the setup is correct. Run these commands as “knox” user.
  • To verify search base, search attribute and search class, run the below ldap command
    ldapsearch -h <Active-Directory-Domain-Private-IP-Address> -p 389 -x -D 'CN=awsadmin,CN=Users,DC=awsknox,DC=com' -w 'CheckSum123' -b 'CN=Users,DC=awsknox,DC=com' -z 5 '(objectClass=person)' sAMAccountName

  • Replace “<Active-Directory-Domain-Private-IP-Address>” with the private IP address of the Active Directory EC2 instance. You can get this IP address from the output of second CloudFormation template.
  • To verify the values for server host, port, username, and password, run the below ldap command.
    ldapwhoami -h <Active-Directory-Domain-Private-IP-Address> -p 389 -x -D 'CN=awsadmin,CN=Users,DC=awsknox,DC=com' -w 'CheckSum123'

  • Replace “<Active-Directory-Domain-Private-IP-Address>” with the private IP address of the Active Directory EC2 instance. You can get this IP address from the output of second CloudFormation template.
  • It should display the below output:

  • To verify the System LDAP bind successful or not:
    /home/knox/knox/bin/knoxcli.sh user-auth-test --cluster emr-cluster-top --u awsadmin --p 'CheckSum123'

  • Here “emr-cluster-top” is the topology file that defines the applications that are available and the endpoints that Knox should connect to service the application.
  • The output from the command should return the below output:

“System LDAP Bind successful!”

  • To verify LDAP authentication successful or not, run the below command.
    /home/knox/knox/bin/knoxcli.sh user-auth-test --cluster emr-cluster-top --u awsadmin --p 'CheckSum123'

  • Here “emr-cluster-top” is the topology file name that we created.
  • The output the command should return the below output:

“LDAP authentication successful!”

  • Verify if WebHDFS is reachable directly using the service
  • First, we must get a valid Kerberos TGT, for that we must use the kinit command as below:
    kinit -kt /mnt/var/lib/bigtop_keytabs/knox.keytab knox/<EMR-Master-Machine-Private-DNS>@EC2.INTERNAL
    curl --negotiate -u : http://<EMR-Master-Machine-Private-DNS>:50070/webhdfs/v1/?op=GETHOMEDIRECTORY

  • For example: EMR-Master-Machine-Private-DNS appears in this format: ip-xx-xx-xx-xx.ec2.internal
  • It should return a JSON object containing a “Path” variable of the user’s home directory.

Cleanup

Delete the CloudFormation stack to clean up all the resources created for this setup. If you used the nested stack, CloudFormation deletes all resources in one operation. If you deployed the templates individually, delete them in the reverse order of creation, deleting the VPC stack last.

Conclusion

In this post, we went through the setup, configuration, and validation of Perimeter security for EMR clusters using Apache Knox. This helps simplify Authentication for various Hadoop services. In our next post, we will show you how to integrate Apache Knox and Apache Ranger to enable authorization and audits.

Stay tuned!

 


Related

 


About the Author


Varun Rao is a enterprise solutions architect
. He works with enterprise customers in their journey to the cloud with focus of data strategy and security. In his spare time, he tries to keep up with his 4-year old.

 

 

 

Mert Hocanin is a big data architect with AWS, covering several products, including EMR, Athena and Managed Blockchain. Prior to working in AWS, he has worked on Amazon.com’s retail business as a Senior Software Development Engineer, building a data lake to process vast amounts of data from all over the company for reporting purposes. When not building and designing data lakes, Mert enjoys traveling and food.

 

 

 

Photo of Srikanth KodaliSrikanth Kodali is a Sr. IOT Data analytics architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on building IoT data and analytics solutions, helping them improve the value of their solutions when using AWS.

 

 

 

Run Spark applications with Docker using Amazon EMR 6.0.0 (Beta)

Post Syndicated from Paul Codding original https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

The Amazon EMR team is excited to announce the public beta release of EMR 6.0.0 with Spark 2.4.3, Hadoop 3.1.0, Amazon Linux 2, and Amazon Corretto 8. With this beta release, Spark users can use Docker images from Docker Hub and Amazon Elastic Container Registry (Amazon ECR) to define environment and library dependencies. Using Docker, users can easily define their dependencies and use them for individual jobs, avoiding the need to install dependencies on individual cluster hosts.

This post shows you how to use Docker with the EMR release 6.0.0 Beta. You’ll learn how to launch an EMR release 6.0.0 Beta cluster and run Spark jobs using Docker containers from both Docker Hub and Amazon ECR.

Hadoop 3 Docker support

EMR 6.0.0 (Beta) includes Hadoop 3.1.0, which allows the YARN NodeManager to launch containers either directly on the host machine of the cluster, or inside a Docker container. Docker containers provide a custom execution environment in which the application’s code runs isolated from the execution environment of the YARN NodeManager and other applications.

These containers can include special libraries needed by the application, and even provide different versions of native tools and libraries such as R, Python, Python libraries. This allows you to easily define the libraries and runtime dependencies that your applications need, using familiar Docker tooling.

Clusters running the EMR 6.0.0 (Beta) release are configured by default to allow YARN applications such as Spark to run using Docker containers. To customize this, use the configuration for Docker support defined in the yarn-site.xml and container-executor.cfg files available in the /etc/hadoop/conf directory. For details on each configuration option and how it is used, see Launching Applications Using Docker Containers.

You can choose to use Docker when submitting a job. On job submission, the following variables are used to specify the Docker runtime and Docker image used:

    • YARN_CONTAINER_RUNTIME_TYPE=docker
    • YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={DOCKER_IMAGE_NAME}

When you use Docker containers to execute your YARN applications, YARN downloads the Docker image specified when you submit your job. For YARN to resolve this Docker image, it must be configured with a Docker registry. Options to configure a Docker registry differ based on how you chose to deploy EMR (using either a public or private subnet).

Docker registries

A Docker registry is a storage and distribution system for Docker images. For EMR 6.0.0 (Beta), the following Docker registries can be configured:

  • Docker Hub: A public Docker registry containing over 100,000 popular Docker images.
  • Amazon ECR: A fully-managed Docker container registry that allows you to create your own custom images and host them in a highly available and scalable architecture.

Deployment considerations

Docker registries require network access from each host in the cluster, as each host downloads images from the Docker registry when your YARN application is running on the cluster. How you choose to deploy your EMR cluster (launching it into a public or private subnet) may limit your choice of Docker registry due to network connectivity requirements.

Public subnet

With EMR public subnet clusters, nodes running YARN NodeManager can directly access any registry available over the internet, such as Docker Hub, as shown in the following diagram.

Private Subnet

With EMR private subnet clusters, nodes running YARN NodeManager don’t have direct access to the internet.  Docker images can be hosted in the ECR and accessed through AWS PrivateLink, as shown in the following diagram.

For details on how to use AWS PrivateLink to allow access to ECR in a private subnet scenario, see Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR.

Configuring Docker registries

Docker must be configured to trust the specific registry used to resolve Docker images. The default trust registries are local (private) and centos (on public Docker Hub). You can override docker.trusted.registries in /etc/hadoop/conf/container-executor.cfg to use other public repositories or ECR. To override this configuration, use the EMR Classification API with the container-executor classification key.

The following example shows how to configure the cluster to trust both a public repository (your-public-repo) and an ECR registry (123456789123.dkr.ecr.us-east-1.amazonaws.com). When using ECR, replace this endpoint with your specific ECR endpoint.  When using Docker Hub, please replace this repository name with your actual repository name.

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,centos, your-public-repo,123456789123.dkr.ecr.us-east-1.amazonaws.com",
                "docker.privileged-containers.registries": "local,centos, your-public-repo,123456789123.dkr.ecr.us-east-1.amazonaws.com"
            }
        }
    ]
  }
]

To launch an EMR 6.0.0 (Beta) cluster with this configuration using the AWS Command Line Interface (AWS CLI), create a file named container-executor.json with the contents of the preceding JSON configuration.  Then, use the following commands to launch the cluster:

$ export KEYPAIR=<Name of your Amazon EC2 key-pair>
$ export SUBNET_ID=<ID of the subnet to which to deploy the cluster>
$ export INSTANCE_TYPE=<Name of the instance type to use>
$ export REGION=<Region to which to deploy the cluster deployed>

$ aws emr create-cluster \
    --name "EMR-6-Beta Cluster" \
    --region $REGION \
    --release-label emr-6.0.0-beta \
    --applications Name=Hadoop Name=Spark \
    --service-role EMR_DefaultRole \
    --ec2-attributes KeyName=$KEYPAIR,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=$SUBNET_ID \
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$INSTANCE_TYPE InstanceGroupType=CORE,InstanceCount=2,InstanceType=$INSTANCE_TYPE \
    --configuration file://container-executor.json

Using ECR

If you’re new to ECR, follow the instructions in Getting Started with Amazon ECR and verify you have access to ECR from each instance in your EMR cluster.

To access ECR using the docker command, you must first generate credentials. To make sure that YARN can access images from ECR, pass a reference to those generated credentials using the container environment variable YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG.

Run the following command on one of the core nodes to get the login line for your ECR account.

$ aws ecr get-login --region us-east-1 --no-include-email

The get-login command generates the correct Docker CLI command to run to create credentials. Copy and run the output from get-login.

$ sudo docker login -u AWS -p <password> https://<account-id>.dkr.ecr.us-east-1.amazonaws.com

This command generates a config.json file in the /root/.docker folder.  Copy this file to HDFS so that jobs submitted to the cluster can use it to authenticate to ECR.

Execute the commands below to copy the config.json file to your home directory.

$ mkdir -p ~/.docker
$ sudo cp /root/.docker/config.json ~/.docker/config.json
$ sudo chmod 644 ~/.docker/config.json

Execute the commands below to put the config.json in HDFS so it may be used by jobs running on the cluster.

$ hadoop fs -put ~/.docker/config.json /user/hadoop/

At this point, YARN can access ECR as a Docker image registry and pull containers during job execution.

Using Spark with Docker

With EMR 6.0.0 (Beta), Spark applications can use Docker containers to define their library dependencies, instead of requiring dependencies to be installed on the individual Amazon EC2 instances in the cluster. This integration requires configuration of the Docker registry, and definition of additional parameters when submitting a Spark application.

When the application is submitted, YARN invokes Docker to pull the specified Docker image and run the Spark application inside of a Docker container. This allows you to easily define and isolate dependencies. It reduces the time spent bootstrapping or preparing instances in the EMR cluster with the libraries needed for job execution.

When using Spark with Docker, make sure that you consider the following:

  • The docker package and CLI are only installed on core and task nodes.
  • The spark-submit command should always be run from a master instance on the EMR cluster.
  • The Docker registries used to resolve Docker images must be defined using the Classification API with the container-executor classification key to define additional parameters when launching the cluster:
    • docker.trusted.registries
    • docker.privileged-containers.registries
  • To execute a Spark application in a Docker container, the following configuration options are necessary:
    • YARN_CONTAINER_RUNTIME_TYPE=docker
    • YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={DOCKER_IMAGE_NAME}
  • When using ECR to retrieve Docker images, you must configure the cluster to authenticate itself. To do so, you must use the following configuration option:
    • YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG={DOCKER_CLIENT_CONFIG_PATH_ON_HDFS}
  • Mount the /etc/passwd file into the container so that the user running the job can be identified in the Docker container.
    • YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro
  • Any Docker image used with Spark must have Java installed in the Docker image.

Creating a Docker image

Docker images are created using a Dockerfile, which defines the packages and configuration to include in the image.  The following two example Dockerfiles use PySpark and SparkR.

PySpark Dockerfile

Docker images created from this Dockerfile include Python 3 and the numpy Python package.  This Dockerfile uses Amazon Linux 2 and the Amazon Corretto JDK 8.

FROM amazoncorretto:8

RUN yum -y update
RUN yum -y install yum-utils
RUN yum -y groupinstall development

RUN yum list python3*
RUN yum -y install python3 python3-dev python3-pip python3-virtualenv

RUN python -V
RUN python3 -V

ENV PYSPARK_DRIVER_PYTHON python3
ENV PYSPARK_PYTHON python3

RUN pip3 install --upgrade pip
RUN pip3 install numpy panda

RUN python3 -c "import numpy as np"

SparkR Dockerfile

Docker images created from this Dockerfile include R and the randomForest CRAN package. This Dockerfile includes Amazon Linux 2 and the Amazon Corretto JDK 8.

FROM amazoncorretto:8

RUN java -version

RUN yum -y update
RUN amazon-linux-extras enable R3.4

RUN yum -y install R R-devel openssl-devel
RUN yum -y install curl

#setup R configs
RUN echo "r <- getOption('repos'); r['CRAN'] <- 'http://cran.us.r-project.org'; options(repos = r);" > ~/.Rprofile

RUN Rscript -e "install.packages('randomForest')"

For more information on Dockerfile syntax, see the Dockerfile reference documentation.

Using Docker images from ECR

Amazon Elastic Container Registry (ECR) is a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. When using ECR, the cluster must be configured to trust your instance of ECR, and you must configure authentication in order for the cluster to use Docker images from ECR.

In this example, our cluster must be created with the following additional configuration, to ensure the ECR registry is trusted. Please replace the 123456789123.dkr.ecr.us-east-1.amazonaws.com endpoint with your ECR endpoint.

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,centos,123456789123.dkr.ecr.us-east-1.amazonaws.com",
                "docker.privileged-containers.registries": "local,centos, 123456789123.dkr.ecr.us-east-1.amazonaws.com"
            }
        }
    ]
  }
]

Using PySpark with ECR

This example uses the PySpark Dockerfile.  It will be tagged and upload to ECR. Once uploaded, you will run the PySpark job and reference the Docker image from ECR.

After you launch the cluster, use SSH to connect to a core node and run the following commands to build the local Docker image from the PySpark Dockerfile example.

First, create a directory and a Dockerfile for our example.

$ mkdir pyspark

$ vi pyspark/Dockerfile

Paste the contents of the PySpark Dockerfile and run the following commands to build a Docker image.

$ sudo docker build -t local/pyspark-example pyspark/

Create the emr-docker-examples ECR repository for our examples.

$ aws ecr create-repository --repository-name emr-docker-examples

Tag and upload the locally built image to ECR, replacing 123456789123.dkr.ecr.us-east-1.amazonaws.com with your ECR endpoint.

$ sudo docker tag local/pyspark-example 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-example
$ sudo docker push 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-example

Use SSH to connect to the master node and prepare a Python script with the filename main.py. Paste the following content into the main.py file and save it.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("docker-numpy").getOrCreate()
sc = spark.sparkContext

import numpy as np
a = np.arange(15).reshape(3, 5)
print(a)

To submit the job, reference the name of the Docker. Define the additional configuration parameters to make sure that the job execution uses Docker as the runtime. When using ECR, the YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG must reference the config.json file containing the credentials used to authenticate to ECR.

$ DOCKER_IMAGE_NAME=123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-example
$ DOCKER_CLIENT_CONFIG=hdfs:///user/hadoop/config.json
$ spark-submit --master yarn \
--deploy-mode cluster \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro \
--num-executors 2 \
main.py -v

When the job has completed, take note of the YARN application ID, and use the following command to obtain the output of the PySpark job.

$ yarn logs --applicationId application_id | grep -C2 '\[\['
LogLength:55
LogContents:
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

Using SparkR with ECR

This example uses the SparkR Dockerfile. It will be tagged and upload to ECR. Once uploaded, you will run the SparkR job and reference the Docker image from ECR.

After you launch the cluster, use SSH to connect to a core node and run the following commands to build the local Docker image from the SparkR Dockerfile example.

First, create a directory and the Dockerfile for this example.

$ mkdir sparkr

$ vi sparkr/Dockerfile

Paste the contents of the SparkR Dockerfile and run the following commands to build a Docker image.

$ sudo docker build -t local/sparkr-example sparkr/

Tag and upload the locally built image to ECR, replacing 123456789123.dkr.ecr.us-east-1.amazonaws.com with your ECR endpoint.

$ sudo docker tag local/sparkr-example 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:sparkr-example
$ sudo docker push 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:sparkr-example

Use SSH to connect to the master node and prepare an R script with name sparkR.R. Paste the following contents into the sparkR.R file.

library(SparkR)
sparkR.session(appName = "R with Spark example", sparkConfig = list(spark.some.config.option = "some-value"))

sqlContext <- sparkRSQL.init(spark.sparkContext)
library(randomForest)
# check release notes of randomForest
rfNews()

sparkR.session.stop()

To submit the job, reference the name of the Docker. Define the additional configuration parameters to make sure that the job execution uses Docker as the runtime. When using ECR, the YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG must reference the config.json file containing the credentials used to authenticate to ECR.

$ DOCKER_IMAGE_NAME=123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:sparkr-example
$ DOCKER_CLIENT_CONFIG=hdfs:///user/hadoop/config.json
$ spark-submit --master yarn \
--deploy-mode cluster \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE_NAME \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG=$DOCKER_CLIENT_CONFIG \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro \
sparkR.R

When the job has completed, note the YARN application ID, and use the following command to obtain the output of the SparkR job. This example includes testing to make sure that the randomForest library, version installed, and release notes are available.

$ yarn logs --applicationId application_id | grep -B4 -A10 "Type rfNews"
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Wishlist (formerly TODO):

* Implement the new scheme of handling classwt in classification.

* Use more compact storage of proximity matrix.

* Allow case weights by using the weights in sampling?

========================================================================
Changes in 4.6-14:

Using a Docker image from Docker Hub

To use Docker Hub, you must deploy your cluster to a public subnet, and configure it to use Docker Hub as a trusted registry. In this example, the cluster needs the following additional configuration to to make sure that the your-public-repo repository on Docker Hub is trusted. When using Docker Hub, please replace this repository name with your actual repository.

[
  {
    "Classification": "container-executor",
    "Configurations": [
        {
            "Classification": "docker",
            "Properties": {
                "docker.trusted.registries": "local,centos,your-public-repo ",
                "docker.privileged-containers.registries": "local,centos,your-public-repo"
            }
        }
    ]
  }
]

Beta limitations

EMR 6.0.0 (Beta) focuses on helping you get value from using Docker with Spark to simplify dependency management. You can also use EMR 6.0.0 (Beta) to get familiar with Amazon Linux 2, and Amazon Corretto JDK 8.

The EMR 6.0.0 (Beta) supports the following applications:

  • Spark 2.4.3
  • Livy 0.6.0
  • ZooKeeper 3.4.14
  • Hadoop 3.1.0

This beta release is supported in the following Regions:

  • US East (N. Virginia)
  • US West (Oregon)

The following EMR features are currently not available with this beta release:

  • Cluster integration with AWS Lake Formation
  • Native encryption of Amazon EBS volumes attached to an EMR cluster

Conclusion

In this post, you learned how to use an EMR 6.0.0 (Beta) cluster to run Spark jobs in Docker containers and integrate with both Docker Hub and ECR. You’ve seen examples of both PySpark and SparkR Dockerfiles.

The EMR team looks forward to hearing about how you’ve used this integration to simplify dependency management in your projects. If you have questions or suggestions, feel free to leave a comment.


About the Authors

Paul Codding is a senior product manager for EMR at Amazon Web Services.

 

 

 

 

Ajay Jadhav is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Rentao Wu is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Stephen Wu is a software development engineer for EMR at Amazon Web Services.

 

 

 

 

Extract Oracle OLTP data in real time with GoldenGate and query from Amazon Athena

Post Syndicated from Sreekanth Krishnavajjala original https://aws.amazon.com/blogs/big-data/extract-oracle-oltp-data-in-real-time-with-goldengate-and-query-from-amazon-athena/

This post describes how you can improve performance and reduce costs by offloading reporting workloads from an online transaction processing (OLTP) database to Amazon Athena and Amazon S3. The architecture described allows you to implement a reporting system and have an understanding of the data that you receive by being able to query it on arrival. In this solution:

  • Oracle GoldenGate generates a new row on the target for every change on the source to create Slowly Changing Dimension Type 2 (SCD Type 2) data.
  • Athena allows you to run ad hoc queries on the SCD Type 2 data.

Principles of a modern reporting solution

Advanced database solutions use a set of principles to help them build cost-effective reporting solutions. Some of these principles are:

  • Separate the reporting activity from the OLTP. This approach provides resource isolation and enables databases to scale for their respective workloads.
  • Use query engines running on top of distributed file systems like Hadoop Distributed File System (HDFS) and cloud object stores, such as Amazon S3. The advent of query engines that can run on top of open-source HDFS and cloud object stores further reduces the cost of implementing dedicated reporting systems.

Furthermore, you can use these principles when building reporting solutions:

  • To reduce licensing costs of the commercial databases, move the reporting activity to an open-source database.
  • Use a log-based, real-time, change data capture (CDC), data-integration solution, which can replicate OLTP data from source systems, preferably in real-time mode, and provide a current view of the data. You can enable the data replication between the source and the target reporting systems using database CDC solutions. The transaction log-based CDC solutions capture database changes noninvasively from the source database and replicate them to the target datastore or file systems.

Prerequisites

If you use GoldenGate with Kafka and are considering cloud migration, you can benefit from this post. This post also assumes prior knowledge of GoldenGate and does not detail steps to install and configure GoldenGate. Knowledge of Java and Maven is also assumed. Ensure that a VPC with three subnets is available for manual deployment.

Understanding the architecture of this solution

The following workflow diagram (Figure 1) illustrates the solution that this post describes:

  1. Amazon RDS for Oracle acts as the source.
  2. A GoldenGate CDC solution produces data for Amazon Managed Streaming for Apache Kafka (Amazon MSK). GoldenGate streams the database CDC data to the consumer. Kafka topics with an MSK cluster receives the data from GoldenGate.
  3. The Apache Flink application running on Amazon EMR consumes the data and sinks it into an S3 bucket.
  4. Athena analyzes the data through queries. You can optionally run queries from Amazon Redshift Spectrum.

Data Pipeline

Figure 1

Amazon MSK is a fully managed service for Apache Kafka that makes it easy to provision  Kafka clusters with few clicks without the need to provision servers, storage and configuring Apache Zookeeper manually. Kafka is an open-source platform for building real-time streaming data pipelines and applications.

Amazon RDS for Oracle is a fully managed database that frees up your time to focus on application development. It manages time-consuming database administration tasks, including provisioning, backups, software patching, monitoring, and hardware scaling.

GoldenGate is a real-time, log-based, heterogeneous database CDC solution. GoldenGate supports data replication from any supported database to various target databases or big data platforms like Kafka. GoldenGate’s ability to write the transactional data captured from the source in different formats, including delimited text, JSON, and Avro, enables seamless integration with a variety of BI tools. Each row has additional metadata columns including database operation type (Insert/Update/Delete).

Flink is an open-source, stream-processing framework with a distributed streaming dataflow engine for stateful computations over unbounded and bounded data streams. EMR supports Flink, letting you create managed clusters from the AWS Management Console. Flink also supports exactly-once semantics with the checkpointing feature, which is vital to ensure data accuracy when processing database CDC data. You can also use Flink to transform the streaming data row by row or in batches using windowing capabilities.

S3 is an object storage service with high scalability, data availability, security, and performance. You can run big data analytics across your S3 objects with AWS query-in-place services like Athena.

Athena is a serverless query service that makes it easy to query and analyze data in S3. With Athena and S3 as a data source, you define the schema and start querying using standard SQL. There’s no need for complex ETL jobs to prepare your data for analysis, which makes it easy for anyone familiar with SQL skills to analyze large-scale datasets quickly.

The following diagram shows a more detailed view of the data pipeline:

  1. RDS for Oracle runs in a Single-AZ.
  2. GoldenGate runs on an Amazon EC2 instance.
  3. The MSK cluster spans across three Availability Zones.
  4. Kafka topic is set up in MSK.
  5. Flink runs on an EMR Cluster.
  6. Producer Security Group for Oracle DB and GoldenGate instance.
  7. Consumer Security Group for EMR with Flink.
  8. Gateway endpoint for S3 private access.
  9. NAT Gateway to download software components on GoldenGate instance.
  10. S3 bucket and Athena.

For simplicity, this setup uses a single VPC with multiple subnets to deploy resources.

Figure 2

Configuring single-click deployment using AWS CloudFormation

The AWS CloudFormation template included in this post automates the deployment of the end-to-end solution that this blog post describes. The template provisions all required resources including RDS for Oracle, MSK, EMR, S3 bucket, and also adds an EMR step with a JAR file to consume messages from Kafka topic on MSK. Here’s the list of steps to launch the template and test the solution:

  1. Launch the AWS CloudFormation template in the us-east-1
  2. After successful stack creation, obtain GoldenGate Hub Server public IP from the Outputs tab of cloudformation.
  3. Login to GoldenGate hub server using the IP address from step 2 as ec2-user and then switch to oracle user.sudo su – oracle
  4. Connect to the source RDS for Oracle database using the sqlplus client and provide password(source).[[email protected] ~]$ sqlplus [email protected]
  5. Generate database transactions using SQL statements available in oracle user’s home directory.
    SQL> @s
    
     SQL> @s1
    
     SQL> @s2

  6. Query STOCK_TRADES table from Amazon Athena console. It takes a few seconds after committing transactions on the source database for database changes to be available for Athena for querying.

Manually deploying components

The following steps describe the configurations required to stream Oracle-changed data to MSK and sink it to an S3 bucket using Flink running on EMR. You can then query the S3 bucket using Athena. If you deployed the solution using AWS CloudFormation as described in the previous step, skip to the Testing the solution section.

 

  1. Prepare an RDS source database for CDC using GoldenGate.The RDS source database version is Enterprise Edition 12.1.0.2.14. For instructions on configuring the RDS database, see Using Oracle GoldenGate with Amazon RDS. This post does not consider capturing data definition language (DDL).
  2. Configure an EC2 instance for the GoldenGate hub server.Configure the GoldenGate hub server using Oracle Linux server 7.6 (ami-b9c38ad3) image in the us-east-1 Region. The GoldenGate hub server runs the GoldenGate extract process that extracts changes in real time from the database transaction log files. The server also runs a replicat process that publishes database changes to MSK.The GoldenGate hub server requires the following software components:
  • Java JDK 1.8.0 (required for GoldenGate big data adapter).
  • GoldenGate for Oracle (12.3.0.1.4) and GoldenGate for big data adapter (12.3.0.1).
  • Kafka 1.1.1 binaries (required for GoldenGate big data adapter classpath).
  • An IAM role attached to the GoldenGate hub server to allow access to the MSK cluster for GoldenGate processes running on the hub server.Use the GoldenGate (12.3.0) documentation to install and configure the GoldenGate for Oracle database. The GoldenGate Integrated Extract parameter file is eora2msk.prm.
    EXTRACT eora2msk
    SETENV (NLSLANG=AL32UTF8)
    
    USERID [email protected], password ggadmin
    TRANLOGOPTIONS INTEGRATEDPARAMS (max_sga_size 256)
    EXTTRAIL /u01/app/oracle/product/ogg/dirdat/or
    LOGALLSUPCOLS
    
    TABLE SOURCE.STOCK_TRADES;

    The logallsupcols extract parameter ensures that a full database table row is generated for every DML operation on the source, including updates and deletes.

  1. Create a Kafka cluster using MSK and configure Kakfa topic.You can create the MSK cluster from the AWS Management Console, using the AWS CLI, or through an AWS CloudFormation template.
  • Use the list-clusters command to obtain a ClusterArn and a Zookeeper connection string after creating the cluster. You need this information to configure the GoldenGate big data adapter and Flink consumer. The following code illustrates the commands to run:
    $aws kafka list-clusters --region us-east-1
    {
        "ClusterInfoList": [
            {
                "EncryptionInfo": {
                    "EncryptionAtRest": {
                        "DataVolumeKMSKeyId": "arn:aws:kms:us-east-1:xxxxxxxxxxxx:key/717d53d8-9d08-4bbb-832e-de97fadcaf00"
                    }
                }, 
                "BrokerNodeGroupInfo": {
                    "BrokerAZDistribution": "DEFAULT", 
                    "ClientSubnets": [
                        "subnet-098210ac85a046999", 
                        "subnet-0c4b5ee5ff5ef70f2", 
                        "subnet-076c99d28d4ee87b4"
                    ], 
                    "StorageInfo": {
                        "EbsStorageInfo": {
                            "VolumeSize": 1000
                        }
                    }, 
                    "InstanceType": "kafka.m5.large"
                }, 
                "ClusterName": "mskcluster", 
                "CurrentBrokerSoftwareInfo": {
                    "KafkaVersion": "1.1.1"
                }, 
                "CreationTime": "2019-01-24T04:41:56.493Z", 
                "NumberOfBrokerNodes": 3, 
                "ZookeeperConnectString": "10.0.2.9:2181,10.0.0.4:2181,10.0.3.14:2181", 
                "State": "ACTIVE", 
                "CurrentVersion": "K13V1IB3VIYZZH", 
                "ClusterArn": "arn:aws:kafka:us-east-1:xxxxxxxxx:cluster/mskcluster/8920bb38-c227-4bef-9f6c-f5d6b01d2239-3", 
                "EnhancedMonitoring": "DEFAULT"
            }
        ]
    }

  • Obtain the IP addresses of the Kafka broker nodes by using the ClusterArn.
    $aws kafka get-bootstrap-brokers --region us-east-1 --cluster-arn arn:aws:kafka:us-east-1:xxxxxxxxxxxx:cluster/mskcluster/8920bb38-c227-4bef-9f6c-f5d6b01d2239-3
    {
        "BootstrapBrokerString": "10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092"
    }

  • Create a Kafka topic. The solution in this post uses the same name as table name for Kafka topic.
    ./kafka-topics.sh --create --zookeeper 10.0.2.9:2181,10.0.0.4:2181,10.0.3.14:2181 --replication-factor 3 --partitions 1 --topic STOCK_TRADES

  1. Provision an EMR cluster with Flink.Create an EMR cluster 5.25 with Flink 1.8.0 (advanced option of the EMR cluster), and enable SSH access to the master node. Create and attach a role to the EMR master node so that Flink consumers can access the Kafka topic in the MSK cluster.
  2. Configure the Oracle GoldenGate big data adapter for Kafka on the GoldenGate hub server.Download and install the Oracle GoldenGate big data adapter (12.3.0.1.0) using the Oracle GoldenGate download link. For more information, see the Oracle GoldenGate 12c (12.3.0.1) installation documentation.The following is the GoldenGate producer property file for Kafka (custom_kafka_producer.properties):
    #Bootstrap broker string obtained from Step 3
    bootstrap.servers= 10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092
    #bootstrap.servers=localhost:9092
    acks=1
    reconnect.backoff.ms=1000
    value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
    key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
    # 100KB per partition
    batch.size=16384
    linger.ms=0

    The following is the GoldenGate properties file for Kafka (Kafka.props):

    gg.handlerlist = kafkahandler
    gg.handler.kafkahandler.type=kafka
    gg.handler.kafkahandler.KafkaProducerConfigFile=custom_kafka_producer.properties
    #The following resolves the topic name using the short table name
    #gg.handler.kafkahandler.topicName=SOURCE
    gg.handler.kafkahandler.topicMappingTemplate=${tableName}
    #The following selects the message key using the concatenated primary keys
    gg.handler.kafkahandler.keyMappingTemplate=${primaryKeys}
    gg.handler.kafkahandler.format=json_row
    #gg.handler.kafkahandler.format=delimitedtext
    #gg.handler.kafkahandler.SchemaTopicName=mySchemaTopic
    #gg.handler.kafkahandler.SchemaTopicName=oratopic
    gg.handler.kafkahandler.BlockingSend =false
    gg.handler.kafkahandler.includeTokens=false
    gg.handler.kafkahandler.mode=op
    goldengate.userexit.writers=javawriter
    javawriter.stats.display=TRUE
    javawriter.stats.full=TRUE
    
    gg.log=log4j
    #gg.log.level=INFO
    gg.log.level=DEBUG
    gg.report.time=30sec
    gg.classpath=dirprm/:/home/oracle/kafka/kafka_2.11-1.1.1/libs/*
    
    javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

    The following is the GoldenGate replicat parameter file (rkafka.prm):

    REPLICAT rkafka
    -- Trail file for this example is located in "AdapterExamples/trail" directory
    -- Command to add REPLICAT
    -- add replicat rkafka, exttrail AdapterExamples/trail/tr
    TARGETDB LIBFILE libggjava.so SET property=dirprm/kafka.props
    REPORTCOUNT EVERY 1 MINUTES, RATE
    GROUPTRANSOPS 10000
    MAP SOURCE.STOCK_TRADES, TARGET SOURCE.STOCK_TRADES;

  3. Create an S3 bucket and directory with a table name underneath for Flink to store (sink) Oracle CDC data.
  4. Configure a Flink consumer to read from the Kafka topic that writes the CDC data to an S3 bucket.For instructions on setting up a Flink project using the Maven archetype, see Flink Project Build Setup.The following code example is the pom.xml file, used with the Maven project. For more information, see Getting Started with Maven.
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
    
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-quickstart-java</artifactId>
      <version>1.8.0</version>
      <packaging>jar</packaging>
    
      <name>flink-quickstart-java</name>
      <url>http://www.example.com</url>
    
      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <slf4j.version>@[email protected]</slf4j.version>
        <log4j.version>@[email protected]</log4j.version>
        <java.version>1.8</java.version>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
      </properties>
    
    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.8.0</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-hadoop-compatibility_2.11</artifactId>
            <version>1.8.0</version>
        </dependency>
        <dependency>
         <groupId>org.apache.flink</groupId>
         <artifactId>flink-connector-filesystem_2.11</artifactId>
         <version>1.8.0</version>
        </dependency>
    
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>1.8.0</version>
            <scope>compile</scope>
        </dependency>
         <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-s3-fs-presto</artifactId>
            <version>1.8.0</version>
        </dependency>
        <dependency>
       <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.11</artifactId>
            <version>1.8.0</version>
        </dependency>
        <dependency>
          <groupId>org.apache.flink</groupId>
          <artifactId>flink-clients_2.11</artifactId>
          <version>1.8.0</version>
        </dependency>
        <dependency>
          <groupId>org.apache.flink</groupId>
          <artifactId>flink-scala_2.11</artifactId>
          <version>1.8.0</version>
        </dependency>
    
        <dependency>
          <groupId>org.apache.flink</groupId>
          <artifactId>flink-streaming-scala_2.11</artifactId>
          <version>1.8.0</version>
        </dependency>
    
        <dependency>
          <groupId>com.typesafe.akka</groupId>
          <artifactId>akka-actor_2.11</artifactId>
          <version>2.4.20</version>
        </dependency>
        <dependency>
           <groupId>com.typesafe.akka</groupId>
           <artifactId>akka-protobuf_2.11</artifactId>
           <version>2.4.20</version>
        </dependency>
    <build>
      <plugins>
         <plugin>
             <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.1</version>
                <executions>
                       <execution>
                          <phase>package</phase>
                            <goals>
                                <goal>shade</goal>
                             </goals>
                           <configuration>
                          <artifactSet>
                      <excludes>
    
                             <!-- Excludes here -->
                               </excludes>
    </artifactSet>
                    <filters>
                            <filter>
                                                                                     <artifact>org.apache.flink:*</artifact>
                            </filter>
                       </filters>
                 <transformers>
                   <!-- add Main-Class to manifest file -->
                                                                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                                                            <mainClass>flinkconsumer.flinkconsumer</mainClass>
                   </transformer>
                                                                             <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                                                         <resource>reference.conf</resource>
                                                                    </transformer>
                                                            </transformers>
                    <relocations>
                          <relocation>
                                                                <pattern>org.codehaus.plexus.util</pattern>
                                                                  <shadedPattern>org.shaded.plexus.util</shadedPattern>
                        <excludes>
                                                                      <exclude>org.codehaus.plexus.util.xml.Xpp3Dom</exclude>
                                                                      <exclude>org.codehaus.plexus.util.xml.pull.*</exclude>
                                                                  </excludes>
                                                               </relocation>
                                                            </relocations>
                                                            <createDependencyReducedPom>false</createDependencyReducedPom>
                                                    </configuration>
                                            </execution>
                                    </executions>
                            </plugin>
    <!-- Add the main class as a manifest entry -->
                            <plugin>
                                    <groupId>org.apache.maven.plugins</groupId>
                                    <artifactId>maven-jar-plugin</artifactId>
                                    <version>2.5</version>
                                    <configuration>
                                            <archive>
                                                    <manifestEntries>
                                                            <Main-Class>flinkconsumer.flinkconsumer</Main-Class>
                                                    </manifestEntries>
                                            </archive>
           </configuration>
                            </plugin>
    
                            <plugin>
                                    <groupId>org.apache.maven.plugins</groupId>
                                    <artifactId>maven-compiler-plugin</artifactId>
                                    <version>3.1</version>
                                    <configuration>
                                            <source>1.7</source>
                                            <target>1.7</target>
                                    </configuration>
                            </plugin>
                    </plugins>
    
    </build>
    <profiles>
                    <profile>
                            <id>build-jar</id>
                            <activation>
                                    <activeByDefault>false</activeByDefault>
                            </activation>
                    </profile>
            </profiles>
    
    
    </project>

    Compile the following Java program using mvn clean install and generate the JAR file:

    package flinkconsumer;
    import org.apache.flink.api.common.typeinfo.TypeInformation;
    import org.apache.flink.api.java.typeutils.TypeExtractor;
    import org.apache.flink.api.java.utils.ParameterTool;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.source.SourceFunction;
    import org.apache.flink.streaming.util.serialization.DeserializationSchema;
    import org.apache.flink.streaming.util.serialization.SerializationSchema;
    import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
    import org.apache.flink.api.common.functions.FlatMapFunction;
    import org.apache.flink.api.common.functions.MapFunction;
    import org.apache.flink.streaming.api.windowing.time.Time;
    import org.apache.flink.util.Collector;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
    import org.slf4j.LoggerFactory;
    import org.apache.flink.runtime.state.filesystem.FsStateBackend;
    import akka.actor.ActorSystem;
    import akka.stream.ActorMaterializer;
    import akka.stream.Materializer;
    import com.typesafe.config.Config;
    import org.apache.flink.streaming.connectors.fs.*;
    import org.apache.flink.streaming.api.datastream.*;
    import org.apache.flink.runtime.fs.hdfs.HadoopFileSystem;
    import java.util.stream.Collectors;
    import java.util.Arrays;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.List;
    import java.util.Properties;
    import java.util.regex.Pattern;
    import java.io.*;
    import java.net.BindException;
    import java.util.*;
    import java.util.Map.*;
    import java.util.Arrays;
    
    public class flinkconsumer{
    
        public static void main(String[] args) throws Exception {
            // create Streaming execution environment
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setBufferTimeout(1000);
            env.enableCheckpointing(5000);
            Properties properties = new Properties();
            properties.setProperty("bootstrap.servers", "10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092");
            properties.setProperty("group.id", "flink");
            properties.setProperty("client.id", "demo1");
    
            DataStream<String> message = env.addSource(new FlinkKafkaConsumer<>("STOCK_TRADES", new SimpleStringSchema(),properties));
            env.enableCheckpointing(60_00);
            env.setStateBackend(new FsStateBackend("hdfs://ip-10-0-3-12.ec2.internal:8020/flink/checkpoints"));
    
            RollingSink<String> sink= new RollingSink<String>("s3://flink-stream-demo/STOCK_TRADES");
           // sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd-HHmm"));
           // The bucket part file size in bytes.
               sink.setBatchSize(400);
             message.map(new MapFunction<String, String>() {
                private static final long serialVersionUID = -6867736771747690202L;
                @Override
                public String map(String value) throws Exception {
                    //return " Value: " + value;
                    return value;
                }
            }).addSink(sink).setParallelism(1);
            env.execute();
        }
    }

    Log in as a Hadoop user to an EMR master node, start Flink, and execute the JAR file:

    $ /usr/bin/flink run ./flink-quickstart-java-1.7.0.jar

  5. Create the stock_trades table from the Athena console. Each JSON document must be on a new line.
    CREATE EXTERNAL TABLE `stock_trades`(
      `trade_id` string COMMENT 'from deserializer', 
      `ticker_symbol` string COMMENT 'from deserializer', 
      `units` int COMMENT 'from deserializer', 
      `unit_price` float COMMENT 'from deserializer', 
      `trade_date` timestamp COMMENT 'from deserializer', 
      `op_type` string COMMENT 'from deserializer')
    ROW FORMAT SERDE 
      'org.openx.data.jsonserde.JsonSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION
      's3://flink-cdc-demo/STOCK_TRADES'
    TBLPROPERTIES (
      'has_encrypted_data'='false', 
      'transient_lastDdlTime'='1561051196')

    For more information, see Hive JSON SerDe.

Testing the solution

To test that the solution works, complete the following steps:

  1. Log in to the source RDS instance from the GoldenGate hub server and perform insert, update, and delete operations on the stock_trades table:
    $sqlplus [email protected]
    SQL> insert into stock_trades values(6,'NEW',29,75,sysdate);
    SQL> update stock_trades set units=999 where trade_id=6;
    SQL> insert into stock_trades values(7,'TEST',30,80,SYSDATE);
    SQL>insert into stock_trades values (8,'XYZC', 20, 1800,sysdate);
    SQL> commit;

  2. Monitor the GoldenGate capture from the source database using the following stats command:
    [[email protected] 12.3.0]$ pwd
    /u02/app/oracle/product/ogg/12.3.0
    [[email protected] 12.3.0]$ ./ggsci
    
    Oracle GoldenGate Command Interpreter for Oracle
    Version 12.3.0.1.4 OGGCORE_12.3.0.1.0_PLATFORMS_180415.0359_FBO
    Linux, x64, 64bit (optimized), Oracle 12c on Apr 16 2018 00:53:30
    Operating system character set identified as UTF-8.
    
    Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.
    
    
    
    GGSCI (ip-10-0-1-170.ec2.internal) 1> stats eora2msk

  3. Monitor the GoldenGate replicat to a Kafka topic with the following:
    [[email protected] 12.3.0]$ pwd
    /u03/app/oracle/product/ogg/bdata/12.3.0
    [[email protected] 12.3.0]$ ./ggsci
    
    Oracle GoldenGate for Big Data
    Version 12.3.2.1.1 (Build 005)
    
    Oracle GoldenGate Command Interpreter
    Version 12.3.0.1.2 OGGCORE_OGGADP.12.3.0.1.2_PLATFORMS_180712.2305
    Linux, x64, 64bit (optimized), Generic on Jul 13 2018 00:46:09
    Operating system character set identified as UTF-8.
    
    Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.
    
    
    
    GGSCI (ip-10-0-1-170.ec2.internal) 1> stats rkafka

  4. Query the stock_trades table using the Athena console.

Summary

This post illustrates how you can offload reporting activity to Athena with S3 to reduce reporting costs and improve OLTP performance on the source database. This post serves as a guide for setting up a solution in the staging environment.

Deploying this solution in a production environment may require additional considerations, for example, high availability of GoldenGate hub servers, different file encoding formats for optimal query performance, and security considerations. Additionally, you can achieve similar outcomes using technologies like AWS Database Migration Service instead of GoldenGate for database CDC and Kafka Connect for the S3 sink.

 


About the Authors

Sreekanth Krishnavajjala is a solutions architect at Amazon Web Services.

 

 

 

 

Vinod Kataria is a senior partner solutions architect at Amazon Web Services.

Creating custom Pinpoint dashboards using Amazon QuickSight, part 3

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/messaging-and-targeting/creating-custom-pinpoint-dashboards-using-amazon-quicksight-part-3/

Note: This post was written by Manan Nayar and Aprajita Arora, Software Development Engineers on the AWS Digital User Engagement team.


This is the third and final post in our series about creating custom visualizations of your Amazon Pinpoint metrics using Amazon QuickSight.

In our first post, we used the Metrics APIs to retrieve specific Key Performance Indicators (KPIs), and then created visualizations using QuickSight. In the second post, we used the event stream feature in Amazon Pinpoint to enable more in-depth analyses.

The examples in the first two posts used Amazon S3 to store the metrics that we retrieved from Amazon Pinpoint. This post takes a different approach, using Amazon Redshift to store the data. By using Redshift to store this data, you gain the ability to create visualizations on large data sets. This example is useful in situations where you have a large volume of event data, and situations where you need to store your data for long periods of time.

Step 1: Provision the storage

The first step in setting up this solution is to create the destinations where you’ll store the Amazon Pinpoint event data. Since we’ll be storing the data in Amazon Redshift, we need to create a new Redshift cluster. We’ll also create an S3 bucket, which will house the original event data that’s streamed from Amazon Pinpoint.

To create the Redshift cluster and the S3 bucket

  1. Create a new Redshift cluster. To learn more, see the Amazon Redshift Getting Started Guide.
  2. Create a new table in the Redshift cluster that contains the appropriate columns. Use the following query to create the table:
    create table if not exists pinpoint_events_table(
      rowid varchar(255) not null,
      project_key varchar(100) not null,
      event_type varchar(100) not null,
      event_timestamp timestamp not null,
      campaign_id varchar(100),
      campaign_activity_id varchar(100),
      treatment_id varchar(100),
      PRIMARY KEY (rowid)
    );
  3. Create a new Amazon S3 bucket. For complete procedures, see Create a Bucket in the Amazon S3 Getting Started Guide.

Step 2: Set up the event stream

This example uses the event stream feature of Amazon Pinpoint to send event data to S3. Later, we’ll create a Lambda function that sends the event data to your Redshift cluster when new event data is added to the S3 bucket. This method lets us store the original event data in S3, and transfer a subset of that data to Redshift for analysis.

To set up the event stream

  1. Sign in to the Amazon Pinpoint console at http://console.aws.amazon.com/pinpoint. In the list of projects, choose the project that you want to enable event streaming for.
  2. Under Settings, choose Event stream.
  3. Choose Edit, and then configure the event stream to use Amazon Kinesis Data Firehose. If you don’t already have a Kinesis Data Firehose stream, follow the link to create one in the Kinesis console. Configure the stream to send data to an S3 bucket. For more information about creating streams, see Creating an Amazon Kinesis Data Firehose Delivery Stream.
  4. Under IAM role, choose Automatically create a role. Choose Save.

Step 3: Create the Lambda function

In this section, you create a Lambda function that processes the raw event stream data, and then writes it to a table in your Redshift cluster.
To create the Lambda function:

  1. Download the psycopg2 binary from https://github.com/jkehler/awslambda-psycopg2. This Python library lets you interact with PostgreSQL databases, such as Amazon Redshift. It contains certain libraries that aren’t included in Lambda.
    • Note: This Github repository is not an official AWS-managed repository.
  2. Within the awslambda-psycopg2-master folder, you’ll find a folder called psycopg2-37. Rename the folder to psycopg2 (you may need to delete the existing folder with that name), and then compress the entire folder to a .zip file.
  3. Create a new Lambda function from scratch, using the Python 3.7 runtime.
  4. Upload the psycopg2.zip file that you created in step 1 to Lambda.
  5. In Lambda, create a new function called lambda_function.py. Paste the following code into the function:
    import datetime
    import json
    import re
    import uuid
    import os
    import boto3
    import psycopg2
    from psycopg2 import Error
    
    cluster_redshift = "<clustername>"
    dbname_redshift = "<dbname>"
    user_redshift = "<username>"
    password_redshift = "<password>"
    endpoint_redshift = "<endpoint>"
    port_redshift = "5439"
    table_redshift = "pinpoint_events_table"
    
    # Get the file that contains the event data from the appropriate S3 bucket.
    def get_file_from_s3(bucket, key):
        s3 = boto3.client('s3')
        obj = s3.get_object(Bucket=bucket, Key=key)
        text = obj["Body"].read().decode()
    
        return text
    
    # If the object that we retrieve contains newline-delineated JSON, split it into
    # multiple objects.
    def clean_and_split(json_raw):
        json_delimited = re.sub('}\s{','}---X-DELIMITER---{',json_raw)
        json_clean = re.sub('\s+','',json_delimited)
        data = json_clean.split("---X-DELIMITER---")
    
        return data
    
    # Set all of the variables that we'll use to create the new row in Redshift.
    def set_variables(in_json):
    
        for line in in_json:
            content = json.loads(line)
            app_id = content['application']['app_id']
            event_type = content['event_type']
            event_timestamp = datetime.datetime.fromtimestamp(content['event_timestamp'] / 1e3).strftime('%Y-%m-%d %H:%M:%S')
    
            if (content['attributes'].get('campaign_id') is None):
                campaign_id = ""
            else:
                campaign_id = content['attributes']['campaign_id']
    
            if (content['attributes'].get('campaign_activity_id') is None):
                campaign_activity_id = ""
            else:
                campaign_activity_id = content['attributes']['campaign_activity_id']
    
            if (content['attributes'].get('treatment_id') is None):
                treatment_id = ""
            else:
                treatment_id = content['attributes']['treatment_id']
    
            write_to_redshift(app_id, event_type, event_timestamp, campaign_id, campaign_activity_id, treatment_id)
                
    # Write the event stream data to the Redshift table.
    def write_to_redshift(app_id, event_type, event_timestamp, campaign_id, campaign_activity_id, treatment_id):
        row_id = str(uuid.uuid4())
    
        query = ("INSERT INTO " + table_redshift + "(rowid, project_key, event_type, "
                + "event_timestamp, campaign_id, campaign_activity_id, treatment_id) "
                + "VALUES ('" + row_id + "', '"
                + app_id + "', '"
                + event_type + "', '"
                + event_timestamp + "', '"
                + campaign_id + "', '"
                + campaign_activity_id + "', '"
                + treatment_id + "');")
    
        try:
            conn = psycopg2.connect(user = user_redshift,
                                    password = password_redshift,
                                    host = endpoint_redshift,
                                    port = port_redshift,
                                    database = dbname_redshift)
    
            cur = conn.cursor()
            cur.execute(query)
            conn.commit()
            print("Updated table.")
    
        except (Exception, psycopg2.DatabaseError) as error :
            print("Database error: ", error)
        finally:
            if (conn):
                cur.close()
                conn.close()
                print("Connection closed.")
    
    # Handle the event notification that we receive when a new item is sent to the 
    # S3 bucket.
    def lambda_handler(event,context):
        print("Received event: \n" + str(event))
    
        bucket = event['Records'][0]['s3']['bucket']['name']
        key = event['Records'][0]['s3']['object']['key']
        data = get_file_from_s3(bucket, key)
    
        in_json = clean_and_split(data)
    
        set_variables(in_json)

    In the preceding code, make the following changes:

    • Replace <clustername> with the name of the cluster.
    • Replace <dbname> with the name of the database.
    • Replace <username> with the user name that you specified when you created the Redshift cluster.
    • Replace <password> with the password that you specified when you created the Redshift cluster.
    • Replace <endpoint> with the endpoint address of the Redshift cluster.
  6. In IAM, update the execution role that’s associated with the Lambda function to include the GetObject permission for the S3 bucket that contains the event data. For more information, see Editing IAM Policies in the AWS IAM User Guide.

Step 4: Set up notifications on the S3 bucket

Now that we’ve created the Lambda function, we’ll set up a notification on the S3 bucket. In this case, the notification will refer to the Lambda function that we created in the previous section. Every time a new file is added to the bucket, the notification will cause the Lambda function to run.

To create the event notification

  1. In S3, create a new bucket notification. The notification should be triggered when PUT events occur, and should trigger the Lambda function that you created in the previous section. For more information about creating notifications, see Configuring Amazon S3 Event Notifications in the Amazon S3 Developer Guide.
  2. Test the event notification by sending a test campaign. If you send an email campaign, your Redshift database should contain events such as _campaign.send, _email.send, _email.delivered, and others. You can check the contents of the Redshift table by running the following query in the Query Editor in the Redshift console:
    select * from pinpoint_events_table;

Step 5: Add the data set in Amazon QuickSight

If your Lambda function is sending event data to Redshift as expected, you can use your Redshift database to create a new data set in Amazon QuickSight. QuickSight includes an automatic database discovery feature that helps you add your Redshift database as a data set with only a few clicks. For more information, see Creating a Data Set from a Database in the Amazon QuickSight User Guide.

Step 6: Create your visualizations

Now that QuickSight is retrieving information from your Redshift database, you can use that data to create visualizations. To learn more about creating visualizations in QuickSight, see Creating an Analysis in the Amazon QuickSight User Guide.

This brings us to the end of our series. While these posts focused on using Amazon QuickSight to visualize your analytics data, you can also use these same techniques to create visualizations using 3rd party applications. We hope you enjoyed this series, and we can’t wait to see what you build using these examples!

How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime

Post Syndicated from Guillermo Menendez Corral original https://aws.amazon.com/blogs/big-data/how-to-migrate-from-ibm-netezza-to-amazon-redshift-with-no-downtime/

A large EMEA company recently decided to migrate their on-premises IBM Netezza data warehouse to Amazon Redshift. Given the volume of data (270TB uncompressed and more than 27K tables), the number of interconnected systems that made use of this data (more than 4.5K business processes), and the zero downtime requirements, we understood that the project would be a challenge. Since the company planned to decommission the data center where the data warehouse was deployed in less than a year’s time, there were also time constraints in place.

The data warehouse is a central piece for the company; it allows users across units to gather data and generate the daily reports required to run the business. In just a few years, business units accessing the cluster increased almost 3x, with 5x the initial number of users, executing 50x the number of daily queries for which the cluster had been designed. The legacy data warehouse was not able to scale to cover their business needs anymore, resulting in nightly ETL processes running outside of their time boundaries, and live queries taking too long.

The general dissatisfaction among the business users — along with the proximity of the data center decommissioning — moved the company to plan the migration, putting its IT department in charge of the definition of the new architecture, and the execution of the project.

Amazon Redshift, Amazon Web Services’ (AWS) fast, scalable OLAP data warehouse that makes it simple and cost-effective to analyze all your data across your data warehouse and data lake, was the perfect fit to solve their problems. Not only does Amazon Redshift provide full elasticity for future growth, and features such as concurrency scaling to cover high demand peaks, it also offers a whole ecosystem of analytics services to be easily integrated.

In this article, we explain how this customer performed a large-scale data warehouse migration from IBM Netezza to Amazon Redshift without downtime, by following a thoroughly planned migration process, and leveraging AWS Schema Conversion Tool (SCT) and Amazon Redshift best practices.

Preparing the migration

Large enterprise customers typically use data warehouse systems as a central repository for data analytics, aggregating operational data from heterogeneous transactional databases and running analytical workloads to serve analysts teams through business intelligence applications and reports. Using AWS, customers can benefit from the flexibility of having multiple compute resources processing different workloads, each workload scaling as the demand grows.

In this section, we describe the steps that we followed to prepare the migration of this data warehouse from IBM Netezza to Amazon Redshift.

Identifying workloads and dependencies

Customers typically have three different types of workloads running in their data warehouses:

  1. Batch processes: Long-running processes that require many resources and low concurrency, such as ETL jobs.
  2. Ad hoc queries: Short queries with high concurrency, such as analysts querying data.
  3. Business workloads: Typically mixed workloads, such as BI applications, reports, and dashboards.

In this customer’s case, they were building business data marts through complex ETL jobs, running statistical models, generating reports, and allowing analysts to run ad hoc queries. In essence, these applications are divided into two groups of workloads: batch and ad hoc queries. The on-premises platform was always saturated and struggling to deal with the level of concurrency demanded by these workloads while offering acceptable performance.

The following diagram shows the architecture that customer had before the migration:

By using Amazon Redshift, the customer is able to fulfill the requirements of every analytical workload. Within the old data warehouse, shown in the above diagram, two different managed service providers managed two sets of independent data and workloads. For the new architecture, the decision was to split the data warehouse into two different Amazon Redshift clusters to serve those different workloads, as described in the following section. Within each of these clusters, the customer is able to manage resources and concurrency for different applications under the same workload by configuring Amazon Redshift Workload Management (WLM). A typical WLM setup is to match every workload with a queue, so in this case, each cluster had two queues configured: batch and ad hoc, each with a different number of slots and assigned memory.

Sizing the target architecture

For heterogeneous migrations, like this one, a comprehensive analysis should be performed on the source database, collecting enough data to design the new architecture that supports both data and applications.

The AWS Schema Conversion Tool was the perfect companion for this project, as the customer was able to automate the reports and generate an assessment that helped estimate the migration complexity for different objects, e.g. data types, UDFs, and stored procedures.

In a typical database migration, customers categorize large tables by number of rows. However, when migrating to columnar databases, such as Amazon Redshift, it is essential to also assess table width (that is, number of columns) from the very beginning. While columnar databases are generally more efficient than row-based databases in storing data, wide tables with few rows may have a negative impact on columnar databases. To estimate the minimum table size required for each table in Amazon Redshift, use this formula from the AWS Knowledge Center.

For this customer, there was a clear separation between core applications, and a large isolated business application with minimum data dependencies and a different set of users. As a result, one of the main architectural decisions was to use two different Amazon Redshift clusters:

  • Primary cluster: Holds the core schemas and most of the data, and serves most of the business applications. Due to high storage requirements and long batch processes that run here, the recommended Amazon Redshift node type for this cluster is the dense storage family.
  • Secondary cluster: Purpose built cluster for a single application, demands high I/O. The recommended Amazon Redshift node type for this cluster is the dense compute family.

Planning the migration

There are several approaches when it comes to database migration. A must-have requirement for many migrations is to minimize downtime, which was the main driver for the migration pattern described in this post.

One of the challenges when migrating databases is to keep the data updated in both systems, capturing the changes on the source, and applying them to the destination database during the migration. By definition, data warehouses shouldn’t be used to run transactional (OLTP) workloads, but long-running ETL processes and analytical workloads (OLAP). Usually, those ETL processes update data in batches, typically on a daily basis. This simplifies the migration, because when those ETL processes that load data into the data warehouse are run in parallel against both target systems during the migration, CDC is not required.

The following image summarizes how we planned this migration with the customer, following a parallel approach to minimize downtime on their production data warehouse:

The main steps of this process are (1) data migration, (2) technical validation, (3) data sync, and (4) business validation, as described in the following section.

Running the migration

Full data migration

The initial data migration was the first milestone of the project. The main requirements for this phase were: (1) minimize the impact on the data source, and (2) transfer the data as fast as possible. To do this, AWS offers several options, depending on the size of the database, network performance (AWS Direct Connect or AWS Snowball), and whether the migration is heterogeneous or not (AWS Database Migration Service or AWS Schema Conversion Tool).

For this heterogeneous migration, the customer used the AWS Schema Conversion Tool (SCT). The SCT enabled them to run the data migration, provisioning multiple virtual machines in the same data center where IBM Netezza was installed, each running an AWS SCT Data Extractor agent. These data extractors are Java processes that connect directly to the source database and migrate data in chunks to the target database.

Sizing data extraction agents

To estimate the number of data extractor agents needed for the migration, consider this rule of thumb: One data extractor agent per 1 TB of compressed data on the source. Another recommendation is to install extraction agents on individual computers.

For each agent, consider the following hardware general requirements:

CPU4Lots of transformations and a large number of packets to process during data migration
RAM16Data chunks are kept in memory before dumping to disk.
Disk100 GB / ~500 IOPSIntermediate results are stored on Disk.
NetworkAt least 1 Gbit (10 Gbit recommended)While provisioning the resources, it is recommended to reduce the number of network hops from the source to AWS SCT data extraction agents.

Follow this documentation in order to go through the installation steps for the data extraction agents.

Depending on the size of the data to be migrated and the network speed, you may also run data extractor agents on top of EC2. For large data warehouses and in order to minimize downtime and optimize data transfer, it is recommended to deploy the data extractor agents as close as possible to the source. For example, in this migration project, 24 individual SCT extractor agents were installed in the on-premises data center for concurrent data extraction, and in order to speed up the process. Due to the stress that these operations generate on the data source, every extraction phase was run during weekends and off-hours.

The diagram below depicts the migration architecture deployed during the data migration phases:

Creating data extraction tasks

The source tables were migrated in parallel, on a table-by-table basis, using the deployed SCT extraction agents. These extraction agents authenticate using a valid user on the data source, allowing to adjust the resources available for that user during the extraction. Data was processed locally by SCT agents and uploaded to S3 through the network (via AWS Direct Connect). Note that other migration scenarios might require the use of AWS Snowball devices. Check the Snowball documentation to decide which transfer method is better for your scenario.

As part of the analysis performed while planning the migration, the customer identified large tables, e.g. tables with more than 20 million rows or 1TB size. In order to extract data from those tables, they used the virtual partitioning feature on AWS SCT, creating several sub-tasks and parallelizing the data extraction process for this table. We recommend creating two groups of tasks for each schema that migrates; one for small tables and another one for large tables using virtual partitions.

These tasks can be defined and created before running, so everything is ready for the migration window. Visit the following documentation to create, run, and monitor AWS SCT Data Extraction Tasks.

Technical validation

Once the initial extracted data was loaded to Amazon Redshift, data validation tests were performed in parallel, using validation scripts developed by the partner teams involved in the migration. The goal at this stage is to validate production workloads, comparing IBM Netezza and Amazon Redshift outputs from the same inputs.

Typical activities covered during this phase are the following:

  • Count number of objects and rows on each table.
  • Compare the same random subset of data in both IBM Netezza and Amazon Redshift for all migrated tables, validating that data is exactly the same row by row.
  • Check incorrect column encodings.
  • Identify skewed table data.
  • Annotate queries not benefiting from sort keys.
  • Identify inappropriate join cardinality.
  • Deal with tables with large varchar columns.
  • Confirm that processes do not crash when connected with target environment.
  • Validate daily batch job runs (job duration, number of rows processed).

You’ll find the right techniques to execute most of those activities in Top 10 Performance Tuning Techniques for Amazon Redshift.

Data synchronization

During this phase, the customer again migrated the tables and schemas that lost synchronization with the source during the Technical Validation phase. By using the same mechanism described on the First Full Data Migration section, and as ETL processes that generate the data marts are already running on the future system, data is kept updated after this synchronization phase.

Business validation

After the second data migration was successfully performed and the data movement was technically validated, the last remaining task was to involve the data warehouse  users in the final validation. These users from different business units across the company accessed the data warehouse using a variety of tools and methods: JDBC/ODBC clients, Python code, PL/SQL procedures, custom applications, etc.  It was central to the migration to make sure that every end user had verified and adapted his processes to work seamlessly with Amazon Redshift before the final cut-over was performed.

This phase took around three months and consisted of several tasks:

  • Adapt business users’ tools, applications, and scripts to connect to Amazon Redshift endpoints.
  • Modify user’s data load and dump procedures, replacing data movement to / from shared storage via ODBC / JDBC with COPY / UNLOAD operations from / to S3.
  • Modify any incompatible query, taking into account Amazon Redshift PostgreSQL implementation nuances.
  • Run business processes, both against IBM Netezza and Amazon Redshift, and compare results and execution times, being sure to notify any issue or unexpected result to the team in charge of the migration, so the case can be analyzed in detail.
  • Tune query performance, taking into account table sort keys and making extensive use of the EXPLAIN command in order to understand how Amazon Redshift plans and executes queries.

This business validation phase was key to have all end users aligned and ready for the final cut-over. Following Amazon Redshift best practices enabled end users to leverage the capabilities of their new data warehouse.

Soft cut-over

After all the migration and validation tasks had been performed, every ETL, business process, external system, and user tool was successfully connected and tested against Amazon Redshift.

This is when every process can be disconnected from the old data warehouse, which can be safely powered off and decommissioned.

Conclusion

In this blog post, we described the steps taken to perform a successful large-scale data warehouse migration from an on-premises IBM Netezza to Amazon Redshift. These same steps can be extrapolated to any other source data warehouse.

While this article describes a pure lift-and-shift migration, this is just the beginning of a transformation process towards a full-fledged corporate data lake. There are a series of next steps necessary in order to gain full advantage of the powerful analytics tools and services provided by AWS:

  • Activate Amazon Redshift’s Concurrency scaling feature on interactive query queues so clusters scale seamlessly on high usage periods without needing to provision the clusters for peak capacity.
  • Create a data lake in S3 and offload less accessed data, keeping warm and hot data on the Amazon Redshift clusters for higher performance.
  • Leverage Amazon Redshift Spectrum to be able to combine cold and hot data on analytic queries when required by the business needs.
  • Use Amazon Athena to be able to query cold data without affecting the data warehouse performance.

It is worth pointing out several takeaways that we feel are central to achieving a successful large-scale migration to Amazon Redshift:

  • Start with a PoC to make an accurate initial sizing of the Amazon Redshift cluster.
  • Create a detailed migration plan, which includes a clear procedure for every affected system.
  • Have end users fully aligned with the migration process, and make sure all their processes are validated before the final cutover is performed.
  • Follow Amazon Redshift best practices and techniques to leverage its full capabilities and performance.
  • Engage with the AWS account team from early stages and throughout the whole process. They are the point of contact with AWS specialists, Professional Services, and partners in order to bring the migration project to a successful conclusion.

We hope you found this post useful. Please feel free to leave a comment or question.

 


About the Authors

Guillermo Menendez Corral is a solutions architect at Amazon Web Services. He has over 12 years of experience designing and building SW applications and currently provides architectural guidance to AWS customers, with a focus on Analytics and Machine Learning.

 

 

 

 

Arturo Bayo is a big data consultant at Amazon Web Services. He promotes a data-driven culture in enterprise customers around EMEA, providing specialized guidance on business intelligence and data lake projects while working with AWS customers and partners to build innovative solutions around data and analytics.

 

 

 

 

Perform biomedical informatics without a database using MIMIC-III data and Amazon Athena

Post Syndicated from James Wiggins original https://aws.amazon.com/blogs/big-data/perform-biomedical-informatics-without-a-database-using-mimic-iii-data-and-amazon-athena/

Biomedical researchers require access to accurate, detailed data. The MIT MIMIC-III dataset is a popular resource. Using Amazon Athena, you can execute standard SQL queries against MIMIC-III without first loading the data into a database. Your analyses always reference the most recent version of the MIMIC-III dataset.

This post describes how to make the MIMIC-III dataset available in Athena and provide automated access to an analysis environment for MIMIC-III on AWS. We also compare a MIMIC-III reference bioinformatics study using a traditional database to that same study using Athena.

Overview

A dataset capturing a variety of measures longitudinally over time, across many patients, can drive analytics and machine learning toward research discovery and improved clinical decision-making. These features describe the MIT Laboratory of Computational Physiology (LCP) MIMIC-III dataset. In the words of LCP researchers:

“MIMIC-III is a large, publicly available database comprising de-identified health-related data associated with approximately 60K admissions of patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. …MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors: it is publicly and freely available, it encompasses a diverse and large population of ICU patients, and it contains high temporal resolution data including lab results, electronic documentation, and bedside monitor trends and waveforms.”

Recently, the Registry of Open Data on AWS (RODA) program made the MIMIC-III dataset available through the AWS Cloud. You can now use the MIMIC-III dataset without having to download, copy, or pay to store it. Instead, you can analyze the MIMIC-III dataset in the AWS Cloud using AWS services like Amazon EC2, Athena, AWS Lambda, or Amazon EMR. AWS Cloud availability enables quicker and cheaper research into the dataset.

Services like Athena also offer you new analytical approaches to the MIMIC-III dataset. Using Athena, you can execute standard SQL queries against MIMIC-III without first loading the data into a database. Because you can reference the MIMIC-III dataset hosted in the RODA program, your analyses always reference the most recent version of the MIMIC-III dataset. Live hosting reduces upfront time and effort, eliminates data synchronization issues, improves data analysis, and reduces overall study costs.

Transforming MIMIC-III data

Historically, the MIMIC team distributed the MIMIC-III dataset in compressed (gzipped) CSV format. The choice of CSV format reflects the loading of MIMIC-III data into a traditional relational database for analysis.

In contrast, the MIMIC team provides the MIMIC-III dataset to the RODA program in the following formats:

  • The traditional CSV format
  • An Apache Parquet format optimized for modern data processing technologies such as Athena

Apache Parquet stores data by column, so queries that fetch specific columns can run without reading the whole table. This optimization helps improve the performance and lower the cost of many query types common to biomedical informatics.

To convert the original MIMIC-III CSV dataset to Apache Parquet, we created a data transformation job using AWS Glue. The MIMIC team schedules the AWS Glue job to run as needed, updating the Parquet files in the RODA program with any changes to the CSV dataset. These scheduled updates keep the Parquet files current without interfering with the MIMIC team’s CSV dataset creation procedures. Find the code for this AWS Glue job in the mimic-code GitHub repo. Use the same code as a basis to develop other CSV-to-Parquet conversions.

The code that converts the MIMIC-III NOTEEVENTS table offers a valuable resource. This table stores long medical notes made up of multiple lines, with commas, double-quotes, and other characters that can be challenging to transform. For example, you can use the following Spark read statement to interpret that CSV file:

df = spark.read.csv('s3://'+mimiccsvinputbucket+'/NOTEEVENTS.csv.gz',\
	header=True,\
	schema=schema,\
	multiLine=True,\
	quote='"',\
	escape='"')

Executing the indwelling arterial catheter study using MIMIC-III

For example, the MIMIC team provides code for the MIMIC indwelling arterial catheter (IAC) aline study. The study uses the MIMIC-III dataset to reproduce findings from the published study, The Association Between Indwelling Arterial Catheters and Mortality in Hemodynamically Stable Patients With Respiratory Failure.

You can obtain the aline study code, as well as many other analytic examples, in the MIT Laboratory of Computational Physiology MIMIC Code Repository in GitHub. Learn more about the provided code in the JAMA paper, The MIMIC Code Repository: enabling reproducibility in critical care research.

The aline study code comprises about 1400 lines of SQL. It was developed for a PostgreSQL database, and is executed and further analyzed by Python and R code. We ran the aline study against the Parquet MIMIC-III dataset using Athena instead of PostgreSQL to answer the following questions:

  • What modifications would be required, if any, to run the study using Athena, instead of PostgreSQL?
  • How would performance differ?
  • How would cost differ?

Some SQL features used in the study work differently in Athena than they do in PostgreSQL. As a result, about 5% of the SQL statements needed modification. Specifically, the SQL statements require the following types of modification for use with Athena:

Instead of using materialized views, in Athena, create a new table. For instance, the CREATE MATERIALIZED VIEW statement can be written as CREATE TABLE.

  • Instead of using schema context statements like SET SEARCH_PATH, in Athena, explicitly declare the schema of referenced tables. For instance, instead of having a SET SEARCH_PATH statement before your query and referencing tables without a schema as follows:

SET SEARCH_PATH TO mimiciii;

left join chartevents ce

You declare the schema as a part of every table reference:

left join mimiciii.chartevents ce

  • Epoch time, as well as timestamp arithmetic, works differently in Athena than PostgreSQL. So, instead of arithmetic, use the date_diff() and specify seconds as the return value.

extract(epoch from endtime-starttime)/24.0/60.0/60.0 

The preceding statement can be written as:

date_diff('second',starttime, endtime)/24.0/60.0/60.0

  • Athena uses the “double” data type instead of the “numeric” data type.

ROUND( cast(f.height_first as numeric), 2) AS height_first,

The preceding statement can be written as:

ROUND( cast(f.height_first as double), 2) AS height_first,

  • If VARCHAR fields are empty, they are not detected as NULL. Instead, compare a VARCHAR to an empty string. For instance, consider the following field:

where ce.value IS NOT NULL

If ce.value is a VARCHAR, it can be written as the following:

where ce.value <> ''

After making these minor edits to the aline study SQL statements, the study executes successfully using Athena instead of PostgreSQL. The output produced from both methods is identical.

Next, consider the speed of Athena compared to PostgreSQL in analyzing the aline study. As shown in the following graph, the aline study using Athena queries of the Parquet-formatted MIMIC-III dataset run 10 times faster than queries of the same data in an Amazon RDS PostgreSQL database.

Compare the costs of running the aline study in Athena to running it in PostgreSQL. RDS PostgreSQL databases bill in one-second increments for the time the database runs, while Athena queries bill per gigabyte of data each query scans. Because of this fundamental pricing difference, you must make some assumptions to compare costs.

You could assume that running the aline study in RDS PostgreSQL involves the following steps, taking up an eight-hour working day:

  1. Deploy a new RDS PostgreSQL database.
  2. Load the MIMIC-III dataset.
  3. Connect a Jupyter notebook.
  4. Run the aline study.
  5. Terminate the RDS database.

You’d then compare the cost of running an RDS PostgreSQL database for eight hours (at $0.37 per hour, db.m5.xlarge with 100-GB EBS storage) to the cost of running the aline study against a dataset using Athena (9.93 GB of data scanned at $0.005/GB). As the following graph shows, this results in an almost 60x cost savings.

Working with MIMIC-III in AWS

MIMIC-III is a publicly available dataset containing detailed information about the clinical care of patients. For this reason, the MIMIC team requires that you complete a training course and submit a formal request before gaining access. For more information, see Requesting access.

After you obtain access to the MIMIC-III dataset, log in to the PhysioNet website and, under your profile settings, provide your AWS account ID. Then click the request access link, under Files, on the MIMIC-III Clinical Database page.  You then have access to the MIMIC-III dataset in AWS.

Choose Launch Stack below to begin working with the MIMIC-III dataset in your AWS account. This launches an AWS CloudFormation template, which deploys an index of the MIMIC-III Parquet-formatted dataset into your AWS Glue Data Catalog. It also deploys an Amazon SageMaker notebook instance pre-loaded with the aline study code and many other MIMIC-provided analytic examples.

After the AWS CloudFormation template deploys, choose Outputs, and follow the link to access Jupyter Notebooks. From there, you can open and execute the aline study code by browsing the filesystem to the path ./mimic-code/notebooks/aline-aws/aline-awsathena.ipynb.

After you’ve completed your analysis, you can stop your Jupyter Notebook instance to suspend charges for compute.  Any time that you want to continue working, just start the instance and continue where you left off.

Conclusion

The RODA program provides quick and inexpensive access to the global biomedical research resources of the MIMIC-III dataset. Cloud-native analytic tools like AWS Glue and Athena accelerate research, and groups like the MIT Laboratory of Computational Physiology are pioneering data availability in modern data formats like Apache Parquet.

The analytic approaches and code demonstrated in this post can be applied generally to help increase agility and decrease cost. Consider using them as you enhance existing or perform new biomedical informatics research.

 


About the Author


James Wiggins is a senior healthcare solutions architect at AWS. He is passionate about using technology to help organizations positively impact world health. He also loves spending time with his wife and three children.

 

 

 


Alistair Johnson is a research scientist and the Massachusetts Institute of Technology. Alistair has extensive experience and expertise in working with ICU data, having published the MIMIC-III and eICU-CRD datasets. His current research focuses on clinical epidemiology and machine learning for decision support.

 

 

Creating custom Pinpoint dashboards using Amazon QuickSight, part 2

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/messaging-and-targeting/creating-custom-pinpoint-dashboards-using-amazon-quicksight-part-2/

Note: This post was written by Manan Nayar and Aprajita Arora, Software Development Engineers on the AWS Digital User Engagement team.


In our previous post, we discussed the process of visualizing specific, pre-aggregated Amazon Pinpoint metrics—such as delivery rate or open rate—using the Amazon Pinpoint Metrics APIs. In that example, we showed how to create a Lambda function that retrieves your metrics, and then make those metrics available for creating visualizations in Amazon QuickSight.

This post discusses shows a different approach to exporting data from Amazon Pinpoint and using it to build visualizations. Rather than retrieve specific metrics, you can use the event stream feature in Amazon Pinpoint to export raw event data. You can use this data in Amazon QuickSight to create in-depth analyses of your data, as opposed to visualizing pre-calculated metrics. As an added benefit, when you use this solution, you don’t have to modify any code, and the underlying data is updated every few minutes.

Step 1: Configure the event stream in Amazon Pinpoint

The Amazon Pinpoint event stream includes information about campaign events (such as campaign.send) and application events (such as session.start). It also includes response information related to all of the emails and SMS messages that you send from your Amazon Pinpoint project, regardless of whether they were sent from campaigns or on a transactional basis. When you enable event streams, Amazon Pinpoint automatically sends this data to your S3 bucket (via Amazon Kinesis Data Firehose) every few minutes.

To set up the event stream

  1. Sign in to the Amazon Pinpoint console at http://console.aws.amazon.com/pinpoint. In the list of projects, choose the project that you want to enable event streaming for.
  2. Under Settings, choose Event stream.
  3. Choose Edit, and then configure the event stream to use Amazon Kinesis Data Firehose. If you don’t already have a Kinesis Data Firehose stream, follow the link to create one in the Kinesis console. Configure the stream to send data to an S3 bucket. For more information about creating streams, see Creating an Amazon Kinesis Data Firehose Delivery Stream.
  4. Under IAM role, choose Automatically create a role. Choose Save.

Step 2: Add a data set in Amazon QuickSight

Now that you’ve started streaming your Amazon Pinpoint data to S3, you can set Amazon QuickSight to look for data in the S3 bucket. You connect QuickSight to sources of data by creating data sets.

To create a data set

    1. In a text editor, create a new file. Paste the following code:
      {
          "fileLocations": [
              {
                  "URIPrefixes": [ 
                      "s3://<bucketName>/"          
                  ]
              }
          ],
          "globalUploadSettings": {
              "format": "JSON"
          }
      }

      In the preceding code, replace <bucketName> with the name of the S3 bucket that you’re using to store the event stream data. Save the file as manifest.json.

    2. Sign in to the QuickSight console at https://quicksight.aws.amazon.com.
    3. Create a new S3 data set. When prompted, choose the manifest file that you created in step 1. For more information about creating S3 data sets, see Creating a Data Set Using Amazon S3 Files in the Amazon QuickSight User Guide.
    4. Create a new analysis. From here, you can start creating visualizations of your data. To learn more, see Creating an Analysis in the Amazon QuickSight User Guide.

Step 3: Set the refresh rate for the data set

You can configure your data sets in Amazon QuickSight to automatically refresh on a certain schedule. In this section, you configure the data set to refresh every day, one minute before midnight.

To set the refresh schedule

  1. Go to the QuickSight start page at https://quicksight.aws.amazon.com/sn/start. Choose Manage data.
  2. Choose the data set that you created in the previous section.
  3. Choose Schedule refresh. Follow the prompts to set up a daily refresh schedule.

Step 4: Create your visualizations

From this point, you can start creating visualizations of your data. To learn more about creating visualizations, see Creating an Analysis in the Amazon QuickSight User Guide.

Build, secure, and manage data lakes with AWS Lake Formation

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/building-securing-and-managing-data-lakes-with-aws-lake-formation/

A data lake is a centralized store of a variety of data types for analysis by multiple analytics approaches and groups. Many organizations are moving their data into a data lake. In this post, I explore how you can use AWS Lake Formation to build, secure, and manage data lakes.

Traditionally, organizations have kept data in a rigid, single-purpose system, such as an on-premises data warehouse appliance. Similarly, they have analyzed data using a single method, such as predefined BI reports. Moving data between databases or for use with different approaches, like machine learning (ML) or improvised SQL querying, required “extract, transform, load” (ETL) processing before analysis. At best, these traditional methods have created inefficiencies and delays. At worst, they have complicated security.

By contrast, cloud-based data lakes open structured and unstructured data for more flexible analysis. Any amount of data can be aggregated, organized, prepared, and secured by IT staff in advance. Analysts and data scientists can then access it in place with the analytics tools of their choice, in compliance with appropriate usage policies.

Data lakes let you combine analytics methods, offering valuable insights unavailable through traditional data storage and analysis. In a retail scenario, ML methods discovered detailed customer profiles and cohorts on non-personally identifiable data gathered from web browsing behavior, purchase history, support records, and even social media. The exercise showed the deployment of ML models on real-time, streaming, interactive customer data.

Such models could analyze shopping baskets and serve up “next best offers” in the moment, or deliver instant promotional incentives. Marketing and support staff could explore customer profitability and satisfaction in real time and define new tactics to improve sales. Around a data lake, combined analytics techniques like these can unify diverse data streams, providing insights unobtainable from siloed data.

The challenges of building data lakes

Unfortunately, the complex and time-consuming process for building, securing, and starting to manage a data lake often takes months. Even building a data lake in the cloud requires many manual and time-consuming steps:

  • Setting up storage.
  • Moving, cleaning, preparing, and cataloging data.
  • Configuring and enforcing security policies for each service.
  • Manually granting access to users.

You want data lakes to centralize data for processing and analysis with multiple services. But organizing and securing the environment requires patience.

Currently, IT staff and architects spend too much time creating the data lake, configuring security, and responding to data requests. They could spend this time acting as curators of data resources, or as advisors to analysts and data scientists. Analysts and data scientists must wait for access to needed data throughout the setup.

The following diagram shows the data lake setup process:

Setting up storage

Data lakes hold massive amounts of data. Before doing anything else, you must set up storage to hold all that data. If you are using AWS, configure Amazon S3 buckets and partitions. If you are building the data lake on premises, acquire hardware and set up large disk arrays to store all the data.

Moving data

Connect to different data sources — on-premises and in the cloud — then collect data on IoT devices. Next, collect and organize the relevant datasets from those sources, crawl the data to extract the schemas, and add metadata tags to the catalog. You can use a collection of file transfer and ETL tools:

Cleaning and preparing data

Next, collected data must be carefully partitioned, indexed, and transformed to columnar formats to optimize for performance and cost. You must clean, de-duplicate, and match related records.

Today, organizations accomplish these tasks using rigid and complex SQL statements that perform unreliably and are difficult to maintain. This complex process of collecting, cleaning, and transforming the incoming data requires manual monitoring to avoid errors. Many customers use AWS Glue for this task.

Configuring and enforcing policies

Customers and regulators require that organizations secure sensitive data. Compliance involves creating and applying data access, protection, and compliance policies. For example, you restrict access to personally identifiable information (PII) at the table, column, or row level, encrypt all data, and keep audit logs of who is accessing the data.

Today, you can secure data using access control lists on S3 buckets or third-party encryption and access control software. You create and maintain data access, protection, and compliance policies for each analytics service requiring access to the data. For example, if you are running analysis against your data lake using Amazon Redshift and Amazon Athena, you must set up access control rules for each of these services.

Many customers use AWS Glue Data Catalog resource policies to configure and control metadata access to their data. Some choose to use Apache Ranger. But these approaches can be painful and limiting. S3 policies provide at best table-level access. And you must maintain data and metadata policies separately. With Apache Ranger, you can configure metadata access to only one cluster at a time. Also, policies can become wordy as the number of users and teams accessing the data lake grows within an organization.

Making it easy to find data

Users with different needs, like analysts and data scientists, may struggle to find and trust relevant datasets in the data lake. To make it easy for users to find relevant and trusted data, you must clearly label the data in a data lake catalog. Provide users with the ability to access and analyze this data without making requests to IT.

Today, each of these steps involves a lot of manual work. Customer labor includes building data access and transformation workflows, mapping security and policy settings, and configuring tools and services for data movement, storage, cataloging, security, analytics, and ML. With all these steps, a fully productive data lake can take months to implement.

The wide range of AWS services provides all the building blocks of a data lake, including many choices for storage, computing, analytics, and security. In the nearly 13 years that AWS has been operating Amazon S3 with exabytes of data, it’s also become the clear first choice for data lakes. AWS Glue adds a data catalog and server-less transformation capabilities. Amazon EMR brings managed big data processing frameworks like Apache Spark and Apache Hadoop. Amazon Redshift Spectrum offers data warehouse functions directly on data in Amazon S3. Athena brings server-less SQL querying.

With all these services available, customers have been building data lakes on AWS for years. AWS runs over 10,000 data lakes on top of S3, many using AWS Glue for the shared AWS Glue Data Catalog and data processing with Apache Spark.

AWS has learned from the thousands of customers running analytics on AWS that most customers who want to do analytics also want to build a data lake. But many of you want this process to be easier and faster than it is today.

AWS Lake Formation (now generally available)

At AWS re:Invent 2018, AWS introduced Lake Formation: a new managed service to help you build a secure data lake in days. If you missed it, watch Andy Jassy’s keynote announcement. Lake Formation has several advantages:

  • Identify, ingest, clean, and transform data: With Lake Formation, you can move, store, catalog, and clean your data faster.
  • Enforce security policies across multiple services: After your data sources are set up, you then define security, governance, and auditing policies in one place, and enforce those policies for all users and all applications.
  • Gain and manage new insights:With Lake Formation, you build a data catalog that describes available datasets and their appropriate business uses. This catalog makes your users more productive by helping them find the right dataset to analyze.

The following screenshot illustrates Lake Formation and its capabilities.

How to create a data lake

S3 forms the storage layer for Lake Formation. If you already use S3, you typically begin by registering existing S3 buckets that contain your data. Lake Formation creates new buckets for the data lake and import data into them. AWS always stores this data in your account, and only you have direct access to it.

There is no lock-in to Lake Formation for your data. Because AWS stores data in standard formats like CSV, ORC, or Parquet, it can be used with a wide variety of AWS or third-party analytics tools.

Lake Formation also optimizes the partitioning of data in S3 to improve performance and reduce costs. The raw data you load may reside in partitions that are too small (requiring extra reads) or too large (reading more data than needed). Lake Formation organizes your data by size, time, or relevant keys to allow fast scans and parallel, distributed reads for the most commonly used queries.

How to load data and catalog metadata

Lake Formation uses the concept of blueprints for loading and cataloging data. You can run blueprints one time for an initial load or set them up to be incremental, adding new data and making it available.

With Lake Formation, you can import data from MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. You can also import from on-premises databases by connecting with Java Database Connectivity (JDBC).

Point Lake Formation to the data source, identify the location to load it into the data lake, and specify how often to load it. Blueprints discovers the source table schema, automatically convert data to the target data format, partition the data based on the partitioning schema, and track data that was already processed. All these actions can be customized.

Blueprints rely on AWS Glue as a support service. AWS Glue crawlers connect and discover the raw data that to be ingested. AWS Glue code generation and jobs generate the ingest code to bring that data into the data lake. Lake Formation uses the same data catalog for organizing the metadata. AWS Glue stitches together crawlers and jobs and allows for monitoring for individual workflows. In these ways, Lake Formation is a natural extension of AWS Glue capabilities.

The following graphics show the Blueprint Workflow and Import screens:

How to transform and prepare data for analysis

In addition to supporting all the same ETL capabilities as AWS Glue, Lake Formation introduces new Amazon ML Transforms. This feature includes a fuzzy logic blocking algorithm that can de-duplicate 400M+ records in less than 2.5 hours, which is magnitudes better than earlier approaches.

To match and de-duplicate your data using Amazon ML Transforms: First, merge related datasets. Amazon ML Transforms divides these sets into training and testing samples, then scans for exact and fuzzy matches. You can provide more data and examples for greater accuracy, putting these into production to process new data as it arrives to your data lake. The partitioning algorithm requires minimal tuning. The confidence level reflects the quality of the grouping, improving on earlier, more improvised algorithms. The following diagram shows this matching and de-duplicating workflow.

Amazon.com is currently using and vetting Amazon ML Transforms internally, at scale, for retail workloads. Lake Formation now makes these algorithms available to customers, so you can avoid the frustration of creating complex and fragile SQL statements to handle record matching and de-duplication. Amazon ML Transforms help improve data quality before analysis. For more information, see Fuzzy Matching and Deduplicating Data with Amazon ML Transforms for AWS Lake Formation.

How to set access control permissions

Lake Formation lets you define policies and control data access with simple “grant and revoke permissions to data” sets at granular levels. You can assign permissions to IAM users, roles, groups, and Active Directory users using federation. You specify permissions on catalog objects (like tables and columns) rather than on buckets and objects.

You can easily view and audit all the data policies granted to a user—in one place. Search and view the permissions granted to a user, role, or group through the dashboard; verify permissions granted; and when necessary, easily revoke policies for a user. The following screenshots show the Grant permissions console:

How to make data available for analytics

Lake Formation offers unified, text-based, faceted search across all metadata, giving users self-serve access to the catalog of datasets available for analysis. This catalog includes discovered schemas (as discussed previously) and lets you add attributes like data owners, stewards, and other business-specific attributes as table properties.

At a more granular level, you can also add data sensitivity level, column definitions, and other attributes as column properties. You can explore data by any of these properties. But access is subject to user permissions. See the following screenshot of the AWS Glue tables tab:

How to monitor activity

With Lake Formation, you can also see detailed alerts in the dashboard, and then download audit logs for further analytics.

Amazon CloudWatch publishes all data ingestion events and catalog notifications. In this way, you can identify suspicious behavior or demonstrate compliance with rules.

To monitor and control access using Lake Formation, first define the access policies, as described previously. Users who want to conduct analysis access data directly through an AWS analytics service, such as Amazon EMR for Spark, Amazon Redshift, or Athena. Or, they access data indirectly with Amazon QuickSight or Amazon SageMaker.

A service forwards the user credentials to Lake Formation for the validation of access permissions. Then Lake Formation returns temporary credentials granting access to the data in S3, as shown in the following diagrams. After a user gains access, actual reads and writes of data operate directly between the analytics service and S3. This approach removes the need for an intermediary in the critical data-processing path.

The following screenshot and diagram show how to monitor and control access using Lake Formation.

Conclusion

With just a few steps, you can set up your data lake on S3 and start ingesting data that is readily queryable. To get started, go to the Lake Formation console and add your data sources. Lake Formation crawls those sources and moves the data into your new S3 data lake.

Lake Formation can automatically lay out the data in S3 partitions; change it into formats for faster analytics, like Apache Parquet and ORC; and increase data quality through machine-learned record matching and de-duplication.

From a single dashboard, you can set up all the permissions for your data lake. Those permissions are implemented for every service accessing this data – including analytics and ML services (Amazon Redshift, Athena, and Amazon EMR for Apache Spark workloads). Lake Formation saves you the hassle of redefining policies across multiple services and provides consistent enforcement of and compliance with those policies.

Learn how to start using AWS Lake Formation.


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at AWS. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 

 

Prajakta Damle is a Principle Product Manager at Amazon Web Services.

 

 

 

 

 

 

AWS Lake Formation – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-lake-formation-now-generally-available/

As soon as companies started to have data in digital format, it was possible for them to build a data warehouse, collecting data from their operational systems, such as Customer relationship management (CRM) and Enterprise resource planning (ERP) systems, and use this information to support their business decisions.

The reduction in costs of storage, together with an even greater reduction in complexity for managing large quantities of data, made possible by services such as Amazon S3, has allowed companies to retain more information, including raw data that is not structured, such as logs, images, video, and scanned documents.

This is the idea of a data lake: to store all your data in one, centralized repository, at any scale. We are seeing this approach with customers like Netflix, Zillow, NASDAQ, Yelp, iRobot, FINRA, and Lyft. They can run their analytics on this larger dataset, from simple aggregations to complex machine learning algorithms, to better discover patterns in their data and understand their business.

Last year at re:Invent we introduced in preview AWS Lake Formation, a service that makes it easy to ingest, clean, catalog, transform, and secure your data and make it available for analytics and machine learning. I am happy to share that Lake Formation is generally available today!

With Lake Formation you have a central console to manage your data lake, for example to configure the jobs that move data from multiple sources, such as databases and logs, to your data lake. Having such a large and diversified amount of data makes configuring the right access permission also critical. You can secure access to metadata in the Glue Data Catalog and data stored in S3 using a single set of granular data access policies defined in Lake Formation. These policies allow you to define table and column-level data access.

One thing I like the most of Lake Formation is that it works with your data already in S3! You can easily register your existing data with Lake Formation, and you don’t need to change existing processes loading your data to S3. Since data remains in your account, you have full control.

You can also use Glue ML Transforms to easily deduplicate your data. Deduplication is important to reduce the amount of storage you need, but also to make analyzing your data more efficient because you don’t have neither the overhead nor the possible confusion of looking at the same data twice. This problem is trivial if duplicate records can be identified by a unique key, but becomes very challenging when you have to do a “fuzzy match”. A similar approach can be used for record linkage, that is when you are looking for similar items in different tables, for example to do a “fuzzy join” of two databases that do not share a unique key.

In this way, implementing a data lake from scratch is much faster, and managing a data lake is much easier, making these technologies available to more customers.

Creating a Data Lake
Let’s build a data lake using the Lake Formation console. First I register the S3 buckets that are going to be part of my data lake. Then I create a database and grant permission to the IAM users and roles that I am going to use to manage my data lake. The database is registered in the Glue Data Catalog and holds the metadata required to analyze the raw data, such as the structure of the tables that are going to be automatically generated during data ingestion.

Managing permissions is one of the most complex tasks for a data lake. Consider for example the huge amount of data that can be part of it, the sensitive, mission-critical nature of some of the data, and the different structured, semi-structured, and unstructured formats in which data can reside. Lake Formation makes it easier with a central location where you can give IAM users, roles, groups, and Active Directory users (via federation) access to databases, tables, optionally allowing or denying access to specific columns within a table.

To simplify data ingestion, I can use blueprints that create the necessary workflows, crawlers and jobs on AWS Glue for common use cases. Workflows enable orchestration of your data loading workloads by building dependencies between Glue entities, such as triggers, crawlers and jobs, and allow you to track visually the status of the different nodes in the workflows on the console, making it easier to monitor progress and troubleshoot issues.

Database blueprints help load data from operational databases. For example, if you have an e-commerce website, you can ingest all your orders in your data lake. You can load a full snapshot from an existing database, or incrementally load new data. In case of an incremental load, you can select a table and one or more of its columns as bookmark keys (for example, a timestamp in your orders) to determine previously imported data.

Log file blueprints simplify ingesting logging formats used by Application Load Balancers, Elastic Load Balancers, and AWS CloudTrail. Let’s see how that works more in depth.

Security is always a top priority, and I want to be able to have a forensic log of all management operations across my account, so I choose the CloudTrail blueprint. As source, I select a trail collecting my CloudTrail logs from all regions into an S3 bucket. In this way, I’ll be able to query account activity across all my AWS infrastructure. This works similarly for a larger organization having multiple AWS accounts: they just need, when configuring the trail in the CloudTrial console, to apply the trail to their whole organization.

I then select the target database, and the S3 location for the data lake. As data format I use Parquet, a columnar storage format that will make querying the data faster and cheaper. The import frequency can be hourly to monthly, with the option to choose the day of the week and the time. For now, I want to run the workflow on demand. I can do that from the console or programmatically, for example using any AWS SDK or the AWS Command Line Interface (CLI).

Finally, I give the workflow a name, the IAM role to use during execution, and a prefix for the tables that will be automatically created by this workflow.

I start the workflow from the Lake Formation console and select to view the workflow graph. This opens the AWS Glue console, where I can visually see the steps of the workflow and monitor the progress of this run.

When the workflow is completed a new table is available in my data lake database. The source data remain as logs in the S3 bucket output of CloudTrail, but now I have them consolidated, in Parquet format and partitioned by date, in my data lake S3 location. To optimize costs, I can set up an S3 lifecycle policy that automatically expires data in the source S3 bucket after a safe amount of time has passed.

Securing Access to the Data Lake
Lake Formation provides secure and granular access to data stores in the data lake, via a new grant/revoke permissions model that augments IAM policies. It is simple to set up these permissions, for example using the console:

I simply select the IAM user or role I want to grant access to. Then I select the database and optionally the tables and the columns I want to provide access to. It is also possible to select which type of access to provide. For this demo, simple select permissions are sufficient.

Accessing the Data Lake
Now I can query the data using tools like Amazon Athena or Amazon Redshift. For example, I open the query editor in the Athena console. First, I want to use my new data lake to look into which source IP addresses are most common in my AWS Account activity:

SELECT sourceipaddress, count(*)
FROM my_trail_cloudtrail
GROUP BY  sourceipaddress
ORDER BY  2 DESC;

Looking at the result of the query, you can see which are the AWS API endpoints that I use the most. Then, I’d like to check which user identity types are used. That is an information stored in JSON format inside one of the columns. I can use some of the JSON functions available with Amazon Athena to get that information in my SQL statements:

SELECT json_extract_scalar(useridentity, '$.type'), count(*)
FROM "mylake"."my_trail_cloudtrail"
GROUP BY  json_extract_scalar(useridentity, '$.type')
ORDER BY  2 DESC;

Most of the times, AWS services are the ones creating activities in my trail. These queries are just an example, but give me quickly a deeper insight in what is happening in my AWS account.

Think of what could be a similar impact for your business! Using database and logs blueprints, you can quickly create workflows to ingest data from multiple sources within your organization, set the right permission at column level of who can have access to any information collected, clean and prepare your data using machine learning transforms, and correlate and visualize the information using tools like Amazon Athena, Amazon Redshift, and Amazon QuickSight.

Customizing Data Access with Column-Level Permissions
In order to follow data privacy guidelines and compliance, the mission-critical data stored in a data lake requires to create custom views for different stakeholders inside the company. Let’s compare the visibility of two IAM users in my AWS account, one that has full permissions on a table, and one that has only select access to a subset of the columns of the same table.

I already have a user with full access to the table containing my CloudTrail data, it’s called danilop. I create a new limitedview IAM user and I give it access to the Athena console. In the Lake Formation console, I only give this new user select permissions on three of the columns.

To verify the different access to the data in the table, I log in with one user at a time and go to the Athena console. On the left I can explore which tables and columns the logged-in user can see in the Glue Data Catalog. Here’s a comparison for the two users, side-by-side:

The limited user has access only to the three columns that I explicitly configured, and to the four columns used for partitioning the table, whose access is required to see any data. When I query the table in the Athena console with a select * SQL statement, logged in as the limitedview user, I only see data from those seven columns:

Available Now
There is no additional cost in using AWS Lake Formation, you pay for the use of the underlying services such as Amazon S3 and AWS Glue. One of the core benefits of Lake Formation are the security policies it is introducing. Previously you had to use separate policies to secure data and metadata access, and these policies only allowed table-level access. Now you can give access to each user, from a central location, only to the the columns they need to use.

AWS Lake Formation is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). Redshift integration with Lake Formation requires Redshift cluster version 1.0.8610 or higher, your clusters should have been automatically updated by the time you read this. Support for Apache Spark with Amazon EMR will follow over the next few months.

I only scratched the surface of what you can do with Lake Formation. Building and managing a data lake for your business is now much easier, let me know how you are using these new capabilities!

Danilo