Tag Archives: Analytics

How the Yahoo! JAPAN Smart Devices Team is improving voice user interfaces with Amazon QuickSight business intelligence

Post Syndicated from Kazuhide Fujita original https://aws.amazon.com/blogs/big-data/how-the-yahoo-japan-smart-devices-team-is-improving-voice-user-interfaces-with-amazon-quicksight-business-intelligence/

This is a guest blog post by Kazuhide Fujita, Product Manager at Yahoo! JAPAN.

Yahoo! JAPAN is a large internet search and media company, with Yahoo! JAPAN’s web portal being the one of the most commonly used websites in Japan. Our smart devices team is responsible for building and improving Yahoo! JAPAN apps for voice user interfaces (VUI) such as Amazon Alexa and Google Assistant. We see VUI as a market that will grow exponentially in the future, and we want to be ready to lead the consumer experience with such devices. In this post, I discuss how we’re using Amazon QuickSight business intelligence (BI) to help our product teams improve these services.

Enhanced access to insights at lower cost

To continuously improve our services, we use data to understand how consumers are interacting with the software and to identify growth trends. However, the data we get directly from smart device makers is limited. So, we built our own log system to capture more granular data, such as the types of commands customers are using, the time of day they use the application, and how frequently they use it.

Early on, we used Amazon Elasticsearch Service (Amazon ES) and Kibana to analyze data. Although this solution was very capable, it came at a higher price point than we were targeting. Another option was to export data directly to Microsoft Excel for ad hoc analysis. However, this was very time consuming and limited us to working with extracts of historical data rather than the latest information.

We decided to look for a solution that would suit the full spectrum of our needs while being cost-effective for our specific use case. While we were searching, our data team made the decision to standardize on a data lake architecture using Amazon Simple Storage Service (Amazon S3) and Amazon Athena. This approach provided a high level of flexibility and scalability. To visualize our data, it made sense to use QuickSight, the serverless BI solution with pay-per-session pricing on AWS.

Unifying data to understand customers better

This system has proven to be a good fit for our needs. The data lake allows us to accumulate different types of data from monitoring many KPIs and VUI products. For example, we might want to know the number of active users over a given period, and then drill down into how active those users were in the 2 weeks from when they registered. The data lake makes this possible. It’s easy to maintain even though the data is very diverse. For aggregating and performing calculations on the data, we use Athena because it provides optimal performance for complex queries thanks to the distributed computing model.

For ad hoc analysis, dashboards, and reporting, QuickSight connects seamlessly to our data lake. QuickSight makes it easy to view trends in customer behavior such as the time of usage, method of interaction, typical settings, and so on. The following screenshot shows a sample dashboard in QuickSight.

The following screenshot shows a sample dashboard in QuickSight.

For example, the default wake word for Alexa-powered devices is to say the name of the voice assistant: “Hey, Alexa.” However, Japanese customers may prefer to say “ohayō,” which means “good morning” in Japanese. Which setting customers prefer could be an important trend for us to know when we configure our offerings. With QuickSight, it’s easy to compare trends for this type of behavior across other user characteristics.

This is only one small example of the kinds of insights we glean by using QuickSight. Another use case is regarding initiatives to increase product usage through marketing or incentives. We can track the outcome of these programs using QuickSight by tracking whether they result in an uptick in usage relative to the communications we send out.

The freedom to focus on what matters to the business

One of the big advantages of using QuickSight and other AWS services is that we don’t have to worry about maintaining on-premises systems for our data lake and analytics. It’s easy to manage and we can focus on gaining insights and improving our products—not running data center infrastructure. Building our end-to-end data-to-insights pipeline on AWS ensures that we can easily apply security and governance policies to all our data.

Overall, QuickSight provides us with the flexibility to analyze all kinds of data quickly, so we can aim to be the market leader in the VUI marketplace. We’re excited to see what the future holds for this powerful tool—and to apply the knowledge we gain to improving our services.


About the Author

Kazuhide Fujita is the Skill Team Product Manager, Smart Device Division, at Yahoo Japan Corporation

Implementing multi-tenant patterns in Amazon Redshift using data sharing

Post Syndicated from Rajesh Francis original https://aws.amazon.com/blogs/big-data/implementing-multi-tenant-patterns-in-amazon-redshift-using-data-sharing/

Software service providers offer subscription-based analytics capabilities in the cloud with Analytics as a Service (AaaS), and increasingly customers are turning to AaaS for business insights. A multi-tenant storage strategy allows the service providers to build a cost-effective architecture to meet increasing demand.

Multi-tenancy means a single instance of software and its supporting infrastructure is shared to serve multiple customers. For example, a software service provider could generate data that is housed in a single data warehouse cluster, but accessed securely by multiple customers. This storage strategy offers an opportunity to centralize management of data, simplify ETL processes, and optimize costs. However, service providers have to constantly balance between cost and providing a better user experience for their customers.

With the new data sharing feature, you can use Amazon Redshift to scale and meet both objectives of managing costs by simplifying storage and ETL pipelines while still providing consistent performance to customers.  You can ingest data into a cluster designated as a producer cluster, and share this live data with one or more consumer clusters. Clusters accessing this shared data are isolated from each other, therefore performance of a producer cluster isn’t impacted by workloads on consumer clusters. This enables consuming clusters to get consistent performance based on individual compute capacity.

In this post, we focus on various AaaS patterns, and discuss how you can use data sharing in a multi-tenant architecture to scale for virtually unlimited users. We discuss detailed steps to use data sharing with different storage strategies.

Multi-tenant storage patterns

Multi-tenant storage patterns help simplify the architecture and long-term maintenance of the analytics platform. In a multi-tenant strategy, data is stored centrally in a single cluster for all tenants, enabling simplification of the ETL ingestion pipeline and data management. In the previously published whitepaper SaaS Storage Strategies, various models of storage and benefits are covered for a single cluster scenario.

The three strategies you can choose from are:

  • Pool model – Data is stored in a single database schema for all tenants, and a new column (tenant_id) is used to scope and control access to individual tenant data. Access to the multi-tenant data is controlled using views built on the tables.
  • Bridge model – Storage and access to data for each tenant is controlled at individual schema level in the same database.
  • Silo model – Storage and access control to data for each tenant is maintained in separate databases

The following diagram illustrates the architecture of these multi-tenant storage strategies.

The following diagram illustrates the architecture of these multi-tenant storage strategies.

In the following sections, we will discuss how these multi-tenant strategies can be implemented using Amazon Redshift data sharing feature with a multi-cluster architecture.

Scaling your multi-tenant architecture using data sharing

AaaS providers implementing multi-tenant architectures were previously limited to resources of a single cluster to meet the compute and concurrency requirements of users across all the tenants. As the number of tenants increased, you could either turn on concurrency scaling or create additional clusters. However, the addition of new clusters means additional ingestion pipelines and increased operational overhead.

With data sharing in Amazon Redshift, you can easily and securely share data across clusters. Data ingested into the producer cluster is shared with one or more consumer clusters, which allows total separation of ETL and BI workloads. Several consumer clusters can read data from the managed storage of a producer cluster. This enables instant, granular, and high-performance access without data copies and movement. Workloads accessing shared data are isolated from each other and the producer. You can distribute workloads across multiple clusters while simplifying and consolidating the ETL ingestion pipeline into one main producer cluster, providing optimal price for performance.

Consumer clusters can in turn be producers for the data sets they own. Customers can optimize costs even further by collocating multiple tenants on the same consumer cluster. For instance, you can group low volume tier 3 tenants into a single consumer cluster to provider a lower cost offering, while high volume tier 1 tenants get their own isolated compute clusters. Consumer clusters can be created in the same account as producer or in a different AWS account. With this you can have separate billing for the consumer clusters, where you can chargeback to the business group that uses the consumer cluster or even allow your customers to use their own Redshift cluster in their account, so they pay for usage of the consumer cluster. The following diagram shows the difference in ETL and consumer access patterns in a multi-tenant architecture using data sharing versus a single cluster approach without data sharing.

Consumer clusters can in turn be producers for the data sets they own.

Multi-tenant architecture with data sharing compared to single cluster approach

Creating a multi-tenant architecture for an AaaS solution

For this post, we use a simple data model with a fact and a dimension table to demonstrate how to leverage data sharing to design a scalable multi-tenant AaaS solution. We cover detailed steps involved for each storage strategy using this data model. The tables are as follows:

  • Customer – dimension table containing customer details
  • Sales – fact table containing sales transactions

We use two Amazon Redshift ra3.4xl clusters, with 2 nodes each, and designate one cluster as producer and other as consumer.

The high-level steps involved in enabling data sharing across clusters are as follows:

  1. Create a data share in the producer cluster and assign database objects to the data share.
  2. From the producer cluster, grant usage on the data share to consumer clusters, identified by namespace or AWS account.
  3. From the consumer cluster, create an external database using the data share from the producer
  4. Query the tables in the data share through the external shared database in the consumer cluster. Grant access to other users to access this shared database and objects.

Creating producer and consumer Amazon Redshift clusters

Let us start by creating two Amazon Redshift ra3.4xl clusters with 2-nodes each, one for the producer and other for consumer.

  1. On the Amazon Redshift cluster, create two clusters of RA3 instance type, and name them ds-producer and ds-consumer-c1, respectively.
  1. Next, log in to Amazon Redshift using the query editor. You can also use a SQL client tool like DBeaver, SQL Workbench, or Aginity Workbench. For configuration information, see Connecting to an Amazon Redshift cluster using SQL client tools.

Get the cluster namespace of the producer and consumer clusters from the console. We will use the namespaces to create the tenant table and to create and access the data shares. You can also get the cluster namespaces by logging into each of the clusters and executing the SELECT CURRENT_NAMESPACE statement in the query editor.

Please note to replace the corresponding namespaces in the code sections wherever producercluster_namespace, consumercluster1_namespace, and consumercluster_namespace is referenced.

The following screenshot shows the namespace on the Amazon Redshift console.

Now that we have the clusters created, we will go through the detailed steps for the three models. First, we will cover the Pool model, followed by Bridge model and finally the Silo model.

Pool model

The pool model represents an all-in, multi-tenant model where all tenants share the same storage constructs and provides the most benefit in simplifying the AaaS solution.

With this model, data storage is centralized in one cluster database, and data is stored for all tenants in the same set of data models. To scope and control access to tenant data, we introduce a column (tenant_id) that serves as a unique identifier for each tenant.

Security management to prevent cross-tenant access is one of the main aspects to address with the pool model. We can implement row-level security and provide secure access to the data by creating database views and set application-level policies by creating groups with specific access and assigning users to the groups. The following diagram illustrates the pool model architecture.

The following diagram illustrates the pool model architecture.

To create a multi-tenant solution using the pool model, you create data shares for the pool model in the producer cluster, and share data with the consumer cluster. We provide more detail on these steps in the following sections.

Creating data shares for the pool model in the producer cluster

To create data shares for the pool model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and run the following script.

Note that we have a tenant table to store unique identifiers for each tenant or consumer (tenant).

We add a column (tenant_id) to the sales and customer tables to uniquely identify tenant data. This tenant_id references the tenant_id in the tenant table to uniquely identify the tenant and consumer records. See the following code:

/**********************************************/
/*       datasharing datasetup pool model     */
/**********************************************/

-- Create Schema
create schema sales;

CREATE TABLE IF NOT EXISTS sales.tenant ( 
t_tenantid int8 not null,
t_name varchar(50) not null,
t_namespace varchar(50),
t_account varchar(16)
)
DISTSTYLE AUTO
SORTKEY AUTO;

-- Create tables for multi-tenant sales schema
drop table sales.customer;
CREATE TABLE IF NOT EXISTS sales.customer(
  c_tenantid int8 not null,
  c_custid int8 not null,
  c_name varchar(25) not null,
  c_region varchar(40) not null,
  Primary Key(c_tenantid, c_custid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;

CREATE TABLE IF NOT EXISTS sales.sales (
  s_tenantid int8 not null,
  s_orderid int8 not null,
  s_custid int8 not null,
  s_totalprice numeric(12,2) not null,
  s_orderdate date not null,
  Primary Key(s_tenantid, s_orderid)
) 
DISTSTYLE AUTO
SORTKEY AUTO;
  1. Set up the tenant table with the details for each consumer cluster, and ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use INSERT statements:
    -- Ingest data 
    insert into sales.tenant values
    (0, 'primary', '<producercluster_namespace>',''),
    (1, 'tenant1', '<consumercluster1_namespace>',''),
    (2, 'tenant2', '<consumercluster2_namespace>','');
    
    insert into sales.customer values
    (1, 1, 'Customer 1', 'NorthEast'),
    (1, 2, 'Customer 2', 'SouthEast'),
    (2, 1, 'Customer 3', 'NorthWest'),
    (2, 2, 'Customer 4', 'SouthEast');
    
    truncate table sales.sales;
    insert into sales.sales values
    (1, 1, 1, 2434.33, '2020-11-21'),
    (1, 2, 2, 54.90, '2020-5-5'),
    (1, 3, 2, 9678.99, '2020-3-8'),
    (2, 1, 2, 452.24, '2020-1-23'),
    (2, 2, 1, 76523.10, '2020-11-3'),
    (2, 3, 1, 6745.20, '2020-10-01');
    
    select count(*) from sales.tenant;
    select count(*) from sales.customer;
    select count(*) from sales.sales;

Securing data on the producer cluster by restricting access

In the pool model, no external user has direct access to underlying tables. All access is restricted using views.

  1. Create a view for each of the fact and dimension tables to include a condition to filter records from the consumer tenant’s namespace. In our example, we create v_customersales to combine sales fact and customer dimension tables with a restrictive filter for tenant.namespace = current_namespace. See the following code:
    /**********************************************/
    /* We will create late binding views          */
    /* but materialized views could also be used  */
    /**********************************************/
    
    create or replace view sales.v_customer as
    select * 
    from sales.customer c, sales.tenant t
    where c.c_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_sales as
    select * 
    from sales.sales s, sales.tenant t
    where s.s_tenantid = t.t_tenantid 
    and t.t_namespace  = current_namespace;
    
    create or replace view sales.v_customersales as 
    select c_tenantid, c_name, c_region, 
    	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
    	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
    	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
    	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
    	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
    	t.t_namespace
    from sales.tenant t, sales.customer c, sales.sales s
    where t.t_tenantid = c.c_tenantid 
    and c.c_tenantid = s.s_tenantid 
    and c.c_custid = s.s_custid 
    and t.t_namespace = current_namespace 
    WITH NO SCHEMA BINDING;
    
    select * from sales.v_customersales;
          
    

Now that we have database objects created in the producer cluster, we can share the data with the consumer clusters.

Sharing data with the consumer cluster

To share data with the consumer cluster, complete the following steps:

  1. Create a data share for the sales data:
    /***************************************************/
    /* Create Datashare and add objects to the share    */
    /****************************************************/
    CREATE DATASHARE salesshare;
    

  1. Enter the following code to alter the data share, add the sales schema to be shared with the consumer clusters, and add all tables in the sales schema to be shared with the consumer cluster:
    /************************************************************/
    /* Add objects at desired granularities: schemas, tables,   */
    /* views include materialized, and SQL UDFs                 */
    /************************************************************/
    ALTER DATASHARE salesshare ADD SCHEMA sales;  -- New addition to create SCHEMA first
    
    /*For pool model, we share only the views and not tables */
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customer;
    ALTER DATASHARE SalesShare ADD TABLE sales.v_customersales;
    

For the pool model, we share only the views with the consumer cluster and not the tables. The ALTER statement ADD TABLE is used to add both views and tables.

  1. Grant usage on the sales data share to the namespace of the BI consumer cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster. See the following code:
    /********************************************************************/
    /* Grant access to consumer clusters                                */
    /* login to Consumer BI Cluster and get the Namespace from          */
    /* the Redshift console or using SELECT CURRENT_NAMESPACE           */
    /********************************************************************/
    SELECT CURRENT_NAMESPACE;
    
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE salesshare TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    GRANT USAGE ON DATASHARE salesshare TO ACCOUNT 'Consumer_AWSAccount';
    

  1. View data shares that are shared from the producer cluster:
    SELECT * FROM SVV_DATASHARES;

The following screenshot shows the output.

The following screenshot shows the output.

You can also see the data shares and their detailed objects and consumers using the following commands:

SHOW DATASHARES;
DESC DATASHARE salesshare;
select * from SVV_DATASHARE_OBJECTS;
select * from SVV_DATASHARE_CONSUMERS;

Viewing and querying data shares for the pool model from the consumer cluster

To view and query data shares from the consumer cluster, complete the following steps:

  1. Log in to the consumer cluster as an admin user and view the data share objects:
    /**********************************************************/
    /* Login to Consumer cluster as awsuser:                  */
    /* View datashares and create local database for querying */
    /**********************************************************/
    select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the results.

The following screenshot shows the results.

  1. Create a new database from the data share of the producer cluster:
    /**********************************************************/
    /* Create a local database and schema reference           */
    /**********************************************************/
    CREATE DATABASE sales_db FROM DATASHARE salesshare
    OF NAMESPACE '<producercluster_namespace>';

  1. Optionally, you can create an external schema in the consumer cluster pointing to the schema in the database of the producer cluster.

Creating a local external schema in the consumer cluster allows schema-level access controls within the consumer cluster, and uses a two-part notation when referencing shared data objects (localschema.table; vs. external_db.producerschema.table). See the following code:

/*********************************************************/
/* Create External Schema - Optional                     */
/* reason for schema: give specific access to schema     */
/* using shared alias get access to a secondary database */
/*********************************************************/
CREATE EXTERNAL SCHEMA sales_schema 
FROM REDSHIFT DATABASE 'sales_db' SCHEMA 'sales';
  1. Now you can query the shared data from the producer cluster by using the syntax tenant.schema.table:
    select * from sales_db.sales.customer;
    select * from sales_db.sales.v_customersales;

  1. From the tenant1 consumer cluster, you can view the databases and the tenants that are accessible to tenant1. tenant1_schema is as follows:
    select * from SVV_REDSHIFT_DATABASES;
    

The following screenshot shows the results.

The following screenshot shows the results.

Creating local consumer users and controlling access

You can control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create tenant1_group, grant usage on the local database sales_db and schema sales_schema to the group, and assign the user tenant1_user to the tenant1_group:
    /********************************************************/
    /* Consumers can create own users and assign privileges */
    /* Create tenant1_group and assign privileges to read   */
    /* sales_db and the sales_schema                        */
    /* Create tenant1_user in tenant1_group                 */
    /********************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE sales_db TO tenant1_group;
    GRANT USAGE ON SCHEMA sales_schema TO GROUP tenant1_group;
    

  1. Now, login as tenant1_user to consumer cluster 1 and select data from the views v_customer and v_customersales:
    /*******************************************************/
    /* select from view returns only sales records related */
    /* to Consumer A namespace                             */
    /*******************************************************/
    select * from sales_db.sales.v_customer;
    select * from sales_db.sales.v_customersales;
    

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

You should see only the data relevant to tenant 1 and not the data that is associated with tenant 2.

 

Create Materialized views to optimize performance

Consumer clusters can have their own database objects which are local to the consumer. You can also create materialized views on the datashare objects and control when to refresh the dataset for your consumers. This provides another level of isolation from the producer cluster, and will ensure the consumer clusters go against their local dataset.

  1. Log in as an admin user on consumer cluster 1 and enter the following code to create a materialized view for customersales. This will create a local view that can be periodically refreshed from the consumer cluster.

 

/*******************************************************/
/* Create materialized view in consumer cluster        */
/*******************************************************/
create MATERIALIZED view tenant1_sales.mv_customersales as 
select c_tenantid, c_name, c_region, 
	date_part(w, to_date(s_orderdate,'YYYY-MM-DD')) as "week", 
	date_part(mon, to_date(s_orderdate,'YYYY-MM-DD')) as "month", 
	date_part(dow, to_date(s_orderdate,'YYYY-MM-DD')) as "dow",
	date_part(yr, to_date(s_orderdate,'YYYY-MM-DD')) as "year",
	date_part(d, to_date(s_orderdate,'YYYY-MM-DD')) as "dom", 
	t.t_namespace
from sales_db.tenant t, sales_db.customer c, sales_db.sales s
where t.t_tenantid = c.c_tenantid 
and c.c_tenantid = s.s_tenantid 
and c.c_custid = s.s_custid 
and t.t_namespace = current_namespace;

select * from tenant1_sales.mv_customersales top 100;

REFRESH MATERIALIZED VIEW tenant1_sales.mv_customersales;

 

With the preceding steps, we have demonstrated how you can control access to the tenant data in the same datastore using views. We also reviewed how data shares help efficiently share data between producer and consumer clusters with transaction consistency. We also saw how a local materialized view can be created to further isolate your BI workloads for your customers and provide a consistent, performant user experience. In the next section we will discuss the Bridge model.

Bridge model

In the bridge model, data for each tenant is stored in its own schema in a database and contains a similar set of tables. Data shares are created for each schema and shared with the corresponded consumer. This is an appealing balance between silo and pool model, providing both data isolation and ETL consolidation. With Amazon Redshift, you can create up to 9,900 schemas. For more information, see Quotas and limits in Amazon Redshift.

With data sharing, separate consumer clusters can be provisioned to use the same managed storage from producer cluster. Consumer clusters have all the capabilities of a producer cluster, and can in turn be producer clusters for data objects they own. Consumers can’t share data that is already shared with them. Without data sharing, queries from all customers are directed to a single cluster. The following diagram illustrates the bridge model.

The following diagram illustrates the bridge model.

To create a multi-tenant architecture using bridge model, complete the steps in the following sections.

Creating database schemas and tables for the bridge model in the producer cluster

As we did in the pool model, the first step is to create the database schema and tables. We log in to the producer cluster as an admin user and create separate schemas for each tenant. For our post, we create two schemas, tenant1 and tenant2, to store data for two tenants.

  1. Log in to the producer cluster as the admin user.
  1. Use the script below to create two schemas, tenant1 and tenant2, and create tables for customer dimension and sales facts under each of the two schemas. See the following code:
    /****************************************/
    /* Bridge -  Data Model */
    /****************************************/
    -- Create schemas tenant1 and tenant2
    create schema tenant1;
    create schema tenant2;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
      
    
    -- Create tables for tenant2
    CREATE TABLE IF NOT EXISTS tenant2.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant2.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    

  1. Ingest data into the customer dimension and sales fact tables. Using the COPY command is the recommended way to ingest data into Amazon Redshift, but for illustration purposes, we use the INSERT statement:
    -- ingest data for tenant1
    -- ingest customer data
    insert into tenant1.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    -- ingest sales data
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    
    select count(*) from tenant1.customer;
    select count(*) from tenant1.sales;
    
    
    -- ingest data for tenant2
    -- ingest customer data
    insert into tenant2.customer values
    (1, 'Customer 3', 'NorthWest'),
    (2, 'Customer 4', 'SouthEast');
    
    -- ingest sales data
    truncate table tenant2.sales;
    insert into tenant2.sales values
    (1, 2, 452.24, '2020-1-23'),
    (2, 1, 76523.10, '2020-11-3'),
    (3, 1, 6745.20, '2020-10-01');
    
    
    select count(*) from tenant2.customer;
    select count(*) from tenant2.sales;
    

Creating data shares and granting usage to the consumer cluster

In the following code, we create two data shares, tenant1share and tenant2share, to share the database objects under the two schemas to the respective consumer clusters.

  1. Create two datashares tenant1share and tenant2share to share the database objects under the two schemas to the respective consumer clusters.
    /******************************************************************/
    /*   Create Datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1share;
    CREATE DATASHARE tenant2share;

  1. Alter the datashare and add the schema(s) for respective tenants to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD SCHEMA tenant2;

  1. Alter the datashare and add all tables in the schema(s) to be shared with the consumer cluster
    ALTER DATASHARE tenant1share ADD ALL TABLES IN SCHEMA tenant1;
    ALTER DATASHARE tenant2share ADD ALL TABLES IN SCHEMA tenant2;

Getting the namespace of the first consumer cluster

  1. Log in to the consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant1 schema for Tenant1 BI Cluster */
    /* login to tenant1 BI Cluster and get the Namespace 
     * or get the Namespace from the Redshift console */
    SELECT CURRENT_NAMESPACE;

 

  1. Grant usage on the data share for the first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    -- Grant usage on the datashare to the first consumer cluster
    -- Namespace refers to the namespace GUID of the consumer cluster 
    GRANT USAGE ON DATASHARE tenant1share TO NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1share TO ACCOUNT '<Consumer_AWSAccount>';

Getting the namespace of the second consumer cluster

  1. Log in to the second consumer cluster and get the namespace from the console or by running the select current_namespace command:
    /* Grant access to tenant2 schema for Tenant2 BI Cluster */
    /*login to tenant1 BI Cluster and get the Namespace      */
    SELECT CURRENT_NAMESPACE;
    

  1. Grant usage on the data share for the second tenant to the namespace of the second consumer cluster you just got from the previous step:
    -- Grant usage on the datashare to the second consumer cluster
    GRANT USAGE ON DATASHARE tenant2share TO NAMESPACE '<consumercluster2_namespace>'
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant2share TO ACCOUNT '<Consumer_AWSAccount>';

  1. To view data shares from the producer cluster, enter the following code:
    /******************************************************************/
    /*   View Datashares created, Datashare objects and consumers     */
    /******************************************************************/
    select * from SVV_DATASHARES;
    select * from SVV_DATASHARE_OBJECTS;
    select * from SVV_DATASHARE_CONSUMERS;

The following screenshot shows the commands in the query editor.

The following screenshot shows the commands in the query editor.

The following screenshot shows the query results.

The following screenshot shows the query results.

Accessing data using the consumer cluster from the data share

To access data using the consumer cluster, complete the following steps:

  1. Log in to the first consumer cluster ds-consumer-c1, as an admin user.
  1. View data share objects from the SVV_DATASHARE_OBJECTS system view:
    /*****************************************************/
    /* Consumer cluster as adminuser
    /* List the shares available and review contents for each 
    /********************************************************/
    -- You can view datashare objects associated with the cluster
    -- using either of the two commands
    SHOW DATASHARES;
    select * from SVV_DATASHARES;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View objects shared in inbound share for consumer
select * from SVV_DATASHARE_OBJECTS;

The following screenshot shows the query results.

The following screenshot shows the query results.

--View namespace or clusters granted usage to a datashare
select * from svv_datashare_consumers;
  1. Create a local database in the first consumer cluster, and an external schema to be able to provide controlled access to the specific schema to the consumer clusters:
    /*******************************************************/
    /* Create a local database and schema reference        */
    /* to the share objects                                */
    /*******************************************************/
    CREATE DATABASE tenant1_db FROM DATASHARE tenant1share
    OF NAMESPACE '<producercluster_namespace>';

  1. Query the database tables using the three-part notation db.tenant.table:
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;

  1. Optionally, you can create an external schema.

There are two reasons to create an external schema: either to enable two-part notation access to the tables from the consumer cluster, or to provide restricted access to the specific schemas for selected users, when multiple schemas are shared from the producer cluster. See the following code for our external schema:

/* Create External Schema */
CREATE EXTERNAL SCHEMA tenant1_schema FROM REDSHIFT DATABASE 'tenant1_db' SCHEMA 'tenant1';
  1. If you created the local schemas, you can use the following two-part notation to query the database tables:
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;

  1. You can view the shared databases by querying the SVV_REDSHIFT_DATABASES table:
    select * from SVV_REDSHIFT_DATABASES;

The following screenshot shows the query results.

The following screenshot shows the query results.

Creating consumer users for managing access

Still logged in as an admin user to the consumer cluster, you can create other users who have access to the database objects.

  1. Create users and groups, and assign users and object privileges to the groups with the following code:
    /*******************************************************/
    /* Consumer can create own users and assign privileges */
    /* Create tenant1_user and assign privileges to        */
    /* read datashare from tenant1 schema                  */
    /*******************************************************/
    create group tenant1_group;
    create user tenant1_user password 'Redshift#123!' in group tenant1_group;
    
    GRANT USAGE ON DATABASE tenant1_db TO tenant1_user;
    GRANT USAGE ON SCHEMA tenant1_schema TO GROUP tenant1_group;
    

Now tenant1_user can log in and query the shared tables from tenant schema.

  1. Log in to the consumer cluster as tenant1_user and query the tables:
    /************************************************************/
    /* Consumer cluster as tenant1_user - As a consumer cluster */
    /* administrator, you must follow these steps:              */
    /************************************************************/
    
    select * from tenant1_db.tenant1.customer;
    select * from tenant1_db.tenant1.sales;
    
    
    /************************************************************/
    /* If you have created and External Schema                  */
    /* you can use the two-part notation to query tables.       */
    /************************************************************/
    
    select * from tenant1_schema.customer;
    select * from tenant1_schema.sales;
    

Revoking access to a data share (optional)

  1. At any point, if you want to revoke access to the data share, you can the REVOKE USAGE command:
    /*************************************************************/
    /* To revoke access at any time use the REVOKE USAGE command */
    /*************************************************************/
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    REVOKE USAGE ON DATASHARE Salesshare FROM NAMESPACE '<consumercluster1_namespace>'
    --Account numbers are 12 digit long
    REVOKE USAGE ON DATASHARE Salesshare FROM ACCOUNT '<Consumer_AWSAccount>';
    

Silo model

The third option is to store data for each tenant in separate databases within a cluster. If you need your data isolated from other tenants, you can use the silo model and each database may have distinct data models, monitoring, management, and security footprints.

Amazon Redshift supports cross-database queries across databases, which allow you to simplify data organization. You can store common or granular datasets used across all tenants in a centralized database, and use the cross-database query capability to join relevant data for each tenant.

The steps to create a data share in a silo model is similar to a bridge model; however, unlike a bridge model (where data share is for each schema), the silo model has a data share created for each database. The following diagram illustrates the architecture of the silo model.

The following diagram illustrates the architecture of the silo model.

Creating data shares for the silo model in the producer cluster

To create data shares for the silo model in the producer cluster, complete the following steps:

  1. Log in to the producer cluster as an admin user and create separate databases for each tenant:
    /*****************************************************/
    /** Silo Model – Create databases for the 2 tenants **/
    /*****************************************************/
    create database tenant1_silodb;
    
    create database tenant2_silodb;
    

  1. Log in again to the producer cluster with the database name and user ID for the database that you want to share (tenant1_silodb) and create the schema and tables:
    /***********************************************************/
    /* login to tenant1_db and create schema tenant1 and tables*/
    /***********************************************************/
    create schema tenant1_siloschema;
    
    -- Create tables for tenant1
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.customer (
      c_custid int8 not null ,
      c_name varchar(25) not null,
      c_region varchar(40) not null,
      Primary Key(c_custid)
    ) diststyle ALL sortkey(c_custid);
    
    CREATE TABLE IF NOT EXISTS tenant1_silodb.tenant1_siloschema.sales (
      s_orderid int8 not null,
      s_custid int8 not null,
      s_totalprice numeric(12,2) not null,
      s_orderdate date not null,
      Primary Key(s_orderid)
    ) distkey(s_orderid) sortkey(s_orderdate, s_orderid) ;
    
    insert into tenant1_siloschema.customer values
    (1, 'Customer 1', 'NorthEast'),
    (2, 'Customer 2', 'SouthEast');
    
    truncate table tenant1_siloschema.sales;
    insert into tenant1.sales values
    (1, 1, 2434.33, '2020-11-21'),
    (2, 2, 54.90, '2020-5-5'),
    (3, 2, 9678.99, '2020-3-8');
    

  1. Create a data share with a name for the first tenant (for example, tenant1dbshare):
    /******************************************************************/
    /*   Create datashare and add database objects to the datashare   */
    /******************************************************************/
    CREATE DATASHARE tenant1_silodbshare;
    

  1. Run Alter datashare commands to add the schemas to be shared with the consumer cluster and add all tables in the schemas to be shared with the consumer cluster:
    ALTER DATASHARE tenant1_silodbshare ADD SCHEMA tenant1_siloschema;
    ALTER DATASHARE tenant1_silodbshare ADD ALL TABLES IN SCHEMA tenant1_siloschema;

  1. Grant usage on the data share for first tenant to the namespace of the BI cluster. You can get the namespace of the BI cluster from the console or using the SELECT CURRENT_NAMESPACE statement in the BI cluster:
    --Namespace refers to the namespace GUID of the consumer cluster in the account
    GRANT USAGE ON DATASHARE tenant1_silodbshare TO NAMESPACE ‘<consumercluster1_namespace>’
    --Account numbers are 12 digit long (optional)
    --GRANT USAGE ON DATASHARE tenant1_silodbshare TO ACCOUNT '<AWS-Account>';
    

Viewing and querying data shares for the silo model from the consumer cluster

To view and query your data shares, complete the following steps:

  1. Log in to the consumer cluster as an admin user.
  2. Create a new database from the data share of the producer cluster:
    CREATE DATABASE tenant1_silodb FROM DATASHARE tenant1_silodbshare
    OF NAMESPACE ‘<producercluster_namespace>’;

Now you can start querying the shared data from the producer cluster by using the syntax – tenant.schema.table. If you created an external schema, then you can also use the two-part notation to query the tables.

  1. Query the data with the following code:
    select * from tenant1_silodb.tenant1.customer;
    select * from tenant1_silodb.tenant1.sales;
    

  1. Optionally, you can create an external schema pointing to the schema in the database of the producer cluster. This allows you query shared tables using a two-part notation. See the following code:
    CREATE EXTERNAL SCHEMA tenant1_siloschema FROM REDSHIFT DATABASE 'tenant1_silodb' SCHEMA 'tenant1';
    
    --With this Schema, you can access using two-part notation to select from data share tables
    select * from tenant1_siloschema.customer;
    select * from tenant1_siloschema.sales;
    

  1. You can repeat the same steps for tenant2 to share the tenant2 database with tenant2 You can also control access to users in your consumer cluster by creating users and groups, and assigning access to the data share objects.

System views to view data shares

We have introduced new system tables and views to easily identify the data shares and related objects. You can use three different groups of system views to view the data share objects:

  • Views starting with SVV_DATASHARES – has detail of datashares and objects in a datashare.
View Name Purpose
SVV_DATASHARES View a list of data shares created on the cluster and data shares shared with the cluster
SVV_DATASHARE_OBJECTS View a list of objects in all data shares created on the cluster or shared with the cluster
SVV_DATASHARE_CONSUMERS View a list of consumers for data share created on the cluster
  • Views starting with SVV_REDSHIFT – contains details on both local and remote Redshift databases.
View Name Purpose
SVV_REDSHIFT_DATABASES List of all databases that a user has access to
SVV_REDSHIFT_SCHEMAS List of all schemas that user has access to
SVV_REDSHIFT_TABLES List of all tables that a user has access to
SVV_REDSHIFT_COLUMNS List of all columns that a user has access to
SVV_REDSHIFT_FUNCTIONS List of all functions that user has access to
  • Views starting with SVV_ALL– contain local and remote databases, external schemas including  spectrum and federated query, and external schema references to shared data.  If you create external schemas in consumer cluster, you need to use the SVV_ALL views to look at the objects.
View Name Purpose
SVV_ ALL _SCHEMAS Union of list of all schemas from SVV_REDSHIFT_SCHEMA view and consolidated list of all external tables and schemas that user has access to
SVV_ ALL _TABLES List of all tables that a user has access to
SVV_ ALL _COLUMNS List of all columns that a user has access to
SVV_ ALL _FUNCTIONS List of all functions that user has access to

Considerations for choosing a storage strategy

You can adopt a storage strategy or choose a hybrid approach based on business, technical, and operational requirements. Before deciding on a strategy, consider the quotas and limits for various objects in Amazon Redshift, and the number of databases per cluster or number of schemas per database to check if it meets your requirements. The following table summarizes these considerations.

  Pool Bridge Silo
Separation of tenant data Views Schema Database
ETL pipeline complexity Low Low Medium
Limits 100,000 tables (RA3 – 4x, 16x large clusters) 9,900 schemas per database 60 databases per cluster
Chargeback to consumer accounts Yes Yes Yes
Scalability High High High

Conclusion

In this post, we discussed how you can use the new data sharing feature of Amazon Redshift to implement an AaaS solution with a multi-tenant architecture while meeting SLAs for consumers using separate Amazon Redshift clusters. We demonstrated three types of models providing various levels of isolation for the tenant data. We compared and contrasted the models and provided guidance on when to choose an implementation model. We encourage you to try the data sharing feature to build your AaaS or software as a service (SaaS) solutions.


About the Authors

Rajesh Francis is a Sr. Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build scalable Analytic solutions.

 

 

 

Neeraja Rentachintala is a Principal Product Manager with Amazon Redshift. Neeraja is a seasoned Product Management and GTM leader, bringing over 20 years of experience in product vision, strategy and leadership roles in data products and platforms. Neeraja delivered products in analytics, databases, data Integration, application integration, AI/Machine Learning, large scale distributed systems across On-Premise and Cloud, serving Fortune 500 companies as part of ventures including MapR (acquired by HPE), Microsoft SQL Server, Oracle, Informatica and Expedia.com.

 

Jeetesh Srivastva is a Sr. Analytics specialist solutions architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions leveraging Redshift and other AWS Analytic services. He has worked to deliver on premises and cloud based analytic solutions for customers in banking & finance and hospitality industry verticals.

Querying a Vertica data source in Amazon Athena using the Athena Federated Query SDK

Post Syndicated from Kelly Ragan original https://aws.amazon.com/blogs/big-data/querying-a-vertica-data-source-in-amazon-athena-using-the-athena-federated-query-sdk/

The ability to query data and perform ad hoc analysis across multiple platforms and data stores with a single tool brings immense value to the big data analytical arena. As organizations build out data lakes with increasing volumes of data, there is a growing need to combine that data with large amounts of data in other data stores. As the variety of data increases, it becomes paramount to have a query tool to bridge two or more data stores with a single query.

Even though data lakes became popular for analytic workloads recently, it’s not uncommon to have data warehouses in addition to data lakes for various reporting and business intelligence (BI) use cases. It becomes imperative to be able to seamlessly query the data stored in the data warehouse and the data lake. To address this issue, Amazon Athena has released a feature called Athena Federated Query. Athena is an interactive query service provided by AWS that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Vertica is a columnar MPP database platform that can be deployed in the cloud or on premises, and supports exabyte scale data warehouses. With Athena Federated Query and the Vertica connector, you can now run analytical queries over a data warehouse on Vertica and a data lake in Amazon S3.

Athena Federated Query includes pre-built connectors to a variety of AWS services and databases, as well as an SDK to build custom connectors to other databases and data stores. With this feature, federated queries can pull data from a data lake in an S3 bucket and from an external data source, and then combine it into a single result set in Athena. These connectors are an extension of the Athena query engine, which translates content between Athena and the external data source. Pre-built connectors exist for Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), and Amazon Relational Database Service (Amazon RDS), as well as a JDBC connector for Amazon Redshift, MySQL, and PostgreSQL. For other types of relational databases, you can use the Athena Federated Query SDK to create a custom connector.

In this post, we demonstrate how to deploy the custom connector between Athena and a Vertica database built using the Athena Federated Query SDK. After deploying the custom connector, we demonstrate issuing federated queries and moving data from Vertica to a data lake using CREATE TABLE AS (CTAS) with a federated query.

AWS services used in the solution

The Athena Federated Query SDK is an open-source framework to build custom connectors, and comes with a connector publish tool that deploys the connector executables in an application to the AWS Serverless Application Repository. The Athena Federated Query uses an AWS Lambda function that in turn uses the application deployed to the AWS Serverless Application Repository.

A custom connector is composed of a Lambda function that utilizes three components:

  • MetadataHandler – An interface that exposes metadata information of schemas, tables, and columns from the underlying data store to Athena
  • RecordHandler – An interface that provides hooks to read data from the external source and share it with the Athena query engine in Apache Arrow columnar format
  • CompositeHandler – For managing running the MetadataHandler and RecordHandler

The Lambda function connects to the external data store using an appropriate connection protocol and sends the parsed SQL statement. In the case of Vertica, it is JDBC. The RecordHandler processes the result set produced by the external data store and passes the rows to Athena for final processing. Multiple Lambda functions are called by Athena depending on the Lambda concurrency settings to read the result set in parallel. A spill bucket is used to handle a large dataset that exceeds the Lambda server’s capacity to process the result set.

The JDBC connection established by the Lambda function to the external database is used to send the parsed SQL statement and retrieve the result set rows from the external database. This scenario works well in terms of bandwidth for smaller databases and result sets. However, you might have Vertica deployments with petabyte or exabyte data warehouses. Typical queries return result sets on the order of 10, 20, 30 gigabytes, or more. Due to the bandwidth issue with a JDBC connection, the solution presented in this post modifies the Athena Federated Query SDK to implement a different route for the transmission of large result sets from Vertica to the Athena server for final processing.

The alternate solution utilizes the Vertica EXPORT command as a wrapper around the parsed SQL statement. You can use the EXPORT command to write a result set for a SQL statement directly to an S3 bucket using Vertica’s highly parallelized write to Amazon S3 using partitioning. This solution modifies the SDK to allow Athena to read the result set in the S3 bucket, determine the number of partitions, and call subsequent Lambda functions to parallelize the read of the result set. This produces an efficient way to move a multi-gigabyte result set from Vertica to Athena with parallelized writes from Vertica to Amazon S3 and parallelized reads from Amazon S3 to Athena. When connecting to the Vertica database, the SDK uses AWS Secrets Manager to retrieve a user ID and password for a service account on the Vertica database.

Solution architecture

The following diagram shows the solution architecture for the Vertica custom connector when deployed to AWS.

The following diagram shows the solution architecture for the Vertica custom connector when deployed to AWS.

The connector components are as follows:

  1. A user issues a federated SQL query in Athena against a table in Vertica.
  2. Athena parses the query and calls a Lambda function.
  3. The Lambda function makes a call to Secrets Manager to get the user ID and password for connecting to Vertica.
  4. The connector sends an EXPORT statement wrapper with the embedded SQL statement to Vertica through the JDBC connection. For example, see the following code:
    EXPORT TO PARQUET (directory = 's3://<bucket_name>/<folder_name>, 
    Compression='Snappy', fileSizeMB=64) OVER() as   
    SELECT  
    ORDER_ID,  
    ITEM,  
    CUSTOMER_ID,
    ORDERED_DATE
    FROM SCHEMA1.ORDERS  
    WHERE CUSTOMER_ID = 2;
    

  5. Vertica processes the SQL query and writes the result set to the S3 bucket specified in the EXPORT command. Vertica parallelizes the write to S3 bucket based on the fileSizeMB parameter into as many partitions as needed for the result set.
  6. Athena calls a Lambda function to scan the S3 bucket in order to determine the number of files to read for the result set.
  7. Athena invokes multiple Lambda functions depending on the number of partitions using Amazon S3 Select. This allows Athena to parallelize the read of the S3 files.
  8. Athena combines the result set returned from Vertica with data scanned from the data lake, and returns the combined result set to the user.

Prerequisites

Before you get started, make sure you have the following prerequisites:

  • Amazon EC2 IAM role permissions – The AWS Identity and Access Management (IAM) role of the Amazon Elastic Compute Cloud (Amazon EC2) machines hosting the Vertica database must be given write permissions to the VerticaExport S3 bucket, which is created when deploying the connector.
  • Secrets Manager – The Vertica connection credentials are stored in Secrets Manager. The secret name is prefixed with Vertica– and the secret value is the connection credentials.
  • Lambda IAM role permissions – When the Lambda function is deployed to the AWS Serverless Application Repository, it creates a custom IAM role for the function to run. The custom role has the following IAM permissions in order to successfully perform the read and write functions associated with the MetadataHandler and RecordHandler:
    • AWSLambdaBasicExecutionRole
    • AWSLambdaVPCAccessExecutionRole
    • For Secrets Manager, GetSecretValue for secrets with a prefix given in SecretNameOrPrefix
    • For Amazon S3, list, read, and write permissions for SpillBucket and ExportBucket, and list permissions for all S3 buckets
    • For Athena, GetQueryExecution

Demonstration tables

To demonstrate the Athena Vertica connector capabilities, we use the following components:

  • A Vertica database running in our AWS environment.
  • A Vertica table called orders containing details of customer orders.
  • An Athena table called customer, which has an S3 bucket as a data source. This table contains information regarding customers.

The following screenshot shows the details of the customer table in Amazon S3.

The following screenshot shows the details of the customer table in Amazon S3.

The following screenshot shows the details of the orders table in Vertica.

The following screenshot shows the details of the orders table in Vertica.

Setting up the Athena Vertica connector project

To set up your connector project, complete the following steps:

  1. Create an S3 bucket in your AWS account. This is the bucket where the result set from Vertica is exported.
  2. Create another S3 bucket in your AWS account. This is the bucket where the code for the connector is stored and retrieved.
  3. Grant the IAM role of the EC2 machines hosting the Vertica database read and write permissions to the S3 result set bucket, allowing Vertica to export data to the bucket.
  4. Clone the GitHub repo in your local folder.
  5. Open the project in your preferred IDE.
  6. From the athena-query-federation directory, run mvn clean install.
  7. From the athena-vertica directory, run mvn clean install.
  8. From the athena-vertica directory, run ../tools/publish.sh <s3_code_bucket_name> athena-vertica [region] to publish the connector to your private AWS Serverless Application Repository.
  9. Upon successful completion of the script, the connector’s serverless application is published to the AWS Serverless Application Repository.

Deploying the connector

To deploy your connector, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Published Applications.
  2. On the Private Applications tab, select Show apps that create in order to see deployed applications.
  3. Choose the VerticaAthenaConnector serverless app.
  4. For AthenaCatalogName, enter the name of the connector Lambda function used when querying the Vertica tables (avc).
  5. For SecretNameOrPrefix, enter the prefix used to store the Vertica credentials in Secrets Manager (the default is Vertica-).
  6. For SpillBucket, enter the S3 bucket name where data is spilled in case the result set data volume crosses a certain limit (test-spill-bucket).
  7. For VerticaExportBucket, enter the S3 bucket where the result set from Vertica is exported (test-export-bucket).
  8. For VpcId, enter your VPC ID.

For VpcId, enter your VPC ID.

  1. For SpillPrefix, enter athena-spill.
  2. For SubnetIds, enter your subnet IDs.
  3. For VerticaConnectionString, enter the connection string of the Vertica database in the following format:
    jdbc:vertica://<host_name>:<port>/<database>?user=${vertica-username}&password=${vertica-password}
    

    Where, vertica-username and vertica-password are the secret names of the Vertica   user credentials stored in AWS Secrets Manager.

  1. Select I acknowledge that this app creates custom IAM roles.

Select I acknowledge that this app creates custom IAM roles.

  1. Choose Deploy.

Upon successful deployment, a Lambda function with the name given for AthenaCatalogName is deployed in your AWS environment. We use this function to issue federated queries to Vertica. The connector is now deployed and ready to use.

Using the connector

On the Athena console, you can query Vertica tables as shown in the following code. The value for <lambda_function> corresponds to the function you created in the previous section.

SELECT  
ORDER_ID,   
CUSTOMER_ID,
PRODUCT_ID,
PRODUCT_NAME,
ORDER_DATE
FROM "lambda:<lambda_function>".SCHEMA1.ORDERS  
WHERE CUSTOMER_ID <= 3461
ORDER BY ORDER_ID DESC

In this example, we named the function as avc. The following screenshot shows our query results.

The following screenshot shows our query results.

This demonstrates that the newly deployed connector read the user-requested columns and the Vertica source table, wrapped an EXPORT statement around the SQL statement, and ran it in Vertica. The results of this query were exported to the specified S3 bucket (test-export-bucket) in Parquet format. The connector then invoked multiple Lambda functions to read the data from the S3 bucket using Amazon S3 Select and displayed it on the Athena console. Note that currently the connector exports Vertica timestamp and timestamptz data types as a varchar data type. Therefore we need to use the date_parse(string, format) function to convert the timestamps columns into the correct data type.

We can also create an Athena table using CTAS with the result set of the Vertica query using the following query:

CREATE TABLE default.vertica_customers_table AS (
SELECT  
ORDER_ID,
CUSTOMER_ID,
PRODUCT_ID,
PRODUCT_NAME,
ORDER_DATE
FROM "lambda:<lambda_function>".SCHEMA1.ORDERS  
WHERE CUSTOMER_ID <= 3461
);

We can then use the newly created table to query the data as shown in the following screenshot.

We can then use the newly created table to query the data as shown in the following screenshot.

In addition, we can also query and join the customer data in Amazon S3 and orders data in Vertica using the following sample query:

WITH   
customer_data AS (  
  SELECT   
   	CUSTOMER_NAME,  
   	CUSTOMER_ID  
    
  FROM default.customer
  ),  
orders_data AS (  
  SELECT    
   	ORDER_ID,  
   	PRODUCT_NAME,  
   	CUSTOMER_ID   
  FROM "lambda:<lambda_function>".schema1.orders  
  )  
SELECT a.CUSTOMER_ID, b.ORDER_ID, b.PRODUCT_NAME
FROM customer_data a 
INNER JOIN orders_data b
ON a.customer_data.customer_id = b.orders_data.customer_id 
WHERE lower(b.PRODUCT_NAME) like 'pencil'
ORDER BY b.ORDER_ID DESC

This query joins the orders data in Vertica with customer data in the S3 bucket in the customer_id column and displays the results on the Athena console.

This query joins the orders data in Vertica with customer data in the S3 bucket in the customer_id column and displays the results on the Athena console.

This demonstrates the ease of performing analytics across multiple platforms and data stores.

Conclusion

In this post, we introduced the Athena Vertica connector, its solution architecture, and demonstrated how to deploy the connector using the Athena Federated Query SDK. We saw how to run SQL queries on the Vertica data source. We also learned that we can use the connector to perform extract, transform, and load operations on the data in the Vertica tables and Amazon S3, enabling us to perform faster and better analytics across multiple platforms and data sources.

For more information about Athena Federated Query, see the GitHub repo.

Special Acknowledgement

Special acknowledgement goes to the Intuit Data Engineering staff Denise McInerney – Data Architect, Sanjay Rane – Group Engineering Manager – Data, and Kannan Nagarajan – Database Architect. They helped design, review, and support the development of the custom connector and architecture.


About the Authors

Kelly RaganKelly Ragan is a Senior Data Architect, Strategic Accounts Team, AWS Professional Services. He helps customers solve big data problems and wrestle with large-scale data warehouses. In his spare time, he enjoys snow skiing, bicycling, and camping in the Pacific Northwest.

 

 

Rohit MasurRohit Masur is an Associate Big Data Consultant, Data and Analytics Team, AWS Professional Services. He helps customers architect and implement solutions on AWS to get business value out of data. In his spare time, he enjoys reading books, going on long walks, and exploring new hiking trails in the Bay Area.

Automating AWS service logs table creation and querying them with Amazon Athena

Post Syndicated from Michael Hamilton original https://aws.amazon.com/blogs/big-data/automating-aws-service-logs-table-creation-and-querying-them-with-amazon-athena/

I was working with a customer who was just getting started using AWS, and they wanted to understand how to query their AWS service logs that were being delivered to Amazon Simple Storage Service (Amazon S3). I introduced them to Amazon Athena, a serverless, interactive query service that allows you to easily analyze data in Amazon S3 and other sources. Together, we used Athena to query service logs, and were able to create tables for AWS CloudTrail logs, Amazon S3 access logs, and VPC flow logs. As I was walking the customer through the documentation and creating tables and partitions for each service log in Athena, I thought there had to be an easier and faster way to allow customers to query their logs in Amazon S3, which is the focus of this post.

This post demonstrates how to use AWS CloudFormation to automatically create AWS service log tables, partitions, and example queries in Athena. We also use the SQL query editor in Athena to query the AWS service log tables that AWS CloudFormation created.

Athena best practices

This solution is appropriate for ad hoc use and queries the raw log files. These raw files can range from compressed JSON to uncompressed text formats, depending on how they were configured to be sent to Amazon S3. If you need to query over hundreds of GBs or TBs of data per day in Amazon S3, performing ETL on your raw files and transforming them to a columnar file format like Apache Parquet can lead to increased performance and cost savings. You can save on your Amazon S3 storage costs by using snappy compression for Parquet files stored in Amazon S3. To learn more about Athena best practices, see Top 10 Performance Tuning Tips for Amazon Athena.

Table partition strategies

There are a few important considerations when deciding how to define your table partitions. Mainly you should ask: what types of queries will I be writing against my data in Amazon S3? Do I only need to query data for that day and for a single account, or do I need to query across months of data and multiple accounts? In this post, we talk about how to query across a single, partitioned account.

By partitioning data, you can restrict the amount of data scanned per query, thereby improving performance and reducing cost. When creating a table schema in Athena, you set the location of where the files reside in Amazon S3, and you can also define how the table is partitioned. The location is a bucket path that leads to the desired files. If you query a partitioned table and specify the partition in the WHERE clause, Athena scans the data only for that partition. For more information, see Table Location in Amazon S3 and Partitioning Data. You can then define partitions in Athena that map to the data residing in Amazon S3.

Let’s look at an example to see how defining a location and partitioning our table can improve performance and reduce costs. In the following tree diagram, we’ve outlined what the bucket path may look like as logs are delivered to your S3 bucket, starting from the bucket name and going all the way down to the day.

In the following tree diagram, we’ve outlined what the bucket path may look like as logs are delivered to your S3 bucket

Outlined in red is where we set the location for our table schema, and Athena then scans everything after the CloudTrail folder. We then outlined our partitions in blue. This is where we can specify the granularity of our queries. In this case, we partition our table down to the day, which is very granular because we can tell Athena exactly where to look for our data. This is also the most performant and cost-effective option because it results in scanning only the required data and nothing else.

If you have to query multiple accounts and Regions, you should back off the location to AWSLogs and then create a non-partitioned CloudTrail table. This allows you to write queries across all your accounts and Regions, but the trade-off is that your queries take much longer and are more expensive due to Athena having to scan all the data that comes after AWSLogs every query. However, querying multiple accounts is beyond the scope of this post.

Prerequisites

Before you get started, you should have the following prerequisites:

  • Service logs already being delivered to Amazon S3
  • An AWS account with access to your service logs

Deploying the automated solution in your AWS account

The following steps walk you through deploying a CloudFormation template that creates saved queries for you to run (Create Table, Create Partition, and example queries for each service log).

  1. Choose Launch Stack:

  1. Choose Next.
  2. For Stack name, enter a name for your stack.

You don’t need to have every AWS service log that the template asks for. If you don’t have CloudFront logs for example, you can leave the PathParameter as is. If you need CloudFront logs in the future, you can simply update the Create Table statement with the correct Amazon S3 location in Athena.

  1. For each service log table you want to create, follow the steps below:
  • Replace <_BUCKET_NAME> with the name of your S3 bucket that holds each AWS service log. You can use the same bucket name if it’s used to hold more than one type of service log.
  • Replace <Prefix> with your own folder prefix in Amazon S3. If you don’t have a prefix, make sure to remove it from the path parameters.
  • Replace <ACCOUNT-ID> and <REGION> with desired account and region.

Choose Next.

  1. Choose Next.
  2. Enter any tags you wish to assign to the stack.
  3. Choose Next.
  4. Verify parameters are correct and choose Create stack at the bottom.

Verify the stack has been created successfully. The stack takes about 1 minute to create the resources.

Querying your tables

You’re now ready to start querying your service logs.

  1. On the Athena console, on the Saved queries tab, search for the service log you want to interact with.

On the Athena console, on the Saved queries tab, search for the service log you want to interact with.

  1. Choose Create Table – CloudTrail Logs to run the SQL statement in the Athena query editor.

Make sure the location for Amazon S3 is correct in your SQL statement and verify you have the correct database selected.

  1. Choose Run query or press Tab+Enter to run the query.

Choose Run query or press Tab+Enter to run the query.

The table cloudtrail_logs is created in the selected database. You can repeat this process to create other service log tables.

For partitioned tables like cloudtrail_logs, you must add partitions to your table before querying.

  1. On the Saved queries tab, choose Create Partition – CloudTrail.
  2. Update the Region, year, month, and day you want to partition. Choose Run query or press Tab+Enter to run the query.

Choose Run query or press Tab+Enter to run the query.

After you run the query, you have successfully added a partition to your cloudtrail_logs table. Let’s look at some of the example queries we can run now.

  1. On the Saved queries tab, choose Query – CloudTrail Logs.

This is a base template included to begin querying your CloudTrail logs.

  1. Highlight the query and choose Run query.

You can see the base query template uses the WHERE clause to leverage partitions that have been loaded.

You can see the base query template uses the WHERE clause to leverage partitions that have been loaded.

Let’s say we have a spike in API calls from AWS Lambda and we want to see the users that the calls were coming from in a specific time range as well as the count for each user. Our query looks like the following code:

SELECT useridentity.sessioncontext.sessionissuer.username as "User",
       count(eventname) as "Lambda API Calls"
FROM cloudtrail_logs
WHERE eventsource = 'lambda.amazonaws.com'
       AND eventtime BETWEEN '2020-11-24T18:00:00Z' AND '2020-11-24T21:00:00Z' 
group by useridentity.sessioncontext.sessionissuer.username
order by count(eventname) desc

Or if we wanted to check our S3 Access Logs to make sure only authorized users are accessing certain prefixes:

SELECT *
FROM s3_access_logs
WHERE key='prefix/images/example.jpg'
        AND requester != 'arn:aws:iam::accountid:user/username'

Cost of solution and cleaning up

Deploying the CloudFormation template doesn’t cost anything. You’re only charged for the amount of data scanned by Athena. Remember to use the best practices we discussed earlier when querying your data in Amazon S3. For more pricing information, see Amazon Athena pricing and Amazon S3 pricing.

To clean up the resources that were created, delete the CloudFormation stack you created earlier. This also deletes the saved queries in Athena.

Summary

In this post, we discussed how we can use AWS CloudFormation to easily create AWS service log tables, partitions, and starter queries in Athena by entering bucket paths as parameters. We used CloudTrail and Amazon S3 access logs as examples, but you can replicate these steps for other service logs that you may need to query by visiting the Saved queries tab in Athena. Feel free to check out the video as well, where I go over how we store logs in Amazon S3 and then give a quick demo on how to deploy the solution.

For more information about service logs, see Easily query AWS service logs using Amazon Athena.


About the Author

Michael Hamilton is a Solutions Architect at Amazon Web Services and is based out of Charlotte, NC. He has a focus in analytics and enjoys helping customers solve their unique use cases. When he’s not working, he loves going hiking with his wife, kids, and a 2-year-old German shepherd.

Building Product Intelligence Platform with Cloudflare Workers

Post Syndicated from Robert Cepa original https://blog.cloudflare.com/building-product-intelligence-platform-with-cloudflare-workers/

Building Product Intelligence Platform with Cloudflare Workers

Building Product Intelligence Platform with Cloudflare Workers

“You can only improve what you can measure.”

We try to make Cloudflare’s onboarding experience as accessible as possible. For this reason, many customers are able to set up Cloudflare, configure their accounts and products, and discover additional products entirely on their own in our dashboard. Our Customer Onboarding team builds the dashboard experiences that make this possible.

The Onboarding team is data-driven, so we use data to validate our ideas. Rather than shipping the implementation of some idea right away, we run A/B tests with a small percentage of our customers. The results of these tests tell us what we should do with our idea next – either ship it to everyone, try to improve it (and run the test again), or discard it. This practice helps us with hedging our efforts so we don’t waste time on an idea that isn’t fruitful, and it provides us a method to reliably gather more information about needs of our customers. We use a third-party analytics tool to produce data for these A/B tests. This tool helps us to collect and analyse data about how our customers interact with the experiences that we build.

The onboarding experience in the dashboard is just one of many places where our customers interact with Cloudflare. Other Product teams, Customer Success team, and Marketing team build their own experiences in the dashboard and beyond, and they use their own analytics tools that best suit their needs.  Each of those teams has different goals, but we all have one thing in common – we want to understand our customers.

For example, knowing how our customers interact with campaigns and emails that Marketing teams build can help us on the Onboarding team to build a better, more personalized onboarding experience. Similarly, understanding how our customers interact with the onboarding experience in the dashboard can help our Marketing team to create more personalized emails and campaigns.

Using multiple third-party analytics tools across multiple teams created many challenges related to data integrity, security, privacy and performance. In this blogpost, we are going to talk about how we used Cloudflare Workers to build our product intelligence platform to overcome these challenges, serving hundreds of millions of requests per month from over 200 cities over the world, close to our customers, all without having to configure and maintain infrastructure.

Motivation: Data integrity, security, privacy, and performance

In the past, teams at Cloudflare used third-party scripts provided by analytics platforms like Google Analytics and Heap to measure user behavior. These scripts presented multiple challenges:

Data Integrity

In the product analytics world, an “event” is any user interaction with the product. Because we were using third-party scripts to send event data to varied analytics destinations, it was hard to make sure that these event data are consistent across all these destinations. In our case, our analytics tools categorized event data in different ways, creating confusion for our teams. For example, if a Cloudflare customer purchased our Workers product, Heap would send an event named “Purchase Workers”, while Google Analytics called it “Product Purchase Success” with a data attribute label: “workers”. Nobody trusted this data, so they sought out more reliable sources, such as billing databases.

Security and privacy

Third-party analytics vendors use third-party scripts to track end-user behavior. We take security and data privacy very seriously, and these scripts pose risks to us and our customers. They are hard to audit, and make it hard to ensure they don’t send data we don’t want them to send. They also change over time, and can be buggy, inefficient, and hard to test.

Performance

We want to give Cloudflare dashboard users a highly performant experience, but third-party scripts can cause slowdowns. For example, they can have a significant size because they try to do a lot of things automatically. Having to load and parse too much JavaScript can extend page load and render times, delay user interaction, and drain more battery. They can also fire too many network requests to multiple servers.

Vendor lock-in

Cloudflare’s dashboard codebase is massive, and hardcoded tracking calls tied to a specific analytics vendor makes that vendor difficult to replace. Moreover, adding a new vendor would require significant effort to add that vendor’s tracking calls everywhere, and would have a negative impact on performance on the frontend.

We wanted to solve these problems by creating a system that would decouple what we measure from how we measure it. The requirements were:

  • Unified API: a single API that all clients adhere to regardless of the vendor they primarily use. Engineers don’t need to understand how analytics vendors work and what data they require.
  • Secure and compliant: we fully own and control the code, protecting our customers from vulnerabilities in third-party code. We fully control how our data is measured, distributed, and stored.
  • Performant: lightweight, fast, and non-blocking on the frontend. Move as much logic as we can to the backend.
  • Flexible: ability to add/replace/remove vendors with relatively small effort on the backend, and no effort on the frontend.

We chose to use Cloudflare Workers, which deploys serverless code on the edge across the globe, as our backend infrastructure. Workers offers the following advantages:

  • Nimbleness through serverless development: Our team is small, and analytics wasn’t our primary focus at the time, so we wanted to create something quickly without having to worry about setting up and maintaining the infrastructure. With Workers, we never have to look at things like system health status, or load balancing and scaling, or how fast it is across the world. Everything is included in the package, and works really well.
  • JavaScript support: Since we work on user experiences, we are mostly UI-engineering focused and use React+TypeScript every day. Our team can write frontend and backend code in the same language, which reduces cognitive load.
  • Dogfooding opportunities: We help to test Workers at scale, which makes the product stronger.

Iteration #1: Sparrow and Trace Worker

Our analytics platform has gone through multiple iterations. The first version had two components – a JavaScript SDK called Sparrow, and a corresponding worker we call Trace.

Sparrow

The Sparrow SDK turns various data about product events into a consistent format, so internal users don’t need to understand API requirements further down the data pipeline.

Sparrow has 2 main features:

// tracks page visits
sparrow.pageview(pathname: string);

// tracks user interaction
sparrow.track(event: string, properties: Record<string, any>);

The pageview function can be run whenever a page loads in an application, which allows us to track where users navigate.

The track function is more generic. We can send any event name with any metadata. For example, the event name can be “purchase product” with properties: { product: “workers” }.

Both functions create a JSON object with the following interface and send it to the Trace Worker, which forwards it along to various analytics platforms:

{
  event: string,
  deviceId: string,
  userId?: string,
  properties: Record<string, any>
}

Trace Worker

The Trace Worker receives event data from Sparrow, checks payload correctness to make sure the request came from valid sources, and fans out the data to all connected analytics providers. The following diagram shows the pipeline.

Building Product Intelligence Platform with Cloudflare Workers

Any third-party vendor can be added to Trace Worker, as long as that vendor provides a REST API. How the data are parsed, transformed, and sent to those APIs is implemented by us in our custom functions we call trackers. Trackers aim to replicate the behavior of third-party scripts provided by these vendors. Why are we doing this when we can just use third-party scripts? The main reasons are security and data privacy.

  • We use allowlists to explicitly define event names and event properties that can be sent further upstream. This helps us to prevent sending potentially sensitive information from cookies, URL query parameters, or data payloads. Every event and data property that is not in the allowlist is ignored.
  • On top of that, all allowed properties are sanitized by our internal data scrubber.
  • Always HTTPS: Some third-party scripts still use non-secure HTTP protocol. Trace Worker runs on HTTPS, and we make sure that outgoing requests are also using HTTPS.
  • We fully control the code, which means there are no surprises – the code cannot update without us knowing it.
  • Because the logic lives in the worker, our customers are not exposed to unnecessary client-side risks from using eval or document.write.

Another benefit is performance – because most of our analytics framework’s logic lives in the worker, there’s less JavaScript we need to send to the client, which means faster load times! SparrowJS on its own is super lightweight.

Here’s the simplified implementation of Trace Worker:

import trackers from ‘./trackers;
import { generateContext } from ‘./utils’;
Import { sanitize } from ‘./sanitizer’;

addEventListener(‘fetch’, event => {
  event.respondWith(handle(event));
}

async function handle(event: FetchEvent) {
  try {
    const payload = sanitize(await event.request.json());

    const context = await generateContext(event);


    // fan out
    event.waitUntil(Promise.allSettled(trackers.map(tracker => tracker[payload.event === “pageview” ? “pageview” : “event”](payload, context))));

    // return new Response(“OK”, { status:”OK”, statusCode: 200  })
  } catch (err) {
    // logging
    return new Response(“Something went wrong”, { 
      status:”Internal Server Error”, 
      statusCode: 500 
    })
  }
}

trackers is an array of tracking functions for each third-party vendor. Under the hood, they transform the incoming requests from Sparrow and send them to each vendor’s REST APIs. For example, this is a simplified implementation of Google Analytics tracker that transforms Sparrow payloads to adhere to Measurement Protocol:

const URL = ‘https://www.google-analytics.com/collect’;

export async function event(event: TrackingEvent, context: Context) {
  return fetch(URL, {
    method: ‘POST’,
    body: new URLSearchParams({
      ...createCommonParams(context),
      t: ‘event’,
      ea: context.data.event,
      ec: context.data.properties.category || ‘Uncategorized Event’,
      el: context.data.properties.label,
    }).toString()
  });
}

export async function pageview(event: TrackingEvent, context: Context) {
  return fetch(URL, {
    method: ‘POST’,
    body: new URLSearchParams({
      ...createCommonParams(context),
      t: ‘pageview’,
      dp: context.data.event
    }).toString()
  });
}

function createCommonParams(context: Context) {
  return {
    tid: context.gaId,
    v: ‘1’,
    cid: context.data.deviceId,
    uid: context.data.userId,
    ...context.data.properties
  }
}

Similarly, Heap tracker implements its own transformation for https://heapanalytics.com/api/track.

As you may have noticed, Trace Worker is not your typical service worker. There is no origin service – Trace Worker is the service, except it runs everywhere in the network.

Problem: Nobody (still) trusts the data

Iteration #1 of our data analytics’ platform worked well for a while, but as more product teams used Sparrow to run their own analyses, we started getting reports of data not looking right. The reports were along the lines of:

  • “Google Analytics underreports Heap by x percent…”
  • “Product purchases are not consistent with DB…”
  • “Signup conversion dropped by x percent, but we don’t think that’s actually happening…”

At the same time, we added another vendor – Amplitude, which made these problems even more complicated, because we now had three systems out of sync.

Due to the distributed nature of our analytics platform, we had a lot of potential breaking points. To find a solution, we needed to answer questions like:

Dashboard/Sparrow problems

  • Are product teams using Sparrow correctly?
  • Do we have any hard redirects that cause request cancellation? Should we try Beacon API and see what changes?
  • How does Trace Worker respond?

Trace Worker problems

  • Are we not catching some exceptions?
  • Are we exceeding runtime limits?
  • Are we hitting firewall/DDoS protection?

Third-party vendors problems

  • Are they silently rejecting or not storing some payloads sent from Trace Worker? Google Analytics always responds with 200 OK to any request. How are other vendors handling requests?
  • Are they having internal issues? How can we know, since these systems are blackboxes?
  • Can we recover any lost data?

If we release Trace Worker and our event volume takes a nosedive, it’s a pretty strong clue that it’s caused by us and we should roll back.

But what if nothing unusual happens for a while, and then page views drop by 20%? 10%? 5%? Is it us, or third-party vendors, or just a nice sunny day in Europe and so people are not sitting behind their desks? As we had no source of truth to compare these data against, this was impossible to answer.

Regardless, we knew we had to get serious about observability before we even begin asking these questions. At the time, there was no wrangler tail or Workers analytics (there are now!). Also, even though we used Sentry, an app monitoring platform, our logger was a very basic wrapper around Sentry’s REST API, because there was no full-blown Sentry SDK for Workers runtime – the current SDKs use globals, causing race-conditions in Workers.

The goals were:

1. Get better at diagnosing our own problems – catch every exception in Trace Worker and every non-200 HTTP response from third-party vendors, and log it to some visible source, with some helpful stack-traces and other metadata like request headers and body.

2. Be able to isolate problems that may be happening outside of our codebase – have a single source of truth for all incoming/outgoing requests which we can query and compare against data in our third-party tools. This would help us discover dropped requests that weren’t represented with an error state.

Iteration #2: New Sentry SDK for Workers

To meet these goals, we implemented and open-sourced a new Sentry SDK for Workers called toucan-js. Toucan adheres to the Sentry unified API guidelines, so the interface is familiar from other SDKs (node/browser). It currently supports capturing errors and messages with stack-traces that can be enhanced with source maps, breadcrumbs, request data/headers/cookies, tags, and extra metadata.

Since we replaced our simple Sentry logger with toucan-js, every single log started having:

  • Full request payload and some allowed headers
  • Stack-trace with source maps
  • Response status code and body (if applicable)

If anything goes wrong, we have all the information we need to reproduce and fix it – request body, headers, stack-trace, and all necessary context.

At the same time, we started sending cloned requests to the /debug/collect endpoint in Google Analytics Tracker that, combined with Sentry alerts, helped us find many dropped requests due to schema adherence problems such as “The value provided for parameter ‘cid’ is invalid.”

Iteration #3: The single source of truth

Better Sentry logs helped us with major drifts, but the data were still slightly off. While we observed some intermittent HTTP errors in Sentry, when we compared the number of these alerts with differences between different analytics platforms the numbers didn’t add up.

Due to this uncertainty, we decided to own the data layer, and create our own database – the single source of truth of all incoming payloads sent from Sparrow to Trace Worker before any transformation.

In order for us to trust the data in this ‘single source of truth’ database, the database needed to receive Sparrow payloads from outside of Trace Worker, preferably from a system that sits right in front of it, with minimal logic, that changes rarely, and that is highly available. Ideally, this system was to do three things – grab the incoming request payload, log it, and forward it to Trace Worker. These payloads should be logged raw, untouched, corresponding to whatever is sent from clients (SparrowJS).

The nice thing about this solution is that even if Trace Worker gets a bad release, we will not lose any data. Another strong case for us owning the data is that incidents in third-party vendors will not affect us anymore, because the solution will open the door for backfilling of dropped requests.

We considered Workers KV — Cloudflare’s low latency key-value store hosted at the network edge — for our storage needs, but being able to query the data was really important for us, because we wanted to diagnose complex problems quickly and select the data based on some property. For this reason, we went in a different direction.

Google BigQuery was our storage solution

We decided to use Google BigQuery for our ‘single source of truth’ database because:

  • It was designed for big data
  • It lets us use SQL to query what we need
  • We can use REST API in our new system to send the logs

Of course, Google BigQuery is a columnar database. How would we use it to store JSON data?

The first option was to write some kind of transformer that would map every object property to a column, but that was against our requirement of a system with minimal logic. The set of allowed characters we could use to name a column was also limited, so we wouldn’t be able to map column names back to original properties.

Due to these limitations we decided to store raw json strings, and use JSON functions to build views on top of these data.

First, we created a partitioned-by-day table called raw with the following schema:

Field name Type
eventId STRING
timestamp TIMESTAMP
data STRING

Sparrow’s payloads are stored in the data field as stringified JSON.

We don’t run queries against this table directly. Instead, we built a view called raw_normalized that looks something like this:

select 
  json_extract_scalar(data, '$.event') as event, 
  json_extract_scalar(data, '$.deviceId') as deviceId, 
  json_extract_scalar(data, '$.userId') as userId, 
  json_extract_scalar(data, '$.properties.category') as category, 
  json_extract_scalar(data, '$.properties.productName) as productName 
from raw;

With this setup, we can write complex SQL queries while retaining the original JSON values. To demonstrate on a simple example, when we insert a row with data being:

{
  event: “purchase product”,
  deviceId: “desktop1”,
  userId: “michelle1”,
  properties: { category: “billing”, productName: “workers” }
}

and then run:

select * from data_normalized where event = ‘purchase product’;

we get:

event deviceId userId category productName
purchase product desktop1 michelle1 billing workers

We had our data layer prepared. But how to actually push the data into BigQuery?

Dispatcher Worker

We created another worker, the Dispatcher, that sits in front of Trace Worker! As we said earlier, the sole purpose of Dispatcher Worker is to:

  1. Read the incoming request body
  2. Send it to BigQuery
  3. Forward the incoming request to Trace Worker

The architecture changed to:

Building Product Intelligence Platform with Cloudflare Workers

Here’s a simplified implementation:

import Toucan from 'toucan-js';
import { BigQueryClient } from “./bigquery”;

const bigQuery = new BigQueryClient({
  serviceAccountEmail: SERVICE_ACCOUNT_EMAIL,
  serviceAccountSecret: SERVICE_ACCOUNT_SECRET,
  projectId: PROJECT_ID,
  datasetId: DATASET_ID
});
 
addEventListener('fetch', event => {
  const toucan = new Toucan({dsn: DSN, event});
 
  // do the work without blocking the response
  event.waitUntil(dispatch(event, biqQuery, toucan));

  event.respondWith(return new Response('OK', {
      status: 200,
      statusText: 'OK'
  });
});
 
async function dispatch(event: FetchEvent, bigQuery: BigQueryClient, toucan: Toucan) { 
  try {
     // Original request to be sent to Trace Worker
    const requestOriginal = event.request;
    
    // We clone the request here to allow multiple uses of Body
    const requestClone = requestOriginal.clone();
    
    // read the request payload
    const payload = await requestClone.text();
 
    // create a timestamp
    const timestamp = Date.now();
 
    // send to BQ
    const bigQueryResponse = await bigQuery.insertRow({timestamp, json: payload});
     
    // log failed logs
    if (!bigQueryResponse.ok) {    
       sentry.captureException(await HttpError.fromResponse(bigQueryResponse));
    }
 
    // send to trace worker
    const traceResponse = await fetch(TRACE_WORKER_URL, requestOriginal)

    // log failed logs
    if (!traceResponse.ok) {    
       toucan.captureException(await HttpError.fromResponse(traceResponse));
    }
  } catch (err) {
   toucan.captureException(err);
  }
}

BigQueryClient is a lightweight SDK we implemented to be able to send data to BigQuery. Internally, it builds a request and sends it to Google Cloud Platform using their Stream API. We won’t go into details, but we want to briefly cover how we handle authentication.

Google Cloud APIs use the OAuth 2.0 protocol for authenticating both user accounts and service accounts. In short, the protocol involves building a signed JWT (JSON Web Token), sending it to Google Authorization Server to obtain access token, and sending that access token with all requests to GCP API.

We tried a few libraries to help us build that JWK (such as jsonwebtoken), but they were too slow due to their RSA implementation, and we were hitting runtime limits. So we implemented our own JWT builder using SubtleCrypto, which is a web standard that is also implemented in Cloudflare Workers!

JSON Web Tokens consist of 3 parts:

  1. Header
  2. Body
  3. Signature

First, we build the header:

const tokenHeader = base64UrlEncode(JSON.stringify({
  alg: ‘RS256’,
  typ: ‘JWT’
}));

Then we build the token body, and concatenate with the header to build the token base:

const nowSeconds = Date.now() / 1000;
const tokenTtl = 3600;
const expire = nowSeconds + tokenTtl;

const tokenBody = base64UrlEncode(JSON.stringify({
  iss: SERVICE_ACCOUNT_EMAIL,
  scope : ‘https://www.googleapis.com/auth/bigquery.insertdata’
  aud: ‘https://www.googleapis.com/oauth2/v4/token’,
  exp: expire,
  iat: nowSeconds
}));
 
const tokenBase = `${tokenHeader}.${tokenBody}`;

All that’s left is signing the token base:

const signature = base64UrlEncode(
  arrayBufferToString(
    await crypto.subtle.sign(
      {
        name: ‘RSASSA-PKCS1-v1_5’,
        hash: { name: ‘SHA-256’ }
      },
     SERVICE_ACCOUNT_JWK,
     new TextEncoder.encode(tokenBase)
    )
  )
);
 
const jwt = `${tokenBase}.${signature}`;

Once we have the JWT, our SDK sends a request to the Authorization Server to retrieve the access token:

const token = await (await fetch (‘https://www.googleapis.com/oauth2/v4/token’, {
  method: ‘POST’,
  body: 'grant_type=urn%3Aietf%3Aparams%3Aoauth%3Agrant-type%3Ajwt-bearer&assertion=' + jwt,
  headers: { ‘content-type’: ‘application/x-www-form-urlencoded’ }
})).json();

The access token doesn’t need to be requested with every FetchEvent – it can be reused until it expires. Caching the token helps with performance, because RSA encryption is costly.

BigQueryClient stores the access token in a global variable, so all isolates that share the environment can use it. In fact, that’s the reason we initialize BigQueryClient outside of addEventListener. The SDK manages the token internally and handles the OAuth2 ceremony for the clients – the first call of insertAll generates a JWT to retrieve and store an access token, but subsequent calls of insertAll use the access token from the memory.

With all the pieces put together, this is the state of our analytics pipeline today.

Future work: Risk management

Logging all data gives us great visibility and makes debugging easier. We now have a clear picture of where in the pipeline the problems are, and we have all possible information to fix them. We can react to problems pretty well, but we would like to get better at preventing problems in production before they happen.

We currently have two environments: staging and production. Our staging environment is behind Cloudflare Access, only accessible to Cloudflare employees. When we merge our changes, the CI pipeline automatically deploys them to the staging environment where we can test these changes before they get to production.

While our staging environment helps us with catching catastrophic errors early, it’s not too great for finding errors that may cause partial data drops, because the staging traffic is generally very low, which makes it harder to spot changes in data patterns. From this perspective, our releases are very risky.

To reduce these risks, we need to test our changes with production traffic. We can achieve that with a variant of blue-green deployment approach, ensuring we have two production environments, as identical as possible. We could use another Worker, sitting in front of Blue and Green Dispatcher Workers, receiving live traffic and duplicating all requests to blue/green variants. One of these variants, green for example, would be using live code and live data sources, and the other one, blue, would be in the final stage of testing, with our new code, logging to the staging data sources. Once we validate the data in staging data sources, we can easily cut-over by switching data sources between green and blue.

Future work: Durability

Outages of third-party vendors don’t affect us anymore, since we essentially own the data layer and can backfill dropped events once a vendor is fixed and back online.

This presents us with a question: How can we automate this with Workers?

How can we make sure that everything in our source of truth will eventually get stuffed to the analytics tools?

Can we implement a true distributed and durable streaming platform like Kafka at the edge?

We think we can! More about this next time.

Using AWS DevOps Tools to model and provision AWS Glue workflows

Post Syndicated from Nuatu Tseggai original https://aws.amazon.com/blogs/devops/provision-codepipeline-glue-workflows/

This post provides a step-by-step guide on how to model and provision AWS Glue workflows utilizing a DevOps principle known as infrastructure as code (IaC) that emphasizes the use of templates, source control, and automation. The cloud resources in this solution are defined within AWS CloudFormation templates and provisioned with automation features provided by AWS CodePipeline and AWS CodeBuild. These AWS DevOps tools are flexible, interchangeable, and well suited for automating the deployment of AWS Glue workflows into different environments such as dev, test, and production, which typically reside in separate AWS accounts and Regions.

AWS Glue workflows allow you to manage dependencies between multiple components that interoperate within an end-to-end ETL data pipeline by grouping together a set of related jobs, crawlers, and triggers into one logical run unit. Many customers using AWS Glue workflows start by defining the pipeline using the AWS Management Console and then move on to monitoring and troubleshooting using either the console, AWS APIs, or the AWS Command Line Interface (AWS CLI).

Solution overview

The solution uses COVID-19 datasets. For more information on these datasets, see the public data lake for analysis of COVID-19 data, which contains a centralized repository of freely available and up-to-date curated datasets made available by the AWS Data Lake team.

Because the primary focus of this solution showcases how to model and provision AWS Glue workflows using AWS CloudFormation and CodePipeline, we don’t spend much time describing intricate transform capabilities that can be performed in AWS Glue jobs. As shown in the Python scripts, the business logic is optimized for readability and extensibility so you can easily home in on the functions that aggregate data based on monthly and quarterly time periods.

The ETL pipeline reads the source COVID-19 datasets directly and writes only the aggregated data to your S3 bucket.

The solution exposes the datasets in the following tables:

Table Name Description Dataset location Provider
countrycode Lookup table for country codes s3://covid19-lake/static-datasets/csv/countrycode/ Rearc
countypopulation Lookup table for the population of each county s3://covid19-lake/static-datasets/csv/CountyPopulation/ Rearc
state_abv Lookup table for US state abbreviations s3://covid19-lake/static-datasets/json/state-abv/ Rearc
rearc_covid_19_nyt_data_in_usa_us_counties Data on COVID-19 cases at US county level s3://covid19-lake/rearc-covid-19-nyt-data-in-usa/csv/us-counties/ Rearc
rearc_covid_19_nyt_data_in_usa_us_states Data on COVID-19 cases at US state level s3://covid19-lake/rearc-covid-19-nyt-data-in-usa/csv/us-states/ Rearc
rearc_covid_19_testing_data_states_daily Data on COVID-19 cases at US state level s3://covid19-lake/rearc-covid-19-testing-data/csv/states_daily/ Rearc
rearc_covid_19_testing_data_us_daily US total test daily trend s3://covid19-lake/rearc-covid-19-testing-data/csv/us_daily/ Rearc
rearc_covid_19_testing_data_us_total_latest US total tests s3://covid19-lake/rearc-covid-19-testing-data/csv/us-total-latest/ Rearc
rearc_covid_19_world_cases_deaths_testing World total tests s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/ Rearc
rearc_usa_hospital_beds Hospital beds and their utilization in the US s3://covid19-lake/rearc-usa-hospital-beds/ Rearc
world_cases_deaths_aggregates Monthly and quarterly aggregate of the world s3://<your-S3-bucket-name>/covid19/world-cases-deaths-aggregates/ Aggregate

Prerequisites

This post assumes you have the following:

  • Access to an AWS account
  • The AWS CLI (optional)
  • Permissions to create a CloudFormation stack
  • Permissions to create AWS resources, such as AWS Identity and Access Management (IAM) roles, Amazon Simple Storage Service (Amazon S3) buckets, and various other resources
  • General familiarity with AWS Glue resources (triggers, crawlers, and jobs)

Architecture

The CloudFormation template glue-workflow-stack.yml defines all the AWS Glue resources shown in the following diagram.

architecture diagram showing ETL process

Figure: AWS Glue workflow architecture diagram

Modeling the AWS Glue workflow using AWS CloudFormation

Let’s start by exploring the template used to model the AWS Glue workflow: glue-workflow-stack.yml

We focus on two resources in the following snippet:

  • AWS::Glue::Workflow
  • AWS::Glue::Trigger

From a logical perspective, a workflow contains one or more triggers that are responsible for invoking crawlers and jobs. Building a workflow starts with defining the crawlers and jobs as resources within the template and then associating it with triggers.

Defining the workflow

This is where the definition of the workflow starts. In the following snippet, we specify the type as AWS::Glue::Workflow and the property Name as a reference to the parameter GlueWorkflowName.

Parameters:
  GlueWorkflowName:
    Type: String
    Description: Glue workflow that tracks all triggers, jobs, crawlers as a single entity
    Default: Covid_19

Resources:
  Covid19Workflow:
    Type: AWS::Glue::Workflow
    Properties: 
      Description: Glue workflow that tracks specified triggers, jobs, and crawlers as a single entity
      Name: !Ref GlueWorkflowName

Defining the triggers

This is where we define each trigger and associate it with the workflow. In the following snippet, we specify the property WorkflowName on each trigger as a reference to the logical ID Covid19Workflow.

These triggers allow us to create a chain of dependent jobs and crawlers as specified by the properties Actions and Predicate.

The trigger t_Start utilizes a type of SCHEDULED, which means that it starts at a defined time (in our case, one time a day at 8:00 AM UTC). Every time it runs, it starts the job with the logical ID Covid19WorkflowStarted.

The trigger t_GroupA utilizes a type of CONDITIONAL, which means that it starts when the resources specified within the property Predicate have reached a specific state (when the list of Conditions specified equals SUCCEEDED). Every time t_GroupA runs, it starts the crawlers with the logical ID’s CountyPopulation and Countrycode, per the Actions property containing a list of actions.

  TriggerJobCovid19WorkflowStart:
    Type: AWS::Glue::Trigger
    Properties:
      Name: t_Start
      Type: SCHEDULED
      Schedule: cron(0 8 * * ? *) # Runs once a day at 8 AM UTC
      StartOnCreation: true
      WorkflowName: !Ref GlueWorkflowName
      Actions:
        - JobName: !Ref Covid19WorkflowStarted

  TriggerCrawlersGroupA:
    Type: AWS::Glue::Trigger
    Properties:
      Name: t_GroupA
      Type: CONDITIONAL
      StartOnCreation: true
      WorkflowName: !Ref GlueWorkflowName
      Actions:
        - CrawlerName: !Ref CountyPopulation
        - CrawlerName: !Ref Countrycode
      Predicate:
        Conditions:
          - JobName: !Ref Covid19WorkflowStarted
            LogicalOperator: EQUALS
            State: SUCCEEDED

Provisioning the AWS Glue workflow using CodePipeline

Now let’s explore the template used to provision the CodePipeline resources: codepipeline-stack.yml

This template defines an S3 bucket that is used as the source action for the pipeline. Any time source code is uploaded to a specified bucket, AWS CloudTrail logs the event, which is detected by an Amazon CloudWatch Events rule configured to start running the pipeline in CodePipeline. The pipeline orchestrates CodeBuild to get the source code and provision the workflow.

For more information on any of the available source actions that you can use with CodePipeline, such as Amazon S3, AWS CodeCommit, Amazon Elastic Container Registry (Amazon ECR), GitHub, GitHub Enterprise Server, GitHub Enterprise Cloud, or Bitbucket, see Start a pipeline execution in CodePipeline.

We start by deploying the stack that sets up the CodePipeline resources. This stack can be deployed in any Region where CodePipeline and AWS Glue are available. For more information, see AWS Regional Services.

Cloning the GitHub repo

Clone the GitHub repo with the following command:

$ git clone https://github.com/aws-samples/provision-codepipeline-glue-workflows.git

Deploying the CodePipeline stack

Deploy the CodePipeline stack with the following command:

$ aws cloudformation deploy \
--stack-name codepipeline-covid19 \
--template-file cloudformation/codepipeline-stack.yml \
--capabilities CAPABILITY_NAMED_IAM \
--no-fail-on-empty-changeset \
--region <AWS_REGION>

When the deployment is complete, you can view the pipeline that was provisioned on the CodePipeline console.

CodePipeline console showing the deploy pipeline in failed state

Figure: CodePipeline console

The preceding screenshot shows that the pipeline failed. This is because we haven’t uploaded the source code yet.

In the following steps, we zip and upload the source code, which triggers another (successful) run of the pipeline.

Zipping the source code

Zip the source code containing Glue scripts, CloudFormation templates, and Buildspecs file with the following command:

$ zip -r source.zip . -x images/\* *.history* *.git* *.DS_Store*

You can omit *.DS_Store* from the preceding command if you are not a Mac user.

Uploading the source code

Upload the source code with the following command:

$ aws s3 cp source.zip s3://covid19-codepipeline-source-<AWS_ACCOUNT_ID>-<AWS_REGION>

Make sure to provide your account ID and Region in the preceding command. For example, if your AWS account ID is 111111111111 and you’re using Region us-west-2, use the following command:

$ aws s3 cp source.zip s3://covid19-codepipeline-source-111111111111-us-west-2

Now that the source code has been uploaded, view the pipeline again to see it in action.

CodePipeline console showing the deploy pipeline in success state

Figure: CodePipeline console displaying stage “Deploy” in-progress

Choose Details within the Deploy stage to see the build logs.

CodeBuild console displaying build logs

Figure: CodeBuild console displaying build logs

To modify any of the commands that run within the Deploy stage, feel free to modify: deploy-glue-workflow-stack.yml

Try uploading the source code a few more times. Each time it’s uploaded, CodePipeline starts and runs another deploy of the workflow stack. If nothing has changed in the source code, AWS CloudFormation automatically determines that the stack is already up to date. If something has changed in the source code, AWS CloudFormation automatically determines that the stack needs to be updated and proceeds to run the change set.

Viewing the provisioned workflow, triggers, jobs, and crawlers

To view your workflows on the AWS Glue console, in the navigation pane, under ETL, choose Workflows.

Glue console showing workflows

Figure: Navigate to Workflows

To view your triggers, in the navigation pane, under ETL, choose Triggers.

Glue console showing triggers

Figure: Navigate to Triggers

To view your crawlers, under Data Catalog, choose Crawlers.

Glue console showing crawlers

Figure: Navigate to Crawlers

To view your jobs, under ETL, choose Jobs.

Glue console showing jobs

Figure: Navigate to Jobs

Running the workflow

The workflow runs automatically at 8:00 AM UTC. To start the workflow manually, you can use either the AWS CLI or the AWS Glue console.

To start the workflow with the AWS CLI, enter the following command:

$ aws glue start-workflow-run --name Covid_19 --region <AWS_REGION>

To start the workflow on the AWS Glue console, on the Workflows page, select your workflow and choose Run on the Actions menu.

Glue console run workflow

Figure: AWS Glue console start workflow run

To view the run details of the workflow, choose the workflow on the AWS Glue console and choose View run details on the History tab.

Glue console view run details of a workflow

Figure: View run details

The following screenshot shows a visual representation of the workflow as a graph with your run details.

Glue console showing visual representation of the workflow as a graph.

Figure: AWS Glue console displaying details of successful workflow run

Cleaning up

To avoid additional charges, delete the stack created by the CloudFormation template and the contents of the buckets you created.

1. Delete the contents of the covid19-dataset bucket with the following command:

$ aws s3 rm s3://covid19-dataset-<AWS_ACCOUNT_ID>-<AWS_REGION> --recursive

2. Delete your workflow stack with the following command:

$ aws cloudformation delete-stack --stack-name glue-covid19 --region <AWS_REGION>

To delete the contents of the covid19-codepipeline-source bucket, it’s simplest to use the Amazon S3 console because it makes it easy to delete multiple versions of the object at once.

3. Navigate to the S3 bucket named covid19-codepipeline-source-<AWS_ACCOUNT_ID>- <AWS_REGION>.

4. Choose List versions.

5. Select all the files to delete.

6. Choose Delete and follow the prompts to permanently delete all the objects.

S3 console delete all object versions

Figure: AWS S3 console delete all object versions

7. Delete the contents of the covid19-codepipeline-artifacts bucket:

$ aws s3 rm s3://covid19-codepipeline-artifacts-<AWS_ACCOUNT_ID>-<AWS-REGION> --recursive

8. Delete the contents of the covid19-cloudtrail-logs bucket:

$ aws s3 rm s3://covid19-cloudtrail-logs-<AWS_ACCOUNT_ID>-<AWS-REGION> --recursive

9. Delete the pipeline stack:

$ aws cloudformation delete-stack --stack-name codepipeline-covid19 --region <AWS-REGION>

Conclusion

In this post, we stepped through how to use AWS DevOps tooling to model and provision an AWS Glue workflow that orchestrates an end-to-end ETL pipeline on a real-world dataset.

You can download the source code and template from this Github repository and adapt it as you see fit for your data pipeline use cases. Feel free to leave comments letting us know about the architectures you build for your environment. To learn more about building ETL pipelines with AWS Glue, see the AWS Glue Developer Guide and the AWS Data Analytics learning path.

About the Authors

Nuatu Tseggai

Nuatu Tseggai is a Cloud Infrastructure Architect at Amazon Web Services. He enjoys working with customers to design and build event-driven distributed systems that span multiple services.

Suvojit Dasgupta

Suvojit Dasgupta is a Sr. Customer Data Architect at Amazon Web Services. He works with customers to design and build complex data solutions on AWS.

How EMX reduced data pipeline costs by 85% with Amazon Athena

Post Syndicated from Gary Bouton original https://aws.amazon.com/blogs/big-data/how-emx-reduced-data-pipeline-costs-by-85-with-amazon-athena/

This is a guest blog post by Gary Bouton and Louis Ashner from EMX. In their own words, “ENGINE Media Exchange (EMX) is a leading marketing technology company, leveraging a patented, end-to-end tech stack purpose-built to meet the demands of today’s digital marketplace. The company creates both open- and closed-loop solutions designed to unify advertisers, platforms, and publishers across digital media channels—including advanced TV, video, display, search, and social.”

While recognized as an independent solutions provider for the digital media landscape, EMX also serves as the technology and programmatic division for its parent, ENGINE—a global data-driven marketing company serving advertising’s most recognized brands.

In the past, we used typical legacy data warehouse solutions for our data pipeline. We needed massive clusters to house all our raw data as well as advanced pipelines to import that data into the final database. We then needed to query the raw data clusters to aggregate and move it into a separate cluster for more frequent querying. This process was not only time consuming, but also quite expensive, because these legacy database clusters weren’t cheap in order to house that much data.

Then came Amazon Athena, which allowed us to not only simplify our pipeline, but also save on costs significantly. We were able to simply route the raw data straight to Amazon Simple Storage Service (Amazon S3) at minimal storage costs, then query the data with Athena to move aggregate data into small Amazon Redshift clusters for querying more frequently. Athena’s querying is not only quick, but the query costs are mere pennies when the table is set up correctly with partitions and the query utilizes them properly. Additionally, we can increase our data retention time, because the cost of storing data in Amazon S3 is significantly cheaper than an ever-growing legacy data warehousing solution.

In short, Athena allowed us to simplify our data pipeline while saving 85% on data storage costs at the same time.

This post discusses the following:

  • Why EMX digital chose Athena for its backend ETL workflow
  • How EMX manages Athena performance and run time
  • How EMX continues to scale Athena with new products and create coherent workflows
  • The benefits of this solution for EMX

Advantages of a robust backend ETL workflow when dealing with fast and furious big data

For most companies, data is an ever-growing problem. The volume, velocity, variety, and veracity of its availability can be performance limiting and a financial burden.

Data is very important to us at EMX; we process over 450,000 requests per second, which we clean, audit, and deliver for reporting and optimizations to keep our clients informed and current in an ever-changing ad space.

To do this, we have to have a backend system that is robust, on time, available, and cost effective in order to meet the demands of our split-second decision-making.

Why Athena is the right tool for EMX

As detailed below, Athena’s pay-per-query pricing model, performance and reliability at scale, and ease of use made it the right tool for us.

  • Scale – EMX processes over 2 TB of raw and sometimes unstructured data every hour for reporting and optimizations. The ability to run these jobs without managing the cluster optimizations directly allows the team to focus on more research and product goals. Using Athena allows us to focus more on research, product development ideas, and ad hoc tasks, and alleviates us from having to take time to estimate the process and computational power needed to complete jobs in time.
  • Cost – Cost per query is at least four times cheaper than other backend ETL tools, and its on-demand nature means we only pay for what we use. We’re no longer losing increasing costs by keeping up a system that isn’t being used. The feedback of cost per query through Athena also allows us to tune and optimize our logic, to not only reduce that cost further but test new ways to split into and run our production ETL jobs.
  • Resilience – We have thrown everything but the kitchen sink problems at Athena while building out our production pipeline, and were impressed at the lack of failure from the service. Even though we don’t directly own the resources to the cloud solution of Athena, it has always had high availability. In instances where availability was hampered, Athena has made it easy and straightforward to add in workflow hooks to retry failed jobs when a queue becomes available.
  • Ease of use – Unlike most competitor offerings, Athena works out of the box. It’s very easily customized using the Athena GUI, or you can build your own roles, rules, database structures, and projections. The documentation for tuning AWS performance with Presto is very easy and straightforward, making it a small learning curve for any new user.
  • Data transformations – Athena’s robust Presto query language allows us to perform regex, quartile, and percentile statistics without resorting to an outside transformation step in Python or other languages. Going further, using window functions inside those same queries allows us to do some of the heavy mathematical lifting we would have needed to do outside of the backend process, thus saving cost and time. With Athena, these extra vital steps see no difference in cost or performance to our backend pipeline and allow us to condense complicated parts into one step.

Why we continue to grow with Athena

We continue to grow with Athena for the following reasons:

  • Future scale – Athena and its team keep improving and adding resources that support our ever-growing data needs, which have increased by 200% since Athena’s implementation. This has served as the bedrock to our backend solutions.
  • Improvements – The Sales and Engineering team at AWS has always been open to feedback and has turned that into better error reporting, work groups for Athena, changes in policy, and workload management through roles. This has allowed us to split Athena resources with workgroups from production-level work to running ad hoc jobs in future updates.
  • Cost is king – Every dollar we have saved through Athena has been put into products to make Athena better for us. Using Athena has allowed us to improve our front-end delivery products—from building our own workflows right into Athena, to taking time to work with the right compression for Athena ingestion, and even offloading more work that would have gone to a traditional ETL box. Cost for us is not just dollars but the time it takes to manage; that time saved is allowing us to be on the bleeding edge in development of new tools to deliver the data Athena helps us serve.

The following sections detail how EMX uses Athena to build, manage, and orchestrate its backend ELT work with minimal coding and maintenance.

Solution architecture

The following diagram shows the architecture EMX uses.

The following diagram shows the architecture EMX uses.

How we use Athena

Our custom scripts stream batched data each minute from auction servers directly to raw S3 buckets. The data is dropped in a .gzip format to datetime-partitioned S3 buckets. This partition structure helps us limit our Athena query scan. For example, the partitioned buckets look like the following screenshot.

For example, the partitioned buckets look like the following screenshot.

When the data has reached these partitioned buckets, EMX uses Apache Airflow to schedule various jobs across Athena. The following screenshot shows our DAG for our most-used pipeline.

The following screenshot shows our DAG for our most-used pipeline.

Before beginning to run Athena queries, we run two checks on our data in Amazon S3:

  • Check if all the expected data has arrived and is in the bucket.
  • Check logic match rules and clean illegal fields in the data.

On the success of both tasks, we start adding the latest partition to an Athena table.

When the partition is added to the table, we start running the query. The query status is polled every 10 seconds to get the latest status on the query performance until completion. The query returns the status as success, failed, or canceled. Depending on what query status is returned, further tasks are then forked.

At times, we have noticed queries fail with the error Query resource exhausted at this scale, which usually goes away on triggered retries. For the same reason, we have a retry mechanism in place on the execute_athena_sql task. If the retry fails, it alerts the team and data is copied over to a debug bucket for further investigation. If it succeeds, it moves ahead with further transformation.

For further transformation, we get the output of the Athena query back in Amazon S3, and then we add the business rules to enrich the data in Amazon S3.

Based on the pipeline logic, the data is then copied from Amazon S3 to different data stores, one of them being Amazon Redshift.

The last step is to clean up the metadata that was generated by the Athena query.

The following is an example from one of our pipelines. This Athena table is projected on top of the partitioned buckets, and the table is also partitioned by datetime so that the table can be read off the data directly when it’s ready. The following screenshot is what the sample table campaigns_stream looks like, which reads the data from the aforementioned bucket.

The following screenshot is what the sample table campaigns_stream looks like, which reads the data from the aforementioned bucket.

As soon as our scheduled jobs are triggered, the job runs data checks, data matches, and complex SQL queries on this table using the latest date partition, which was loaded in the last DAG task, which limits the data scan and reduces costs. The results are generated and pushed to the S3 reporting bucket to be picked up by other processes. The results can be generated in different formats like CSV, Apache Avro, and Apache Parquet using the CTAS or INSERT INTO command.

For example, running the following simple count query for each domain scans approximately 1.65 TB of data and gives back the results in less than 600 seconds, without needing us to set up or manage any infrastructure.

For example, running the following simple count query for each domain scans approximately.

When the query is complete and the output files in the S3 reporting bucket are ready, they’re picked up by our DAG and pushed into data storage like Amazon Redshift.

Optimization on Athena

By default, Athena has a soft limit of 20 DML active queries (CTAS). When we have multiple jobs running in parallel, we may hit that limit, delaying our time-sensitive pipelines and jobs. To overcome this, we allocated a fixed time window in each hour for our most critical pipelines, and other jobs with lower priority are run later.

For example, our production pipelines get priority 1 – with window minute 0 to minute 15 of every hour. We’re aware that we can request a limit increase from AWS, but we instead decided to use this opportunity to improve the resilience and robustness of our system.

Conclusion

“Build, don’t buy” has been EMX’s motto. It drives our innovation forward, much like Athena continues to be able to solve all the questions we ask of it. We build boutique and large-scale solutions for our advertising clients, which require a malleable and robust ETL backend that takes the work and cost to a manageable level. We built an ever-scaling, cost-effective, and highly available ETL backend with Athena.

Our successes with Athena are shown through both time and cost savings, including:

  • 30% of the time used on maintenance of a traditional ETL structure is now moved into Athena improvement, which sees improved feedback in reduced costs that we can pass on to our clients
  • Four times less cost per query than competitors has allowed us to put money into different tools for storage and modeling, giving even more entropy to driving more revenue for our clients and less cost
  • 10 times less technical debt in Athena setup, research, staging, and production, which goes back into other future-thinking projects

What we can do with data is only limited by the time we need in herding, cleaning, and delivering this data for insights and development. Since throwing 100% of our ETL backend systems into Athena, we have increased product delivery and systems optimization four-fold in only a quick short year. Athena and the Athena team continue to grow with us even as our data needs begin to soar exponentially, adding more tools that reduce workflows, management, and job distribution in the AWS ecosystem itself. This entropy between EMX and Athena has resulted in increased cooperation and more business with us and our growing lists of clients.

Our “Why” is building the tools for the future, and Athena personifies our “Why” in delivering what EMX is about: scale, on time, and delivery of data optimized for the modern era.


About the Authors

Gary Bouton is VP of Data Engineering at ENGINE Media Exchange and leads their Data Engineering and Data Science Product teams. Pipeline implementation is led by Director of Data Pipeline Rahul Gupta, Senior Engineer Nader S. Gharawi, Data Science Engineer Raghav Gupta. Data model implementation is led by Senior Data Scientist Gabrielle Agrocostea , and Data Scientist Heena Otia.

 

Louis Ashner is EVP of Technology at ENGINE Media Exchange. He has a passion for making the Internet faster, and is an ad-tech pioneer with more than 10 years of experience working with digital advertising, including real-time bidding and programmatic advertising. His 9 patents in networking optimization and data caching are used to power EMX’s proprietary ad exchange.

Data monetization and customer experience optimization using telco data assets: Part 1

Post Syndicated from Vikas Omer original https://aws.amazon.com/blogs/big-data/part-1-data-monetization-and-customer-experience-optimization-using-telco-data-assets/

The landscape of the telecommunications industry is changing rapidly. For telecom service providers (TSPs), revenue from core voice and data services continues to shrink due to regulatory pressure and emerging OTT players that offer an attractive alternative. Despite increasing demand from customers for bandwidth, speed, and efficiency, TSPs are finding that ROI from implementing new access technologies like 5G are unsubstantial.

To overcome the risk of being relegated to a utility or dumb pipe, TSPs today are looking to diversify, adopting alternative business models to generate new revenue streams.

In recent times, adopting customer experience (CX) and data monetization initiatives has been a key theme across all industries. Although many Tier-1 TSPs are leading this transformation by using new technologies to improve CX and improve profitability, many TSPs have yet to embark on this challenging but rewarding journey.

Building and implementing a CX management and data monetization strategy

Data monetization is often misunderstood as making dollars by selling data, but what it really means is to drive revenue by increasing the top line or the bottom line. It can be tangible or intangible, internal or external, or by making use of data assets.

According to Gartner, most data and analytics leaders are looking to increase investments in business intelligence (BI) and analytics (see the following study results).

The preceding visualization is from “The 2019 CIO Agenda: Securing a New Foundation for Digital Business”, published October 15, 2018.

Although the external monetization opportunities are limited due to strict regulations, a plethora of opportunities exist for TSPs to monetize data both internally (regulated but much less compared to external) and externally via a marketplace (highly regulated). If TSPs can shift their mindsets from selling data to focus on using data insights for monetization and improving CX, they can adopt a significant number of use cases to realize an immediate positive impact.

Tapping and utilizing insights around customer behavior acts like a Swiss Army Knife for businesses. You can use these insights to drive CX, hyper-personalization and localization, micro-segmentation, subscriber retention, loyalty and rewards programs, network planning and optimization, internal and external data monetization, and more. The following are some use cases that can be driven using CX and data monetization strategies:

  • Segmentation/micro-segmentation (cross-sell, up-sell, targeted advertising, enhanced market locator); for example:
    • Identify targets for consuming baby products or up-selling a kids-related TV channel
    • Identify females in the age range of 18-35 to target for high-end beauty products or apparels

You can build hundreds of such segments.

  • Personalized loyalty and reward programs (incentivize customers with what they like). For example, movie tickets or discounts for a movie lover, or food coupons and deals for a food lover.
  • CX-driven network optimization (allocate more resources to streaming hotspots with high-value customers).
  • Identifying potential partners for joint promotions. For example, bundling device offers with a music app subscription.
  • Hyper-personalization. For example, personalized recommendations for on-portal apps and websites.
  • Next best action and next best offer. For example, intelligent bundling and packaging of offerings.

Challenges with driving CX and data monetization

In this digital era, TSPs consider data analytics a strategic pillar in their quest to evolve into a true data-driven organization. Although many TSPs are harnessing the power of data to drive and improve CX, there are technological gaps and challenges to baseline and formulate internal and external data monetization strategies. Some of these challenges include:

  • Non-overlapping technology investments for CX and data monetization due to misaligned business and IT initiatives
  • Huge CAPEX requirements to process massive volumes of data
  • Inability to unearth hidden insights due to siloed data initiatives
  • Inability to marry various datasets together due to missing pieces around data standardization techniques
  • Lack of user-friendly tools and techniques to discover, ingest, process, correlate, analyze, and consume the data
  • Inability to experiment and innovate with agility and low cost

In this two-part series, I demonstrate a working solution with an AWS CloudFormation template for how a TSP can use existing data assets to generate new revenue streams and improve and personalize CX using AWS services. I also include key pieces of information around data standardization, baselining an analytics data model to marry different datasets in the data warehouse, self-service analytics, metadata search, and media dictionary framework.

In this post, you deploy the stack using a CloudFormation template and follow simple steps to transform, enrich, and bring multiple datasets together so that they can be correlated and queried.

In part 2, you learn how advanced business users can query enriched data and derive meaningful insights using Amazon Redshift and Amazon Redshift Spectrum or Amazon Athena, enable self-service analytics for business users, and publish ready-made dashboards via Amazon QuickSight.

Solution overview

The main ingredient of this solution is Packet Switch (PS) probe data embedded with a deep packet inspection (DPI) engine, which can reveal a lot of information about user interests and usage behavior. This data is transformed and enriched with DPI media and device dictionaries, along with other standard telco transformations to deduce insights, profile and micro-segment subscribers. Enriched data is made available along with other transformed dimensional attributes (CRM, subscriptions, media, carrier, device and network configuration management) for rich slicing and dicing.

For example, the following QuickSight visualizations depict a use case to identity music lovers ages 18-55 with Apple devices. You can also generate micro-segments by capturing the top X subscribers by consumption or adding KPIs like recency and frequency.

The following diagram illustrates the workflow of the solution.

For this post, AWS CloudFormation sets up the required folder structure in Amazon Simple Storage Service (Amazon S3) and provides sample data and dictionary file. Most of the data included as part of the CloudFormation template is dummy and is as follows:

  • CRM
  • Subscription and subscription mapping
  • Network 3G & 4G configuration management
  • Operator PLMN
  • DPI and device dictionary
  • PS probe data

Descriptions of all the input datasets and attributes are available with AWS Glue Data Catalog tables and as part of Amazon Redshift metadata for all tables in Amazon Redshift.

The workflow for this post includes the following steps:

  1. Catalog all the files in the AWS Glue Data Catalog using the following AWS Glue data crawlers:
    1. DPI data crawler (to crawl incoming PS probe DPI data)
    2. Dimension data crawler (to crawl all dimension data)
  2. Update attribute descriptions in the Data Catalog (this step is optional).
  3. Create Amazon Redshift schema, tables, procedures, and metadata using an AWS Lambda
  4. Process each data source file using separate AWS Glue Spark jobs. These jobs enrich, transform, and apply business filtering rules before ingesting data into an Amazon Redshift cluster.
  5. Trigger Amazon Redshift hourly and daily aggregation procedures using Lambda functions to aggregate data from the raw table into hourly and daily tables.

Part 2 includes the following steps:

  1. Catalog the processed raw, aggregate, and dimension data in the Data Catalog using the DPI processed data crawler.
  2. Interactively query data directly from Amazon S3 using Amazon Athena.
  3. Enable self-service analytics using QuickSight to prepare and publish insights based on data residing in the Amazon Redshift cluster.

The workflow can change depending on the complexity of the environment and your use case, but the fundamental idea remains the same. For example, your use case could be processing PS probe DPI data in real time rather than in batch mode, keeping hot data in Amazon Redshift, storing cold and historical data on Amazon S3, or archiving data in Amazon S3 Glacier for regulatory compliance. Amazon S3 offers several storage classes designed for different use cases. You can move the data among these different classes based on Amazon S3 lifecycle properties. For more information, see Amazon S3 Storage Classes.

Prerequisites

For this walkthrough, you should have the following prerequisites:

For more information about AWS Regions and where AWS services are available, see Region Table.

Creating your resources with AWS CloudFormation

To get started, create your resources with the following CloudFormation stack.

  1. Click the Launch Stack button below:
  2. Leave the parameters at their default, with the following exceptions:
    1. Enter RedshiftPassword and S3BucketNameParameter parameters, which aren’t populated by default.
    2. An Amazon S3 bucket name is globally unique, so enter a unique bucket name for S3BucketNameParameter.

The following screenshot shows the parameters for our use case.

  1. Choose Next.
  2. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Create stack.

It takes approximately 10 minutes to deploy the stack. For more information about the key resources deployed through the stack, see Data Monetization and Customer Experience(CX)Optimization using telco data assets: Amazon CloudFormation stack details. You can view all the resources on the AWS CloudFormation console. For instructions, see Viewing AWS CloudFormation stack data and resources on the AWS Management Console.

The CloudFormation stack we provide in this post serves as a baseline and is not a production-grade solution.

Building a Data Catalog using AWS Glue

You start by discovering sample data stored on Amazon S3 through an AWS Glue crawler. For more information, see Populating the AWS Glue Data Catalog. To catalog data, complete the following steps:

  1. On the AWS Glue console, in the navigation pane, choose Crawlers.
  2. Select DPIRawDataCrawler and choose Run crawler.
  3. Select DimensionDataCrawler and choose Run crawler.
  4. Wait for the crawlers to show the status Stopping.

The tables added against the DimensionDataCrawler and DPIRawDataCrawler crawlers should show 9 and 1, respectively.

  1. In the navigation pane, choose Tables.
  2. Verify the following 10 tables are created under the cemdm database:
    • d_crm_demographics
    • d_device
    • d_dpi_dictionary
    • d_network_cm_3g
    • d_network_cm_4g
    • d_operator_plmn
    • d_tac
    • d_tariff_plan
    • d_tariff_plan_desc
    • raw_dpi_incoming

Updating attribute descriptions in the Data Catalog

The AWS Glue Data Catalog has a comment field to store the metadata under each table in the AWS Glue database. Anybody who has access to this database can easily understand attributes coming from different data sources through metadata provided in the comment field. The CloudFormation stack includes a CSV file that contains a description of all the attributes from the source files. This file is used to update the comment field for all the Data Catalog tables this stack deployed. This step is not mandatory to proceed with the workflow. However, if you want to update the comment field against each table, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the GlueCatalogUpdate
  3. Configure a test event by choosing Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which implies that the Data Catalog attribute description is complete.

Attributes of the table under the Data Catalog database should now have descriptions in the Comment column. For example, the following screenshot shows the d_operator_plmn table.

Creating Amazon Redshift schema, tables, procedures, and metadata

To create schema, tables, procedures, and metadata in Amazon Redshift, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the RedshiftDDLCreation
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that the schema, table, procedures, and metadata generation is complete.

Running AWS Glue ETL jobs

AWS Glue provides the serverless, scalable, and distributed processing capability to transform and enrich your datasets. To run AWS Glue extract, transform, and load (ETL) jobs, complete the following steps:

  1. On the AWS Glue console, in the navigation pane, choose Jobs.
  2. Select the following jobs (one at a time) and choose Run job from Action
    • d_customer_demographics
    • d_device
    • d_dpi_dictionary
    • d_location
    • d_operator_plmn
    • d_tac
    • d_tariff_plan
    • d_tariff_plan_desc
    • f_dpi_enrichment

You can run all these jobs in parallel.

All dimension data jobs should finish successfully within 3 minutes, and the fact data enrichment job should finish within 5 minutes.

  1. Verify the jobs are complete by selecting each job and checking Run status on the History tab.

Aggregating hourly and daily DPI data in Amazon Redshift

To aggregate hourly and daily sample data in Amazon Redshift using Lambda functions, complete the following steps:

  1. On the Lambda console, in the navigation pane, choose Functions.
  2. Choose the RedshiftDPIHourlyAgg function.
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that hourly aggregation is complete.

  1. In the navigation pane, choose Functions.
  2. Choose the RedshiftDPIDailyAgg function.
  3. Choose Configure test events.
  4. For Event name, enter Test.
  5. Choose Create.
  6. Choose Test.

You should see a message that the test succeeded, which means that daily aggregation is complete.

Both hourly and daily Lambda functions are hardcoded with the date and hour to aggregate the sample data. To make them generic, there are a few commented lines of code that need to be uncommented and a few lines to be commented. Both functions are also equipped with offset parameters to decide how far back in time you want to do the aggregations. However, this isn’t required for this walkthrough.

You can schedule these functions with CloudWatch. However, this is not required for this walkthrough.

So far, we have completed the following:

  1. Deployed the CloudFormation stack.
  2. Cataloged sample raw data by running DimensionDataCrawler and DPIRawDataCrawler AWS Glue crawlers.
  3. Updated attribute descriptions in the AWS Glue Data Catalog by running the GlueCatalogUpdate Lambda function.
  4. Created Amazon Redshift schema, tables, stored procedures, and metadata through the RedshiftDDLCreation Lambda function.
  5. Ran all AWS Glue ETL jobs to transform raw data and load it into their respective Amazon Redshift tables.
  6. Aggregated hourly and daily data from enriched raw data into hourly and daily Amazon Redshift tables by running the RedshiftDPIHourlyAgg and RedshiftDPIDailyAgg Lambda functions.

Cleaning up

If you don’t plan to proceed to the part 2 of this series, and want to avoid incurring future charges, delete the resources you created by deleting the CloudFormation stack.

Conclusion

In this post, I demonstrated how you can easily transform, enrich, and bring multiple telco datasets together in an Amazon Redshift data warehouse cluster. You can correlate these datasets to produce multi-dimensional insights from several angles, like subscriber, network, device, subscription, roaming, and more.

In part 2 of this series, I demonstrate how you can enable data analysts, scientists, and advanced business users to query data from Amazon Redshift or Amazon S3 directly.

As always, AWS welcomes feedback. This is a wide space to explore, so reach out to us if you want a deep dive into building this solution and more on AWS. Please submit comments or questions in the comments section.


About the Author

Vikas Omer is an analytics specialist solutions architect at Amazon Web Services. Vikas has a strong background in analytics, customer experience management (CEM), and data monetization, with over 11 years of experience in the telecommunications industry globally. With six AWS Certifications, including Analytics Specialty, he is a trusted analytics advocate to AWS customers and partners. He loves traveling, meeting customers, and helping them become successful in what they do.

Validate, evolve, and control schemas in Amazon MSK and Amazon Kinesis Data Streams with AWS Glue Schema Registry

Post Syndicated from Brian Likosar original https://aws.amazon.com/blogs/big-data/validate-evolve-and-control-schemas-in-amazon-msk-and-amazon-kinesis-data-streams-with-aws-glue-schema-registry/

Data streaming technologies like Apache Kafka and Amazon Kinesis Data Streams capture and distribute data generated by thousands or millions of applications, websites, or machines. These technologies serve as a highly available transport layer that decouples the data-producing applications from data processors. However, the sheer number of applications producing, processing, routing, and consuming data can make it hard to coordinate and evolve data schemas, like adding or removing a data field, without introducing data quality issues and downstream application failures. Developers often build complex tools, write custom code, or rely on documentation, change management, and Wikis to protect against schema changes. This is quite error prone because it relies too heavily on human oversight. A common solution with data streaming technologies is a schema registry that provides for validation of schema changes to allow for safe evolution as business needs adjust over time.

AWS Glue Schema Registry, a serverless feature of AWS Glue, enables you to validate and reliably evolve streaming data against Apache Avro schemas at no additional charge. Through Apache-licensed serializers and deserializers, the Glue Schema Registry integrates with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.

This post explains the benefits of using the Glue Schema Registry and provides examples of how to use it with both Apache Kafka and Kinesis Data Streams.

With the Glue Schema Registry, you can eliminate defensive coding and cross-team coordination, improve data quality, reduce downstream application failures, and use a registry that is integrated across multiple AWS services. Each schema can be versioned within the guardrails of a compatibility mode, providing developers the flexibility to reliably evolve schemas. Additionally, the Glue Schema Registry can serialize data into a compressed format, helping you save on data transfer and storage costs.

Although there are many ways to leverage the Glue Schema Registry (including using the API to build your own integrations), in this post, we show two use cases. The Schema Registry is a free feature that can significantly improve data quality and developer productivity. If you use Avro schemas, you should be using the Schema Registry to supplement your solutions built on Apache Kafka (including Amazon MSK) or Kinesis Data Streams. The following diagram illustrates this architecture.

AWS Glue Schema Registry features

Glue Schema Registry has the following features:

  • Schema discovery – When a producer registers a schema change, metadata can be applied as a key-value pair to provide searchable information for administrators or developers. This metadata can indicate the original source of the data (source=MSK_west), the team name to contact (owner=DataEngineering), or AWS tags (environment=Production). You could potentially encrypt a field in your data on the producing client and use metadata to specify to potential consumer clients which public key fingerprint to use for decryption.
  • Schema compatibility – The versioning of each schema is governed by a compatibility mode. If a new version of a schema is requested to be registered that breaks the specified compatibility mode, the request fails and an exception is thrown. Compatibility checks enable developers building downstream applications to have a bounded set of scenarios to build applications against, which helps to prepare for the changes without issue. Commonly used modes are FORWARD, BACKWARD, and FULL. For more information about mode definitions, see Schema Versioning and Compatibility.
  • Schema validation – Glue Schema Registry serializers work to validate that the schema used during data production is compatible. If it isn’t, the data producer receives an exception from the serializer. This ensures that potentially breaking changes are found earlier in development cycles, and can also help prevent unintentional schema changes due to human error.
  • Auto-registration of schemas – If configured to do so, the producer of data can auto-register schema changes as they flow in the data stream. This is especially useful for use cases where the source of the data is change data capture from a database.
  • IAM support – Thanks to integrated AWS Identity and Access Management (IAM) support, only authorized producers can change certain schemas. Furthermore, only those consumers authorized to read the schema can do so. Schema changes are typically performed deliberately and with care, so it’s important to use IAM to control who performs these changes. Additionally, access control to schemas is important in situations where you might have sensitive information included in the schema definition itself. In the examples that follow, IAM roles are inferred via the AWS SDK for Java, so they are inherited from the Amazon Elastic Compute Cloud (Amazon EC2) instance’s role that the application runs in. IAM roles can also be applied to any other AWS service that could contain this code, such as containers or Lambda functions.
  • Integrations and other support – The provided serializers and deserializers are currently for Java clients using Apache Avro for data serialization. The GitHub repo also contains support for Apache Kafka Streams, Apache Kafka Connect, and Apache Flink—all licensed using the Apache License 2.0. We’re already working on additional language and data serialization support, but we need your feedback on what you’d like to see next.
  • Secondary deserializer – If you have already registered schemas in another schema registry, there’s an option for specifying a secondary deserializer when performing schema lookups. This allows for migrations from other schema registries without having to start anew. If the schema ID being used isn’t known to the Glue Schema Registry, it’s looked for in the secondary deserializer.
  • Compression – Using the Avro format already reduces message size due to its compact, binary format. Using a schema registry can further reduce data payload by no longer needing to send and receive schemas with each message. Glue Schema Registry libraries also provide an option for zlib compression, which can reduce data requirements even further by compressing the payload of the message. This varies by use case, but compression can reduce the size of the message significantly.

Example schema

For this post, we use the following schema to begin each of our use cases:

{
 "namespace": "Customer.avro",
 "type": "record",
 "name": "Customer",
 "fields": [
 {"name": "first_name", "type": "string"},
 {"name": "last_name", "type": "string"}
 ]
}

Using AWS Glue Schema Registry with Amazon MSK and Apache Kafka

You can use the following Apache Kafka producer code to produce Apache Avro formatted messages to a topic with the preceding schema:

package com.amazon.gsrkafka;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import com.amazonaws.services.schemaregistry.serializers.avro.AWSKafkaAvroSerializer;
import com.amazonaws.services.schemaregistry.serializers.avro.AWSAvroSerializer;
import com.amazonaws.services.schemaregistry.utils.AvroRecordType;
import com.amazonaws.services.schemaregistry.utils.AWSSchemaRegistryConstants;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.Schema;
import org.apache.avro.Schema.Parser;
import java.util.Properties;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;

public class gsrkafka {
private static final Properties properties = new Properties();
private static final String topic = "test";
public static void main(final String[] args) throws IOException {
// Set the default synchronous HTTP client to UrlConnectionHttpClient
System.setProperty("software.amazon.awssdk.http.service.impl", "software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService");
properties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, AWSKafkaAvroSerializer.class.getName());
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, AWSKafkaAvroSerializer.class.getName());
properties.put(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2");
properties.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "liko-schema-registry");
properties.put(AWSSchemaRegistryConstants.SCHEMA_NAME, "customer");
properties.put(AWSSchemaRegistryConstants.COMPATIBILITY_SETTING, Compatibility.FULL);
properties.put(AWSSchemaRegistryConstants.SCHEMA_AUTO_REGISTRATION_SETTING, true);
Schema schema_customer = new Parser().parse(new File("Customer.avsc"));
GenericRecord customer = new GenericData.Record(schema_customer);

try (KafkaProducer<String, GenericRecord> producer = new KafkaProducer<String, GenericRecord>(properties)) {
final ProducerRecord<String, GenericRecord> record = new ProducerRecord<String, GenericRecord>(topic, customer);
customer.put("first_name", "Ada");
customer.put("last_name", "Lovelace");
customer.put("full_name", "Ada Lovelace");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Sue");
customer.put("last_name", "Black");
customer.put("full_name", "Sue Black");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Anita");
customer.put("last_name", "Borg");
customer.put("full_name", "Anita Borg");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Grace");
customer.put("last_name", "Hopper");
customer.put("full_name", "Grace Hopper");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);

customer.put("first_name", "Neha");
customer.put("last_name", "Narkhede");
customer.put("full_name", "Neha Narkhede");
producer.send(record);
System.out.println("Sent message");
Thread.sleep(1000L);
producer.flush();
System.out.println("Successfully produced 5 messages to a topic called " + topic);
} catch (final InterruptedException | SerializationException e) {
e.printStackTrace();
}
}
}

Use the following Apache Kafka consumer code to look up the schema information while consuming from a topic to learn the schema details:

package com.amazon.gsrkafka;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import com.amazonaws.services.schemaregistry.deserializers.avro.AWSKafkaAvroDeserializer;
import com.amazonaws.services.schemaregistry.deserializers.avro.AWSAvroDeserializer;
import com.amazonaws.services.schemaregistry.utils.AvroRecordType;
import com.amazonaws.services.schemaregistry.utils.AWSSchemaRegistryConstants;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import java.util.Collections;
import java.util.Properties;
import java.io.IOException;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.File;


public class gsrkafka {
private static final Properties properties = new Properties();
private static final String topic = "test";
public static void main(final String[] args) throws IOException {
// Set the default synchronous HTTP client to UrlConnectionHttpClient
System.setProperty("software.amazon.awssdk.http.service.impl", "software.amazon.awssdk.http.urlconnection.UrlConnectionSdkHttpService");
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "gsr-client");
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, AWSKafkaAvroDeserializer.class.getName());
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, AWSKafkaAvroDeserializer.class.getName());
properties.put(AWSSchemaRegistryConstants.AWS_REGION, "us-east-2");
properties.put(AWSSchemaRegistryConstants.REGISTRY_NAME, "liko-schema-registry");
properties.put(AWSSchemaRegistryConstants.AVRO_RECORD_TYPE, AvroRecordType.GENERIC_RECORD.getName());

try (final KafkaConsumer<String, GenericRecord> consumer = new KafkaConsumer<String, GenericRecord>(properties)) {
consumer.subscribe(Collections.singletonList(topic));
while (true) {
final ConsumerRecords<String, GenericRecord> records = consumer.poll(1000);
for (final ConsumerRecord<String, GenericRecord> record : records) {
final GenericRecord value = record.value();
System.out.println("Received message: value = " + value);
}
			}
} catch (final SerializationException e) {
e.printStackTrace();
}
}
}

Using AWS Glue Schema Registry with Kinesis Data Streams

You can use the following Kinesis Producer Library (KPL) code to publish messages in Apache Avro format to a Kinesis data stream with the preceding schema:

private static final String SCHEMA_DEFINITION = "{"namespace": "Customer.avro",\n"
+ " "type": "record",\n"
+ " "name": "Customer",\n"
+ " "fields": [\n"
+ " {"name": "first_name", "type": "string"},\n"
+ " {"name": "last_name", "type": "string"}\n"
+ " ]\n"
+ "}";

KinesisProducerConfiguration config = new KinesisProducerConfiguration();
config.setRegion("us-west-1")

//[Optional] configuration for Schema Registry.

GlueSchemaRegistryConfiguration schemaRegistryConfig = 
new GlueSchemaRegistryConfiguration("us-west-1");

schemaRegistryConfig.setCompression(true);

config.setGlueSchemaRegistryConfiguration(schemaRegistryConfig);

///Optional configuration ends.

final KinesisProducer producer = 
new KinesisProducer(config);

final ByteBuffer data = getDataToSend();

com.amazonaws.services.schemaregistry.common.Schema gsrSchema = 
new Schema(SCHEMA_DEFINITION, DataFormat.AVRO.toString(), "demoSchema");

ListenableFuture<UserRecordResult> f = producer.addUserRecord(
config.getStreamName(), TIMESTAMP, Utils.randomExplicitHashKey(), data, gsrSchema);

private static ByteBuffer getDataToSend() {
org.apache.avro.Schema avroSchema = 
new org.apache.avro.Schema.Parser().parse(SCHEMA_DEFINITION);

GenericRecord user = new GenericData.Record(avroSchema);
user.put("name", "Emily");
user.put("favorite_number", 32);
user.put("favorite_color", "green");

ByteArrayOutputStream outBytes = new ByteArrayOutputStream();
Encoder encoder = EncoderFactory.get().directBinaryEncoder(outBytes, null);
new GenericDatumWriter<>(avroSchema).write(user, encoder);
encoder.flush();
return ByteBuffer.wrap(outBytes.toByteArray());
}

On the consumer side, you can use the Kinesis Client Library (KCL) (v2.3 or later) to look up schema information while retrieving messages from a Kinesis data stream:

GlueSchemaRegistryConfiguration schemaRegistryConfig = 
new GlueSchemaRegistryConfiguration(this.region.toString());

 GlueSchemaRegistryDeserializer glueSchemaRegistryDeserializer = 
new GlueSchemaRegistryDeserializerImpl(DefaultCredentialsProvider.builder().build(), schemaRegistryConfig);

 RetrievalConfig retrievalConfig = configsBuilder.retrievalConfig().retrievalSpecificConfig(new PollingConfig(streamName, kinesisClient));
 retrievalConfig.glueSchemaRegistryDeserializer(glueSchemaRegistryDeserializer);
 
Scheduler scheduler = new Scheduler(
configsBuilder.checkpointConfig(),
configsBuilder.coordinatorConfig(),
configsBuilder.leaseManagementConfig(),
configsBuilder.lifecycleConfig(),
configsBuilder.metricsConfig(),
configsBuilder.processorConfig(),
retrievalConfig
);

 public void processRecords(ProcessRecordsInput processRecordsInput) {
MDC.put(SHARD_ID_MDC_KEY, shardId);
try {
log.info("Processing {} record(s)", 
processRecordsInput.records().size());
processRecordsInput.records()
.forEach(
r -> 
log.info("Processed record pk: {} -- Seq: {} : data {} with schema: {}", 
r.partitionKey(), r.sequenceNumber(), recordToAvroObj(r).toString(), r.getSchema()));
} catch (Throwable t) {
log.error("Caught throwable while processing records. Aborting.");
Runtime.getRuntime().halt(1);
} finally {
MDC.remove(SHARD_ID_MDC_KEY);
}
 }
 
 private GenericRecord recordToAvroObj(KinesisClientRecord r) {
byte[] data = new byte[r.data().remaining()];
r.data().get(data, 0, data.length);
org.apache.avro.Schema schema = new org.apache.avro.Schema.Parser().parse(r.schema().getSchemaDefinition());
DatumReader datumReader = new GenericDatumReader<>(schema);

BinaryDecoder binaryDecoder = DecoderFactory.get().binaryDecoder(data, 0, data.length, null);
return (GenericRecord) datumReader.read(null, binaryDecoder);
 }

Example of schema evolution

As a producer, let’s say you want to add an additional field to our schema:

{
 "namespace": "Customer.avro",
 "type": "record",
 "name": "Customer",
 "fields": [
 {"name": "first_name", "type": "string"},
 {"name": "last_name", "type": "string"},
 {"name": "full_name", "type": ["string", “null”], “default”: null}
]
}

Regardless of whether you’re following the Apache Kafka or Kinesis Data Streams example, you can use the previously provided producer code to publish new messages using this new schema version with the full_name field. This is simply a concatenation of first_name and last_name.

This schema change added an optional field (full_name), which is indicated by the type field having an option of null in addition to string with a default of null. In adding this optional field, we’ve created a schema evolution. This qualifies as a FORWARD compatible change because the producer has modified the schema and the consumer can read without updating its version of the schema. It’s a good practice to provide a default for a given field. This allows for its eventual removal if necessary. If it’s removed by the producer, the consumer uses the default that it knew for that field from before the removal.

This change is also a BACKWARD compatible change, because if the consumer changes the schema it expects to receive, it can use that default to fill in the value for the field it isn’t receiving. By being both FORWARD and BACKWARD compatible, it is therefore a FULL compatible change. The Glue Schema Registry serializers default to BACKWARD compatible, so we have to add a line declaring it as FULL.

In looking at the full option set, you may find FORWARD_ALL, BACKWARD_ALL, and FULL_ALL. These typically only come into play when you want to change data types for a field whose name you don’t change. The most common observed compatibility mode is BACKWARD, which is why it’s the default.

As a consumer application, however, you don’t want to have to recompile your application to handle the addition of a new field. If you want to reference the customer by full name, that’s your choice in your app instead of being forced to consume the new field and use it. When you consume the new messages you’ve just produced, your application doesn’t crash or have problems, because it’s still using the prior version of the schema, and that schema change is compatible with your application. To experience this in action, run the consumer code in one window and don’t interrupt it. As you run the producer application again, this time with messages following the new schema, you can still see output without issue, thanks to the Glue Schema Registry.

Conclusion

In this post, we discussed the benefits of using the Glue Schema Registry to register, validate, and evolve schemas for data streams as business needs change. We also provided examples of how to use Glue Schema Registry with Apache Kafka and Kinesis Data Streams.

For more information and to get started, see AWS Glue Schema Registry.


About the Authors

Brian Likosar is a Senior Streaming Specialist Solutions Architect at Amazon Web Services. Brian loves helping customers capture value from real-time streaming architectures, because he knows life doesn’t happen in batch. He’s a big fan of open-source collaboration, theme parks, and live music.

 

 

Larry Heathcote is a Senior Product Marketing Manager at Amazon Web Services for data streaming and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys walking his Samoyed Sasha in the mornings so she can look for squirrels to bark at.

 

 

The best new features for data analysts in Amazon Redshift in 2020

Post Syndicated from Helen Anderson original https://aws.amazon.com/blogs/big-data/the-best-new-features-for-data-analysts-in-amazon-redshift-in-2020/

This is a guest post by Helen Anderson, data analyst and AWS Data Hero

Every year, the Amazon Redshift team launches new and exciting features, and 2020 was no exception. New features to improve the data warehouse service and add interoperability with other AWS services were rolling out all year.

I am part of a team that for the past 3 years has used Amazon Redshift to store source tables from systems around the organization and usage data from our software as a service (SaaS) product. Amazon Redshift is our one source of truth. We use it to prepare operational reports that support the business and for ad hoc queries when numbers are needed quickly.

When AWS re:Invent comes around, I look forward to the new features, enhancements, and functionality that make things easier for analysts. If you haven’t tried Amazon Redshift in a while, or even if you’re a longtime user, these new capabilities are designed with analysts in mind to make it easier to analyze data at scale.

Amazon Redshift ML

The newly launched preview of Amazon Redshift ML lets data analysts use Amazon SageMaker over datasets in Amazon Redshift to solve business problems without the need for a data scientist to create custom models.

As a data analyst myself, this is one of the most interesting announcements to come out in re:Invent 2020. Analysts generally use SQL to query data and present insights, but they don’t often do data science too. Now there is no need to wait for a data scientist or learn a new language to create predictive models.

For information about what you need to get started with Amazon Redshift ML, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

For information about what you need to get started with Amazon Redshift ML, see the Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML blog post.

Federated queries

As analysts, we often have to join datasets that aren’t in the same format and sometimes aren’t ready for use in the same place. By using federated queries to access data in other databases or Amazon Simple Storage Service (Amazon S3), you don’t need to wait for a data engineer or ETL process to move data around.

re:Invent 2019 featured some interesting talks from Amazon Redshift customers who were tackling this problem. Now that federated queries over operational databases like Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL are generally available and querying Amazon RDS for MySQL and Amazon Aurora MySQL is in preview, I’m excited to hear more.

For a step-by-step example to help you get started, see Build a Simplified ETL and Live Data Query Solution Using Redshift Federated Query.

SUPER data type

Another problem we face as analysts is that the data we need isn’t always in rows and columns. The new SUPER data type makes JSON data easy to use natively in Amazon Redshift with PartiQL.

PartiQL is an extension that helps analysts get up and running quickly with structured and semistructured data so you can unnest and query using JOINs and aggregates. This is really exciting for those who deal with data coming from applications that store data in JSON or unstructured formats.

For use cases and a quickstart, see Ingesting and querying semistructured data in Amazon Redshift (preview).

Partner console integration

The preview of the native console integration with partners announced at AWS re:Invent 2020 will also make data analysis quicker and easier. Although analysts might not be doing the ETL work themselves, this new release makes it easier to move data from platforms like Salesforce, Google Analytics, and Facebook Ads into Amazon Redshift.

Matillion, Sisense, Segment, Etleap, and Fivetran are launch partners, with other partners coming soon. If you’re an Amazon Redshift partner and would like to integrate into the console, contact [email protected].

RA3 nodes with managed storage

Previously, when you added Amazon Redshift nodes to a cluster, both storage and compute were scaled up. This all changed with the 2019 announcement of RA3 nodes, which upgrade storage and compute independently.

In 2020, the Amazon Redshift team introduced RA3.xlplus nodes, which offer even more compute sizing options to address a broader set of workload requirements.

AQUA for Amazon Redshift

As analysts, we want our queries to run quickly so we can spend more time empowering the users of our insights and less time watching data slowly return. AQUA, the Advanced Query Accelerator for Amazon Redshift tackles this problem at an infrastructure level by bringing the stored data closer to the compute power

This hardware-accelerated cache enables Amazon Redshift to run up to 10 times faster as it scales out and processes data in parallel across many nodes. Each node accelerates compression, encryption, and data processing tasks like scans, aggregates, and filtering. Analysts should still try their best to write efficient code, but the power of AQUA will speed up the return of results considerably.

AQUA is available on Amazon Redshift RA3 instances at no additional cost. To get started with AQUA, sign up for the preview.

The following diagram shows Amazon Redshift architecture with an AQUA layer.

AQUA is available on Amazon Redshift RA3 instances at no additional cost.

Figure 1: Amazon Redshift architecture with AQUA layer

Automated performance tuning

For analysts who haven’t used sort and distribution keys, the learning curve can be steep. A table created with the wrong keys can mean results take much longer to return.

Automatic table optimization tackles this problem by using machine learning to select the best keys and tune the physical design of tables. Letting Amazon Redshift determine how to improve cluster performance reduces manual effort.

Summary

These are just some of the Amazon Redshift announcements made in 2020 to help analysts get query results faster. Some of these features help you get access to the data you need, whether it’s in Amazon Redshift or somewhere else. Others are under-the-hood enhancements that make things run smoothly with less manual effort.

For more information about these announcements and a complete list of new features, see What’s New in Amazon Redshift.


About the Author

Helen Anderson is a Data Analyst based in Wellington, New Zealand. She is well known in the data community for writing beginner-friendly blog posts, teaching, and mentoring those who are new to tech. As a woman in tech and a career switcher, Helen is particularly interested in inspiring those who are underrepresented in the industry.

How FanDuel Group secures personally identifiable information in a data lake using AWS Lake Formation

Post Syndicated from Damian Grech original https://aws.amazon.com/blogs/big-data/how-fanduel-group-secures-personally-identifiable-information-in-a-data-lake-using-aws-lake-formation/

This post is co-written with Damian Grech from FanDuel

FanDuel Group is an innovative sports-tech entertainment company that is changing the way consumers engage with their favorite sports, teams, and leagues. The premier gaming destination in the US, FanDuel Group consists of a portfolio of leading brands across gaming, sports betting, daily fantasy sports, advance-deposit wagering, and TV/media, including FanDuel, Betfair US, and TVG. FanDuel Group has a presence across 50 states and over 8.5 million customers. The company is based in New York with offices in California, New Jersey, Florida, Oregon, and Scotland. FanDuel Group is a subsidiary of Flutter Entertainment plc, the world’s largest sports betting and gaming operator with a portfolio of globally recognized brands and a constituent of the FTSE 100 index of the London Stock Exchange.

In this post, we discuss how FanDuel used AWS Lake Formation and Amazon Redshift Spectrum to restrict access to personally identifiable information (PII) in their data lake.

The challenge

In 2018, a series of mergers led to the creation of FanDuel Group, and the combined data engineering team found themselves operating three data warehouses running on Amazon Redshift. The team decided to create a new single platform to replace the three separate warehouses, consisting of a data warehouse containing the core business data model and a data lake to catalog and hold all other types of data. FanDuel’s vision was to create an unified data platform that served their data requirements. This included the ability to ingest and organize real-time and batch datasets, and secure and govern PII.

Because the end-users of the existing data warehouses were familiar with Amazon Redshift, it was critical that they be able to access the data lake using Amazon Redshift. Other important architecture considerations included a simplified user experience, the ability to scale to huge data volumes, and a robust security model to provision relevant data to analysts and data scientists.

To accomplish the vision, FanDuel decided to modernize the data platform and introduce Amazon Simple Storage Service (Amazon S3)-based data lakes. Data lakes are a logical construct that allows data to be stored in its native format using open data formats. With a data lake architecture, FanDuel can enable data analysts to analyze large volume of data without significant modeling. Also, data lakes allow FanDuel to store structured and unstructured data.

Some of the data to be stored in the data lake was customer PII, so access to this category of data needed to be carefully restricted to only employees who required access to perform their job functions. To address these security challenges, FanDuel first tested out a tag-based approach on Amazon S3 to restrict access to the PII data. The idea was to write two datasets for a single dataset—one with PII and another without PII—and apply tags for files where PII is stored, securing files using AWS Identity and Access Management (IAM) policies. This approach was complex and needed 100–200 hours of development time for every data source that was ingested.

Solution overview

FanDuel decided to use Lake Formation and Redshift Spectrum to solve this challenge. The following architectural diagram shows how FanDuel secured their data lake.

The solution includes the following steps:

  1. The FanDuel team registered the S3 location in Lake Formation.

After the location is registered, Lake Formation takes control of the data lake, thereby eliminating the need to set up complicated policies in IAM.

  1. FanDuel built AWS Glue ETL jobs to extract data from sources, including MySQL databases and flat files. They used AWS Glue to cleanse and transform raw data to form refined datasets stored in Parquet-formatted files. They also used AWS Glue crawlers to register the cleansed datasets in the Data Catalog.
  2. The team used Lake Formation to set up column-based permissions using two roles:
    1. LimitedPIIAnalyst – Granted access to all columns. Only analysts who needed access to PII data were assigned this role.
    2. NonPIIAnalyst – Granted access to non-PII columns. By default, analysts using the data lake were assigned this role.
  3. FanDuel created two external schemas using Redshift Spectrum: one using the NonPIIAnalyst role, and one using the LimitedPIIAnalyst The following code is an example of the DDL that uses the role that was set up in Lake Formation:
    CREATE EXTERNAL SCHEMA nonpii_data_lake FROM DATA CATALOG
    DATABASE 'fanduel_data_lake' REGION 'us-east-1'
    IAM_ROLE 'arn:aws:iam::123456789012:role/NonPIIAnalyst';
    
    CREATE EXTERNAL SCHEMA limitedpii_data_lake FROM DATA CATALOG
    DATABASE 'fanduel_data_lake' REGION 'us-east-1'
    IAM_ROLE 'arn:aws:iam::123456789012:role/LimitedPIIAnalyst';
    

FanDuel could already manage access permissions by adding or removing users from a group in Amazon Redshift, so they already had a group consisting of only the analysts who should be permitted access to PII. The following code grants this group access to the limitedpii_data_lake schema, which effectively means only this group can query the data lake using the LimitedPIIAnalyst role:

GRANT USAGE ON SCHEMA nonpii_data_lake TO base_group;
GRANT SELECT ON ALL TABLES IN SCHEMA nonpii_data_lake TO base_group;
GRANT USAGE ON SCHEMA limitedpii_data_lake TO pii_permitted_group;
GRANT SELECT ON ALL TABLES IN SCHEMA limitedpii_data_lake TO pii_permitted_group;

Benefits

The ability to extend queries to the data lake with Redshift Spectrum and have column-level access control provides superior control over the S3 tag-based permissions approach that was originally considered. This architecture provided the following benefits for FanDuel:

  • FanDuel could offer new capabilities to data analysts. For example, data analysts could quickly access raw data with PII and combine it with existing data in Amazon Redshift. Lake Formation provided a single view for monitoring the data access patterns.
  • Lake Formation column-level access control allowed them to secure PII data, which otherwise would have taken a complex S3 tag-based approach. This saved 100–200 hours of development time for every new data source and data footprint, because the original approach required creating two files (one with PII and another without PII), tagging files, and setting up permissions based on tags.
  • The ability to extend access from Amazon Redshift to the data lake with appropriate access control has allowed FanDuel to reduce data stored in Amazon Redshift.

Conclusion

FanDuel will leverage its new data platform to ingest additional data sources with real-time data so analysts and data scientists can gain insights and improve customer experience.

Questions or feedback? Send an email to [email protected].


About the Authors

Damian Grech is a Data Engineering Senior Manager at FanDuel. Damian has over 15 years of experience in software delivery and has worked with organizations ranging from large enterprises to start-ups at their infant stages. In his spare time, you can find him either experimenting in the kitchen or trailing the Scottish Highlands.

 

 

Shiv Narayanan is Global Business Development Manager for Data Lakes and Analytics solutions at AWS. He works with AWS customers across the globe to strategize, build, develop and deploy modern data platforms. Shiv loves music, travel, food and trying out new tech.

 

 

 

Sidhanth Muralidhar is a Senior Technical Account Manager at Amazon Web Services. He works with large enterprise customers who run their workloads on AWS. He is passionate about working with customers and helping them in their cloud journey. In his spare time, he loves to play and watch football.

 

 

 

 

 

 

Mythbusting the Analytics Journey

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/mythbusting-the-analytics-journey-58d692ea707e

Part of our series on who works in Analytics at Netflix — and what the role entails

by Alex Diamond

This Q&A aims to mythbust some common misconceptions about succeeding in analytics at a big tech company.

This isn’t your typical recruiting story. I wasn’t actively looking for a new job and Netflix was the only place I applied. I didn’t know anyone who worked there and just submitted my resume through the Jobs page 🤷🏼‍♀️ . I wasn’t even entirely sure what the right role fit would be and originally applied for a different position, before being redirected to the Analytics Engineer role. So if you find yourself in a similar situation, don’t be discouraged!

How did you come to Netflix?

Movies and TV have always been one of my primary sources of joy. I distinctly remember being a teenager, perching my laptop on the edge of the kitchen table to “borrow” my neighbor’s WiFi (back in the days before passwords 👵🏻), and streaming my favorite Netflix show. I felt a little bit of ✨magic✨ come through the screen each time, and that always stuck with me. So when I saw the opportunity to actually contribute in some way to making the content I loved, I jumped at it. Working in Studio Data Science & Engineering (“Studio DSE”) was basically a dream come true.

Not only did I find the subject matter interesting, but the Netflix culture seemed to align with how I do my best work. I liked the idea of Freedom and Responsibility, especially if it meant having autonomy to execute projects all the way from inception through completion. Another major point of interest for me was working with “stunning colleagues”, from whom I could continue to learn and grow.

What was your path to working with data?

My road-to-data was more of a stumbling-into-data. I went to an alternative high school for at-risk students and had major gaps in my formal education — not exactly a head start. I then enrolled at a local public college at 16. When it was time to pick a major, I was struggling in every subject except one: Math. I completed a combined math bachelors + masters program, but without any professional guidance, networking, or internships, I was entirely lost. I had the piece of paper, but what next? I held plenty of jobs as a student, but now I needed a career.

A visual representation of all the jobs I had in high school and college: From pizza, to gourmet rice krispie treats, to clothing retail, to doors and locks

After receiving a grand total of *zero* interviews from sending out my resume, the natural next step was…more school. I entered a PhD program in Computer Science and shortly thereafter discovered I really liked the coding aspects more than the theory. So I earned the honor of being a PhD dropout.

A visual representation of all the hats I’ve worn

And here’s where things started to click! I used my newfound Python and SQL skills to land an entry-level Business Intelligence Analyst position at a company called Big Ass Fans. They make — you guessed it — very large industrial ventilation fans. I was given the opportunity to branch out and learn new skills to tackle any problem in front of me, aka my “becoming useful” phase. Within a few months I’d picked up BI tools, predictive modeling, and data ingestion/ETL. After a few years of wearing many different proverbial hats, I put them all to use in the Analytics Engineer role here. And ever since, Netflix has been a place where I can do my best work, put to use the skills I’ve gathered over the years, and grow in new ways.

What does an ordinary day look like?

As part of the Studio DSE team, our work is focused on aiding the movie-making process for our Netflix Originals, leading all the way up to a title’s launch on the service. Despite the affinity for TV and movies that brought me here, I didn’t actually know very much about how they got made. But over time, and by asking lots of questions, I’ve picked up the industry lingo! (Can you guess what “DOOD” stands for?)

My main stakeholders are members of our Studio team. They’re experts on the production process and an invaluable resource for me, sharing their expertise and providing context when I don’t know what something means. True to the “people over process” philosophy, we adapt alongside our stakeholders’ needs throughout the production process. That means the work products don’t always fit what you might imagine a traditional Analytics Engineer builds — if such a thing even exists!

A typical production lifecycle

On an ordinary day, my time is generally split evenly across:

  • 🤝📢 Speaking with stakeholders to understand their primary needs
  • 🐱💻 Writing code (SQL, Python)
  • 📊📈 Building visual outputs (Tableau, memos, scrappy web apps)
  • 🤯✍️ Brainstorming and vision planning for future work

Some days have more of one than the others, but variety is the spice of life! The one constant is that my day always starts with a ridiculous amount of coffee. And that it later continues with even more coffee. ☕☕☕

My road-to-data was more of a stumbling-into-data.

What advice would you give to someone just starting their career in data?

🐾 Dip your toes in things. As you try new things, your interests will evolve and you’ll pick up skills across a broad span of subject areas. The first time I tried building the front-end for a small web app, it wasn’t very pretty. But it piqued my interest and after a few times it started to become second nature.

💪 Find your strengths and weaknesses. You don’t have to be an expert in everything. Just knowing when to reach out for guidance on something allows you to uplevel your skills in that area over time. My weakness is statistics: I can use it when needed but it’s just not a subject that comes naturally to me. I own that about myself and lean on my stats-loving peers when needed.

🌸 Look for roles that allow you to grow. As you grow in your career, you’ll provide impact to the business in ways you didn’t even expect. As a business intelligence analyst, I gained data science skills. And in my current Analytics Engineer role, I’ve picked up a lot of product management and strategic thinking experience.

This is what I look like.

☝️ One Last Thing

I started off my career with the vague notion of, “I guess I want to be a data scientist?” But what that’s meant in practice has really varied depending on the needs of each job and project. It’s ok if you don’t have it all figured out. Be excited to try new things, lean into strengths, and don’t be afraid of your weaknesses — own them.

If this post resonates with you and you’d like to explore opportunities with Netflix, check out our analytics site, search open roles, and learn about our culture. You can also find more stories like this here.


Mythbusting the Analytics Journey was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Harness the power of your data with AWS Analytics

Post Syndicated from Rahul Pathak original https://aws.amazon.com/blogs/big-data/harness-the-power-of-your-data-with-aws-analytics/

2020 has reminded us of the need to be agile in the face of constant and sudden change. Every customer I’ve spoken to this year has had to do things differently because of the pandemic. Some are focusing on driving greater efficiency in their operations and others are experiencing a massive amount of growth. Across the board, I see organizations looking to use their data to make better decisions quickly as changes occur. Such agility requires that they integrate terabytes to petabytes and sometimes exabytes of data that were previously siloed in order to get a complete view of their customers and business operations. Traditional on-premises data analytics solutions can’t handle this approach because they don’t scale well enough and are too expensive. As a result, we’re seeing an acceleration in customers looking to modernize their data and analytics infrastructure by moving to the cloud.

Customer data in the real world

To analyze these vast amounts of data, many companies are moving all their data from various silos into a single location, often called a data lake, to perform analytics and machine learning (ML). These same companies also store data in purpose-built data stores for the performance, scale, and cost advantages they provide for specific use cases. Examples of such data stores include data warehouses—to get quick results for complex queries on structured data—and technologies like Elasticsearch—to quickly search and analyze log data to monitor the health of production systems. A one-size-fits-all approach to data analytics no longer works because it inevitably leads to compromises.

To get the most from their data lakes and these purpose-built stores, customers need to move data between these systems easily. For instance, clickstream data from web applications can be collected directly in a data lake and a portion of that data can be moved out to a data warehouse for daily reporting. We think of this concept as inside-out data movement.

Similarly, customers also move data in the other direction:  from the outside-in. For example, they copy query results for sales of products in a given region from their data warehouse into their data lake to run product recommendation algorithms against a larger data set using ML.

Finally, in other situations, customers want to move data from one purpose-built data store to another: around-the-perimeter. For example, they may copy the product catalog data stored in their database to their search service in order to make it easier to look through their product catalog and offload the search queries from the database.

As data in these data lakes and purpose-built stores continues to grow, it becomes harder to move all this data around. We call this data gravity.

To make decisions with speed and agility, customers need to be able to use a central data lake and a ring of purpose-built data services around that data lake. They also need to acknowledge data gravity by easily moving the data they need between these data stores in a secure and governed way.

To meet these needs, customers require a data architecture that supports the following:

  • Building a scalable data lake rapidly.
  • Using a broad and deep collection of purpose-built data services that provide the performance required for use-cases like interactive dashboards and log analytics.
  • Moving data seamlessly between the data lake and purpose-built data services and between those purpose-built data services.
  • Ensuring compliance via a unified way to secure, monitor, and manage access to data.
  • Scaling systems at low cost without compromising on performance.

We call this modern approach to analytics the Lake House Architecture.

Lake House Architecture on AWS

A Lake House Architecture acknowledges the idea that taking a one-size-fits-all approach to analytics eventually leads to compromises. It is not simply about integrating a data lake with a data warehouse, but rather about integrating a data lake, a data warehouse, and purpose-built stores and enabling unified governance and easy data movement. The diagram below shows the Lake House Architecture on AWS.

Let’s take a look at how the Lake House Architecture on AWS and some of the new capabilities we announced at re:Invent 2020 help our customers meet each of the requirements above..

Scalable data lakes

Amazon Simple Storage Service (Amazon S3) is the best place to build a data lake because it has unmatched durability, availability, and scalability; the best security, compliance, and audit capabilities; the fastest performance at the lowest cost; the most ways to bring data in; and the most partner integrations.

However, setting up and managing data lakes involves a lot of manual and time-consuming tasks such as loading data from diverse sources, monitoring data flows, setting up partitions, turning on encryption and managing keys, reorganizing data into columnar format, and granting and auditing access. To help make this easier, we built AWS Lake Formation. Lake Formation helps our customers build secure data lakes in the cloud in days instead of months. Lake Formation collects and catalogs data from databases and object storage, moves the data into an Amazon S3 data lake, cleans and classifies data using ML algorithms, and secures access to sensitive data.

In addition, we just announced three new capabilities for AWS Lake Formation in preview:  ACID transactions and governed tables for concurrent updates and consistent query results and query acceleration through automatic file compaction. The preview introduces new APIs that support atomic, consistent, isolated, and durable (ACID) transactions using a new data lake table type, called a governed table. Governed tables allow multiple users to concurrently insert, delete, and modify rows across tables, while still allowing other users to simultaneously run analytical queries and ML models on the same data sets. Automatic file compaction combines small files into larger files to make queries by up to seven times faster.

Purpose-built analytics services

AWS offers the broadest and deepest portfolio of purpose-built analytics services, including Amazon Athena, Amazon EMR, Amazon Elasticsearch Service, Amazon Kinesis, and Amazon Redshift. These services are all built to be best-of-breed, which means you never have to compromise on performance, scale, or cost when using them. For example, Amazon Redshift delivers up to three times better price performance than other cloud data warehouses, and Apache Spark on EMR runs 1.7 times faster than standard Apache Spark 3.0, which means petabyte-scale analysis can be run at less than half of the cost of traditional on-premises solutions.

We’re always innovating to meet our customer’s needs with new capabilities and features in these purpose-built services. For example, to help with additional cost savings and deployment flexibility, today we announced the general availability of Amazon EMR on Amazon Elastic Kubernetes Service (EKS). This offers a new deployment option of fully managed Amazon EMR on Amazon EKS. Until now, customers had to choose between running managed Amazon EMR on EC2 and self-managing their own Apache Spark on Amazon EKS. Now, analytical workloads can be consolidated with microservices and other Kubernetes-based applications on the same Amazon EKS cluster, enabling improved resource utilization, simpler infrastructure management, and the ability to use a single set of tools for monitoring.

For faster data warehousing performance, we announced the general availability of Automatic Table Optimizations (ATO) for Amazon Redshift. ATO simplifies performance tuning of Amazon Redshift data warehouses by using ML to automate optimization tasks such as setting distribution and sort keys to give you the best possible performance without the overhead of manual performance tuning.

We also announced the preview of Amazon QuickSight Q to make it even easier and faster for your business users to get insights from your data. QuickSight Q uses ML to generate a data model that automatically understands the meaning and relationships of business data. It enables users to ask ad hoc questions about their business data in human-readable language and get accurate answers in seconds. As a result, business users can now get answers to their questions instantly without having to wait for modeling by thinly-staffed business intelligence (BI) teams.

Seamless data movement

With data stored in a number of different systems, customers need to be able to easily move that data between all of their services and data stores:  inside-out, outside-in, and around-the-perimeter. No other analytics provider makes it as easy to move data at scale to where it’s needed most. AWS Glue is a serverless data integration service that allows you to easily prepare data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration, so insights can be gained in minutes instead of months. Amazon Redshift and Athena both support federated queries, the ability to run queries across data stored in operational databases, data warehouses, and data lakes to provide insights across multiple data sources with no data movement and no need to set up and maintain complex extract, transform, and load (ETL) pipelines.

To make it even easier to combine and replicate data across multiple data stores, last week we announced the preview of AWS Glue Elastic Views. AWS Glue Elastic Views lets developers create materialized views across a wide variety of databases and data stores using familiar SQL, reducing the time it takes to combine and replicate data across data stores from months to minutes. AWS Glue Elastic Views handles copying and combining data from source to target data stores, continuously monitors for changes in source data stores, and updates the materialized views automatically to ensure data accessed is always up-to-date.

We also announced the preview of Amazon Redshift data sharing. Data sharing provides a secure and easy way to share live data across multiple Amazon Redshift clusters inside the organization and externally without the need to make copies or the complexity of moving around data. Customers can use data sharing to run analytics workloads that use the same data in separate compute clusters in order to meet the performance requirements of each workload and track usage by each business group. For example, customers can set up a central ETL cluster and share data with multiple BI clusters to provide workload isolation and chargeback.

Unified governance

One of the most important pieces of a modern analytics architecture is the ability to authorize, manage, and audit access to data. Enabling such a capability can be challenging because managing security, access control, and audit trails across all the data stores in an organization is complex and time-consuming. It’s also error-prone because it requires manually maintaining access control lists and audit policies across all storage systems, each with different security, data access, and audit mechanisms.

With capabilities like centralized access control and policies combined with column and row-level filtering, AWS gives customers the fine-grained access control and governance to manage access to data across a data lake and purpose-built data stores from a single point of control.

Today, we also announced the preview of row-level security for AWS Lake Formation, which makes it even easier to control access for all the people and applications that need to share data. Row-level security allows for filtering and setting data access policies at the row level. For example, you can now set a policy that gives a regional sales manager access to only the sales data for their region. This level of filtering eliminates the need to maintain different copies of data lake tables for different user groups, saving you operational overhead and unnecessary storage costs.

Performance and cost-effectiveness

At AWS, we’re committed to providing the best performance at the lowest cost across all analytics services and we continue to innovate to improve the price-performance of our services. In addition to industry-leading price performance for services like Amazon Redshift and Amazon EMR, Amazon S3 intelligent tiering saves customers up to 40% on storage costs for data stored in a data lake, and Amazon EC2 provides access to over 350 instance types, up to 400Gbps ethernet networking, and the ability to choose between On-Demand, Reserved, and Spot instances. We announced Amazon EMR support for Amazon EC2 M6g instances powered by AWS Graviton2 processors in October, providing up to 35% lower cost and up to 15% improved performance. Our customers can also take advantage of AWS Savings Plans, a flexible pricing model that provides savings of up to 72% on AWS compute usage.

To set the foundation for the new scale of data, we also shared last week that the AQUA (Advanced Query Accelerator) for Amazon Redshift preview is now open to all customers and will be generally available in January 2021. AQUA is a new distributed and hardware-accelerated cache that brings compute to the storage layer, and delivers up to ten times faster query performance than other cloud data warehouses. AQUA is available on Amazon Redshift RA3 instances at no additional cost, and customers can take advantage of the AQUA performance improvements without any code changes.

Learn More and Get Started Today

Whatever a customer is looking to do with data, AWS Analytics can offer a solution. We provide the broadest and deepest portfolio of purpose-built analytics services to realize a Lake House Architecture. Our portfolio includes the most scalable data lakes, purpose-built analytics services, seamless data movement, and unified governance – all delivered with the best performance at the lowest cost.

Read all the Analytics announcements from AWS re:Invent 2020 at What’s New at AWS re:Invent? and apply for the analytics previews using the links below. Also dive deeper by watching the more than 40 analytics sessions that were part of AWS re:Invent 2020. Simply visit the session catalog and choose the Analytics track to review past sessions and add upcoming ones to your calendar.

Finally, take advantage of the AWS Data Lab. The AWS Data Lab offers opportunities for customers and AWS technical resources to engage and accelerate data and analytics modernization initiatives.

Announcement and Preview Links


About the Author

Rahul Pathak is Vice President of Analytics at AWS and is responsible for Amazon Athena, Amazon Elasticsearch Service, EMR, Glue, Lake Formation, and Redshift. Over his nine years at AWS, he has focused on managed database, analytics, and database services. Rahul has over twenty years of experience in technology and has co-founded two companies, one focused on digital media analytics and the other on IP-geolocation. He holds a degree in Computer Science from MIT and an Executive MBA from the University of Washington.

 

 

 

Amazon EMR Studio (Preview): A new notebook-first IDE experience with Amazon EMR

Post Syndicated from Fei Lang original https://aws.amazon.com/blogs/big-data/amazon-emr-studio-preview-a-new-notebook-first-ide-experience-with-amazon-emr/

We’re happy to announce Amazon EMR Studio (Preview), an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter notebooks and tools like Spark UI and YARN Timeline Service to simplify debugging. EMR Studio uses AWS Single Sign-On (AWS SSO), and allows you to log in directly with your corporate credentials without signing in to the AWS Management Console.

With EMR Studio, you can run notebook code on Amazon EMR running on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), and debug your applications. For more information about Amazon EMR on Amazon EKS, see What is Amazon EMR on EKS.

EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing with the performance-optimized Apache Spark runtime that Amazon EMR provides. You can also install custom kernels and libraries, collaborate with peers using code repositories such as GitHub and Bitbucket, or run parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow or Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Administrators can set up EMR clusters that can be used by EMR Studio users, or create predefined AWS CloudFormation templates for Amazon EMR and allow you to simply choose a template for creating your own cluster.

In this post, we discuss the benefits that EMR Studio offers and we introduce to you some of its capabilities. To learn more about creating and using EMR Studios, see Use Amazon EMR Studio.

Benefits of using EMR Studio

EMR Studio offers the following benefits:

  • Set up a unified experience to develop and diagnose EMR Spark applications – Administrators can set up EMR Studio to allow you to log in using your corporate credentials without having to sign in to the AWS console. You get a single unified environment to interactively explore, process, and visualize data using notebooks, build and schedule pipelines, and debug applications without having to log in to EMR clusters.
  • Use fully managed Jupyter notebooks – With EMR Studio, you can develop analytics and data science applications in R, Python, Scala, and PySpark with fully managed Jupyter notebooks. You can take advantage of distributed processing using the performance-optimized Amazon EMR runtime for Apache Spark with Jupyter kernels and applications running on EMR clusters. you can attach notebooks to an existing cluster that uses Amazon EC2 instances, or to an EMR on EKS virtual cluster. You can also start your own clusters using templates pre-configured by administrators.
  • Collaborate with others using code repositories – From the EMR Studio notebooks environment, you can connect to code repositories such as AWS CodeCommit, GitHub, and Bitbucket to collaborate with peers.
  • Run custom Python libraries and kernels – From EMR Studio, you can install custom Python libraries or Jupyter kernels required for your applications directly to the EMR clusters.
  • Automate workflows using pipelines – EMR Studio makes it easy to move from prototyping to production. You can create EMR Studio notebooks that can be programmatically invoked with parameters, and use APIs to run the parameterized notebooks. You can also use orchestration tools such as Apache Airflow or Amazon MWAA to run notebooks in automated workflows.
  • Simplified debugging – With EMR Studio, you can debug jobs and access logs without logging in to the cluster. EMR Studio provides native application interfaces such as Spark UI and YARN Timeline. When a notebook is run in EMR Studio, the application logs are uploaded to Amazon Simple Storage Service (Amazon S3). As a result, you can access logs and diagnose applications even after your EMR cluster is terminated. You can quickly locate the job to debug by filtering based on the cluster or time when the application was run.

In the following section, we demonstrate some of the capabilities of Amazon EMR Studio using a sample notebook. For our sample notebook, we use the open-source, real-time COVID-19 US daily case reports provided by Johns Hopkins University CSSE from the following GitHub repo.

Notebook-first IDE experience with AWS SSO integration

EMR Studio makes it simple to interact with applications on an EMR cluster. After an administrator sets up EMR Studio and provides the access URL (which looks like https://es-*************************.emrstudio.us-east-1.amazonaws.com), you can log in to EMR Studio with your corporate credentials.

After you log in to EMR Studio, you get started by creating a Workspace. A Workspace is a collection of one or more notebooks for a project. The Workspaces and the notebooks that you create in EMR Studio are automatically saved in an Amazon S3 location.

Now, we create a Workspace by completing the following steps:

  1. On the EMR Studio Dashboard page, choose Create Workspace.
  2. On the Create a Workspace page, enter a Workspace name and a Description.

Naming the Workspace helps identify your project. Your workspace is automatically saved, and you can find it later on the Workspaces page. For this post, we name our Workspace EMR-Studio-WS-Demo1.

  1. On the Subnet drop-down menu, choose a subnet for your Workspace.

Each subnet belongs to the same Amazon Virtual Private Cloud (Amazon VPC) as your EMR Studio. Your administrator may have set up one or more subnets to use for your EMR clusters. You should choose a subnet that matches the subnet where you use EMR clusters. If you’re not sure about which subnet to use, contact your administrator.

  1. For S3 location, choose the Amazon S3 location where EMR Studio backs up all notebook files in the Workspace.

This location is where your Workspace and all the notebooks in the Workspace are automatically saved.

  1. In the Advanced configuration section, you can attach an EMR cluster to your Workspace.

For this post, we skip this step. EMR Studio allows you to create Workspaces and notebooks without attaching to an EMR cluster. You can attach an EMR cluster later when you’re ready to run your notebooks.

  1. Choose Create Workspace.

Fully managed environment for managing and running Jupyter-based notebooks

EMR Studio provides a fully managed environment to help organize and manage Workspaces. Workspaces are the primary building blocks of EMR Studio, and they preserve the state of your notebooks. You can create different Workspaces for each project. From within a Workspace, you can create notebooks, link your Workspace to a code repository, and attach your Workspace to an EMR cluster to run notebooks. Your Workspaces and the notebooks and settings it contains are automatically saved in the Amazon S3 location that you specify.

If you created the workspace EMR-Studio-WS-Demo1 by following the preceding steps, it appears on the Workspaces page with the name EMR-Studio-WS-Demo1 along with status Ready, creation time, and last modified timestamp.

The following table describes each possible Workspace status.

Status Meaning
Starting The Workspace is being prepared, but is not yet ready to use.
Ready You can open the Workspace to use the notebook editor. When a Workspace has a Ready status, you can open or delete it.
Attaching The Workspace is being attached to a cluster.
Attached The Workspace is attached to an EMR cluster. If a Workspace is not attached to an EMR cluster, you need to attach it to an EMR cluster before you can run any notebook code in the Workspace.
Idle

The Workspace is stopped and currently idle. When you launch an idle Workspace, the Workspace status changes from Idle to Starting to Ready.

 

Stopping The Workspace is being stopped.
Deleting When you delete a Workspace, it’s marked for deletion. EMR Studio automatically deletes Workspaces marked for deletion. After a Workspace is deleted, it no longer shows in the list of Workspaces.

You can choose the Workspace that you created (EMR-Studio-WS-Demo1) to open it. This opens a new web browser tab with the JupyterLab interface. The icon-denoted tabs on the left sidebar allow you to access tool panels such as the file browser or JupyterLab command palette. To learn more about the EMR Studio Workspace interface, see Understand the Workspace User Interface.

EMR Studio automatically creates an empty notebook with the same name as the Workspace. For this post, we the Workspace that we created, it automatically creates EMR-Studio-WS-Demo1.ipynb. In the following screenshot, no cluster or kernel is specified in the top right corner, because we didn’t choose to attach any cluster while creating the Workspace. You can write code in your new notebook, but before you run your code, you need to attach it to an EMR cluster and specify a kernel. To attach your workspace to a cluster, choose the EMR clusters icon on the left panel.

Linking Git-based code repositories with your Workspace

You can collaborate with your peers by sharing notebooks as code via code repositories. EMR Studio supports the following Git-based services:

This capability provides the following benefits:

  • Version control – Record code changes in a version control system so you can review the history of your changes and selectively revert them.
  • Collaboration – Share code with team members working in different Workspaces through remote Git-based repositories. Workspaces can clone or merge code from remote repositories and push changes back to those repositories.
  • Code reuse – Many Jupyter notebooks that demonstrate data analysis or machine learning techniques are available in publicly hosted repositories, such as GitHub. You can associate your Workspace with a GitHub repository to reuse the Jupyter notebooks contained in a repository.

To link Git repositories to your Workspace, you can link an existing repository or create a new one. When you link an existing repository, you choose from a list of Git repositories associated with the AWS account in which your EMR Studio was created.

We add a new repository by completing the following steps:

  1. Choose the Git icon.
  2. For Repository name¸ enter a name (for example, emr-notebook).
  3. For Git repository URL, enter the URL for the Git repo (for this post, we use the sample notebook at https://github.com/emrnotebooks/notebook_execution).
  4. For Git credentials, select your credentials. Because we’re using a public repo, we select Use a public repository without credentials.
  5. Choose Add repository.

After it’s added, we can see the repo on the Git repositories drop-down menu.

  1. Choose the repo to link to the Workspace.

You can link up to three Git repositories with an EMR Studio Workspace. For more information, see Link Git-Based Repositories to an EMR Studio Workspace.

  1. Choose the File browser icon to locate the Git repo we just linked.

Attaching and detaching Workspaces to and from EMR clusters

EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance-optimized EMR runtime for Apache Spark. You can attach your Workspace to an EMR cluster and get distributed data processing using Spark or custom kernels. You can use primary node capacity to run non-distributed applications.

In addition to using Amazon EMR clusters running on Amazon EC2, you can attach a Workspace to an Amazon EMR on EKS virtual cluster to run notebook code. For more information about how to use an Amazon EMR on EKS cluster in EMR Studio, see Use an Amazon EMR on EKS Cluster to Run Notebook Code.

Before you can run your notebooks, you must attach your Workspace to an EMR cluster. For more information about clusters, see Create and Use Clusters with EMR Studio.

To run the Git repo notebooks that we linked in the previous step, complete the following steps:

  1. Choose the EMR cluster
  2. Attach the Workspace to an existing EMR cluster running on Amazon EC2 instances.
  3. Open the notebook demo_pyspark.ipynb from the Git repo emr-notebook that we linked to the Workspace.

In the upper right corner of the Workspace UI, we can see the ID of the EMR cluster being attached to our Workspace, as well as the kernel selected to run the notebook.

  1. Record the value of the cluster ID (for example, <j-*************>).

We use this value later to locate the EMR cluster for application debugging purposes.

You can also detach the Workspace from the cluster in the Workspace UI and re-attach it to another cluster. For more information, see Detach a Cluster from Your Workspace.

Being able to easily attach and detach to and from any EMR cluster allows you to move any workload from prototyping into production. For example, you can start your prototype development by attaching your workspace to a development EMR cluster and working with test datasets. When you’re ready to run your notebook with larger production datasets, you can detach your workspace from the development EMR cluster and attach it to a larger production EMR cluster.

Installing and loading custom libraries and kernels

You can install notebook-scoped libraries with a PySpark kernel in EMR Studio. The libraries installed are isolated to your notebook session and don’t interfere with libraries installed via EMR bootstrap actions, or libraries installed by other EMR Studio notebook sessions that may be running on the same EMR cluster. After you install libraries for your Workspace, they’re available for other notebooks in the Workspace in the same session.

Our sample notebook demo_pyspark.ipynb is a Python script. It uses real-time COVID-19 US daily case reports as input data. The following parameters are defined in the first cell:

  • DATE – The given date used when the notebook job is started.
  • TOP_K – The top k US states with confirmed COVID-19 cases. We use this to plot Graph a.
  • US_STATES – The names of the specific US states being checked for the fatality rates of COVID-19 patients. We use this plot Graph b.

The parameters can be any of the Python data types.

Running this notebook plots two graphs:

  • Graph a – Visualizes the top k US states with most the COVID-19 cases on a given date
  • Graph b – Visualizes the fatality rates among specific US states on a given date

In our notebook, we install notebook-scoped libraries by running the following code from within a notebook cell:

sc.install_pypi_package("pandas==0.25.1")
sc.install_pypi_package("requests==2.24.0")
sc.install_pypi_package("numpy==1.19.1")
sc.install_pypi_package("kiwisolver==1.2.0")
sc.install_pypi_package("matplotlib==3.3.0")

We use these libraries in the subsequent cells for the further data analysis and visualization steps in the notebook.

The following set of parameters is used to run the notebook:

{"DATE": "10-15-2020",
 "TOP_K": 6,
"US_STATES": ["Wisconsin", "Texas", "Nevada"]}

Running all the notebook cells generates two graphs. Graph a shows the top six US states with confirmed COVID-19 cases on October 15, 2020.

Graph b shows the fatality rates of COVID-19 patients in Texas, Wisconsin, and Nevada on October 15, 2020.

EMR Studio also allows you to install Jupyter notebook kernels and Python libraries on a cluster primary node, which makes your custom environment available to any EMR Studio Workspace attached the cluster. To install the sas_kernel kernel on a cluster primary node, run the following code within a notebook cell:

!/emr/notebook-env/bin/pip install sas_kernel

The following screenshot shows your output.

For more information about how to install kernels and use libraries, see Installing and Using Kernels and Libraries.

Diagnosing applications and jobs with EMR Studio

In EMR Studio, you can quickly debug jobs and access logs without logging in to the cluster, such as setting up a web proxy through an SSH connection, for both active and stopped clusters. You can use native application interfaces such as Spark UI and YARN Timeline Service directly from EMR Studio. EMR Studio also allows you to quickly locate the cluster or job to debug by using filters such as cluster state, creation time, and cluster ID. For more information, see Diagnose Applications and Jobs with EMR Studio.

Now, we show you how to open a native application interface to debug the notebook job that already finished.

  1. On the EMR Studio page, choose Clusters.

A list appears with all the EMR clusters launched under the same AWS account. You can filter the list by cluster state, cluster ID, or creation time range by entering values in the provided fields.

  1. Choose the cluster ID of the EMR cluster that we attached to the Workspace EMR-Studio-WS-Demo1 for running notebook demo_pyspark.ipynb.
  2. For Spark job debugging, on the Launch application UIs menu, choose Spark History Server.

The following screenshot shows you the Spark job debugging UI.

We can traverse the details for our notebook application by checking actual logs from the Spark History Server, as in the following screenshot.

  1. For Yarn application debugging, on the Launch application UIs menu, choose Yarn Timeline Server.

The following screenshot shows the Yarn debugging UI.

Orchestrating analytics notebook jobs to build ETL production pipelines

EMR Studio makes it easy for you to move any analytics workload from prototyping to production. With EMR Studio, you can run parameterized notebooks as part of scheduled workflows using orchestration services like AWS Step Functions and Apache Airflow or Amazon MWAA.

In this section, we show a simple example of how to orchestrate running notebook workflows using Apache Airflow.

We have a fully tested notebook under an EMR Studio Workspace, and want to schedule a workflow that runs the notebook on an on-demand EMR cluster every 10 minutes.

Record the value of the Workspace ID (for example, e-*****************************) and the notebook file path relative to the home directory within the Workspace (for example, demo.ipynb or my_folder/demo.ipynb)

The workflow that we create takes care of the following tasks:

  1. Create an EMR cluster.
  2. Wait until the cluster is ready.
  3. Start running a notebook defined by the Workspace ID, notebook file path, and the cluster created.
  4. Wait until the notebook is complete.

The following screenshot is the tree view of this example DAG. The DAG definition is available on the GitHub repo. Make sure you replace any placeholder values with the actual ones before using.

When you open the Gantt chart of one of the successful notebooks, we can see the timeline of our workflow. The time spent creating the cluster and creating a notebook execution is negligible compared to the time spent waiting for the cluster to be ready and waiting for the notebook to finish, which meets the expectation of our SLA.

This example is a just starting point. Try it out and extend it with more sophisticated workflows that suit your needs.

Summary

In this post, we highlighted some of the capabilities of EMR Studio, such as the ability to log in via AWS SSO, access fully managed Jupyter notebooks, link Git-based code repositories, change clusters, load custom Python libraries and kernels, diagnose clusters and jobs using native application UIs, and orchestrate notebook jobs using Apache Airflow or Amazon MWAA.

There is no additional charge for using EMR Studio in public preview, and you only pay for the use of the EMR cluster or other AWS services such as AWS Service Catalog. For more information, see the EMR Studio FAQs.

EMR Studio is available on Amazon EMR release version 6.2 and later, in the US East (N. Virginia), US West (Oregon), and EU (Ireland) Regions for public preview. For the latest Region availability for the public preview, see Considerations.

If you have questions or suggestions, feel free to leave a comment.


About the  Authors

Fei Lang is a Senior Big Data Architect at Amazon Web Services. She is passionate about building the right big data solution for customers. In her spare time, she enjoys the scenery of the Pacific Northwest, going for a swim, and spending time with her family.

 

 

 

Shuang Li is a Senior Product Manager for Amazon EMR at AWS. She holds a doctoral degree in Computer Science and Engineering from Ohio State University.

 

 

Ray Liu is a Software Development Engineer at AWS. Besides work, he enjoys traveling and spending time with family.

 

 

 

Kendra Ellis is a Programmer Writer at AWS.

 

 

 

 

Cloudflare’s privacy-first Web Analytics is now available for everyone

Post Syndicated from Jon Levine original https://blog.cloudflare.com/privacy-first-web-analytics/

Cloudflare’s privacy-first Web Analytics is now available for everyone

Cloudflare’s privacy-first Web Analytics is now available for everyone

In September, we announced that we’re building a new, free Web Analytics product for the whole web. Today, I’m excited to announce that anyone can now sign up to use our new Web Analytics — even without changing your DNS settings. In other words, Cloudflare Web Analytics can now be deployed by adding an HTML snippet (in the same way many other popular web analytics tools are) making it easier than ever to use privacy-first tools to understand visitor behavior.

Why does the web need another analytics service?

Popular analytics vendors have business models driven by ad revenue. Using them implies a bargain: they track visitor behavior and create buyer profiles to retarget your visitors with ads; in exchange, you get free analytics.

At Cloudflare, our mission is to help build a better Internet, and part of that is to deliver essential web analytics to everyone with a website, without compromising user privacy. For free. We’ve never been interested in tracking users or selling advertising. We don’t want to know what you do on the Internet — it’s not our business.

Our customers have long relied on Cloudflare’s Analytics because we’re accurate, fast, and privacy-first. In September we released a big upgrade to analytics for our existing customers that made them even more flexible.

However, we know that there are many folks who can’t use our analytics, simply because they’re not able to onboard to use the rest of Cloudflare for Infrastructure — specifically, they’re not able to change their DNS servers. Today, we’re bringing the power of our analytics to the whole web. By adding a simple HTML snippet to your website, you can start measuring your web traffic — similar to other popular analytics vendors.

What can I do with Cloudflare Web Analytics?

We’ve worked hard to make our analytics as powerful and flexible as possible — while still being fast and easy to use.

When measuring analytics about your website, the most common questions are “how much traffic did I get?” and “how many people visited?” We answer this by measuring page views (the total number of times a page view was loaded) and visits (the number of times someone landed on a page view from another website).

With Cloudflare Web Analytics, it’s easy to switch between measuring page views or visits. Within each view, you can see top pages, countries, device types and referrers.

Cloudflare’s privacy-first Web Analytics is now available for everyone

My favorite thing is the ability to add global filters, and to quickly drill into the most important data with actions like “zoom” and “group by”. Say you publish a new blog post, and you want to see the top sites that send you traffic right after you email your subscribers about it. It’s easy to zoom into the time period when you hit the email, and group by to see the top pages. Then you can add a filter to just that page — and then finally view top referrers for that page. It’s magic!

Best of all, our analytics is free. We don’t have limits based on the amount of traffic you can send it. Thanks to our ABR technology, we can serve accurate analytics for websites that get anywhere from one to one billion requests per day.

How does the new Web Analytics work?

Traditionally, Cloudflare Analytics works by measuring traffic at our edge. This has some great benefits; namely, it catches all traffic, even from clients that block JavaScript or don’t load HTML. At the edge, we can also block bots, add protection from our WAF, and measure the performance of your origin server.

The new Web Analytics works like most other measurement tools: by tracking visitors on the client. We’ve long had client-side measuring tools with Browser Insights, but these were only available to orange-cloud users (i.e. Cloudflare customers).

Today, for the first time, anyone can get access to our client-side analytics — even if you don’t use the rest of Cloudflare. Just add our JavaScript snippet to any website, and we can start collecting metrics.

How do I sign up?

We’ve worked hard making our onboarding as simple as possible.

First, enter the name of your website. It’s important to use the domain name that your analytics will be served on — we use this to filter out any unwanted “spam” analytics reports.

Cloudflare’s privacy-first Web Analytics is now available for everyone

(At this time, you can only add analytics from one website to each Cloudflare account. In the coming weeks we’ll add support for multiple analytics properties per account.)

Next, you’ll see a script tag that you can copy onto your website. We recommend adding this just before the closing </body> tag on the pages you want to measure.

Cloudflare’s privacy-first Web Analytics is now available for everyone

And that’s it! After you release your website and start getting visits, you’ll be able to see them in analytics.

What does privacy-first mean?

Being privacy-first means we don’t track individual users for the purposes of serving analytics. We don’t use any client-side state (like cookies or localStorage) for analytics purposes. Cloudflare also doesn’t track users over time via their IP address, User Agent string, or any other immutable attributes for the purposes of displaying analytics — we consider “fingerprinting” even more intrusive than cookies, because users have no way to opt out.

The concept of a “visit” is key to this approach. Rather than count unique IP addresses, which would require storing state about what each visitor does, we can simply count the number of page views that come from a different site. This provides a perfectly usable metric that doesn’t compromise on privacy.

Cloudflare’s privacy-first Web Analytics is now available for everyone

What’s next

This is just the start for our privacy-first Analytics. We’re excited to integrate more closely with the rest of Cloudflare, and give customers even more detailed stats about performance and security (not just traffic.) We’re also hoping to make our analytics even more powerful as a standalone product by building support for alerts, real-time time updates, and more.

Please let us know if you have any questions or feedback, and happy measuring!

Bringing machine learning to more builders through databases and analytics services

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/bringing-machine-learning-to-more-builders-through-databases-and-analytics-services/

Machine learning (ML) is becoming more mainstream, but even with the increasing adoption, it’s still in its infancy. For ML to have the broad impact that we think it can have, it has to get easier to do and easier to apply. We launched Amazon SageMaker in 2017 to remove the challenges from each stage of the ML process, making it radically easier and faster for everyday developers and data scientists to build, train, and deploy ML models. SageMaker has made ML model building and scaling more accessible to more people, but there’s a large group of database developers, data analysts, and business analysts who work with databases and data lakes where much of the data used for ML resides. These users still find it too difficult and involved to extract meaningful insights from that data using ML.

This group is typically proficient in SQL but not Python, and must rely on data scientists to build the models needed to add intelligence to applications or derive predictive insights from data. And even when you have the model in hand, there’s a long and involved process to prepare and move data to use the model. The result is that ML isn’t being used as much as it can be.

To meet the needs of this large and growing group of builders, we’re integrating ML into AWS databases, analytics, and business intelligence (BI) services.

AWS customers generate, process, and collect more data than ever to better understand their business landscape, market, and customers. And you don’t just use one type of data store for all your needs. You typically use several types of databases, data warehouses, and data lakes, to fit your use case. Because all these use cases could benefit from ML, we’re adding ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can train models on their data or add inference results right from their database, without having to export and process their data or write large amounts of ETL code.

Machine Learning for database developers

At re:Invent last year, we announced ML integrated inside Amazon Aurora for developers working with relational databases. Previously, adding ML using data from Aurora to an application was a very complicated process. First, a data scientist had to build and train a model, then write the code to read data from the database. Next, you had to prepare the data so it can be used by the ML model. Then, you called an ML service to run the model, reformat the output for your application, and finally load it into the application.

Now, with a simple SQL query in Aurora, you can add ML to an enterprise application. When you run an ML query in Aurora using SQL, it can directly access a wide variety of ML models from Amazon SageMaker and Amazon Comprehend. The integration between Aurora and each AWS ML service is optimized, delivering up to 100 times better throughput when compared to moving data between Aurora and SageMaker or Amazon Comprehend without this integration. Because the ML model is deployed separately from the database and the application, each can scale up or scale out independently of the other.

In addition to making ML available in relational databases, combining ML with certain types of non-relational database models can also lead to better predictions. For example, database developers use Amazon Neptune, a purpose-built, high-performance graph database, to store complex relationships between data in a graph data model. You can query these graphs for insights and patterns and apply the results to implement capabilities such as product recommendations or fraud detection.

However, human intuition and analyzing individual queries is not enough to discover the full breadth of insights available from large graphs. ML can help, but as was the case with relational databases it requires you to do a significant amount of heavy lifting upfront to prepare the graph data and then select the best ML model to run against that data. The entire process can take weeks.

To help with this, today we announced the general availability of Amazon Neptune ML to provide database developers access to ML purpose-built for graph data. This integration is powered by SageMaker and uses the Deep Graph Library (DGL), a framework for applying deep learning to graph data. It does the hard work of selecting the graph data needed for ML training, automatically choosing the best model for the selected data, exposing ML capabilities via simple graph queries, and providing templates to allow you to customize ML models for advanced scenarios. The following diagram illustrates this workflow.

And because the DGL is purpose-built to run deep learning on graph data, you can improve accuracy of most predictions by over 50% compared to that of traditional ML techniques.

Machine Learning for data analysts

At re:Invent last year, we announced ML integrated inside Amazon Athena for data analysts. With this integration, you can access more than a dozen built-in ML models or use your own models in SageMaker directly from ad-hoc queries in Athena. As a result, you can easily run ad-hoc queries in Athena that use ML to forecast sales, detect suspicious logins, or sort users into customer cohorts.

Similarly, data analysts also want to apply ML to the data in their Amazon Redshift data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day. These Amazon Redshift users want to run ML on their data in Amazon Redshift without having to write a single line of Python. Today we announced the preview of Amazon Redshift ML to do just that.

Amazon Redshift now enables you to run ML algorithms on Amazon Redshift data without manually selecting, building, or training an ML model. Amazon Redshift ML works with Amazon SageMaker Autopilot, a service that automatically trains and tunes the best ML models for classification or regression based on your data while allowing full control and visibility.

When you run an ML query in Amazon Redshift, the selected data is securely exported from Amazon Redshift to Amazon Simple Storage Service (Amazon S3). SageMaker Autopilot then performs data cleaning and preprocessing of the training data, automatically creates a model, and applies the best model. All the interactions between Amazon Redshift, Amazon S3, and SageMaker are abstracted away and automatically occur. When the model is trained, it becomes available as a SQL function for you to use. The following diagram illustrates this workflow.

Rackspace Technology – a leading end-to-end multicloud technology services company, and Slalom –  a modern consulting firm focused on strategy, technology, and business transformation are both users of Redshift ML in preview.

Nihar Gupta, General Manager for Data Solutions at Rackspace Technology says “At Rackspace Technology, we help companies elevate their AI/ML operationsthe seamless integration with Amazon SageMaker will empower data analysts to use data in new ways, and provide even more insight back to the wider organization.”

And Marcus Bearden, Practice Director at Slalom shared “We hear from our customers that they want to have the skills and tools to get more insight from their data, and Amazon Redshift is a popular cloud data warehouse that many of our customers depend on to power their analytics, the new Amazon Redshift ML feature will make it easier for SQL users to get new types of insight from their data with machine learning, without learning new skills.”

Machine Learning for business analysts

To bring ML to business analysts, we launched new ML capabilities in Amazon QuickSight earlier this year called ML Insights. ML Insights uses SageMaker Autopilot to enable business analysts to perform ML inference on their data and visualize it in BI dashboards with just a few clicks. You can get results for different use cases that require ML, such as anomaly detection to uncover hidden insights by continuously analyzing billions of data points, to do forecasting, to predict growth, and other business trends. In addition, QuickSight can also give you an automatically generated summary in plain language (a capability we call auto-narratives), which interprets and describes what the data in your dashboard means. See the following screenshot for an example.

Customers like Expedia Group, Tata Consultancy Services, and Ricoh Company are already benefiting from ML out of the box with QuickSight. These human-readable narratives enable you to quickly interpret the data in a shared dashboard and focus on the insights that matter most.

In addition, customers have also been interested in asking questions of their business data in plain language and receiving answers in near-real time. Although some BI tools and vendors have attempted to solve this challenge with Natural Language Query (NLQ), the existing approaches require that you first spend months in advance preparing and building a model on a pre-defined set of data, and even then, you still have no way of asking ad hoc questions when those questions require a new calculation that wasn’t pre-defined in the data model. For example, the question “What is our year-over-year growth rate?” requires that “growth rate” be pre-defined as a calculation in the model. With today’s BI tools, you need to work with your BI teams to create and update the model to account for any new calculation or data, which can take days or weeks of effort.

Last week, we announced Amazon QuickSight Q. ‘Q’ gives business analysts the ability to ask any question of all their data and receive an accurate answer in seconds. To ask a question, you simply type it into the QuickSight Q search bar using natural language and business terminology that you’re familiar with. Q uses ML (natural language processing, schema understanding, and semantic parsing for SQL code generation) to automatically generate a data model that understands the meaning of and relationships between business data, so you can get answers to your business questions without waiting weeks for a data model to be built. Because Q eliminates the need to build a data model, you’re also not limited to asking only a specific set of questions. See the following screenshot for an example.

Best Western Hotels & Resorts is a privately-held hotel brand with a global network of approximately 4,700 hotels in over 100 countries and territories worldwide. “With Amazon QuickSight Q, we look forward to enabling our business partners to self-serve their ad hoc questions while reducing the operational overhead on our team for ad hoc requests,” said Joseph Landucci, Senior Manager of Database and Enterprise Analytics at Best Western Hotels & Resorts. “This will allow our partners to get answers to their critical business questions quickly by simply typing and searching their questions in plain language.”

Summary

For ML to have a broad impact, we believe it has to get easier to do and easier to apply. Database developers, data analysts, and business analysts who work with databases and data lakes have found it too difficult and involved to extract meaningful insights from their data using ML. To meet the needs of this large and growing group of builders, we’ve added ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can all use ML more easily without the need to be an ML expert. These capabilities put ML in the hands of every data professional so that they can get the most value from their data.


About the Authors

Swami Sivasubramanian is Vice President at AWS in charge of all Amazon AI and Machine Learning services. His team’s mission is “to put machine learning capabilities in the hands on every developer and data scientist.” Swami and the AWS AI and ML organization work on all aspects of machine learning, from ML frameworks (Tensorflow, Apache MXNet and PyTorch) and infrastructure, to Amazon SageMaker (an end-to-end service for building, training and deploying ML models in the cloud and at the edge), and finally AI services (Transcribe, Translate, Personalize, Forecast, Rekognition, Textract, Lex, Comprehend, Kendra, etc.) that make it easier for app developers to incorporate ML into their apps with no ML experience required.

Previously, Swami managed AWS’s NoSQL and big data services. He managed the engineering, product management, and operations for AWS database services that are the foundational building blocks for AWS: DynamoDB, Amazon ElastiCache (in-memory engines), Amazon QuickSight, and a few other big data services in the works. Swami has been awarded more than 250 patents, authored 40 referred scientific papers and journals, and participates in several academic circles and conferences.

 

Herain Oberoi leads Product Marketing for AWS’s Databases, Analytics, BI, and Blockchain services. His team is responsible for helping customers learn about, adopt, and successfully use AWS services. Prior to AWS, he held various product management and marketing leadership roles at Microsoft and a successful startup that was later acquired by BEA Systems. When he’s not working, he enjoys spending time with his family, gardening, and exercising.

 

 

 

Preview: Amazon Lookout for Metrics, an Anomaly Detection Service for Monitoring the Health of Your Business

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/preview-amazon-lookout-for-metrics-anomaly-detection-service-monitoring-health-business/

We are excited to announce Amazon Lookout for Metrics, a new service that uses machine learning (ML) to detect anomalies in your metrics, helping you proactively monitor the health of your business, diagnose issues, and find opportunities quickly – with no ML experience required.

Lookout for Metrics uses the same technology used by Amazon to detect anomalous changes in data that are otherwise hard to find, while reducing the number of false detections. It also groups similar findings together, ranks them by severity, and provides information to determine the root cause of the anomalies.

It can be used across a wide variety of metrics such as revenue, web page views, daily active users, churn rate, transaction volume, mobile app installations, and more. Lookout for Metrics is now available in preview today.

Why Use Amazon Lookout for Metrics for Anomaly Detection?
Organizations across all industries are looking to improve efficiency in their business through technology and automation. While challenges may vary, what’s common is that being able to identify defects and opportunities early and often can lead to material cost savings, higher margins, and better customer experience. Traditionally, organizations rely on manual audits of large amounts of data, which is not scalable and is prone to human error. Others use rule-based methods based on arbitrary ranges, which are often static, do not easily adapt to seasonality changes, and lead to too many false detections.

When anomalies are detected, developers, analysts, and business owners can spend weeks trying to identify the root cause of the change. These are situations where ML can be an effective and transformational tool. However, ML algorithms need to be carefully selected, trained, tested, and deployed for each type of data – requiring a skilled team of ML experts.

Amazon has a long history of being a data-driven company, with a growing number of businesses that need to stay on top of the health of their business, operations, and customer experience. A key part of this effort over the years has involved building and improving ML technology to detect anomalies in key performance indicators (KPI) such as website visits from different traffic channels, number of products added to the shopping cart, number of orders placed, revenue for every product category, and more.

Amazon Lookout for Metrics puts the same ML technology used by Amazon in the hands of every developer. It finds anomalies in your data, groups them intelligently, helps you visualize aggregated results, and automates alerts.

Because it’s a fully managed service, it takes care of the whole ML process so you can get started quickly and focus on your core business. And most importantly, the service improves model performance continually by incorporating your real-time feedback on the accuracy and relevance of the anomalies and root cause analysis.

How Amazon Lookout for Metrics Works
You can get started with Lookout for Metrics with just a few clicks in the AWS Management Console. Without having to write any code, you connect your data to the service through the built-in data source integrations; next Lookout for Metrics trains a custom model for your data; and finally, it begins detecting anomalies for you to review and start taking action on.

Lookout for Metrics continuously monitors data stored in Amazon Simple Storage Service (S3), Amazon Relational Database Service (RDS), Amazon Redshift, Amazon CloudWatch, or SaaS integrations supported by Amazon AppFlow such as Salesforce, Marketo, Google Analytics, Slack, Zendesk, and many more.

During this phase, you can flag each field in your dataset as a measure (or KPI), dimension, or timestamp. For example, if you want to monitor abnormal changes in page views for every device type separately, then you would select page_views as the measure and device_type as the dimension.

Once your data source is configured and connected, Lookout for Metrics inspects and prepares the data for analysis and selects the right algorithm to build the most accurate anomaly detection model. This detector runs on your data at a configurable cadence (every few minutes, hourly, daily, and so on) and provides a threshold dial that allows you to adjust its sensitivity.

When detecting an anomaly, Lookout for Metrics helps you focus on what matters the most by assigning a severity score to aid prioritization. To help you find the root cause, it intelligently groups anomalies that may be related to the same incident and summarizes the different sources of impact (as shown below).

Moreover, you can configure an automatic action such as sending a notification via Amazon Simple Notification Service (SNS), Datadog, PagerDuty, Webhooks, or Slack. Or you can trigger a Lambda function, for example to temporarily hide a product on your e-commerce site when a potential pricing error is detected.

Domain knowledge and expertise can often play an important role in determining if a sudden change in a metric is expected or is an anomaly. Lookout for Metrics allows you to provide real-time feedback on the relevance of the detected anomalies, enabling a powerful human-in-the-loop mechanism. This information is fed back to the anomaly detection model to improve its accuracy.

Who’s Using Amazon Lookout for Metrics Today?
Digitata Networks offers intelligent customer, network and site-centric solutions that assist Mobile Network Operators to monitor, audit, control and automate different aspects of their network. Company CTO Nico Kruger has been pleased with the results he’s seen so far from using Lookout for Metrics.

“We discovered the improved accuracy and insights that Lookout for Metrics can bring to our existing solution and we are thrilled to use the service…we can quickly identify opportunities in addition to finding issues,” he said.

Playrix, one of the leading mobile game developers in the world, known for high-quality games such as Township, Fishdom, and Gardenscapes, is another customer that’s been working with the new service. “We experimented with our user acquisition data to understand how the service works and it quickly identified and grouped anomalies enabling us to work faster and better,” said Mikhail Artyugin, Playrix technical director.

“Lookout for Metrics has saved our team many hours of manual investigation and now notifications are viewed as actionable rather than noise, allowing our teams to easily focus on strategic priorities with less technical overhead,” he added.

“Working with almost a billion impressions every day to capture insights and intent for our customers, we need quick feedback on real data anomalies,” said Brian Ecker, a senior staff engineer at NextRoll, a marketing and data technology company with the mission to provide innovative solutions to companies to keep them growing.

“After working with the Lookout for Metrics team, we saw the improved accuracy that the new service can bring to our existing anomaly detection process and we are thrilled to start using it.”

It’s also worth noting that APN partners such as TensorIoT, Quantiphi, and Provectus have expertise in Lookout for Metrics and can help customers leverage its functionalities.

Available in Preview
Amazon Lookout for Metrics is now available in preview in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland).

You can interact with the service using the AWS Management Console, the AWS SDKs and the CLI . Find out more on the technical documentation and get started quickly by joining the preview at the following link.

Request preview access to Amazon Lookout for Metrics here.

Alex

Ingesting Jira data into Amazon S3

Post Syndicated from Vishwa Gupta original https://aws.amazon.com/blogs/big-data/ingesting-jira-data-into-amazon-s3/

Consolidating data from a work management tool like Jira and integrating this data with other data sources like ServiceNow, GitHub, Jenkins, and Time Entry Systems enables end-to-end visibility of different aspects of the software development lifecycle and helps keep your projects on schedule and within budget.

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, performance, security, and data availability. Many of our customers choose to build their data lakes on Amazon S3. They find the flexible, pay-as-you-go, cloud model ideal when dealing with vast amounts of heterogeneous data.

This post discusses some of the use cases for ingesting Jira data into an Amazon S3 data lake, the ingestion data flow, and a conceptional approach to ingesting data. We also provide the relevant Python code.

Use cases

Business use cases for Jira data ingestion range from proactive project monitoring, detection and resolution of project effort and cost variances, and non-compliance of SDLC processes. In this section, we provide an inclusive but not exhaustive list of use cases and their benefits.

Cognitive project monitoring

Cognitive project monitoring use cases include the following:

  • Automated analytics – You can prevent and reduce project schedule and budget variances by proactively monitoring metrics by combining data from Jira, GitHub, Jenkins, and Time Entry Systems.
  • Automated status reporting – You can use Amazon SageMaker machine learning (ML) models to derive prescriptive metrics by looking at data across various sources. This could reduce a project manager’s time spent stitching data and generating reports, and provide a holistic view of project-tracking metrics.

Automated project compliance and governance

You can analyze user behavior to detect potentially suspicious patterns by building a baseline of user activity. You create this based on primary data from HR (such as role, location, and work hours) and IT infrastructure (such as an assigned asset’s IP address).

Possible business outcomes include the following:

  • Proactively identify user IDs and passwords being shared with other users
  • Detect insider threats, such as abnormal login times, unauthorized access to Jira, and incorrect access permissions in Jira, GitHub, Jenkins, or Time Entry Systems
  • Identify compromised accounts based on frequent logins from unassigned assets or unusual successive authentications
  • Identify theft of corporate IPs based on unusual printing volume, printing project-related documents, and emailing organization-related documents and code to external accounts

Accelerated migration from Jira to another project management product

You can also use AWS Glue and its Data Catalog metadata to map between two products. This could increase your data migration.

Overview of solution

One of the most common approaches to ingest data from Jira into AWS is to create a Python module, which is used in AWS Glue or AWS Lambda. The following diagram shows the high-level approach for an end-to-end solution. In this solution, Glue Development Endpoint and respective SageMaker Jupyter Notebook instance are used to create the Jira Python module to facilitate Jupyter notebook experience, interactive testing and debugging capability. The scope of this post is limited to the following steps:

  • Setting up access for Jira
  • Using the Python model with AWS Lambda or AWS Glue
  • Incrementally pulling changed data from Jira with JQL (Jira Query Language)
  • Ingesting data to the AWS serverless data lake

Ingesting data from Jira into Amazon S3

The Jira server exposes data using REST APIs and open authorization (OAuth) authentication methods. It uses a three-legged OAuth approach (also called the OAuth dance) to acquire access to the resources served by the APIs. For more information about the following steps, see OAuth for REST APIs.

Generating an RSA public/private key pair

Consumer key and consumer secret details are required for interacting with API endpoints. You store the details inside encrypted SSM parameters.

To use macOS or Linux, run the following OpenSSL commands in the terminal (anywhere in the file system):

openssl genrsa -out jira_privatekey.pem 1024
openssl req -newkey rsa:1024 -x509 -key jira_privatekey.pem -out jira_publickey.cer -days 365
openssl pkcs8 -topk8 -nocrypt -in jira_privatekey.pem -out jira_privatekey.pcks8
openssl x509 -pubkey -noout -in jira_publickey.cer > jira_publickey.pem

To use Windows, download OpenSSL and run it using the path to the bin folder. Create a new environment variable named OPENSSL_CONF and the value "path_to"\openssl.cnf. Run the command as admin:

"path_to_openssl"\bin\openssl genrsa -out jira_privatekey.pem 1024
"path_to_openssl"\bin\openssl req -newkey rsa:1024 -x509 -key jira_privatekey.pem -out jira_publickey.cer -days 365
"path_to_openssl"\bin\openssl pkcs8 -topk8 -nocrypt -in jira_privatekey.pem -out jira_privatekey.pcks8
"path_to_openssl"\bin\openssl x509 -pubkey -noout -in jira_publickey.cer > jira_publickey.pem

Configuring a REST API-based consumer in Jira

For full instructions on configuring your REST API-based consumer, see Step 2: Configure your client application as an OAuth consumer in OAuth for REST APIs. Be sure to complete the following steps:

  1. In the Link applications section, select Create incoming link.
  2. For Public key, enter the public key you created earlier.

Performing the OAuth dance

In this step, you go through the process of getting the access token from the resource so the consumer can access the resource.

  1. Create the following parameters in the AWS Systems Manager Parameter Store:
    1. jira_access_private_key – Stores the private key in AWS Systems Manager as a parameter.
    2. jira_access_urls – Stores URLs to access Jira. These URLs are constructed based on display URLs defined in JIRA by adding the additional tags:
      • request_token_urlhttps://jiratoawss3.atlassian.net/plugins/servlet/oauth/request-token
      • access_token_urlhttps://jiratoawss3.atlassian.net/plugins/servlet/oauth/access-token
      • authorize_urlhttps://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize
      • data_urlhttps://jiratoawss3.atlassian.net/rest/api/2/search
    3. jira_access_secrets – Stores secrets to access Jira. Initially, only two values are present in this SSM parameter; it’s updated later with access_token. You need the following two parameters to start:
      1. consumer_key
      2. consumer_secret
  2. Download the notebook file and upload it to SageMaker notebook instance of AWS Glue Development Endpoint.:
    1. Below are the steps to set-up AWS Glue Development Endpoint
      1. In the AWS Glue console, choose Dev endpoints. Choose Add endpoint.
      2. Specify an endpoint name, such as demo-endpoint.
      3. Choose an IAM role with permissions similar to the IAM role that you use to run AWS Glue ETL jobs. For more information, see Create an IAM Role for AWS Glue. Choose Next.
      4. In Networking, leave Skip networking information selected, and choose Next.
      5. In SSH Public Key, enter a public key generated by an SSH key generator program, such as ssh-keygen (do not use an Amazon EC2 key pair). The generated public key will be imported into your development endpoint. Save the corresponding private key to later connect to the development endpoint using SSH. Choose Next. For more information, see ssh-keygen in Wikipedia.b. Once status of AWS Glue Development Endpoint is ready follow steps to set-up SageMaker notebooks with in your development endpoint.
    2. Once status of AWS Glue Development Endpoint is ready follow
    3. Once status shows Ready, open the notebook and upload the downloaded notebook file.

You now run the following cells.

  1. Install and import dependent modules with the following code:
    !pip install tlslite
    !pip install oauth2
    
    import urllib
    import oauth2 as oauth
    from tlslite.utils import keyfactory
    import json
    import sys
    import os
    import base64
    import boto3
    from boto3.dynamodb.conditions import Key, Attr
    import datetime
    import logging
    import pprint
    import time
    from pytz import timezone
    
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    

  2. Create an SSM client to connect parameters defined in Systems Manager (update the Region if it’s different than us-east-1):
    ssm = boto3.client("ssm", region_name='us-east-1')

  3. Define the signature class to sign the Jira REST API requests:
    class SignatureMethod_RSA_SHA1(oauth.SignatureMethod):
        name = 'RSA-SHA1'
    
        def signing_base(self, request, consumer, token):
            if not hasattr(request, 'normalized_url') or request.normalized_url is None:
                raise ValueError("Base URL for request is not set.")
    
            sig = (
                oauth.escape(request.method),
                oauth.escape(request.normalized_url),
                oauth.escape(request.get_normalized_parameters()),
            )
    
            key = '%s&' % oauth.escape(consumer.secret)
            if token:
                key += oauth.escape(token.secret)
            raw = '&'.join(sig)
            return key, raw
    
        def sign(self, request, consumer, token):
    
            key, raw = self.signing_base(request, consumer, token)
    
            # SSM support to fetch private key
            ssm_param = ssm.get_parameter(Name='jira_access_private_key', WithDecryption=True)
            jira_private_key_str = ssm_param['Parameter']['Value']
    
            privateKeyString = jira_private_key_str.strip()
    
            privatekey = keyfactory.parsePrivateKey(privateKeyString)
    
            # Used encode() to convert to bytes
            signature = privatekey.hashAndSign(raw.encode())
            return base64.b64encode(signature) 
    

  4. Get the consumer_key and consumer_secret from the SSM parameter that you defined earlier:
    jira_secrets = json.loads(ssm.get_parameter(Name='jira_access_secrets_1', WithDecryption=True)['Parameter']['Value'])
    jira_secrets
    
    consumer_key = jira_secrets["consumer_key"]
    consumer_key
    
    consumer_secret = jira_secrets["consumer_secret"]
    consumer_secret

  5. Define the URLs for request_token_url, access_token_url, and authorize_url:
    request_token_url = 'input_here'
    access_token_url = 'input_here'
    authorize_url = 'input_here'
    

    These URLs are defined while setting up the Rest API endpoint in Jira. It includes the following components:

    • request_token_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/request-token
    • access_token_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/access-token
    • authorize_url – https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize
  6. Generate your request token:
    # Create Consumer using consumer_key and consumer_secret
    consumer = oauth.Consumer(consumer_key, consumer_secret)
    
    # Use Consumer to create oauth client
    client = oauth.Client(consumer)
    
    # Add Signature Method to the client
    client.set_signature_method(SignatureMethod_RSA_SHA1())
    
    # Get response from request token URL using the client
    resp, content = client.request(request_token_url, "POST")
    
    # Convert the content received from previous step into a Dictionary
    request_token = dict(urllib.parse.parse_qsl(content))
    
    # request token has two components oauth_token and oauth_token_secret
    request_token
    

You only need to do this one time every five years (the default setting in Jira).

The following is an example request token value:

b'oauth_token=oFUFV5cqOuoWycnaCXYrkcioHuRw2TbV&oauth_token_secret=CzhMoEsozCV3xFZ179YQoLzRu4DYQHlR'

The following are example values after the URL parse and converted to dict:

  • {b'oauth_token': b'oFUFV5cqOuoWycnaCXYrkcioHuRw2TbV ',
  • b'oauth_token_secret': b'CzhMoEsozCV3xFZ179YQoLzRu4DYQHlR'}
  1. Manually approve the request token by opening the following user in a browser:
    authorize_url + '?oauth_token=' + request_token[b'oauth_token'].decode()

An example value of the final authorized user is:

  • https://jiratoawss3.atlassian.net/plugins/servlet/oauth/authorize?oauth_token=wYLlIxmcsnZTHgTy2ZpUmBakqzmqSbww.

When you go to the URL in your output, you see the following screenshot.

  1. Use an approved request token to generate an access token:
    # Create an oauth token using components of request token
    token = oauth.Token(request_token[b'oauth_token'], request_token[b'oauth_token_secret'])
    
    # Use Consumer and token to create oauth client
    client = oauth.Client(consumer, token)
    
    # Add Signature Method to the client
    client.set_signature_method(SignatureMethod_RSA_SHA1())
    
    # Get response from access token URL using the client
    access_token_resp, access_token_content = client.request(access_token_url, "POST")
    
    access_token_content

An example access token is:

b'oauth_token=Ym3UDrs1iYnLUZ1t0TkT1PinfJNN3RLj&oauth_token_secret=FYQfGjLLhbCJg3DXZFaKsE6wsURVfebN&oauth_expires_in=157680000&oauth_session_handle=BulouCOypjssDS3GzeY7Ldi30h0ERWDo&oauth_authorization_expires_in=160272000'

The following are example access token values after URL parse and converted to dict:

  • {b'oauth_token': b'Ym3UDrs1iYnLUZ1t0TkT1PinfJNN3RLj',
  • b'oauth_token_secret': b'FYQfGjLLhbCJg3DXZFaKsE6wsURVfebN',
  • b'oauth_expires_in': b'157680000',
  • b'oauth_session_handle': b'BulouCOypjssDS3GzeY7Ldi30h0ERWDo',
  • b'oauth_authorization_expires_in': b'160272000'}

oauth_authorization_expires_in states when the token expires (in seconds), which is generally 5 years.

  1. Update the access_token key in the SSM parameter jira_access_secrets with the value for access_token_content.

This access token is valid for 5 years (the expires_in key of access_token_content states when the token expires in seconds). Rotating the access key depends on your organization’s security policy and is out of scope for this post.

Using the access token, querying data from Jira, and storing data in Amazon S3

The following are the important points in this step:

  • The Jira REST API returns 50 records at a time, but gives a total record count, which is used to paginate through the result set.
  • The Jira REST API endpoint needs to be updated with JQL filters. JQL allows you to only pick changed records.
  • Data returned from the REST API endpoint is serialized to Amazon S3. The Python code batches the records from Jira pages and commits after every four pages (which is configurable) have been fetched from Jira.
  • JQL is appended to the data_url (defined earlier). When pulling data from Jira, it’s good practice to do a one-time bulk load and get incremental loads by maintaining the last data pull date in an Amazon DynamoDB The following screenshot shows an example of tracking dates in DynamoDB by projects. Because it’s an hourly batch, last_ingest_date is rounded up to the hour.
  • The key attributes of JQL used for constructing JQL and pulling data from Jira are:
    • project – Loop through the projects from DynamoDB and pull data for one project at a time.
    • updated – Last update date for Jira story or task. Data pull from Jira is based on this date.
      • For bulk loads, updated is less than or equal to the batch run date, rounded up to the hour.
      • For incremental loads, updated is greater than or equal to last_ingest_date from DynamoDB, and less than or equal to the batch run date, rounded up to the hour.
      • To use date in JQL, we need to parse the date before we use it.
    • startAt – Jira generally paginates the results every 50 records. This attribute is used to loop through the complete data. For example, if a project has 500 records and page size is 50 records, this attribute is incremented by page size in every iteration, and it takes 10 iterations to get complete data.
    • maxResults – This is the page size set up in Jira (maximum number of records Jira returns in every API call).
  • Use the provided notebook to perform the OAuth dance. The sample code pulls data from Jira based on the approach you specified earlier. The purpose of this code is to accelerate implementing data ingestion from Jira.

Cleaning up

To avoid incurring future charges, delete the resources set up as part of this post:

  • AWS Glue Development Endpoint
  • DynamoDB table
  • Systems Manager parameters
  • S3 bucket

Next steps

To extend the usability scope of Jira data in S3 buckets, you can crawl the location to create AWS Glue Data Catalog database tables. Registering the locations with AWS Lake Formation helps simplify permission management and allows you to implement fine-grained access control. You can also use Amazon Athena, Amazon Redshift, Amazon SageMaker, and Amazon QuickSight for data analysis, ML, and reporting services.

Conclusion

This post aims to simplify and accelerate the steps to ingest Jira data into Amazon S3. The solution includes Jira configuration, performing the three-legged OAuth dance, JQL-based attributes for data selection, and Python-based data extraction into Amazon S3.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Glue forum.


About the Authors

Vishwa Gupta is a Data and ML Engineer with AWS Professional Services Intelligence Practice. He helps customers implement big data and analytics platform and solutions. Outside of work, he enjoys spending time with family, traveling, and playing badminton.

 

 

 

Sreeram Thoom is a Data Architect at Amazon Web Services.

 

 

 

 

 

Building a scalable streaming data processor with Amazon Kinesis Data Streams on AWS Fargate

Post Syndicated from Florian Mair original https://aws.amazon.com/blogs/big-data/building-a-scalable-streaming-data-processor-with-amazon-kinesis-data-streams-on-aws-fargate/

Data is ubiquitous in businesses today, and the volume and speed of incoming data are constantly increasing. To derive insights from data, it’s essential to deliver it to a data lake or a data store and analyze it. Real-time or near-real-time data delivery can be cost prohibitive, therefore an efficient architecture is key for processing, and becomes more essential with growing data volume and velocity.

In this post, we show you how to build a scalable producer and consumer application for Amazon Kinesis Data Streams running on AWS Fargate. Kinesis Data Streams is a fully managed and scalable data stream that enables you to ingest, buffer, and process data in real time. AWS Fargate is a serverless compute engine for containers that works with AWS container orchestration services like Amazon Elastic Container Service (Amazon ECS), which allows us to easily run, scale, and secure containerized applications.

This solution also uses the Amazon Kinesis Producer Library (KPL) and Amazon Kinesis Client Library (KCL) to ingest data into the stream and to process it. KPL helps you optimize shard utilization in your data stream by specifying settings for aggregation and batching as data is being produced into your data stream. KCL helps you write robust and scalable consumers that can keep up with fluctuating data volumes being sent to your data stream.

The sample code for this post is available in a GitHub repo, which also includes an AWS CloudFormation template to get you started.

What is data streaming?

Before we look into the details of data streaming architectures, let’s get started with a brief overview of data streaming. Streaming data is data that is generated continuously by a large number of sources that transmit the data records simultaneously in small packages. You can use data streaming for many use cases, such as log processing, clickstream analysis, device geo-location, social media data processing, and financial trading.

A data streaming application consists of two layers: the storage layer and the processing layer. As stream storage, AWS offers the managed services Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), but you can also run stream storages like Apache Kafka or Apache Flume on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EMR. The processing layer consumes the data from the storage layer and runs computations on that data. This could be an Apache Flink application running fully managed on Amazon Kinesis Analytics for Apache Flink, an application running stream processing frameworks like Apache Spark Streaming and Apache Storm or a custom application using the Kinesis API or KCL. For this post, we use Kinesis Data Streams as the storage layer and the containerized KCL application on AWS Fargate as the processing layer.

Streaming data processing architecture

This section gives a brief introduction to the solution’s architecture, as shown in the following diagram.

The architecture consists of four components:

  • Producer group (data ingestion)
  • Stream storage
  • Consumer group (stream processing)
  • Kinesis Data Streams auto scaling

Data ingestion

For ingesting data into the data stream, you use the KPL, which aggregates, compresses, and batches data records to make the ingestion more efficient. In this architecture, the KPL increased the per-shard throughput up to 100 times, compared to ingesting the records with the PutRecord API (more on this in the Monitoring your stream and applications section). This is because the records are smaller than 1 KB each and the example code uses the KPL to buffer and send a collection of records in one HTTP request.

The record buffering can consume enough memory to crash itself; therefore, we recommend handling back-pressure. A sample on handling back-pressure is available in the KPL GitHub repo.

Not every use case is suited for using the KPL for ingestion. Due to batching and aggregation, the KPL has to buffer records, and therefore introduces some additional per-record latency. For a large number of small producers (such as mobile applications), you should use the PutRecords API to batch records or implement a proxy that handles aggregation and batching.

In this post, you set up a simple HTTP endpoint that receives data records and processes them using the KPL. The producer application runs in a Docker container, which is orchestrated by Amazon ECS on AWS Fargate. A target tracking scaling policy manages the number of parallel running data ingestion containers. It adjusts the number of running containers so you maintain an average CPU utilization of 65%.

Stream storage: Kinesis Data Streams

As mentioned earlier, you can run a variety of streaming platforms on AWS. However, for the data processor in this post, you use Kinesis Data Streams. Kinesis Data Streams is a data store where the data is held for 24 hours and configurable up to 1 year. Kinesis Data Streams is designed to be highly available and redundant by storing data across three Availability Zones in the specified Region.

The stream consists of one or more shards, which are uniquely identified sequences of data records in a stream. One shard has a maximum of 2 MB/s in reads (up to five transactions) and 1 MB/s writes per second (up to 1,000 records per second). Consumers with Dedicated Throughput (Enhanced Fan-Out) support up to 2 MB/s data egress per consumer and shard.

Each record written to Kinesis Data Streams has a partition key, which is used to group data by shard. In this example, the data stream starts with five shards. You use random generated partition keys for the records because records don’t have to be in a specific shard. Kinesis Data Streams assigns a sequence number to each data record, which is unique within the partition key. Sequence numbers generally increase over time so you can identify which record was written to the stream before or after another.

Stream processing: KCL application on AWS Fargate

This post shows you how to use custom consumers—specifically, enhanced fan-out consumers—using the KCL. Enhanced fan-out consumers have a dedicated throughput of 2 MB/s and use a push model instead of pull to get data. Records are pushed to the consumer from the Kinesis Data Streams shards using HTTP/2 Server Push, which also reduces the latency for record processing. If you have more than one instance of a consumer, each instance has a 2 MB/s fan-out pipe to each shard independent from any other consumers. You can use enhanced fan-out consumers with the AWS SDK or the KCL.

For the producer application, this example uses the KPL, which aggregates and batches records. For the consumer to be able to process these records, the application needs to deaggregate the records. To do this, you can use the KCL or the Kinesis Producer Library Deaggeragtion Modules for AWS Lambda (support for Java, Node.js, Python, and Go). The KCL is a Java library but also supports other languages via a MultiLangDaemon. The MultiLangDaemon uses STDIN and STDOUT to communicate with the record processor, so be aware of logging limitations. For this sample application, you use enhanced fan-out consumers with the KCL for Python 2.0.1.

Due to the STDOUT limitation, the record processor logs data records to a file that is written to the container logs and published to Amazon CloudWatch. If you create your own record processor, make sure it handles exceptions, otherwise records may be skipped.

The KCL creates an Amazon DynamoDB table to keep track of consumer progress. For example, if your stream has four shards and you have one producer instance, your instance runs a separate record processor for each shard. If the consumer scales to two instances, the KCL rebalances the record processor and runs two record processors on each instance. For more information, see Using the Kinesis Client Library.

A target tracking scaling policy manages the number of parallel running data processor containers. It adjusts the number of running containers to maintain an average CPU utilization of 65%.

Container configuration

The base layer of the container is Amazon Linux 2 with Python 3 and Java 8. Although you use KCL for Python, you need Java because the record processor communicates with the MultiLangDaemon of the KCL.

During the Docker image build, the Python library for the KCL (version 2.0.1 of amazon_kclpy) is installed, and the sample application (release 2.0.1) from the KCL for Python GitHub repo is cloned. This allows you to use helper tools (samples/amazon_kclpy_helper.py) so you can focus on developing the record processor. The KCL is configured via a properties file (record_processor.properties).

For logging, you have to distinguish between logging of the MultiLangDaemon and the record processor. The logging configuration for the MultiLangDaemon is specified in logback.xml, whereas the record processor has its own logger. The record processor logs to a file and not to STDOUT, because the MultiLangDaemon uses STDOUT for communication, therefore the Daemon would throw an unrecognized messages error.

Logs written to a file (app/logs/record_processor.log) are attached to container logs by a subprocess that runs in the container entry point script (run.sh). The starting script also runs set_properties_py, which uses environment variables to set the AWS Region, stream name, and application name dynamically. If you want to also change other properties, you can extend this script.

The container gets its permissions (such as to read from Kinesis Data Streams and write to DynamoDB) by assuming the role ECSTaskConsumerRole01. This sample deployment uses 2 vCPU and 4 GB memory to run the container.

Kinesis capacity management

When changes in the rate of data flow occur, you may have to increase or decrease the capacity. With Kinesis Data Streams, you can have one or more hot shards as a result of unevenly distributed partition keys, very similar to a hot key in a database. This means that a certain shard receives more traffic than others, and if it’s overloaded, it produces a ProvisionedThroughputExceededException (enable detailed monitoring to see that metric on shard level).

You need to split these hot shards to increase throughput, and merge cold shards to increase efficiency. For this post, you use random partition keys (and therefore random shard assignment) for the records, so we don’t dive deeper into splitting and merging specific shards. Instead, we show how to increase and decrease throughput capacity for the whole stream. For more information about scaling on a shard level, see Strategies for Resharding.

You can build your own scaling application utilizing the UpdateShardCount, SplitShard, and MergeShards APIs or use the custom resource scaling solution as described in Scale Amazon Kinesis Data Streams with AWS Application Auto Scaling or Amazon Kineis Scaling Utils. The Application Auto Scaling is an event-driven scaling architecture based on CloudWatch alarms, and the Scaling Utils is a Docker container that constantly monitors your data stream. The Application Auto Scaling manages the number of shards for scaling, whereas the Kinesis Scaling Utils additionally handles shard keyspace allocations, hot shard splitting, and cold shard merging. For this solution, you use the Kinesis Scaling Utils and deploy it on Amazon ECS. You can also deploy it on AWS Elastic Beanstalk as a container or on an Apache Tomcat platform.

Prerequisites

For this walkthrough, you must have an AWS account.

Solution overview

In this post, we walk through the following steps:

  1. Deploying the CloudFormation template.
  2. Sending records to Kinesis Data Streams.
  3. Monitoring your stream and applications.

Deploying the CloudFormation template

Deploy the CloudFormation stack by choosing Launch Stack:

The template launches in the US East (N. Virginia) Region by default. To launch it in a different Region, use the Region selector in the console navigation bar. The following Regions are supported:

  • US East (Ohio)
  • US West (N. California)
  • US West (Oregon)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)
  • Europe (Frankfurt)
  • Europe (Ireland)

Alternatively, you can download the CloudFormation template and deploy it manually. When asked to provide an IPv4 CIDR range, enter the CIDR range that can send records to your application. You can change it later on by adapting the security groups inbound rule for the Application Load Balancer.

Sending records to Kinesis Data Streams

You have several options to send records to Kinesis Data Streams. You can do it from the CLI or any API client that can send REST requests, or use a load testing solution like Distributed Load Testing on AWS or Artillery. With load testing, additional charges for requests occur; as a guideline, 10,000 requests per second for 10 minutes generate an AWS bill of less than $5.00. To do a POST request via curl, run the following command and replace ALB_ENDPOINT with the DNS record of your Application Load Balancer. You can find it on the CloudFormation stack’s Outputs tab. Ensure you have a JSON element “data”. Otherwise, the application can’t process the record.

curl --location --request POST '&lt;ALB_ENDPOINT&gt;' --header 'Content-Type: application/json' --data-raw '{"data":" This is a testing record"}'

Your Application Load Balancer is the entry point for your data records, so all traffic has to pass through it. Application Load Balancers automatically scale to the appropriate size based on traffic by adding or removing different sized load balancer nodes.

Monitoring your stream and applications

The CloudFormation template creates a CloudWatch dashboard. You can find it on the CloudWatch console or by choosing the link on the stack’s Outputs tab on the CloudFormation console. The following screenshot shows the dashboard.

This dashboard shows metrics for the producer, consumer, and stream. The metric Consumer Behind Latest gives you the offset between current time and when the last record was written to the stream. An increase in this metric means that your consumer application can’t keep up with the rate records are ingested. For more information, see Consumer Record Processing Falling Behind.

The dashboard also shows you the average CPU utilization for the consumer and producer applications, the number of PutRecords API calls to ingest data into Kinesis Data Streams, and how many user records are ingested.

Without using the KPL, you would see one PutRecord equals one user record, but in our architecture, you should see a significantly higher number of user records than PutRecords. The ratio between UserRecords and PutRecords operations strongly depends on KPL configuration parameters. For example, if you increase the value of RecordMaxBufferedTime, data records are buffered longer at the producer, more records can be aggregated, but the latency for ingestion is increased.

All three applications (including the Kinesis Data Streams scaler) publish logs to their respective log group (for example, ecs/kinesis-data-processor-producer) in CloudWatch. You can either check the CloudWatch logs of the Auto Scaling Application or the data stream metrics to see the scaling behavior of Kinesis Data Streams.

Cleaning up

To avoid additional cost, ensure that the provisioned resources are decommissioned. To do that, delete the images in the Amazon Elastic Container Registry (Amazon ECR) repository, the CloudFormation stack, and any remaining resources that the CloudFormation stack didn’t automatically delete. Additionally, delete the DynamoDB table DataProcessorConsumer, which the KCL created.

Conclusion

In this post, you saw how to run the KCL for Python on AWS Fargate to consume data from Kinesis Data Streams. The post also showed you how to scale the data production layer (KPL), data storage layer (Kinesis Data Streams), and the stream processing layer (KCL). You can build your own data streaming solution by deploying the sample code from the GitHub repo. To get started with Kinesis Data Streams, see Getting Started with Amazon Kinesis Data Streams.


About the Author

Florian Mair is a Solutions Architect at AWS.He is a t echnologist that helps customers in Germany succeed and innovate by solving business challenges using AWS Cloud services. Besides working as a Solutions Architect, Florian is a passionate mountaineer, and has climbed some of the highest mountains across Europe.

Amazon QuickSight: 2020 in review

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/amazon-quicksight-2020-in-review/

As 2020 draws to a close, we’ve put together this post to walk you through all that’s changed in Amazon QuickSight in 2020. For your reading convenience, this post is broken up into the following sections:

  • Embedded Analytics at scale
  • Faster insights with Q & Machine Learning (ML)
  • Business Intelligence (BI) with QuickSight
    • Build Rich, Interactive Dashboards
    • Scale your data with SPICE
    • Centralized governance, security, and administration
  • QuickSight learning resources

Embedded Analytics at scale

Embedded analytics has been a key focus area for both independent software vendors (ISVs) and enterprises in the past year. Thousands of customers have since chosen to launch embedded dashboards in QuickSight for both their internal and external end-users, including the NFL, Blackboard, Comcast, Panasonic Avionics, EHE Health, and more, with the ability to scale from tens of users to tens of thousands without any server provisioning or management. Embedding analytics within existing end-user workflows has meant better engagement in apps and portals, and less fatigue for users as they do not need to browse multiple apps for insights.

Dashboard embedded within the website of a fictional company

In 2020, we launched namespaces in QuickSight adding native support for multi-tenancy, an important consideration for developers and ISVs looking for in-app analytics. Together with embedded QuickSight authoring (dashboard creation capabilities), namespaces allow developers and ISVs to use QuickSight to provide self-service data exploration and dashboard creation options to specific users of an application or portal. E.g., ISVs can empower power users to build their own dashboards and share with other end-users from the same organization, which provides feature differentiation for the ISV’s app and also new revenue opportunities for ISVs. We also updated the QuickSight JavaScript SDK  to allow you to pass parameters into an embedded dashboard, switch tabs, change languages, and listen to events within an embedded dashboard.

As developers, ISVs, and enterprises roll out analytics with QuickSight for hundreds of thousands, we also saw the need for a pricing model that scales with growth. The QuickSight per-user model for dashboard readers offers a per-user max charge of $5 per reader per month, offering a low per-user price point for BI use cases. However, as dashboards are rolled out to hundreds of thousands of users, we find that average session consumption per user is usually low (less than 10 sessions a month), but aggregate session usage across users within the account is high.

To ensure scalable pricing for growth, we launched a new session capacity pricing model that allows you to commit to session usage in bulk, and benefit from increased per-session discounts with commitments. With session capacity pricing, sessions are simply charged in 30-minute blocks of usage starting with first access. Unlike traditional options for capacity pricing, which require a server with annual commitments, session capacity pricing allows you to get started easily with a $250 per month starter option. After starting with the monthly option, you can move to one of the annual session capacity tiers as your session usage increases—ensuring that costs scale with growing usage. Annual session capacities allow sessions to be consumed across the year, providing flexibility in ramping up on production traffic, balancing session needs across busy or lean periods of the year (the first or last week of the month are busy, holidays may be slow or busy depending on the nature of the business). For more details on the new pricing options, see Amazon QuickSight Pricing.

With this new model in place, we now also support embedding QuickSight dashboards in scenarios where provisioning and managing end-user reader identities might not be convenient (or even possible). Examples of these situations include embedded dashboards for sharing public statistics within a city or county, dashboards of company-wide statistics available to all users to be embedded on an internal corporate portal, or embedded dashboards with the same data intended for tens of thousands of users per customer, department, or job location or site within an app.

To make it easier to get started with QuickSight embedded analytics, we also have a new embedded analytics developer portal. Learn more about how to embed, view code samples, see capabilities of the JavaScript SDK, and interact with embedded dashboards as well as the authoring experience without having to sign up or code!

Faster insights with Q and ML

While embedding dashboards provides a way to present users with key statistics and insights that are curated ahead of time, end-users often want to dig in further and ask detailed questions. While embedded authoring capabilities allow for such detailed exploration, it is often power users or analysts who leverage such features. For most regular end users, it is either the app developer (in case of embedded apps) or the BI team (in case of BI dashboards) that provides answers to such detailed questions. Both app devs and BI teams are often backlogged, and take time to respond. As a result, this workflow can be tedious and time consuming, and disruptive to both the decision-makers and the data teams. This prompted us to develop Q, the machine learning (ML)-powered natural language query capability in QuickSight. With Q, business users of QuickSight can ask questions in everyday business language and get answers in seconds. For more information on Amazon Quicksight Q, see the product page and our blog post New – Amazon Quicksight Q answers natural-language questions about business data. Q is now available in preview, sign up now.

Separately, we have also continued our efforts to integrate AWS advances in ML into QuickSight to provide easy-to-use ML capabilities for business analysts and end-users. The most popular feature of our ML-Insights suite has been natural language narratives in QuickSight, which use natural language generation to provide key insights that you can customize and add to dashboards. We added the ability to add images and links to these insights (and conditionally showing them as needed), allowing you to use these as a versatile component of your dashboards, which can present up-to-date insights in simple business language for easy consumption by end-users.

We also made Amazon SageMaker integration generally available this year. With this integration, you can perform ML-powered inferences on your datasets with just a few clicks. These include predicting the likelihood of customer churn, scoring leads to prioritize sales activity, assessing credit risk for loan applications, and more. With a native integration, data scientists and data engineering teams no longer have to do the heavy lifting of writing code for ETL jobs, consuming inference APIs, storing output in a queryable source, and finally importing to QuickSight. For more information, see Visualizing Amazon SageMaker machine learning predictions with Amazon QuickSight. For customers who want to understand anomalies in their data better without setting up or managing Sagemaker, we continue to enhance anomaly detection in QuickSight. Anomaly detection runs against your data in a database, data warehouse, data lake, or in the QuickSight SPICE in-memory store, and alerts you to anomalous data points. For more information, see Setting Up ML-Powered Anomaly Detection for Outlier Analysis.

Business Intelligence (BI) with QuickSight

With a serverless architecture, QuickSight allows customers to get started with rich interactive BI dashboards in no time. Customers such as Capital One, Best Western, Rio Tinto, and the NFL have taken the stage at re:Invent previously to talk about how a serverless model has allowed a launch to tens of thousands of users without the typical infrastructure planning and monitoring aspects. Additionally, the completely web-based dashboard authoring experience in QuickSight means analysts don’t need to download any clients to create and publish dashboards. Administrators also don’t spend weeks of time and effort in software updates, because QuickSight is fully managed, with updates twice a month. Overall, this translates to an easy setup and rollout of QuickSight as a business intelligence (BI) solution for thousands of users.

Build Rich, Interactive Dashboards

We introduced a significant number of enhancements around dashboards and the dashboard creation experience in 2020. We introduced six new chart types: filled (choropleth) map, histogram, funnel chart, stacked area chart, waterfall chart, and boxplot, bringing the total number of supported charts to 28. QuickSight now also supports reference lines, either based on a value from the chart itself or an external calculated field, which allow authors to convey additional information to the reader. QuickSight charts now also support custom sorting using a field not present in the visual, thereby performing custom ordering of fields based on business context. Through the narrative component available as part of ML-Insights, you can also add relevant text, images, and links into dashboard, allowing for rich visual layouts.

Tables and pivot tables are among the most popular visual types among business users, and we’ve introduced a number of enhancements to these, including text wrap for headers, text alignment for content and headers, customization of header and total text, representing null data, and more. You can also now export tables and pivot tables to Excel, in addition to the existing CSV option.

On the filtering front, we’ve added support for on-sheet filter controls. You can create these controls with a single click, and also scope them to work across multiple datasets used on the dashboard sheet. Available controls include drop-downs, date/time pickers, slider controls, date range controls, and relative data controls—allowing for a breadth of options for dashboards. You can set these controls to cascade so that one filters the other and only shows relevant values to the end-user.

We also followed up on filter actions launched last year to add the ability to cascade these interactive options for users. This means you can choose a state from a visual on a dashboard, which filters the entire dashboard by the state selected, and then drill down further based on the counties presented within the state coming from another visual. Lastly, we introduced dashboard persistence for readers, allowing them to pick up on dashboards where they had left them. This means that filter control and sheet selections are retained for users based on where they last left off.

Authors of dashboards continue to see new calculations available, with support added for:

  • Minimums and maximums for date fields
  • Modulo operations
  • First and last values of metrics and dimensional values
  • Discrete and continuous percentile

Themes for dashboards now also support a selection of fonts, allowing you to pick the font that aligns well with your design language and corporate standards.

Lastly, QuickSight dashboards now support an optimized scaling mode, which ensures that a dashboard is optimized for the most common screen size expected, yet scaled beautifully when on a projector or a larger screen, or emailed to end-users.

We also continue to invest in our mobile apps (iOS and Android), which allow you to access your dashboards on the go, and interact with filters, actions, drill downs, and more.


Scale your data with SPICE

One of the key benefits of choosing QuickSight for your BI needs is QuickSight’s SPICE in-memory data store, which provides fast, interactive access to data for both authors and readers, and automatically scales to meet high concurrency needs. Each dataset in SPICE can now be up to 250 million rows in size (or 500 GB), up from 100 million rows previously. SPICE data can be ingested from supported database or SaaS sources via an API or a scheduled, with email notifications available upon failures.

For relational data sources and Athena, QuickSight supports both SPICE and direct query options, giving you the choice depending on your use case and performance needs. Choosing SPICE with summarized sets of data allows you to reduce traffic on your backend data stores while also automatically scaling to peaks in usage (such as at the end of the month) without lifting a finger. At the same time, direct query is useful when data may change frequently, or you need access to historical data residing in a data warehouse for occasional analysis. Dashboards in QuickSight can support a combination of both SPICE and direct query datasets, with cross visual filters and actions allowing fluid interactions.

Authors and data admins in QuickSight can combine data across multiple data sources (e.g., flat file, Amazon Redshift, Snowflake, Amazon RDS) and bring the resultant data into SPICE for dashboarding and analysis. In 2020, we added support for joining datasets using custom SQL statements with tables from another data source, other custom SQL datasets, or flat files. For more information, see Joining Data.

Centralized governance, security, and administration

With many organizations moving operations almost completely remote, governance, security, and administration have been key areas of focus this year. Within QuickSight, all operations are logged to AWS CloudTrail, providing auditability, alerting, and traceability as users make changes and access data or assets in QuickSight. Unlike many BI options in the market, QuickSight doesn’t follow a disjoint client/server model, but instead offers centralized control over data, assets, and user capabilities. We launch two key features that further emphasize these: folders and user permissions.

Folders in QuickSight come in two types: personal folders that authors can use to organize content for themselves, or shared folders that admins can create and share with authors and readers (at the user or group level), with the ability to delegate permissions to authors. Shared folders allow strong permissions to be enforced across the organization and help control user access to data and sharing. User permissions customization allow admins to restrict user permissions to actions such as sharing assets, creating folders, or downloading data as CSV. The combination of CloudTrail logging, shared folders, and user-level permissions management allows admins to create strong governance around a QuickSight account.

A key element in a governed analytics setup is enforcing security controls and common definitions in datasets used across the organization. We introduced column-level security as a native option in QuickSight, easily configurable from the UI or APIs. This feature compliments the existing row-level security feature. Together, they give authors control over who can access what data within their visualizations. QuickSight now also allows data owners to create datasets that provide pre-built calculations, column-level metadata, and other customizations that make it easier for authors consuming the dataset to understand the data and build on it.

QuickSight is the only BI tool to provide native AWS Identity and Access Management (IAM) permissions-based control over Amazon Simple Storage Service (Amazon S3), Amazon Athena, and other AWS native data sources. QuickSight has AWS PrivateLink support for secure data connectivity to databases in your VPC, data warehouses such as Amazon Redshift or Snowflake, or big data options such as Presto. QuickSight now also supports AWS Lake Formation permissions for data accessed via Athena and has added native connectivity to Amazon Timestream and Amazon Elasticsearch Service (Amazon ES). QuickSight also supports Oracle databases, either Amazon Relational Database Service (Amazon RDS) or running on Amazon Elastic Compute Cloud (Amazon EC2) or on premises.

On the account administration and management front, we launched APIs (including analysis and theme APIs) that enable admins to create a centralized view of assets within the account, and offer customizations that allow ISVs to provide a more integrated experience for embedded analytics users, including a default organization-specific theme for authors.

Admins will also appreciated QuickSight’s support for a variety of authentication options, including Active Directory (using AWS AD Connector acting as a proxy to the on-premises AD) or federated IAM users via an identity provider such as Azure AD, Okta, Ping, or others. We refined the end-user experience for the latter by introducing options for the user to authenticate from the QuickSight login page (SP-initiated login), as opposed to navigating to the identity provider every time. This enables a business user-friendly experience. Customers wanting to turn off non-federated users within accounts can also do so by logging a support request.

Finally, we continue to expand to make QuickSight accessible to more users and use cases, with expansion to India (Mumbai) and US GovCloud (West) Regions, as well as support for five new European languages: Danish, Dutch, Finnish, Norwegian, and Swedish, making us available in 12 AWS Regions and 15 languages.

QuickSight learning resources

New to QuickSight? Use our 2-month free trial (four authors free for 2 months) to take a quick test drive. You will be asked to sign up for AWS if you don’t have an AWS account. If you already have an account from your company or organization, navigate to the AWS Management Console and choose QuickSight to get started. To learn more about embedding QuickSight, use our embedded developer portal and check out our API documentation.

If you’re an existing QuickSight user, or have used QuickSight in the past, check out our YouTube channel, or dive into one of the following workshops to learn more about your area of interest:

  • Build Advanced Analytics and Dashboards with Amazon QuickSight Part 1, Part 2 – If you’re new to QuickSight or looking to build advanced calculations in QuickSight, this workshop is for you. It provides step-by-step instructions to grow your dashboard building skills from basic to advanced level. You learn how to set up QuickSight in your AWS account, import and prepare data, and build advanced visuals to analyze data in a meaningful way.
  • Embed Amazon QuickSight Dashboards – In this virtual, instructor-led workshop, our Global QuickSight Solutions Architect walks you through setting up a reusable embedding framework. We also include a cloud formation template to quickly spin up this framework.
  • Administration on QuickSight – This virtual, instructor-led workshop is designed with a real-world ISV use case in mind. The ISV pipeline for data curation, data analysis, and dashboard publishing is addressed with distinct end-user personas. We also discuss how development, test, and production environments can be managed and operationalized by admins in a single or multiple QuickSight accounts.
  • Administration Level-Up – This virtual, instructor-led workshop is designed with a real-world ISV use case in mind. Learning objectives include automating dashboard deployment, customizing access to the QuickSight console, configuring for team collaboration, and implementing multi-tenancy and client user segregation.
  • Cost Intelligence Dashboard Setup – Existing AWS admins can check out this video of how to use QuickSight to quickly set up dashboards of their AWS Cost and Usage Report data and provide access to business users within the organization.

Looking ahead

To see the full list of 2020 launches, see What’s New in Amazon QuickSight or subscribe to the Amazon QuickSight YouTube channel for the latest training and feature walkthroughs. You can also check out the repository of QuickSight blogs in the AWS Big Data Blog.

We have a packed roadmap for 2021, and continue to focus on enabling you with insights from all your data that you can share with your users, while not having to worry about operations and servers. Thank you for your support.

We wish you all the very best in the New Year!


About the Authors

Jose Kunnackal John is a principal product manager for Amazon QuickSight.

 

 

 

 

Sahitya Pandiri is a technical program manager with Amazon Web Services.