Tag Archives: Analytics

Scale read and write workloads with Amazon Redshift

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/scale-read-and-write-workloads-with-amazon-redshift/

Amazon Redshift is a fast, fully managed, petabyte-scale cloud data warehouse that enables you to analyze large datasets using standard SQL. The concurrency scaling feature in Amazon Redshift automatically adds and removes capacity by adding concurrency scaling to handle demands from thousands of concurrent users, thereby providing consistent SLAs for unpredictable and spiky workloads such as BI reports, dashboards, and other analytics workloads.

Until now, concurrency scaling only supported auto scaling for read queries; write queries had to run on the main cluster. Now, we are extending concurrency scaling to support auto scaling for common write queries including COPY, INSERT, UPDATE, and DELETE. This is available on Amazon Redshift RA3 provisioned instance types in the Regions where concurrency scaling is available. Amazon Redshift serverless comes with built in dynamic auto scaling capability for read workload scaling.

In this post, we discuss how to enable concurrency scaling to offer consistent SLAs for concurrent workloads such as data loads, ETL (extract, transform, and load), and data processing with reduced queue times.

Concurrency scaling overview

With concurrency scaling, Amazon Redshift automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency scaling resources are added to your Amazon Redshift cluster transparently in seconds, as concurrency increases, to serve sudden spikes in concurrent requests with fast performance without wait time. When the workload demand subsides, Amazon Redshift automatically shuts down concurrency scaling resources to save you cost.

The following diagram shows how concurrency scaling works at a high level.

The workflow contains the following steps:

  1. All queries go to the main cluster.
  2. When queries in the designated workload management (WLM) queue begin queuing, Amazon Redshift automatically routes eligible queries to the new clusters, enabling concurrency scaling.
  3. Amazon Redshift automatically spins up a new cluster, processes waiting queries, and shuts down the concurrency scaling cluster when no longer needed.

Enable Amazon Redshift concurrency scaling

You can manage concurrency scaling at the WLM queue level, where you set concurrency scaling policies for specific queues. When concurrency scaling is enabled for a queue, eligible write and read queries are sent to concurrency scaling clusters without having to wait for resources to free up on the main Amazon Redshift cluster. Amazon Redshift handles spinning up concurrency scaling clusters, routing of the queries to the transient clusters, and relinquishing the concurrency clusters.

You can enable concurrency scaling on both automatic and manual WLM.

You first need to determine which parameter group your cluster is. To do so, complete the following steps:

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Choose your cluster.
  3. On the Properties tab, note the parameter group associated to the cluster.
    Now you can configure your WLM parameters.
  4. Under Configurations in the navigation pane, choose Workload management.
  5. Choose the parameter group associated to the cluster.If you’re using the default parameter group default.redshift-1.0, you need to create a custom parameter group and assign that to the cluster. The default parameter group has preset values for each of its parameters, and it can’t be modified.
  6. On the Parameters tab, you can choose between 1–10 max_concurrency_scaling_clusters.This is the max number of concurrent Amazon Redshift clusters you can have running at the same time. Ten is the soft limit; this limit can be increased by submitting a service limit increase request with a support case.
  7. On the Workload management tab, choose auto mode for the concurrency scaling cluster.

Example use cases

In this section, we use three use cases to help you understand how concurrency scaling for read and write heavy workloads can seamlessly scale to improve workload performance SLAs.

We used a 3 TB Cloud DW benchmark dataset. The test included a total of 103 concurrent queries, with each run using a separate database connection. The 103 queries constituted 60 queries from the 99 TPC-DS queries and 43 write queries, with a mix of copy, insert, update and delete statements. We used RA3.4xlarge 5 compute nodes.

The following scenarios showcase how concurrency scaling for reads and writes can seamlessly auto scale and positively impact a heavy concurrent mixed workload:

  • All queries triggered concurrently with concurrency scaling turned off
  • All queries triggered concurrently with concurrency scaling cluster limit set to 5 clusters
  • All queries triggered concurrently with concurrency scaling cluster limit set to 10 clusters

Scenario 1: All queries triggered concurrently with concurrency scaling turned off

In this benchmark test, all queries completed in 299 minutes. The following are the test details.

The Amazon Redshift query optimizer turned the 103 queries into 257 sub-queries for better performance in this run. Amazon Redshift continuous to learn from operational statistics to optimize your workload.

The following screenshot shows how Amazon Redshift auto WLM mode chose to run 16 queries concurrently while queuing the rest. Because concurrency scaling is turned off, no additional clusters are spun up and the queries continue to wait for running queries to complete before they can be processed. Notice the number of queries queued stayed at a higher number for a long period of time and eventually lowered as only a few queries could concurrently run.

No additional concurrent clusters spun up during the window of the workload, as seen in the following screenshot, requiring the primary cluster to process all the queries.

Scenario 2: All queries triggered concurrently with concurrency scaling cluster max limit set to 5 clusters

In this test, all queries completed in 49 minutes.

The following screenshot depicts significant queuing. Within seconds, five additional Amazon Redshift clusters are spun up into ready state, allowing 53 queries to run simultaneously. This number can change in your cluster based on the query types. Notice the number of queries queued starts lowering as more queries are completed using the five additional clusters.

Over time, the concurrency scaling clusters start to wind down progressively to 0 as the queries no longer waited.

Scenario 3: All queries triggered concurrently with concurrency scaling cluster limit set to 10 clusters

In this test, all queries completed in 28 minutes.

The following screenshot depicts significant queuing. Within seconds, 10 additional Amazon Redshift clusters are spun up into ready state, allowing multiple queries to run simultaneously. This number can change in your cluster based on the query types. Notice the number of queries queued starts lowering as more queries are completed using the five additional clusters.

Over time, the concurrency scaling clusters start to wind down progressively to 0 as the queries no longer waited.

Test results review

The following table summarizes our test results.

. Test Scenario 1 Test Scenario 2 Test Scenario 3
Total Workload Completion Time 299 Minutes 49 Minutes 28 Minutes

The test results reveal how concurrency scaling for a mixed workload of reads and writes lowered the total workload completion time from 299 minutes to 28 minutes, which is more than 10 times an improvement in SLAs while being cost effective by only paying for the additional clusters when scaling is necessary.

Monitor concurrency scaling

One method to monitor concurrency scaling is via system views. To monitor which queries benefitted from concurrency scaling, you can use concurrency_scaling_status from stl_query. Concurrency scaling of 1 indicates that the query ran on a concurrency scaling cluster. To monitor concurrency scaling usage, you can use the SVCS_CONCURRENCY_SCALING_USAGE system view.

The Amazon CloudWatch metrics ConcurrencyScalingActiveClusters and ConcurrencyScalingSeconds enable you to set up monitoring of concurrency scaling usage. For more information, refer to Monitoring Amazon Redshift using CloudWatch metrics.

Configure usage limit

With every 24 hours used of the main Amazon Redshift cluster, you accrue 1 hour of concurrency scaling credit. This free credit can be used by both read and write queries. For any usage that exceeds the accrued free usage credits, you’re billed on a per-second basis based on the on-demand rate of your Amazon Redshift cluster. You can apply cost controls for concurrency scaling at the cluster level. You can choose to create multiple queues for ETL, Dashboard, and adhoc workload. With this you can choose to turn on concurrency scaling for selective queues.

As shown in the following screenshot, you can choose a time period (daily, weekly, or monthly) and specify the desired usage limit. You can then choose an action option (Alert, Log to system table, or Disable feature). For more details on how to set cost controls for concurrency scaling, refer to Manage and control your cost with Amazon Redshift Concurrency Scaling and Spectrum.

Summary

In this post, we showed how you can enable concurrency scaling to help you meet the SLAs for both read and write workloads by seamlessly scaling out to the maximum number of clusters you configured, thereby increasing your cluster throughput while controlling your costs. Concurrency scaling with read and write capability can enable you to handle a number of scenarios, such as sudden increases in the volume of data in your data pipeline, backfill operations, ad hoc reporting, and month end processing. It’s now time to put this learning into action and begin optimizing your Redshift cluster(s) for both read and write throughput!


About the Authors

Harsha Tadiparthi is a specialist Principal Solutions Architect, Analytics at AWS. He enjoys solving complex customer problems in databases and analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

Harshida Patel is a Specialist Principal Solutions Architect, Analytics with AWS.

Ramu Ponugumati is a Sr. Technical Account Manager, specialist in Analytics and AI/ML at AWS. He works with enterprise customers to modernize and cost optimize workloads, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, playing tennis, and gardening.

The AWS Customer Data Platform: overview and architecture

Post Syndicated from Larry Bell original https://aws.amazon.com/blogs/architecture/aws-customer-data-platform-overview-and-architecture/

The deprecation of digital consumer identifiers, such as third-party cookies and mobile advertising IDs, and the rapid growth of data from expanding consumer touchpoints, has created challenges in identifying, managing, and reaching customers in digital channels. Organizations must rethink their strategies for collecting and storing customer data. Customer Data Platforms (CDPs) collect, aggregate, and organize customer data sources, and create individual centralized customer profiles to better manage and understand customers.

Independent Software Vendors (ISVs) in the advertising and marketing industry vertical can aid many companies in achieving these goals. The ISVs can help organizations with the heavy lifting required to build, secure, govern, and maintain near real-time, high volume CDPs. However, building these types of vendor solutions, that support a large number of customers, data volumes, and use cases, is a complex undertaking with often unforeseen challenges.

This post examines the logical architecture of the CDP to provide guidance to help reduce complexity, increase agility, improve operational excellence, and optimize cost. Although the material can benefit advanced marketers evaluating building a CDP, the intent of this post is to provide guidance to those ISVs that are facing these challenges for multiple clients.

Marketing CDP logical architecture

The principal challenge of the CDP architecture is to integrate data from many disparate sources and types. Envision a data lake-centric approach based on a layered architecture where the sources of customer data flow through six logical layers: Ingestion; Processing; Storage; Unified Governance (and Security); Cataloging; and Consumption.

A layered, component-oriented architecture promotes separation of concerns, decoupling of tasks, and the flexibility required to build each component consistent with best practices. These components provide the agility necessary to quickly integrate new data sources and support new analytics or product capabilities. The components are depicted in the image below in the conceptual logical model of the marketing CDP and are then described in this post.

Figure 1: Marketing CDP logical architecture

Figure 1: Marketing CDP logical architecture

CDP components

Logical ingestion layer

The ingestion layer is responsible for collecting data across various customer touchpoints. It provides the ability to connect with internal and external data sources. It can ingest batch, near real-time, and real-time data into the storage layer. This layer aggregates data from multiple source systems and therefore elevates the marketing CDP as the primary repository of marketing and advertising data across an organization. Sources can include cloud or on-premise data sources, streaming data, file stores, third-party Software as a Service (SaaS) connectors and APIs. Separating this component into three layers reduces complexity while providing agility to the process.

Logical storage layer

A scalable, flexible, resilient, and reliable storage layer is critical to the value proposition of a marketing CDP. It consists of three distinct storage areas:

  • Raw Zone – Contains ingested data in its original, immutable format, which can be used to source additional attributes in the future. It can also be used to restore data in certain disaster recovery scenarios. This layer acts as an immutable record of what has happened/been observed historically so that we can use that immutable data to generate a source of fact.
  • Clean Zone – Contains the first transformation of raw data, including conversions to an efficient data format such as Parquet or Avro, as well as basic data quality validations. This layer also acts as an ad-hoc layer to develop answers to unknown question in reasonable time frames so that they can be migrated to the curated zone.
  • Curated Zone – Contains data, organized by subject area, that is ready for consumption by users and applications For a marketing CDP, this includes identity resolution, data enrichment, customer segmentation, and aggregation.

Automated data archival can be configured individually for each layer, and aligned to compliance requirements set by the organization. Access to these layers is controlled at a granular level to ensure a secure and collaborative data exchange and exploration.

Logical cataloging layer

The cataloging layer provides a centralized governance control, including mechanisms for data access control, versioning, and metadata exploration. It provides the ability to track the schema and the partitioning of datasets. This layer makes the datasets discoverable. The Governance capabilities of the Catalog ensure standardization for audit purposes.

Logical processing layer

This layer is responsible for transforming data into a consumable state by applying business rules for data validation, identity resolution, segmentation, normalization, profile aggregation, and machine learning (ML) processing. This layer comprises custom application logic. The compute resources for this layer are designed to scale independently from storage to handle large data volumes; support schema-on-read, support partitioned data and diverse data formats; and orchestrate event-based data processing pipelines.

Logical consumption layer

The consumption layer is responsible for providing scalable tools to gain insights from the vast amount of data in the marketing CDP.

  • Analytics layer – Enables consumption by all user personas through several purpose-built analytics tools that support analysis methods, including ad-hoc SQL queries, batch analytics, business intelligence (BI) dashboards and ML-based insights. Components in this layer should support schema-on-read, data partitioning, and a variety of formats.
  • Data collaboration layer – Consists of data clean rooms where organizations can aggregate customer data from different marketing channels or lines of business, and combine it with first-party data to gain insights while enforcing security, anonymization, and compliance controls.
  • Activation layer – This layer integrates customer profiles with the organization. It can also integrate with third-party SaaS providers in the advertising and marketing industry, and is capable of enriching data sets for consumption.

Logical security and governance layer

The Security and Governance layer is responsible for providing mechanisms for access control, encryption, auditing, and data privacy. CDP platforms must securely organize and control the flow of customer event and attribute data. The CDP must manage data, regardless of ingestion method, to unify that data to unique customer profiles, centralizing audience segmentation, and forwarding data to your purpose-built data stores.

Privacy regulations, which often vary by region or country, make it necessary to focus on collecting only the vital data for your marketing efforts. The CDP must align to a standards-based security process. There must be procedures in place to audit data collection, follow least privilege data access, and avoid data silos.

A marketing CDP must include the following security and governance aspects:

  • Encryption at rest – Data must be persisted in encrypted format to protect it from unauthorized access.
  • Encryption in transit – To protect data in transit,  encryption protocols such as TLS and certificates to create a secure HTTPS connection to make API requests.
  • Key management – Keys must be managed securely because they grant access to data.
  • Secrets management – Secrets, such as application passwords and login credentials, must be protected from unintended access.
  • Fine-grained access controls – Control data access to only those users that have the right to see the data.
  • Data archival – Users need to take advantage of storage tiers and data lifecycle policies, which automatically move data to lower cost tiers over time, based on expected access patterns.
  • Auditing – It is critical to monitor and record all activity within the environment with the goal of being able to analyze activity down to individual API call level.
  • Data masking – It is important to allow users the ability to automatically detect and optionally mask, substitute, or encrypt/decrypt Personally Identifiable Information (PII). This helps outputs of the CDP to comply with such standards as HIPAA and GDPR.
  • Compliance programs – Compliance frameworks such as SOC2, GDPR, CCPA, and others can be attested by tying together governance-focused, audit-friendly service features with applicable compliance or audit standards.

Conclusion: Using the CDP to better manage customers

In this post, we reviewed a logical CDP data architecture that addresses several complexities at scale, using a decoupled, component-driven architecture. The Data Analytics Lens can provide further guidance when designing, deploying, and architecting analytics solution workloads. In addition, ISVs should consider a serverless model for implementation, which helps optimize cost and scalability while reducing the required maintenance on the system.

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 3

Post Syndicated from Jon Roberts original https://aws.amazon.com/blogs/big-data/part-3-migrate-a-large-data-warehouse-from-greenplum-to-amazon-redshift-using-aws-sct/

In this third post of a multi-part series, we explore some of the edge cases in migrating a large data warehouse from Greenplum to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and how to handle these challenges. Challenges include how best to use virtual partitioning, edge cases for numeric and character fields, and arrays.

You can check out the first post of this series for guidance on planning, running, and validating the migration. You can also check out the second post for best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting stored procedures, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns.

Unbounded character data type

Greenplum supports creating columns as text and varchar without specifying the length of the field. This works without an issue in Greenplum but doesn’t work well in migrating to Amazon Redshift. Amazon Redshift stores data in columnar format and gets better compression when using shorter column lengths. Therefore, the Amazon Redshift best practice is to use the smallest character length possible.

AWS SCT will convert these unbounded fields as large objects (LOBs) instead of treating the columns as character fields with a specified length. LOBs are implemented differently in each database product on the market, but in general, a LOB is not stored with the rest of the table data. Instead, there is a pointer to the location of the data. When the LOB is queried, the database reconstitutes the data automatically for you, but this typically requires more resources.

Amazon Redshift doesn’t support LOBs, so AWS SCT resolves this by loading the data into Amazon Simple Storage Service (Amazon S3) and in the column, it stores the S3 URL. When you need to retrieve this data, you have to query the table, get the S3 URL, and then fetch the data from Amazon S3. This isn’t ideal because most of the time, the actual maximum length of the field doesn’t require it to be treated as a LOB, and storing the data remotely means it will take much longer to fetch the data for queries.

The current resolution is to calculate the maximum length of these columns and update the Greenplum tables before converting to Amazon Redshift with AWS SCT.

Note that in a future release of AWS SCT, the collection of statistics will include calculating the maximum length for each column, and the conversion of unbounded varchar and text will set the length in Amazon Redshift automatically.

The following code is an example of an unbounded character data type:

CREATE TABLE public.mytable 
(description1 text NOT NULL PRIMARY KEY, description2 varchar);
CREATE INDEX ON public.mytable (description2);

This table uses a primary key column on an unbounded text column. This needs to be converted to varchar(n), where n is the maximum length found in this column.

  1. Drop unique constraints on affected columns:
    ALTER TABLE public.mytable DROP CONSTRAINT mytable_pkey;

  2. Drop indexes on affected columns:
    DROP INDEX public.mytable_description2_idx;

  3. Calculate maximum length of affected columns:
    select coalesce(max(length(description1)), 10),
    coalesce(max(length(description2)), 10) 
    from public.mytable;

Note that in this example, the description1 and description2 columns only contain NULL values, or the table doesn’t have any data in it, or the calculated length of the columns is 10.

  1. Alter the length of the affected columns:
    ALTER TABLE public.mytable ALTER COLUMN description1 TYPE varchar(10);
    ALTER TABLE public.mytable ALTER COLUMN description2 TYPE varchar(10);

You can now proceed with using AWS SCT to convert the Greenplum schema to Amazon Redshift and avoiding using LOBs to store the column values.

GitHub help

If you have many tables to update and want an automated solution, you can use the add_varchar_lengths.sh script found in the GitHub repo to fix all of the unbounded varchar and text columns in a given schema in Greenplum. The script calculates the appropriate maximum length and then alters the Greenplum tables so the varchar data type is bounded by a length.

Please note that the script also will drop any constraints or indexes on the affected columns.

Empty character data

Greenplum and Amazon Redshift support an empty string value in a field that is different from NULL. The behavior is the same between the two databases. However, AWS SCT defaults to convert empty strings to NULL. This simply needs to be disabled to avoid problems.

  1. In AWS SCT, open your project, choose Settings, Project settings, and Data migration.
  2. Scroll to the bottom and find Use empty as null value.
  3. Deselect this so that AWS SCT doesn’t convert empty strings to NULL.

NaN and Infinity numeric data type

Greenplum supports NaN and Infinity in a numeric field to represent an undefined calculation result and infinity. NaN is very uncommon because when using aggregate functions on a column with a NaN row, the result will also be NaN. Infinity is also uncommon and not useful when aggregating data. However, you may encounter these values in a Greenplum database.

Amazon Redshift doesn’t support NaN and Infinity, and AWS SCT doesn’t check for this in your data. If you do encounter this when using AWS SCT, the task will fail with a numeric conversion error.

To resolve this, it’s suggested to use NULL instead of NaN and Infinity. This allows you to aggregate data and get results other than NaN and, importantly, allow you to convert the Greenplum data to Amazon Redshift.

The following code is an example NaN numeric value:

CREATE TABLE public.mytable2 (id int NOT NULL, amount numeric NOT NULL);
INSERT INTO public.mytable2 VALUES (1, 10), (2, 'NaN'), (3, 20);
  1. Drop the NOT NULL constraint:
    ALTER TABLE public.mytable2 ALTER COLUMN amount DROP NOT NULL;

  2. Update the table:
    UPDATE public.mytable2 SET amount = NULL where amount = 'NaN';

You can now proceed with using AWS SCT to migrate the Greenplum data to Amazon Redshift.

Note that in a future release of AWS SCT, there will be an option to convert NaN and Infinity to NULL so that you won’t have to update your Greenplum data to migrate to Amazon Redshift.

Virtual partitioning on GP_SEGMENT_ID

For large tables, it’s recommended to use virtual partitioning to extract data from Greenplum. Without virtual partitioning, AWS SCT will run a single query to unload data from Greenplum. For example:

SELECT * FROM store_sales;

If this table is very large, it will take a long time to extract the data because this is a single process querying the data. With virtual partitioning, multiple queries are run in parallel so that the extraction of data is completed faster. It also makes it easier to recover if there is an issue with the task.

Virtual partitioning is very flexible, but a simple way to do this in Amazon Redshift is to utilize the Greenplum hidden column gp_segment_id. This column identifies which segment in Greenplum has the data, and each segment should have an equal number of rows. Therefore, creating partitions for each gp_segment_id is an easy way to implement virtual partitioning.

If you’re not familiar with the term segment, it’s similar to an Amazon Redshift slice.

For example:

SELECT * FROM store_sales WHERE gp_segment_id = 0;
SELECT * FROM store_sales WHERE gp_segment_id = 1;
SELECT * FROM store_sales WHERE gp_segment_id = 2;
...
  1. First, determine the number of segments in Greenplum:
    SELECT count(*) 
    FROM gp_segment_configuration 
    WHERE content >= 0 AND preferred_role = 'p';

Now you can configure AWS SCT.

  1. In AWS SCT, go to Data Migration view (other) and choose (right-click) a large table.
  2. Scroll down to Add virtual partitioning.
  3. For the partition type, choose Auto Split and change the column name to GP_SEGMENT_ID.
  4. Use 0 for Start value, the number of segments found in Step 1 as End value, and Interval of 1.

When you create a local task to load this table, the task will have a sub-task for each gp_segment_id value.

Note that in a future release of AWS SCT, there will be an option to automatically virtually partition tables based on GP_SEGMENT_ID. This option will also retrieve the number of segments automatically.

Arrays

Greenplum supports arrays such as bigint[] that are unbounded. Typically, arrays are kept relatively small in Greenplum because arrays consume more memory in Greenplum than using an alternative strategy. However, it’s possible to have a very large array in Greenplum that isn’t supported by Amazon Redshift.

AWS SCT converts a Greenplum array to varchar(65535), but if the converted array is longer than 65,535 characters, then the load will fail.

The following code is an example of a large array:

CREATE TABLE public.sales 
(sales_id int NOT NULL,
customer_id int NOT NULL,
sales_item_ids bigint[]) 
DISTRIBUTED BY (sales_id);

INSERT INTO public.sales values (1, 100, '{1,2,3}'), (2, 100, '{1,2,3}');

In this example, the sales items are stored in an array for each sales_id. If you encounter an error while loading that the length is too long to load this data into Amazon Redshift with AWS SCT, then this is the solution. It’s also a more efficient pattern to store data in both Greenplum and Amazon Redshift!

  1. Create a new sales table that has all columns from the existing sales table, but exclude the array column:
    CREATE TABLE public.sales_new 
    (sales_id int NOT NULL,
    customer_id int NOT NULL) 
    DISTRIBUTED BY (sales_id);

  2. Populate the new sales table with the existing data except for the array column:
    INSERT INTO public.sales_new (sales_id, customer_id) 
    SELECT sales_id, customer_id FROM public.sales;

We create a new table that is a cross-reference of sales IDs with the sales items. Instead of having a single row for this association, now there will be a row for each relationship.

  1. Create a new sales item table:
    CREATE TABLE public.sales_items 
    (sales_id int NOT NULL, 
    sales_item_id bigint NOT NULL) 
    DISTRIBUTED BY (sales_id);

  2. To unnest the array, create a row for each array element:
    INSERT INTO public.sales_items 
    (sales_id, sales_item_id) 
    SELECT sales_id, unnest(sales_item_ids) 
    FROM public.sales;

  3. Rename the sales tables:
    ALTER TABLE public.sales RENAME TO sales_old;
    ALTER TABLE public.sales_new RENAME TO sales;

In AWS SCT, refresh the tables and migrate the revised sales and the new sales_items table.

The following are some example queries before and after.

Before:

SELECT s.sales_id, unnest(s.sales_item_ids) 
FROM public.sales s 
WHERE s.sales_id = 1;

After:

SELECT s.sales_id, i.sales_item_id 
FROM public.sales s 
JOIN public.sales_items i ON s.sales_id = i.sales_id 
WHERE s.sales_id = 1;

Before:

SELECT s.sales_id 
FROM public.sales s 
WHERE s.customer_id = 100
AND 10 = ANY(s.sales_item_ids);

After:

SELECT s.sales_id
FROM public.sales s 
JOIN public.sales_items i ON s.sales_id = i.sales_id 
WHERE s.customer_id = 100
AND i.sales_item_id = 10;

VACUUM ANALYZE

Greenplum, like Amazon Redshift, supports the VACUUM command, which reclaims storage space after UPDATE and DELETE commands are run on a table. Greenplum also allows you to add the ANALYZE option to run both statements with a single command.

The following code is the Greenplum command:

VACUUM ANALYZE table;

This is not very common, but you’ll see this from time to time. If you’re just inserting data into a table, there is no need to run VACUUM, but for ease of use, sometimes developers will use VACUUM ANALYZE.

The following are the Amazon Redshift commands:

VACUUM table;
ANALYZE table;

Amazon Redshift doesn’t support adding ANALYZE to the VACUUM command, so instead, this needs to be two different statements. Also note that Amazon Redshift performs VACUUM and ANALYZE automatically for you so in most cases, you can remove these commands from your scripts entirely.

DISTINCT ON query

Greenplum supports an unusual shortcut for eliminating duplicates in a table. This feature keeps the first row for each set of rows based on the order of the data being fetched. It’s easiest to understand by looking at an example:

CREATE TABLE customer 
(customer_name varchar(100) not null, 
customer_address varchar(1000) not null, 
lastupdate timestamp not null);

INSERT INTO customer VALUES
('ACME', '123 Main St', '2022-01-01'), 
('ACME', '456 Market St', '2022-05-01'), 
('ACME', '789 Broadway', '2022-08-01');

SELECT DISTINCT ON (customer_name) customer_name, customer_address 
FROM customer 
ORDER BY customer_name, lastupdate DESC;

We get the following results:

customer_name | customer_address 
---------------+------------------
 ACME          | 789 Broadway

The solution for running this in Amazon Redshift is to use the ANSI standard row_number() analytical function, as shown in the following code:

SELECT sub.customer_name, sub.customer_address 
FROM (SELECT customer_name, customer_address, row_number() over (partition by customer_name ORDER BY lastupdate DESC) AS row_number FROM customer) AS sub 
WHERE sub.row_number = 1;

Clean up

The examples in this post create tables in Greenplum. To remove these example tables, run the following commands:

DROP TABLE IF EXISTS public.mytable;
DROP TABLE IF EXISTS public.mytable2;
DROP TABLE IF EXISTS public.sales;
DROP TABLE IF EXISTS public.sales_new;
DROP TABLE IF EXISTS public.sales_items;
DROP TABLE IF EXISTS public.customer;

Conclusion

In this post, we covered some of the edge cases when migrating Greenplum to Amazon Redshift and how to handle these challenges, including easy virtual partitioning, edge cases for numeric and character fields, and arrays. This is not an exhaustive list of migrating Greenplum to Amazon Redshift, but this series should help you navigate modernizing your data platform by moving to Amazon Redshift.

For additional details, see the Amazon Redshift Getting Started Guide and the AWS SCT User Guide.


About the Authors

Jon Roberts is a Sr. Analytics Specialist based out of Nashville, specializing in Amazon Redshift. He has over 27 years of experience working in relational databases. In his spare time, he runs.

Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical experience focusing on migrating and replicating databases along with data warehouse workloads. She is passionate about helping customers in their cloud journey.

Suresh Patnam is a Principal BDM – GTM AI/ML Leader at AWS. He works with customers to build IT strategy, making digital transformation through the cloud more accessible by leveraging Data & AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Build an AWS Lake Formation permissions inventory dashboard using AWS Glue and Amazon QuickSight

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/build-an-aws-lake-formation-permissions-inventory-dashboard-using-aws-glue-and-amazon-quicksight/

AWS Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning (ML). Lake Formation provides a single place to define fine-grained access control on catalog resources. These permissions are granted to the principals by a data lake admin, and integrated engines like Amazon Athena, AWS Glue, Amazon EMR, and Amazon Redshift Spectrum enforce the access controls defined in Lake Formation. It also allows principals to securely share data catalog resources across multiple AWS accounts and AWS organizations through a centralized approach.

As organizations are adopting Lake Formation for scaling their permissions, there is steady increase in the access policies established and managed within the enterprise. However, it becomes more difficult to analyze and understand the permissions for auditing. Therefore, customers are looking for a simple way to collect, analyze, and visualize permissions data so that they can inspect and validate the policies. It also enables organizations to take actions that help them with compliance requirements.

This solution offers the ability to consolidate and create a central inventory of Lake Formation permissions that are registered in the given AWS account and Region. It provides a high-level view of various permissions that Lake Formation manages and aims at answering questions like:

  • Who has select access on given table
  • Which tables have delete permission granted
  • Which databases or tables does the given principal have select access to

In this post, we walk through how to set up and collect the permissions granted on resources in a given account using the Lake Formation API. AWS Glue makes it straightforward to set up and run jobs for collecting the permission data and creating an external table on the collected data. We use Amazon QuickSight to create a permissions dashboard using an Athena data source and dataset.

Overview of solution

The following diagram illustrates the architecture of this solution.

In this solution, we walk through the following tasks:

  1. Create an AWS Glue job to collect and store permissions data, and create external tables using Boto3.
  2. Verify the external tables created using Athena.
  3. Sign up for a QuickSight Enterprise account and enable Athena access.
  4. Create a dataset using an Athena data source.
  5. Use the datasets for analysis.
  6. Publish the analyses as a QuickSight dashboard.

The collected JSON data is flattened and written into an Amazon Simple Storage Service (Amazon S3) bucket as Parquet files partitioned by account ID, date, and resource type. After the data is stored in Amazon S3, external tables are created on them and filters are added for different types of resource permissions. These datasets can be imported into SPICE, an in-memory query engine that is part of QuickSight, or queried directly from QuickSight to create analyses. Later, you can publish these analyses as a dashboard and share it with other users.

Dashboards are created for the following use cases:

  • Database permissions
  • Table permissions
  • Principal permissions

Prerequisites

You should have the following prerequisites:

  • An S3 bucket to store the permissions inventory data
  • An AWS Glue database for permissions inventory metadata
  • An AWS Identity and Access Management (IAM) role for the AWS Glue job with access to the inventory AWS Glue database and S3 bucket and added as a data lake admin
  • A QuickSight account with access to Athena
  • An IAM role for QuickSight with access to the inventory AWS Glue database and S3 bucket

Set up and run the AWS Glue job

We create an AWS Glue job to collect Lake Formation permissions data for the given account and Region that is provided as job parameters, and the collected data is flattened before storage. Data is partitioned by account ID, date, and permissions type, and is stored as Parquet in an S3 bucket using Boto3. We create external tables on the data and add filters for different types of resource permissions.

To create the AWS Glue job, complete the following steps:

  1. Download the Python script file to local.
  2. On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs.
  3. Under Create job, select Python Shell script editor.
  4. For Options, select Upload and edit an existing script.
  5. For File upload, choose Choose file.
  6. Choose the downloaded file (lf-permissions-inventory.py).
  7. Choose Create.

GlueJob

  1. After the job is created, enter a name for the job (for this post, lf-inventory-builder) and choose Save.

Glue Job save

  1. Choose the Job details tab.
  2. For Name, enter a name for the job.
  3. For IAM Role¸ choose an IAM role that has access to the inventory S3 bucket and inventory schema and registered as data lake admin.
  4. For Type, choose Python Shell.
  5. For Python version, choose Python 3.9.

Glue Job Details

  1. You can leave other property values at their default.
  2. Under Advanced properties¸ configure the following job parameters and values:
    1. catalog-id: with the value as the current AWS account ID whose permissions data are collected.
    2. databasename: with the value as the AWS Glue database where the inventory-related schema objects are created.
    3. region: with the value as the current Region where the job is configured and whose permissions data is collected.
    4. s3bucket: with the value as the S3 bucket where the collected permissions data is written.
    5. createtable: with the value yes, which enables external table creation on the data.

Job Parameters

  1. Choose Save to save the job settings.

Glue Job Save

  1. Choose Run to start the job.

When the job is complete, the run status changes to Succeeded. You can view the log messages in Amazon CloudWatch Logs.

Job Run

Permissions data is collected and stored in the S3 bucket (under lfpermissions-data) that you provided in the job parameters.

S3 Structure

The following external tables are created on the permissions data and can be queried using Athena:

  • lfpermissions – A summary of resource permissions
  • lfpermissionswithgrant – A summary of grantable resource permissions

For both tables, the schema structure is the same and the lftype column indicates what type of permissions the row applies to.

Athena Table Schema

Verify the tables using Athena

You can use Athena to verify the data using the following queries.

For more information, refer to Running SQL queries using Amazon Athena

  • List the database permissions:
Select * from lfpermissions where lftype=’DATABASE’
  • List the table permissions:
Select * from lfpermissions where lftype= ‘TABLE’
  • List the data lake permissions:
Select * from lfpermissions where lftype= ‘DATA_LOCATION’
  • List the grantable database permissions:
Select * from lfpermissionswithgrant where lftype=’DATABASE’
  • List the grantable table permissions:
Select * from lfpermissionswithgrant where lftype= ‘TABLE’
  • List grantable data lake permissions:
Select * from lfpermissionswithgrant where lftype= ‘DATA_LOCATION’

As the next step, we create a QuickSight dashboard with three sheets, each focused on different sets of permissions (database, table, principal) to slice and dice the data.

Sign up for a QuickSight account

If you haven’t signed up for QuickSight, complete the following steps:

  1. Sign in to the AWS Management Console as Admin, search for QuickSight and choose Sign up for QuickSight.

QuickSight signup

  1. For Edition, select Enterprise.
  2. Choose Continue.
  3. For Authentication method, select Use IAM federated identities & QuickSight-managed users.
  4. Under QuickSight Region, choose the same Region as your inventory S3 bucket.
  5. Under Account info, enter a QuickSight account name and email address for notification.

QuickSight Form

  1. In the Quick access to AWS services section, for IAM Role, select Use QuickSight-managed role (default).
  2. Allow access to IAM, Athena, and Amazon S3.
  3. Specify the S3 bucket that contains the permissions data.
  4. Choose Finish to complete the signup process.

QuickSight configuration

Note: If the inventory bucket and database is managed by Lake Formation, grant database and table access to the created QuickSight IAM role. For instructions, refer to Granting and revoking permissions on Data Catalog resources.

Configure your dataset in QuickSight

QuickSight is configured with an Athena data source the same Region as the S3 bucket. To set up your dataset, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.

Quicksight DataSet

  1. Choose Athena as your data source.

QuickSight Datasource

  1. Enter LF_DASHBOARD_DS as the name of your data source.
  2. Choose Create data source.
  3. For Catalog, leave it as AwsDataCatalog.
  4. For Database, choose database name provided as parameter to the Job.
  5. For Tables, select lfpermissions.
  6. Choose Select.

QuickSight Catalog Info

  1. Select Directly query your data and choose Visualize to take you to the analysis.

Quicksight data mode

Create analyses

We create three sheets for our dashboard to view different levels of permissions.

Sheet 1: Database permission view

To view database permissions, complete the following steps:

  1. On the QuickSight console, choose the plus sign to create a new sheet.
  2. Choose Add, then choose Add title.

QuickSight Title

  1. Name the sheet Database Permissions.
  2. Repeat steps (5-7) to add the following parameters:
    • catalogid
    • databasename
    • permission
    • tablename
  3. On the Add menu, choose Add parameter.
  4. Enter a name for the parameter.
  5. Leave the other values as default and choose Create.
  6. Choose Insights in the navigation pane, then choose Add control.

QuickSight Control

  1. Add a control for each parameter:
    1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
      QuickSight Dependency
    2. Provide additional information for each parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id
databasename DatabaseName lfpermissions databasename
permission Permission lfpermissions permission
  1. Add a control dependency and for Database, choose the options menu and choose Edit.

QuickSight Dependency

  1. Under Format control, choose Control options.
  2. Change the relevant values, choose AccountID, and choose Update.
  3. Similarly, under Permission control, choose Control options.
  4. Change the relevant values, choose AccountID, and choose Update.

We create two visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id and databasename into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop principal into Values and change the aggregation to Count distinct.

QuickSight Database View1

  1. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  2. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  3. Add a filter on databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  4. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.
  5. Choose Actions in the navigation pane.
  6. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

  1. Add a second visual and choose the table visual type.
  2. Drag and drop principal to Group by.
  3. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  4. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  5. Add a filter on databasename the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  6. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.

Now the Database and Permission drop-down menus are populated based on the relevant attributes and changes dynamically.

QuickSight Database View2

Sheet 2: Table permission view

Now that we have created the database permissions sheet, we can add a table permissions sheet.

  1. Choose the plus sign to add a new sheet.
  2. On the QuickSight console, choose Add, then choose Add title.
  3. Name the sheet Table Permissions.
  4. Choose Insights in the navigation pane, then choose Add control.
  5. Add a control for each parameter:
    1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
    2. Provide the additional information for each parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id
databasename DatabaseName lfpermissions databasename
permission Permission lfpermissions permission
tablename TableName lfpermissions tablename

We create two visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id, databasename, and tablename into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop principal into Values and change the aggregation to Count distinct.
  5. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  6. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  7. Add a filter on the databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  8. Add a filter on tablename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose tablename.
  9. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.
  10. Choose Actions in the navigation pane.
  11. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

  1. Add a second visual and choose the table visual type.
  2. Drag and drop principal to Group by.
  3. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  4. Add a filter on catalog_id the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  5. Add a filter on the databasename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose databasename.
  6. Add a filter on tablename with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose tablename.
  7. Add a filter on permission with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose permission.

Now the Databasename, Tablename, and Permission drop-down menus are populated based on the relevant attributes.

QuickSight Table Permissions

Sheet 3: Principal permission view

Now we add a third sheet for principal permissions.

  1. Choose the plus sign to add a new sheet.
  2. On the QuickSight console, choose Add, then choose Add title.
  3. Name the sheet Principal Permissions.
  4. Choose Insights in the navigation pane, then choose Add control.
  5. Add a control for the catalogid parameter:
    1. For Style¸ choose List, and for Values, select Link to a dataset field.
    2. Provide the additional information for the parameter according to the following table.
Parameter Display Name Dataset Field
catalogid AccountID lfpermissions catalog_id

We create four visuals for this view.

  1. For the first visual, choose Visualize and choose pivot table as the visual type.
  2. Drag and drop catalog_id and principal into Rows.
  3. Drag and drop permission into Column.
  4. Drag and drop databasename into Values and change the aggregation to Count distinct.
  5. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  6. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  7. Choose Actions in the navigation pane.
  8. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check Databasename.
  9. For the second visual, choose Visualize and choose table as the visual type.
  10. Drag and drop databasename into Group by.
  11. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. DATABASE as the value.
  12. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  13. For the third visual, choose Visualize and choose pivot table as the visual type.
  14. Drag and drop catalog_id and principal into Rows.
  15. Drag and drop permission into Column.
  16. Drag and drop tablename into Values and change the aggregation to Count distinct.
  17. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  18. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.
  19. Choose Actions in the navigation pane.
  20. Define a new action with the following parameters:
    1. For Activation, select Select.
    2. For Filter action, select All fields.
    3. For Target visuals, select Select visuals and Check Tablename.
  21. For the final visual, choose Visualize and choose table as the visual type.
  22. Drag and drop tablename into Group by.
  23. Add a filter on the lftype field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. TABLE as the value.
  24. Add a filter on the catalog_id field with the following options:
    1. Custom filter as the filter type.
    2. Equals as the filter condition.
    3. Select Use parameters and choose catalogid.

The following screenshot shows our sheet.

QuickSight Prinicipal View

Create a dashboard

Now that the analysis is ready, you can publish it as a dashboard and share it with other users. For instructions, refer to the tutorial Create an Amazon QuickSight dashboard.

Clean up

To clean up the resources created in this post, complete the following steps:

  1. Delete the AWS Glue job lf-inventory-builder.
  2. Delete the data stored under the bucket provided as the value of the s3bucket job parameter.
  3. Drop the external tables created under the schema provided as the value of the databasename job parameter.
  4. If you signed up for QuickSight to follow along with this post, you can delete the account.
  5. For an existing QuickSight account, delete the following resources:
    1. lfpermissions dataset
    2. lfpermissions analysis
    3. lfpermissions dashboard

Conclusion

In this post, we provided a design and implementation steps for a solution to collect Lake Formation permissions in a given Region of an account and consolidate them for analysis. We also walked through the steps to create a dashboard using Amazon QuickSight. You can utilize other QuickSight visuals to create more sophisticated dashboards based on your requirements.

You can also expand this solution to consolidate permissions for a multi-account setup. You can use a shared bucket across organizations and accounts and configure an AWS Glue job in each account or organization to write their permission data. With this solution, you can maintain a unified dashboard view of all the Lake Formation permissions within your organization, thereby providing a central audit mechanism to comply with business requirements.

Thanks for reading this post! If you have any comments or questions, please don’t hesitate to leave them in the comments section.


About the Author

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

New! Security Analytics provides a comprehensive view across all your traffic

Post Syndicated from Zhiyuan Zheng original https://blog.cloudflare.com/security-analytics/

New! Security Analytics provides a comprehensive view across all your traffic

New! Security Analytics provides a comprehensive view across all your traffic

An application proxying traffic through Cloudflare benefits from a wide range of easy to use security features including WAF, Bot Management and DDoS mitigation. To understand if traffic has been blocked by Cloudflare we have built a powerful Security Events dashboard that allows you to examine any mitigation events. Application owners often wonder though what happened to the rest of their traffic. Did they block all traffic that was detected as malicious?

Today, along with our announcement of the WAF Attack Score, we are also launching our new Security Analytics.

Security Analytics gives you a security lens across all of your HTTP traffic, not only mitigated requests, allowing you to focus on what matters most: traffic deemed malicious but potentially not mitigated.

Detect then mitigate

Imagine you just onboarded your application to Cloudflare and without any additional effort, each HTTP request is analyzed by the Cloudflare network. Analytics are therefore enriched with attack analysis, bot analysis and any other security signal provided by Cloudflare.

Right away, without any risk of causing false positives, you can view the entirety of your traffic to explore what is happening, when and where.

This allows you to dive straight into analyzing the results of these signals, shortening the time taken to deploy active blocking mitigations and boosting your confidence in making decisions.

New! Security Analytics provides a comprehensive view across all your traffic

We are calling this approach “detect then mitigate” and we have already received very positive feedback from early access customers.

In fact, Cloudflare’s Bot Management has been using this model for the past two years. We constantly hear feedback from our customers that with greater visibility, they have a high confidence in our bot scoring solution. To further support this new way of securing your web applications and bringing together all our intelligent signals, we have designed and developed the new Security Analytics which starts bringing signals from the WAF and other security products to follow this model.

New! Security Analytics provides a comprehensive view across all your traffic

New Security Analytics

Built on top of the success of our analytics experiences, the new Security Analytics employs existing components such as top statistics, in-context quick filters, with a new page layout allowing for rapid exploration and validation. Following sections will break down this new page layout forming a high level workflow.

The key difference between Security Analytics and Security Events, is that the former is based on HTTP requests which covers visibility of your entire site’s traffic, while Security Events uses a different dataset that visualizes whenever there is a match with any active security rule.

Define a focus

The new Security Analytics visualizes the dataset of sampled HTTP requests based on your entire application, same as bots analytics. When validating the “detect then mitigate” model with selected customers, a common behavior observed is to use the top N statistics to quickly narrow down to either obvious anomalies or certain parts of the application. Based on this insight, the page starts with selected top N statistics covering both request sources and request destinations, allowing expanding to view all the statistics available. Questions like “How well is my application admin’s area protected?” lands at one or two quick filter clicks in this area.

New! Security Analytics provides a comprehensive view across all your traffic

After a preliminary focus is defined, the core of the interface is dedicated to plotting trends over time. The time series chart has proven to be a powerful tool to help spot traffic anomalies, also allowing plotting based on different criteria. Whenever there is a spike, it is likely an attack or attack attempt has happened.

As mentioned above, different from Security Events, the dataset used in this page is HTTP requests which includes both mitigated and not mitigated requests. By mitigated requests here, we mean “any HTTP request that had a ‘terminating’ action applied by the Cloudflare platform”. The rest of the requests that have not been mitigated are either served by Cloudflare’s cache or reaching the origin. In the case such as a spike in not mitigated requests but flat in mitigated requests, an assumption could be that there was an attack that did not match any active WAF rule. In this example, you can one click to filter on not mitigated requests right in the chart which will update all the data visualized on this page supporting further investigations.

In addition to the default plotting of not mitigated and mitigated requests, you can also choose to plot trends of either attack analysis or bot analysis allowing you to spot anomalies for attack or bot behaviors.

New! Security Analytics provides a comprehensive view across all your traffic

Zoom in with analysis signals

One of the most loved and trusted analysis signals by our customers is the bot score. With the latest addition of WAF Attack Score and content scanning, we are bringing them together into one analytics page, helping you further zoom into your traffic based on some of these signals. The combination of these signals enables you to find answers to scenarios not possible until now:

  • Attack requests made by (definite) automated sources
  • Likely attack requests made by humans
  • Content uploaded with/without malicious content made by bots

Once a scenario is filtered on, the data visualization of the entire page including the top N statistics, HTTP requests trend and sampled log will be updated, allowing you to spot any anomalies among either one of the top N statistics or the time based HTTP requests trend.

New! Security Analytics provides a comprehensive view across all your traffic

Review sampled logs

After zooming into a specific part of your traffic that may be an anomaly, sampled logs provide a detailed view to verify your finding per HTTP request. This is a crucial step in a security study workflow backed by the high engagement rate when examining the usage data of such logs viewed in Security Events. While we are adding more data into each log entry, the expanded log view becomes less readable over time. We have therefore redesigned the expanded view, starting with how Cloudflare responded to a request, followed by our analysis signals, lastly the key components of the raw request itself. By reviewing these details, you validate your hypothesis of an anomaly, and if any mitigation action is required.

New! Security Analytics provides a comprehensive view across all your traffic

Handy insights to get started

When testing the prototype of this analytics dashboard internally, we learnt that the power of flexibility yields the learning curve upwards. To help you get started mastering the flexibility, a handy insights panel is designed. These insights are crafted to highlight specific perspectives into your total traffic. By a simple click on any one of the insights, a preset of filters is applied zooming directly onto the portion of your traffic that you are interested in. From here, you can review the sampled logs or further fine tune any of the applied filters. This approach has been proven with further internal studies of a highly efficient workflow that in many cases will be your starting point of using this dashboard.

New! Security Analytics provides a comprehensive view across all your traffic

How can I get it?

The new Security Analytics is being gradually rolled out to all Enterprise customers who have purchased the new Application Security Core or Advanced Bundles. We plan to roll this out to all other customers in the near future. This new view will be alongside the existing Security Events dashboard.

New! Security Analytics provides a comprehensive view across all your traffic

What’s next

We are still at an early stage moving towards the “detect then mitigate” model, empowering you with greater visibility and intelligence to better protect your web applications. While we are working on enabling more detection capabilities, please share your thoughts and feedback with us to help us improve the experience. If you want to get access sooner, reach out to your account team to get started!

How Blueshift integrated their customer data environment with Amazon Redshift to unify and activate customer data for marketing

Post Syndicated from Vijay Chitoor original https://aws.amazon.com/blogs/big-data/how-blueshift-integrated-their-customer-data-environment-with-amazon-redshift-to-unify-and-activate-customer-data-for-marketing/

This post was co-written with Vijay Chitoor, Co-Founder & CEO, and Mehul Shah, Co-Founder and CTO from the Blueshift team, as the lead authors.

Blueshift is a San Francisco-based startup that helps marketers deliver exceptional customer experiences on every channel, delivering relevant personalized marketing. Blueshift’s SmartHub Customer Data Platform (CDP) empowers marketing teams to activate their first-party customer data to drive 1:1 personalization on owned (email, mobile) and paid (Google, Facebook, and so on) website and customer (CX) channels.

In this post, Blueshift’s founding team discuss how they used Amazon Redshift Data API to integrate data from their customer’s Amazon Redshift data warehouse with their CDP environment to help marketers activate their enterprise data and drive growth for their businesses.

Business need

In today’s omnichannel world, marketing teams at modern enterprises are being tasked with engaging customers on multiple channels. To successfully deliver intelligent customer engagement, marketers need to operate with a 360 degree view of their customers that takes into account various types of data, including customer behavior, demographics, consent and preferences, transactions, data from human assisted and digital interactions, and more. However, unifying this data and making it actionable for marketers is often a herculean task. Now, for the first time, with the integration of Blueshift with Amazon Redshift, companies can use more data than ever for intelligent cross-channel engagement.

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics infrastructure. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API.

Blueshift aims at empowering business users to unlock data in such data warehouses and activate audiences with personalized journeys for segmentation, 1:1 messaging, website, mobile, and paid media use cases. Moreover, Blueshift can help combine this data in Amazon Redshift data warehouses with real-time website and mobile data for real-time profiles and activation, enabling this data to be used by marketers in these businesses.

Although the data in Amazon Redshift is incredibly powerful, marketers are unable to use it in it’s original form for customer engagement for a variety of reasons. Firstly, querying the data requires knowledge of query languages like SQL, which marketers aren’t necessarily adept at. Furthermore, marketers need to combine the data in the warehouse with additional sources of data that are critical for customer engagement, including real-time events (for example, a website page viewed by a customer), as well as channel-level permissions and identity.

With the new integration, Blueshift customers can ingest multidimensional data tables from Amazon Redshift (for example, a customer table, transactions table, and product catalog table) into Blueshift to build a single customer view that is accessible by marketers. The bi-directional integration also ensures that predictive data attributes computed in Blueshift, as well as campaign engagement data from Blueshift, are written back into Amazon Redshift tables, enabling technology and analytics teams to have a comprehensive view of the data.

In this post, we describe how Blueshift integrates with Amazon Redshift. We highlight the bi-directional integration with data flowing from a customer’s Amazon Redshift data warehouse to Blueshift’s CDP environment and vice versa. These mechanisms are facilitated through the use of the Redshift Data API.

Solution overview

The integration between the two environments is achieved through a connector. We discuss the connector’s core components in this section. Blueshift uses a hybrid approach using Redshift S3 UNLOAD, Redshift S3 COPY, and the Redshift Data API to simplify the integration between Blueshift and Amazon Redshift, thereby facilitating the data needs to empower marketing teams. The following flow diagram shows the overview of the solution.

Blueshift uses container technology to ingest and process data. The data ingestion and egress containers are scaled up and down depending on the amount of data being processed. One of the key design tenets was to simplify the design by not having to manage connections or active connection pools. The Redshift Data API supports a HTTP-based SQL interface without the need for actively managing connections. As depicted in the process flow, the Redshift Data API lets you access data from Amazon Redshift with various types of traditional, cloud-native, containerized, serverless web service-based applications and event-driven applications. The Blueshift application includes a mix of programming languages, including Ruby (for the customer-facing dashboard), Go (for container workloads), and Python (for data science workloads). The Redshift Data API supports bindings in Python, Go, Java, Node.js, PHP, Ruby, and C++, which makes it simple for developer teams to integrate quickly.

With the Redshift Data API integration in place in Blueshift’s application, IT users from Blueshift customers can set up and validate the data connection, and subsequently Blueshift’s business users (marketers) can seamlessly extract value from data by developing insights and putting those insights into action for the customer data housed in AWS Redshift seamlessly. Therefore, the process developed by Blueshift using the Redshift Data API significantly lowers the barrier for entry for new users without needing data warehousing experience or ongoing IT dependencies for the business user.

The solution architecture depicted in the following figure shows how the various components of the CDP environment and Amazon Redshift integrate to provide the the end-to-end solution.

shows how the various components of the CDP environment and Amazon Redshift integrate to provide the the end-to-end solution

Prerequisites

In this section, we describe the requirements of the integration solution between the two infrastructures. A typical data implementation with customers involves data from Amazon Redshift ingesting into the Blueshift CDP environment. This ingestion mechanism must accommodate different data types, such as the following:

  • Customer CRM data (user identifiers and various CRM fields). A typical range for data volume to be supported for this data type is 50–500 GB ingested once initially.
  • Real-time behavior or events data (for example, playing or pausing a movie).
  • Transactions data, such as subscription purchases. Typical data volumes ingested daily for events and transactions are in the 500 GB – 1 TB range.
  • Catalog content (for example, a list of shows or movies for discovery), which is typically about 1 GB in size ingested daily.

The integration also needs to support Blueshift’s CDP platform environment to export data to Amazon Redshift. This includes data such as campaign activities like emails being viewed, which can run into tens of TB, and segment or user exports to support a list of users that are part of a segment definition, typically 50–500 GB exported daily.

Integrate Amazon Redshift with data applications

Amazon Redshift provides several ways to quickly integrate data applications.

For the initial data loads, Blueshift uses Redshift S3 UNLOAD to dump Amazon Redshift data into Amazon Simple Storage Service (Amazon S3). Blueshift natively uses Amazon S3 as a persistent object store and supports bulk ingestion and export from Amazon S3. Data loads from Amazon S3 are ingested in parallel and cut down on data load times, enabling Blueshift clients to quickly onboard.

For incremental data ingestion, Blueshift data import jobs track the last time an import was run, and import new rows of data that have been added or updated since the previous import ran. Blueshift stays in sync with changes (updates or inserts) to the Amazon Redshift data warehouse using the Redshift Data API. Blueshift uses the last_updated_at column in Amazon Redshift tables to determine new or updated rows and subsequently ingest those using the Redshift Data API. Blueshift’s data integration cron job syncs data in near-real time using the Redshift Data API by polling for updates on a regular cadence (for example, every 10 minutes, hourly, or daily). The cadence can be tuned depending on data freshness requirements.

The following table summarizes the integration types.

Integration type Integration mechanism Advantage
Initial data ingestion from Amazon Redshift to Blueshift Redshift S3 UNLOAD command Initial data is exported from Amazon Redshift via Amazon S3 to allow faster parallel loads into Blueshift using the Amazon Redshift UNLOAD command.
Incremental data ingestion from Amazon Redshift to Blueshift Redshift Data API Incremental data changes are synchronized using the Redshift Data API in near-real time.
Data export from Blueshift to Amazon Redshift Redshift S3 COPY command Blueshift natively stores campaign activity and segment data in Amazon S3, which is loaded into Amazon Redshift using the Redshift S3 COPY command.

Redshift supports numerous out-of-the-box mechanisms to provide data access. Blueshift was able to cut down the data onboarding time for clients by using a hybrid approach of integrating with Amazon Redshift with Redshift S3 UNLOAD, the Redshift Data API, and Redshift S3 COPY. Blueshift is able to cut down the initial data load time, as well as be updated in near-real time with changes in Amazon Redshift and vice versa.

Conclusion

In this post, we showed how Blueshift integrated with the Redshift Data API to ingest customer data. This integration was seamless and demonstrated how straightforward the Redshift Data API makes integration with external applications, such as Blueshift’s CDP environment for marketing, with Amazon Redshift. The outlined use cases in this post are just a few examples of how to use the Redshift Data API to simplify interactions between users and Amazon Redshift clusters.

Now go build and integrate Amazon Redshift with Blueshift.


About the authors

Vijay Chittoor is the CEO & co-founder of Blueshift. Vijay has a wealth of experience in AI, marketing technology and e-commerce domains. Vijay was previously the co-founder & CEO of Mertado (acquired by Groupon to become Groupon Goods), and an early team member at Kosmix (acquired by Walmart to become @WalmartLabs). A former consultant with McKinsey & Co., Vijay is a graduate of Harvard Business School’s MBA Program. He also holds Bachelor’s and Master’s degrees in Electrical Engineering from the Indian Institute of Technology, Bombay.

Mehul Shah is co-Founder & CTO at Blueshift. Previously, he was a co-founder & CTO at Mertado, which was acquired by Groupon to become Groupon Goods. Mehul was an early employee at Kosmix that was acquired by Walmart to become @WalmartLabs. Mehul is a Y Combinator alumni and a graduate of University of Southern California. Mehul is a co-inventor of 12+ patents, and coaches a middle school robotics team.

Manohar Vellala is a Senior Solutions Architect at AWS working with digital native customers on their cloud native journey. He is based in San Francisco Bay Area and is passionate about helping customers build modern applications that can take the full advantage of the cloud. Prior to AWS he worked at H2O.ai where he helped customers build ML models. His interests are Storage, Data Analytics and AI/ML.

Prashant Tyagi joined AWS in September 2020, where he now manages the solutions architecture team focused on enabling digital native businesses. Prashant worked previously at ThermoFisher Scientific, and GE Digital where he held roles as Sr. Director for their Digital Transformation initiatives. Prashant has enabled digital transformation for customers in the Life Sciences and other industry verticals. He has experience in IoT, Data Lakes and AI/ML technical domains. He lives in the bay area in California.

How dynamic data masking support in Amazon Redshift helps achieve data privacy and compliance

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/how-dynamic-data-masking-support-in-amazon-redshift-helps-achieve-data-privacy-and-compliance/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, Amazon Redshift is the most widely used cloud data warehouse.

Dynamic data masking (DDM) support (preview) in Amazon Redshift enables you to simplify the process of protecting sensitive data in your Amazon Redshift data warehouse. You can now use DDM to protect data based on your job role or permission rights and level of data sensitivity through a SQL interface. DDM support (preview) in Amazon Redshift enables you to hide, obfuscate, or pseudonymize column values within the tables in your data warehouse without incurring additional storage costs. It is configurable to allow you to define consistent, format-preserving, and irreversible masked data values.

DDM support (preview) in Amazon Redshift provides a native feature to support your need to mask data for regulatory or compliance requirements, or to increase internal privacy standards. Compared to static data masking where underlying data at rest gets permanently replaced or redacted, DDM support (preview) in Amazon Redshift enables you to temporarily manipulate the display of sensitive data in transit at query time based on user privilege, leaving the original data at rest intact. You control access to data through masking policies that apply custom obfuscation rules to a given user or role. That way, you can respond to changing privacy requirements without altering the underlying data or editing SQL queries.

With DDM support (preview) in Amazon Redshift, you can do the following:

  • Define masking policies that apply custom obfuscation policies (for example, masking policies to handle credit card, PII entries, HIPAA or GDPR needs, and more)
  • Transform the data at query time to apply masking policies
  • Attach masking policies to roles or users
  • Attach multiple masking policies with varying levels of obfuscation to the same column in a table and assign them to different roles with priorities to avoid conflicts
  • Implement cell-level masking by using conditional columns when creating your masking policy
  • Use masking policies to partially or completely redact data, or hash it by using user-defined functions (UDFs)

Here’s what our customers have to say on DDM support(private beta) in Amazon Redshift:

“Baffle delivers data-centric protection for enterprises via a data security platform that is transparent to applications and unique to data security. Our mission is to seamlessly weave data security into every data pipeline. Previously, to apply data masking to an Amazon Redshift data source, we had to stage the data in an Amazon S3 bucket. Now, by utilizing the Amazon Redshift Dynamic Data Masking capability, our customers can protect sensitive data throughout the analytics pipeline, from secure ingestion to responsible consumption reducing the risk of breaches.”

-Ameesh Divatia, CEO & co-founder of Baffle

“EnergyAustralia is a leading Australian energy retailer and generator, with a mission to lead the clean energy transition for customers in a way that is reliable, affordable and sustainable for all. We enable all corners of our business with Data & Analytics capabilities that are used to optimize business processes and enhance our customers’ experience. Keeping our customers’ data safe is a top priority across our teams. In the past, this involved multiple layers of custom built security policies that could make it cumbersome for analysts to find the data they require. The new AWS dynamic data masking feature will significantly simplify our security processes so we continue to keep customer data safe, while also reducing the administrative overhead.”

-William Robson, Data Solutions Design Lead, EnergyAustralia

Use case

For our use case, a retail company wants to control how they show credit card numbers to users based on their privilege. They also don’t want to duplicate the data for this purpose. They have the following requirements:

  • Users from Customer Service should be able to view the first six digits and the last four digits of the credit card for customer verification
  • Users from Fraud Prevention should be able to view the raw credit card number only if it’s flagged as fraud
  • Users from Auditing should be able to view the raw credit card number
  • All other users should not be able to view the credit card number

Solution overview

The solution encompasses creating masking policies with varying masking rules and attaching one or more to the same role and table with an assigned priority to remove potential conflicts. These policies may pseudonymize results or selectively nullify results to comply with retailers’ security requirements. We refer to multiple masking policies being attached to a table as a multi-modal masking policy. A multi-modal masking policy consists of three parts:

  • A data masking policy that defines the data obfuscation rules
  • Roles with different access levels depending on the business case
  • The ability to attach multiple masking policies on a user or role and table combination with priority for conflict resolution

The following diagram illustrates how DDM support (preview) in Amazon Redshift policies works with roles and users for our retail use case.

For a user with multiple roles, the masking policy with the highest attachment priority is used. For example, in the following example, Ken is part of the Public and FrdPrvnt role. Because the FrdPrvnt role has a higher attachment priority, card_number_conditional_mask will be applied.

Prerequisites

To implement this solution, you need to complete the following prerequisites:

  1. Have an AWS account.
  2. Have an Amazon Redshift cluster provisioned with DDM support (preview) or a serverless workgroup with DDM support (preview).
    1. Navigate to the provisioned or serverless Amazon Redshift console and choose Create preview cluster.
    2. In the create cluster wizard, choose the preview track.
  3. Have Superuser privilege, or the sys:secadmin role on the Amazon Redshift data warehouse created in step 2.

Preparing the data

To set up our use case, complete the following steps:

  1. On the Amazon Redshift console, choose Query editor v2 in Explorer.
    If you’re familiar with SQL Notebooks, you can download the Jupyter notebook for the demonstration, and import it to quickly get started.
  2. Create the table and populate contents.
  3. Create users.
    -- 1- Create the credit cards table
    CREATE TABLE credit_cards (
    customer_id INT,
    is_fraud BOOLEAN,
    credit_card TEXT
    );
    -- 2- Populate the table with sample values
    INSERT INTO credit_cards
    VALUES
    (100,'n', '453299ABCDEF4842'),
    (100,'y', '471600ABCDEF5888'),
    (102,'n', '524311ABCDEF2649'),
    (102,'y', '601172ABCDEF4675'),
    (102,'n', '601137ABCDEF9710'),
    (103,'n', '373611ABCDEF6352')
    ;
    --run GRANT to grant SELECT permission on the table
    GRANT SELECT ON credit_cards TO PUBLIC;
    --create four users
    CREATE USER Kate WITH PASSWORD '1234Test!';
    CREATE USER Ken  WITH PASSWORD '1234Test!';
    CREATE USER Bob  WITH PASSWORD '1234Test!';
    CREATE USER Jane WITH PASSWORD '1234Test!';

Implement the solution

To satisfy the security requirements, we need to make sure that each user sees the same data in different ways based on their granted privileges. To do that, we use user roles combined with masking policies as follows:

  1. Create user roles and grant different users to different roles:
    -- 1. Create User Roles
    CREATE ROLE cust_srvc_role;
    CREATE ROLE frdprvnt_role;
    CREATE ROLE auditor_role;
    -- note that public role exist by default.
    
    -- Grant Roles to Users
    GRANT ROLE cust_srvc_role to Kate;
    GRANT ROLE frdprvnt_role  to Ken;
    GRANT ROLE auditor_role   to Bob;
    -- note that regualr_user is attached to public role by default.

  2. Create masking policies:
    -- 2. Create Masking policies
    
    -- 2.1 create a masking policy that fully masks the credit card number
    CREATE MASKING POLICY Mask_CC_Full
    WITH (credit_card VARCHAR(256))
    USING ('XXXXXXXXXXXXXXXX');
    
    --2.2- Create a scalar SQL user-defined function(UDF) that partially obfuscates credit card number, only showing the first 6 digits and the last 4 digits
    CREATE FUNCTION REDACT_CREDIT_CARD (text)
      returns text
    immutable
    as $$
      select left($1,6)||'XXXXXX'||right($1,4)
    $$ language sql;
    
    
    --2.3- create a masking policy that applies the REDACT_CREDIT_CARD function
    CREATE MASKING POLICY Mask_CC_Partial
    WITH (credit_card VARCHAR(256))
    USING (REDACT_CREDIT_CARD(credit_card));
    
    -- 2.4- create a masking policy that will display raw credit card number only if it is flagged for fraud 
    CREATE MASKING POLICY Mask_CC_Conditional
    WITH (is_fraud BOOLEAN, credit_card VARCHAR(256))
    USING (CASE WHEN is_fraud 
                     THEN credit_card 
                     ELSE Null 
           END);
    
    -- 2.5- Create masking policy that will show raw credit card number.
    CREATE MASKING POLICY Mask_CC_Raw
    WITH (credit_card varchar(256))
    USING (credit_card);

  3. Attach the masking policies on the table or column to the user or role:
    -- 3. ATTACHING MASKING POLICY
    -- 3.1- make the Mask_CC_Full the default policy for all users
    --    all users will see this masking policy unless a higher priority masking policy is attached to them or their role
    
    ATTACH MASKING POLICY Mask_CC_Full
    ON credit_cards(credit_card)
    TO PUBLIC;
    
    -- 3.2- attach Mask_CC_Partial to the cust_srvc_role role
    --users with the cust_srvc_role role can see partial credit card information
    ATTACH MASKING POLICY Mask_CC_Partial
    ON credit_cards(credit_card)
    TO ROLE cust_srvc_role
    PRIORITY 10;
    
    -- 3.3- Attach Mask_CC_Conditional masking policy to frdprvnt_role role
    --    users with frdprvnt_role role can only see raw credit card if it is fraud
    ATTACH MASKING POLICY Mask_CC_Conditional
    ON credit_cards(credit_card)
    USING (is_fraud, credit_card)
    TO ROLE frdprvnt_role
    PRIORITY 20;
    
    -- 3.4- Attach Mask_CC_Raw masking policy to auditor_role role
    --    users with auditor_role role can see raw credit card numbers
    ATTACH MASKING POLICY Mask_CC_Raw
    ON credit_cards(credit_card)
    TO ROLE auditor_role
    PRIORITY 30;

Test the solution

Let’s confirm that the masking policies are created and attached.

  1. Check that the masking policies are created with the following code:
    -- 1.1- Confirm the masking policies are created
    SELECT * FROM svv_masking_policy;

  2. Check that the masking policies are attached:
    -- 1.2- Verify attached masking policy on table/column to user/role.
    SELECT * FROM svv_attached_masking_policy;

    Now we can test that different users can see the same data masked differently based on their roles.

  3. Test that the Customer Service agents can only view the first six digits and the last four digits of the credit card number:
    -- 1- Confirm that customer service agent can only view the first 6 digits and the last 4 digits of the credit card number
    SET SESSION AUTHORIZATION Kate;
    SELECT * FROM credit_cards;

  4. Test that the Fraud Prevention users can only view the raw credit card number when it’s flagged as fraud:
    -- 2- Confirm that Fraud Prevention users can only view fraudulent credit card number
    SET SESSION AUTHORIZATION Ken;
    SELECT * FROM credit_cards;

  5. Test that Auditor users can view the raw credit card number:
    -- 3- Confirm the auditor can view RAW credit card number
    SET SESSION AUTHORIZATION Bob;
    SELECT * FROM credit_cards;

  6. Test that general users can’t view any digits of the credit card number:
    -- 4- Confirm that regular users can not view any digit of the credit card number
    SET SESSION AUTHORIZATION Jane;
    SELECT * FROM credit_cards;

Modify the masking policy

To modify an existing masking policy, you must detach it from the role first and then drop and recreate it.

In our use case, the business changed direction and decided that Customer Service agents should only be allowed to view the last four digits of the credit card number.

  1. Detach and drop the policy:
    --reset session authorization to the default
    RESET SESSION AUTHORIZATION;
    --detach masking policy from the credit_cards table
    DETACH MASKING POLICY Mask_CC_Partial
    ON                    credit_cards(credit_card)
    FROM ROLE             cust_srvc_role;
    -- Drop the masking policy
    DROP MASKING POLICY Mask_CC_Partial;
    -- Drop the function used in masking
    DROP FUNCTION REDACT_CREDIT_CARD (TEXT);

  2. Recreate the policy and reattach the policy on the table or column to the intended user or role.Note that this time we created a scalar Python UDF. It’s possible to create a SQL, Python, and Lambda UDF based on your use case.
    -- Re-create the policy and re-attach it to role
    
    -- Create a user-defined function that partially obfuscates credit card number, only showing the last 4 digits
    CREATE FUNCTION REDACT_CREDIT_CARD (credit_card TEXT) RETURNS TEXT IMMUTABLE AS $$
        import re
        regexp = re.compile("^([0-9A-F]{6})[0-9A-F]{5,6}([0-9A-F]{4})")
        match = regexp.search(credit_card)
        if match != None:
            last = match.group(2)
        else:
            last = "0000"
        return "XXXXXXXXXXXX{}".format(last)
    $$ LANGUAGE plpythonu;
    
    --Create a masking policy that applies the REDACT_CREDIT_CARD function
    CREATE MASKING POLICY Mask_CC_Partial
    WITH (credit_card VARCHAR(256))
    USING (REDACT_CREDIT_CARD(credit_card));
    
    -- attach Mask_CC_Partial to the cust_srvc_role role
    -- users with the cust_srvc_role role can see partial credit card information
    ATTACH MASKING POLICY Mask_CC_Partial
    ON credit_cards(credit_card)
    TO ROLE cust_srvc_role
    PRIORITY 10;

  3. Test that Customer Service agents can only view the last four digits of the credit card number:
    -- Confirm that customer service agent can only view the last 4 digits of the credit card number
    SET SESSION AUTHORIZATION Kate;
    SELECT * FROM credit_cards;

Clean up

When you’re done with the solution, clean up your resources:

  1. Detach the masking policies from the table:
    -- Cleanup
    --reset session authorization to the default
    RESET SESSION AUTHORIZATION;
    
    --1.	Detach the masking policies from table
    DETACH MASKING POLICY Mask_CC_Full
    ON credit_cards(credit_card)
    FROM PUBLIC;
    DETACH MASKING POLICY Mask_CC_Partial
    ON credit_cards(credit_card)
    FROM ROLE cust_srvc_role;
    DETACH MASKING POLICY Mask_CC_Conditional
    ON credit_cards(credit_card)
    FROM ROLE frdprvnt_role;
    DETACH MASKING POLICY Mask_CC_Raw
    ON credit_cards(credit_card)
    FROM ROLE auditor_role;

  2. Drop the masking policies:
    -- 2.	Drop the masking policies 
    DROP MASKING POLICY Mask_CC_Full;
    DROP MASKING POLICY Mask_CC_Partial;
    DROP MASKING POLICY Mask_CC_Conditional;
    DROP MASKING POLICY Mask_CC_Raw;

  3. Revoke and drop each user and role:
    -- 3.	Revoke/Drop - role/user 
    REVOKE ROLE cust_srvc_role from Kate;
    REVOKE ROLE frdprvnt_role  from Ken;
    REVOKE ROLE auditor_role   from Bob;
    
    DROP ROLE cust_srvc_role;
    DROP ROLE frdprvnt_role;
    DROP ROLE auditor_role;
    
    DROP USER Kate;
    DROP USER Ken;
    DROP USER Bob;
    DROP USER Jane;

  4. Drop the function and table:
    -- 4.	Drop function and table 
    DROP FUNCTION REDACT_CREDIT_CARD (credit_card TEXT);
    DROP TABLE credit_cards;

Considerations and best practices

Consider the following:

  • Always create a default policy attached to the public user. If you create a new user, they will always have a minimum policy attached. It will enforce the intended security posture.
  • Remember that DDM policies in Amazon Redshift always follow invoker permissions convention, not definer (for more information, refer to Security and privileges for stored procedures ). That being said, the masking policies are applicable based on the user or role running it.
  • For best performance, create the masking functions using a scalar SQL UDF, if possible. The performance of scalar UDFs typically goes by the order of SQL to Python to Lambda, in that order. Generally, SQL UDF outperforms Python UDFs and the latter outperforms scalar Lambda UDFs.
  • DDM policies in Amazon Redshift are applied ahead of any predicate or join operations. For example, if you’re running a join on a masked column (per your access policy) to an unmasked column, the join will lead to a mismatch. That’s an expected behavior.
  • Always detach a masking policy from all users or roles before dropping it.
  • As of this writing, the solution has the following limitations:
    • You can apply a mask policy on tables and columns and attach it to a user or role, but groups are not supported.
    • You can’t create a mask policy on views, materialized views, and external tables.
    • The DDM support (preview) in Amazon Redshift is available in following regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm).

Performance benchmarks

Based on various tests performed on TPC-H datasets, we’ve found built-in functions to be more performant as compared to functions created externally using scalar Python or Lambda UDFs.

Expand the solution

You can take this solution further and set up a masking policy that restricts SSN and email address access as follows:

  • Customer Service agents accessing pre-built dashboards may only view the last four digits of SSNs and complete email addresses for correspondence
  • Analysts cannot view SSNs or email addresses
  • Auditing services may access raw values for SSNs as well as email addresses

For more information, refer to Use DDM support (preview) in Amazon Redshift for E-mail & SSN Masking.

Conclusion

In this post, we discussed how to use DDM support (preview) in Amazon Redshift to define configuration-driven, consistent, format-preserving, and irreversible masked data values. With DDM support (preview) in Amazon Redshift, you can control your data masking approach using familiar SQL language. You can take advantage of the Amazon Redshift role-based access control capability to implement different levels of data masking. You can create a masking policy to identify which column needs to be masked, and you have the flexibility of choosing how to show the masked data. For example, you can completely hide all the information of the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or Lambda UDFs. Additionally, you can apply a conditional masking based on other columns, which selectively protects the column data in a table based on the values in one or more columns.

We encourage you to create your own user defined functions for various use-cases and accomplish desired security posture using dynamic data masking support in Amazon Redshift.


About the Authors

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, TX. He has more than 16 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with the utmost security and data governance.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Variyam Ramesh is a Senior Analytics Specialist Solutions Architect at AWS based in Charlotte, NC. He is an accomplished technology leader helping customers conceptualize, develop, and deliver innovative analytic solutions.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

James Moore is a Technical Lead at Amazon Redshift focused on SQL features and security. His work over the last 10 years has spanned distributed systems, machine learning, and databases. He is passionate about building scalable software that enables customers to solve real-world problems.

Amazon EMR launches support for Amazon EC2 C6i, M6i, I4i, R6i and R6id instances to improve cost performance for Spark workloads by 6–33%

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c6i-m6i-i4i-r6i-and-r6id-instances-to-improve-cost-performance-for-spark-workloads-by-6-33/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over two times performance improvements over open-source Apache Spark and Presto, so that your applications run faster and at lower cost.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as C6i, M6i, I4i, R6i, and R6id, which use the third-generation Intel Xeon scalable processors. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33% over previous generation instances.

In this post, we describe how we estimated the cost-performance benefit from using Amazon EMR with these new instances compared to using equivalent previous generation instances.

Amazon EMR runtime performance improvements with EC2 I4i instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with five node clusters of I4i instances with data in Amazon Simple Storage Service (Amazon S3), and compared it to equivalent sized I3 instances. We measured performance improvements using the total query runtime and geometric mean of query runtime across the TPC-DS 3 TB benchmark queries.

Our results showed between 36.41–44.39% improvement in total query runtime performance on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters, and between 36–45.2% improvement in geometric mean. To measure cost improvement, we added up the Amazon EMR and Amazon EC2 cost per instance per hour (on-demand) and multiplied it by the total query runtime. Note that I4i 32XL instances were not benchmarked because I3 instances don’t have the 32 XL size available. We observed between 22.56–33.1% reduced instance hour cost on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters to run the TPC-DS benchmark queries. All TPC-DS queries ran faster on I4i instance clusters compared to I3 instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent I3 and I4i instance EMR clusters.

Instance Size 16 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5
Total query runtime on I3 (seconds) 4752.15457 4506.43694 7110.03042 11853.40336 21333.05743
Total query runtime on I4I (seconds) 2642.77407 2812.05517 4415.0023 7537.52779 12981.20251
Total query runtime improvement with I4I 44.39% 37.60% 37.90% 36.41% 39.15%
Geometric mean query runtime on I3 (sec) 34.99551 29.14821 41.53093 60.8069 95.46128
Geometric mean query runtime on I4I (sec) 19.17906 18.65311 25.66263 38.13503 56.95073
Geometric mean query runtime improvement with I4I 45.20% 36.01% 38.21% 37.29% 40.34%
EC2 I3 instance price ($ per hour) $4.990 $2.496 $1.248 $0.624 $0.312
EMR I3 instance price ($ per hour) $0.270 $0.270 $0.270 $0.156 $0.078
(EC2 + EMR) I3 instance price ($ per hour) $5.260 $2.766 $1.518 $0.780 $0.390
Cost of running on I3 ($ per instance) $6.943 $3.462 $2.998 $2.568 $2.311
EC2 I4I instance price ($ per hour) $5.491 $2.746 $1.373 $0.686 $0.343
EMR I4I price ($ per hour per instance) $1.373 $0.687 $0.343 $0.172 $0.086
(EC2 + EMR) I4I instance price ($ per hour) $6.864 $3.433 $1.716 $0.858 $0.429
Cost of running on I4I ($ per instance) $5.039 $2.681 $2.105 $1.795 $1.546
Total cost reduction with I4I including performance improvement -27.43% -22.56% -29.79% -30.09% -33.10%

The following graph shows per query improvements we observed on I4i 2XL instances with EMR Runtime for Spark on Amazon EMR version 6.8 compared to equivalent I3 2XL instances for the TPC-DS 3 TB benchmark.

Amazon EMR runtime performance improvements with EC2 M6i instances

M6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent M5 instances. Our test results showed between 13.45–29.52% improvement in total query runtime for seven different instance sizes within the instance family, and between 7.98–25.37% improvement in geometric mean. On cost comparison, we observed 7.98–25.37% reduced instance hour cost on M6i instance EMR clusters compared to M5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6i and M5 instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5 5 5
Total query runtime on M5 (seconds) 4027.58043 3782.10766 3348.05362 3516.4308 5621.22532 10075.45109 17278.15146
Total query runtime on M6I (seconds) 3106.43834 2665.70607 2714.69862 3043.5975 4195.02715 8226.88301 14515.50394
Total query runtime improvement with M6I 22.87% 29.52% 18.92% 13.45% 25.37% 18.35% 15.99%
Geometric mean query runtime M5 (sec) 30.45437 28.5207 23.95314 23.55958 32.95975 49.43178 75.95984
Geometric mean query runtime M6I (sec) 23.76853 19.21783 19.16869 19.9574 24.23012 39.09965 60.79494
Geometric mean query runtime improvement with M6I 21.95% 32.62% 19.97% 15.29% 26.49% 20.90% 19.96%
EC2 M5 instance price ($ per hour) $4.61 $3.07 $2.30 $1.54 $0.77 $0.38 $0.19
EMR M5 instance price ($ per hour) $0.27 $0.27 $0.27 $0.27 $0.19 $0.10 $0.05
(EC2 + EMR) M5 instance price ($ per hour) $4.88 $3.34 $2.57 $1.81 $0.96 $0.48 $0.24
Cost of running on M5 ($ per instance) $5.46 $3.51 $2.39 $1.76 $1.50 $1.34 $1.15
EC2 M6I instance price ($ per hour) $4.61 $3.07 $2.30 $1.54 $0.77 $0.38 $0.19
EMR M6I price ($ per hour per instance) $1.15 $0.77 $0.58 $0.38 $0.19 $0.10 $0.05
(EC2 + EMR) M6I instance price ($ per hour) $5.76 $3.84 $2.88 $1.92 $0.96 $0.48 $0.24
Cost of running on M6I ($ per instance) $4.97 $2.84 $2.17 $1.62 $1.12 $1.10 $0.97
Total cost reduction with M6I including performance improvement -8.92% -19.02% -9.28% -7.98% -25.37% -18.35% -15.99%

Amazon EMR runtime performance improvements with EC2 R6i instances

R6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5 instances. Our test results showed between 14.25–32.23% improvement in total query runtime for six different instance sizes within the instance family, and between 16.12–36.5% improvement in geometric mean. R5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 5.48–23.5% reduced instance hour cost on R6i instance EMR clusters compared to R5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6i and R5 instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2XL
Number of core instances in EMR cluster 5 5 5 5 5 5
Total query runtime on R5 (seconds) 4024.4737 3715.74432 3552.97298 3535.69879 5379.73168 9121.41532
Total query runtime on R6I (seconds) 2865.83169 2518.24192 2513.4849 3031.71973 4544.44854 6977.9508
Total query runtime improvement with R6I 28.79% 32.23% 29.26% 14.25% 15.53% 23.50%
Geometric mean query runtime R5 (sec) 30.59066 28.30849 25.30903 23.85511 32.33391 47.28424
Geometric mean query runtime R6I (sec) 21.87897 17.97587 17.54117 20.00918 26.6277 34.52817
Geometric mean query runtime improvement with R6I 28.48% 36.50% 30.69% 16.12% 17.65% 26.98%
EC2 R5 instance price ($ per hour) $6.0480 $4.0320 $3.0240 $2.0160 $1.0080 $0.5040
EMR R5 instance price ($ per hour) $0.2700 $0.2700 $0.2700 $0.2700 $0.2520 $0.1260
(EC2 + EMR) R5 instance price ($ per hour) $6.3180 $4.3020 $3.2940 $2.2860 $1.2600 $0.6300
Cost of running on R5 ($ per instance) $7.0630 $4.4403 $3.2510 $2.2452 $1.8829 $1.5962
EC2 R6I instance price ($ per hour) $6.0480 $4.0320 $3.0240 $2.0160 $1.0080 $0.5040
EMR R6I price ($ per hour per instance) $1.5120 $1.0080 $0.7560 $0.5040 $0.2520 $0.1260
(EC2 + EMR) R6I instance price ($ per hour) $7.5600 $5.0400 $3.7800 $2.5200 $1.2600 $0.6300
Cost of running on R6I ($ per instance) $6.0182 $3.5255 $2.6392 $2.1222 $1.5906 $1.2211
Total cost reduction with R6I including performance improvement -14.79% -20.60% -18.82% -5.48% -15.53% -23.50%

Amazon EMR runtime performance improvements with EC2 C6i instances

C6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent C5 instances. Our test results showed between 16.9–58.22% improvement in total query runtime for four different instance sizes within the instance family, and between 20.25–59.59% improvement in geometric mean. Only C6i 24, 12, 4, and 2xlarge sizes were benchmarked because C5 doesn’t have 32, 16 and 8 xlarge sizes. C5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 16.75–50.07% reduced instance hour cost on C6i instance EMR clusters compared to C5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent C6i and C5 instance EMR clusters.

Instance Size * 24 XL 12 XL 4 XL 2 XL
Number of core instances in EMR cluster 5 5 5 5
Total query runtime on C5 (seconds) 3435.59808 2900.84981 5945.12879 10173.00757
Total query runtime on C6I (seconds) 2711.16147 2471.86778 5195.30093 8787.43422
Total query runtime improvement with C6I 21.09% 14.79% 12.61% 13.62%
Geometric mean query runtime C5 (sec) 25.67058 20.06539 31.76582 46.78632
Geometric mean query runtime C6I (sec) 20.4458 17.14133 26.92196 39.32622
Geometric mean query runtime improvement with C6I 20.35% 14.57% 15.25% 15.95%
EC2 C5 instance price ($ per hour) $4.080 $2.040 $0.680 $0.340
EMR C5 instance price ($ per hour) $0.270 $0.270 $0.170 $0.085
(EC2 + EMR) C5 instance price ($ per hour) $4.35000 $2.31000 $0.85000 $0.42500
Cost of running on C5 ($ per instance) $4.15135 $1.86138 $1.40371 $1.20098
EC2 C6I instance price ($ per hour) $4.0800 $2.0400 $0.6800 $0.3400
EMR C6I price ($ per hour per instance) $1.02000 $0.51000 $0.17000 $0.08500
(EC2 + EMR) C6I instance price ($ per hour) $5.10000 $2.55000 $0.85000 $0.42500
Cost of running on C6I ($ per instance) $3.84081 $1.75091 $1.22667 $1.03741
Total cost reduction with C6I including performance improvement -7.48% -5.93% -12.61% -13.62%

Amazon EMR runtime performance improvements with EC2 R6id instances

R6id instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5D instances. Our test results showed between 11.8–28.7% improvement in total query runtime for five different instance sizes within the instance family, and between 15.1–32.0% improvement in geometric mean. R6ID 32 XL instances were not benchmarked because R5D instances don’t have these sizes available. On cost comparison, we observed 6.8–11.5% reduced instance hour cost on R6ID instance EMR clusters compared to R5D EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6id and R5d instance EMR clusters.

Instance Size 24 XL 16 XL 12 XL 8 XL 4 XL 2 XL XL
Number of core instances in EMR cluster 5 5 5 5 5 5 5
Total query runtime on R5D (seconds) 4054.4492975042 3691.7569385583 3598.6869168064 3532.7398928104 5397.5330161574 9281.2627059927 16862.8766838096
Total query runtime on R6ID (seconds) 2992.1198446983 2633.7131630720 2632.3186613402 2729.8860537867 4583.1040980373 7921.9960917943 14867.5391541445
Total query runtime improvement with R6ID 26.20% 28.66% 26.85% 22.73% 15.09% 14.65% 11.83%
Geometric mean query runtime R5D (sec) 31.0238156851 28.1432927726 25.7532157307 24.0596427675 32.5800246829 48.2306670294 76.6771994376
Geometric mean query runtime R6ID (sec) 22.8681174894 19.1282742957 18.6161830746 18.0498249257 25.9500918360 39.6580341258 65.0947323858
Geometric mean query runtime improvement with R6ID 26.29% 32.03% 27.71% 24.98% 20.35% 17.77% 15.11%
EC2 R5D instance price ($ per hour) $6.912000 $4.608000 $3.456000 $2.304000 $1.152000 $0.576000 $0.288000
EMR R5D instance price ($ per hour) $0.270000 $0.270000 $0.270000 $0.270000 $0.270000 $0.144000 $0.072000
(EC2 + EMR) R5D instance price ($ per hour) $7.182000 $4.878000 $3.726000 $2.574000 $1.422000 $0.720000 $0.360000
Cost of running on R5D ($ per instance) $8.088626 $5.002331 $3.724641 $2.525909 $2.132026 $1.856253 $1.686288
EC2 R6ID instance price ($ per hour) $7.257600 $4.838400 $3.628800 $2.419200 $1.209600 $0.604800 $0.302400
EMR R6ID price ($ per hour per instance) $1.814400 $1.209600 $0.907200 $0.604800 $0.302400 $0.151200 $0.075600
(EC2 + EMR) R6ID instance price ($ per hour) $9.072000 $6.048000 $4.536000 $3.024000 $1.512000 $0.756000 $0.378000
Cost of running on R6ID ($ per instance) $7.540142 $4.424638 $3.316722 $2.293104 $1.924904 $1.663619 $1.561092
Total cost reduction with R6ID including performance improvement -6.78% -11.55% -10.95% -9.22% -9.71% -10.38% -7.42%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used the on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C6i, M6i, I4i, R6i, and R6id, instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33%.


About the authors

AI MSAl MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Perform multi-cloud analytics using Amazon QuickSight, Amazon Athena Federated Query, and Microsoft Azure Synapse

Post Syndicated from Harish Rajagopalan original https://aws.amazon.com/blogs/big-data/perform-multi-cloud-analytics-using-amazon-quicksight-amazon-athena-federated-query-and-microsoft-azure-synapse/

In this post, we show how to use Amazon QuickSight and Amazon Athena Federated Query to build dashboards and visualizations on data that is stored in Microsoft Azure Synapse databases.

Organizations today use data stores that are best suited for the applications they build. Additionally, they may also continue to use some of their legacy data stores as they modernize and migrate to the cloud. These disparate data stores might be spread across on-premises data centers and different cloud providers. This presents a challenge for analysts to be able to access, visualize, and derive insights from the disparate data stores.

QuickSight is a fast, cloud-powered business analytics service that enables employees within an organization to build visualizations, perform ad hoc analysis, and quickly get business insights from their data on their devices anytime. Amazon Athena is a serverless interactive query service that provides full ANSI SQL support to query a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet, that are stored on Amazon Simple Storage Service (Amazon S3). For data that isn’t stored on Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store it in Amazon S3.

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena. You can think of a connector as an extension of Athena’s query engine. In this post, we use the Athena connector for Azure Synapse analytics that enables Athena to run SQL queries on your Azure Synapse databases using JDBC.

Solution overview

Consider the following reference architecture for visualizing data from Azure Synapse Analytics.

In order to explain this architecture, let’s walk through a sample use case of analyzing fitness data of users. Our sample dataset contains users’ fitness information like age, height, and weight, and daily run stats like miles, calories, average heart rate, and average speed, along with hours of sleep.

We run queries on this dataset to derive insights using visualizations in QuickSight. With QuickSight, you can create trends of daily miles run, keep track of the average heart rate over a period of time, and detect anomalies, if any. You can also track your daily sleep patterns and compare how rest impacts your daily activities. The out-of-the-box insights feature gives vital weekly insights that can help you be on top of your fitness goals. The following screenshot shows sample rows of our dataset stored in Azure Synapse.


Prerequisites

Make sure you have the following prerequisites:

  • An AWS account set up with QuickSight enabled. If you don’t have a QuickSight account, you can sign up for one. You can access the QuickSight free trial as part of the AWS Free Tier option.
  • An Azure account with data pre-loaded in Synapse. We use a sample fitness dataset in this post. We used a data generator to generate this data.
  • A virtual private connection (VPN) between AWS and Azure.

Note that the AWS resources for the steps in this post need to be in the same Region.

Configure a Lambda connector

To configure your Lambda connector, complete the following steps:

  1. Load the data.
    In the Azure account, the sample data for fitness devices is stored and accessed in an Azure Synapse Analytics workspace using a dedicated SQL pool table. The firewall settings for Synapse should allow for access to a VPC through a VPN. You can use your data or tables that you need to connect QuickSight to in this step.
  2. On the Amazon S3 console, create a spillover bucket and note the name to use in a later step.
    This bucket is used for storing the spillover data for the Synapse connector.
  3. On the AWS Serverless Application Repository console, choose Available applications in the navigation pane.
  4. On the Public applications tab, search for synapse and choose AthenaSynapseConnector with the AWS verified author tag.
  5. Create the Lambda function with the following configuration:
    1. For Name, enter AthenaSynapseConnector.
    2. For SecretNamePrefix, enter AthenaJdbcFederation.
    3. For SpillBucket, enter the name of the S3 bucket you created.
    4. For DefaultConnectionString, enter the JDBC connection string from the Azure SQL pool connection strings property.
    5. For LambdaFunctionName, enter a function name.
    6. For SecurityGroupIds and SubnetIds, enter the security group and subnet for your VPC (this is needed for the template to run successfully).
    7. Leave the remaining values as their default.
  6. Choose Deploy.
  7. After the function is deployed successfully, navigate to the athena_hybrid_azure function.
  8. On the Configurations tab, choose Environment variables in the navigation pane.
  9. Add the key azure_synapse_demo_connection_string with the same value as the default key (the JDBC connection string from the Azure SQL pool connection strings property).

    For this post, we removed the VPC configuration.
  10. Choose VPC in the navigation pane and choose None to remove the VPC configuration.
    Now you’re ready to configure the data source.
  11. On the Athena console, choose Data sources in the navigation pane.
  12. Choose Create data source.
  13. Choose Microsoft Azure Synapse as your data source.
  14. Choose Next.
  15. Create a data source with the following parameters:
    1. For Data source name, enter azure_synapse_demo.
    2. For Connection details, choose the Lambda function athena_hybrid_azure.
  16. Choose Next.

Create a dataset on QuickSight to read the data from Azure Synapse

Now that the configuration on the Athena side is complete, let’s configure QuickSight.

  1. On the QuickSight console, on the user name menu, choose Manage QuickSight.
  2. Choose Security & permissions in the navigation pane.
  3. Under QuickSight access to AWS services, choose Manage.
  4. Choose Amazon Athena and in the pop-up permissions box, choose Next.
  5. On the S3 Bucket tab, select the spill bucket you created earlier.
  6. On the Lambda tab, select the athena_hybrid_azure function.
  7. Choose Finish.
  8. If the QuickSight access to AWS services window appears, choose Save.
  9. Choose the QuickSight icon on the top left and choose New dataset.
  10. Choose Athena as the data source.
  11. For Data source name, enter a name.
  12. Check the Athena workgroup settings where the Athena data source was created.
  13. Choose Create data source.
  14. Choose the catalog azure_synapse_demo and the database dbo.
  15. Choose Edit/Preview data.
  16. Change the query mode to SPICE.
  17. Choose Publish & Visualize.
  18. Create an analysis in QuickSight.
  19. Publish a QuickSight dashboard.

If you’re new to QuickSight or looking to build stunning dashboards, this workshop provides step-by-step instructions to grow your dashboard building skills from basic to advanced level. The following screenshot is an example dashboard to give you some inspiration.

Clean up

To avoid ongoing charges, complete the following steps:

  1. Delete the S3 bucket created for the Athena spill data.
  2. Delete the Athena data source.
  3. On the AWS CloudFormation console, select the stack you created for AthenaSynapseConnector and choose Delete.
    This will delete the created resources, such as the Lambda function. Check the stack’s Events tab to track the progress of the deletion, and wait for the stack status to change to DELETE_COMPLETE.
  4. Delete the QuickSight datasets.
  5. Delete the QuickSight analysis.
  6. Delete your QuickSight subscription and close the account.

Conclusion

In this post, we showed you how to overcome the challenges of connecting to and analyzing data in other clouds by using AWS analytics services to connect to Azure Synapse Analytics with Athena Federated Query and QuickSight. We also showed you how to visualize and derive insights from the fitness data using QuickSight. With QuickSight and Athena Federated Query, organizations can now access additional data sources beyond those already supported natively by QuickSight. If you have data in sources other than Amazon S3, you can use Athena Federated Query to analyze the data in place or build pipelines that extract and store data in Amazon S3.

For more information and resources for QuickSight and Athena, visit Analytics on AWS.


About the authors

Harish Rajagopalan is a Senior Solutions Architect at Amazon Web Services. Harish works with enterprise customers and helps them with their cloud journey.

Salim Khan is a Specialist Solutions Architect for Amazon QuickSight. Salim has over 16 years of experience implementing enterprise business intelligence (BI) solutions. Prior to AWS, Salim worked as a BI consultant catering to industry verticals like Automotive, Healthcare, Entertainment, Consumer, Publishing and Financial Services. He has delivered business intelligence, data warehousing, data integration and master data management solutions across enterprises.

Sriram Vasantha is a Senior Solutions Architect at AWS in Central US helping customers innovate on the cloud. Sriram focuses on application and data modernization, DevSecOps, and digital transformation. In his spare time, Sriram enjoys playing different musical instruments like Piano, Organ, and Guitar.

Adarsha Nagappasetty is a Senior Solutions Architect at Amazon Web Services. Adarsha works with enterprise customers in Central US and helps them with their cloud journey. In his spare time, Adarsha enjoys spending time outdoors with his family!

Explore your data lake using Amazon Athena for Apache Spark

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/explore-your-data-lake-using-amazon-athena-for-apache-spark/

Amazon Athena now enables data analysts and data engineers to enjoy the easy-to-use, interactive, serverless experience of Athena with Apache Spark in addition to SQL. You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. For interactive Spark applications, you can spend less time waiting and be more productive because Athena instantly starts running applications in less than a second. And because Athena is serverless and fully managed, analysts can run their workloads without worrying about the underlying infrastructure.

Data lakes are a common mechanism to store and analyze data because they allow companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository. Apache Spark is a popular open-source, distributed processing system optimized for fast analytics workloads against data of any size. It’s often used to explore data lakes to derive insights. For performing interactive data explorations on the data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine in Athena. It enables you to be more productive and get started quickly, spending almost no time setting up infrastructure and Spark configurations.

In this post, we show how you can use Athena for Apache Spark to explore and derive insights from your data lake hosted on Amazon Simple Storage Service (Amazon S3).

Solution overview

We showcase reading and exploring CSV and Parquet datasets to perform interactive analysis using Athena for Apache Spark and the expressive power of Python. We also perform visual analysis using the pre-installed Python libraries. For running this analysis, we use the built-in notebook editor in Athena.

For the purpose of this post, we use the NOAA Global Surface Summary of Day public dataset from the Registry of Open Data on AWS, which consists of daily weather summaries from various NOAA weather stations. The dataset is primarily in plain text CSV format. We have transformed the entire and subsets of the CSV dataset into Parquet format for our demo.

Before running the demo, we want to introduce the following concepts related to Athena for Spark:

  • Sessions – When you open a notebook in Athena, a new session is started for it automatically. Sessions keep track of the variables and state of notebooks.
  • Calculations – Running a cell in a notebook means running a calculation in the current session. As long as a session is running, calculations use and modify the state that is maintained for the notebook.

For more details, refer to Session and Calculations.

Prerequisites

For this demo, you need the following prerequisites:

  • An AWS account with access to the AWS Management Console
  • Athena permissions on the workgroup DemoAthenaSparkWorkgroup, which you create as part of this demo
  • AWS Identity and Access Management (IAM) permissions to create, read, and update the IAM role and policies created as part of the demo
  • Amazon S3 permissions to create an S3 bucket and read the bucket location

The following policy grants these permissions. Attach it to the IAM role or user you use to sign in to the console. Make sure to provide your AWS account ID and the Region in which you’re running the demo.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "athena:*",
            "Resource": "arn:aws:athena:<REGION>:<ACCOUNT_ID>:workgroup/DemoAthenaSparkWorkgroup"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:GetRole",
                "iam:ListAttachedRolePolicies",
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "iam:PutRolePolicy",
                "iam:ListRolePolicies",
                "iam:GetRolePolicy",
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::<ACCOUNT_ID>:role/service-role/AWSAthenaSparkExecutionRole-*",
                "arn:aws:iam::<ACCOUNT_ID>:policy/service-role/AWSAthenaSparkRolePolicy-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<ACCOUNT_ID>-<REGION>-athena-results-bucket-*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:ListPolicies",
                "iam:ListRoles",
                "athena:ListWorkGroups",
                "athena:ListEngineVersions"
            ],
            "Resource": "*"
        }
    ]
}

Create your Athena workgroup

We create a new Athena workgroup with Spark as the engine. Complete the following steps:

  1. On the Athena console, choose Workgroups in the navigation pane.
  2. Choose Create workgroup.
  3. For Workgroup name, enter DemoAthenaSparkWorkgroup.
    Make sure to enter the exact name because the preceding IAM permissions are scoped down for the workgroup with this name.
  4. For Analytics engine, choose Apache Spark.
  5. For Additional configurations, select Use defaults.
    The defaults include the creation of an IAM role with the required permissions to run Spark calculations on Athena and an S3 bucket to store calculation results. It also sets the notebook (which we create later) encryption key management to an AWS Key Management Service (AWS KMS) key owned by Athena.
  6. Optionally, add tags to your workgroup.
  7. Choose Create workgroup.

Modify the IAM role

Creating the workgroup creates a new IAM role. Choose the newly created workgroup, then the value under Role ARN to be redirected to the IAM console.

Add the following permission as an inline policy to the IAM role created earlier. This allows the role to read the S3 datasets. For instructions, refer to the section To embed an inline policy for a user or role (console) in Adding IAM identity permissions (console).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/*",
                "arn:aws:s3:::noaa-gsod-pds/2022/*",
                "arn:aws:s3:::noaa-gsod-pds",
                "arn:aws:s3:::athena-examples-us-east-1"
            ]
        }
    ]
}

Set up your notebook

To run the analysis on Spark on Athena, we need a notebook. Complete the following steps to create one:

  1. On the Athena console, choose Notebook Editor.
  2. Choose the newly created workgroup DemoAthenaSparkWorkgroup on the drop-down menu.
  3. Choose Create Notebook.
  4. Provide a notebook name, for example AthenaSparkBlog.
  5. Keep the default session parameters.
  6. Choose Create.

Your notebook should now be loaded, which means you can start running Spark code. You should see the following screenshot.

Explore the dataset

Now that we have workgroup and notebook created, let’s start exploring the NOAA Global Surface Summary of Day dataset. The datasets used in this post are stored in the following locations:

  • CSV data for year 2022s3://noaa-gsod-pds/2022/
  • Parquet data for year 2021s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2021/
  • Parquet data for year 2020s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2020/
  • Entire dataset in Parquet format (until October 2022)s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/

In the rest of this post, we show PySpark code snippets. Copy the code and enter it in the notebook’s cell. Press Shift+Enter to run the code as a calculation. Alternatively, you can choose Run. Add more cells to run subsequent code snippets.

We start by reading the CSV dataset for the year 2022 and print its schema to understand the columns contained in the dataset. Run the following code in the notebook cell:

year_22_csv = spark.read.option("header","true").csv(f"s3://noaa-gsod-pds/2022/")
year_22_csv.printSchema()

We get the following output.

We were able to submit the preceding code as a calculation instantly using the notebook.

Let’s continue exploring the dataset. Looking at the columns in the schema, we’re interested in previewing the data for the following attributes in 2022:

  • TEMP – Mean temperature
  • WDSP – Mean wind speed
  • GUST – Maximum wind gust
  • MAX – Maximum temperature
  • MIN – Minimum temperature
  • Name – Station name

Run the following code:

year_22_csv.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show()

We get the following output.

Now we have an idea of what the dataset looks like. Next, let’s perform some analysis to find the maximum recorded temperature for the Seattle-Tacoma Airport in 2022. Run the following code:

from pyspark.sql.functions import max

year_22_csv.filter("NAME == 'SEATTLE TACOMA AIRPORT, WA US'").agg(max("MAX").alias("max_temp_yr_2022")).show()

We get the following output.

Next, we want to find the maximum recorded temperature for each month of 2022. For this, we use the Spark SQL feature of Athena. First, we need to create a temporary view on the year_22_csv data frame. Run the following code:

year_22_csv.createOrReplaceTempView("y22view")

To run our Spark SQL query, we use %%sql magic:

%%sql
select month(to_date(date,'yyyy-MM-dd')) as month_yr_22,max(MAX) as max_temp 
from y22view where NAME == 'SEATTLE TACOMA AIRPORT, WA US' 
group by 1

We get the following output.

The output of the preceding query produces the month in numeric form. To make it more readable, let’s convert the month numbers into month names. For this, we use a user-defined function (UDF) and register it to use in the Spark SQL queries for the rest of the notebook session. Run the following code to create and register the UDF:

import calendar

# UDF to convert month number to month name
spark.udf.register("month_name_udf",lambda x: calendar.month_name[int(x)])

We rerun the query to find the maximum recorded temperature for each month of 2022 but with the month_name_udf UDF we just created. Also, this time we sort the results based on the maximum temperature value. See the following code:

%%sql
select month_name_udf(month(to_date(date,'yyyy-MM-dd'))) as month_yr_22,
max(MAX) as max_temp
from y22view where NAME == 'SEATTLE TACOMA AIRPORT, WA US' 
group by 1 order by 2 desc

The following output shows the month names.

Until now, we have run interactive explorations for the year 2022 of the NOAA Global Surface Summary of Day dataset. Let’s say we want to compare the temperature values with the previous 2 years. We compare the maximum temperature across 2020, 2021, and 2022. As a reminder, the dataset for 2022 is in CSV format and for 2020 and 2021, the datasets are in Parquet format.

To continue with the analysis, we read the 2020 and 2021 Parquet datasets into the data frame and create temporary views on the respective data frames. Run the following code:

#Read the dataset
year_20_pq = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2020/")
year_21_pq = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2021/")

#Create temporary views
year_20_pq.createOrReplaceTempView("y20view")
year_21_pq.createOrReplaceTempView("y21view")

#Preview the datasets
print('Preview for year 2020:')
year_20_pq.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show(1)
print('Preview for year 2021:')
year_21_pq.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show(1)

We get the following output.

To compare the recorded maximum temperature for each month in 2020, 2021, and 2022, we perform a join operation on the three views created so far from their respective data frames. Also, we reuse the month_name_udf UDF to convert month number to month name. Run the following code:

%%sql
select month_name_udf(month(to_date(y21.DATE,'yyyy-MM-dd'))) as month, max(y20.max) as max_temp_2020, 
max(y21.max) as max_temp_2021, max(y22.max) as max_temp_2022 \
from y20view y20 inner join y21view y21 inner join y22view y22 \
on month(to_date(y20.DATE,'yyyy-MM-dd')) = month(to_date(y21.DATE,'yyyy-MM-dd'))
and month(to_date(y21.DATE,'yyyy-MM-dd')) = month(to_date(y22.DATE,'yyyy-MM-dd')) \
where y20.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y21.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y22.NAME == 'SEATTLE TACOMA AIRPORT, WA US' \
group by 1

We get the following output.

So far, we’ve read CSV and Parquet datasets, run analysis on the individual datasets, and performed join and aggregation operations on them to derive insights instantly in an interactive mode. Next, we show how you can use the pre-installed libraries like Seaborn, Matplotlib, and Pandas for Spark on Athena to generate a visual analysis. For the full list of preinstalled Python libraries, refer to List of preinstalled Python libraries.

We plot a visual analysis to compare the recorded maximum temperature values for each month in 2020, 2021, and 2022. Run the following code, which creates a Spark data frame from the SQL query, converts it into a Pandas data frame, and uses Seaborn and Matplotlib for plotting:

import seaborn as sns
import matplotlib.pyplot as plt

y20_21_22=spark.sql("select month(to_date(y21.DATE,'yyyy-MM-dd')) as month, max(y20.max) as max_temp_yr_2020, \
max(y21.max) as max_temp_yr_2021, max(y22.max) as max_temp_yr_2022 \
from y20view y20 inner join y21view y21 inner join y22view y22 \
on month(to_date(y20.DATE,'yyyy-MM-dd')) = month(to_date(y21.DATE,'yyyy-MM-dd')) \
and month(to_date(y21.DATE,'yyyy-MM-dd')) = month(to_date(y22.DATE,'yyyy-MM-dd')) \
where y20.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y21.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y22.NAME == 'SEATTLE TACOMA AIRPORT, WA US' \
group by 1 order by 1")

#convert to pandas dataframe
y20_21_22=y20_21_22.toPandas()

#change datatypes to float for plotting
y20_21_22['max_temp_yr_2020']= y20_21_22['max_temp_yr_2020'].astype(float)
y20_21_22['max_temp_yr_2021']= y20_21_22['max_temp_yr_2021'].astype(float)
y20_21_22['max_temp_yr_2022']= y20_21_22['max_temp_yr_2022'].astype(float)

# Unpivot dataframe from wide to long format for plotting
y20_21_22=y20_21_22.melt('month',var_name='max_temperature', \
             value_name='temperature')

plt.clf()

sns.catplot(data=y20_21_22,x='month',y='temperature', hue='max_temperature', \
            sort=False, kind='point', height=4, aspect=1.5)
%matplot plt

The following graph shows our output.

Next, we plot a heatmap showing the maximum temperature trend for each month across all the years in the dataset. For this, we have converted the entire CSV dataset (until October 2022) into Parquet format and stored it in s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/.

Run the following code to plot the heatmap:

noaa = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/")
noaa.createOrReplaceTempView("noaaview")

#query to find maximum temperature for each month from year 1973 to 2022
year_hist=spark.sql("select month(to_date(date,'yyyy-MM-dd')) as month, \
year(to_date(date,'yyyy-MM-dd')) as year,  cast(max(temp) as float) as temp \
from noaaview where NAME == 'SEATTLE TACOMA AIRPORT, WA US' group by 1,2") 

# convert spark dataframe to pandas
year_hist=year_hist.toPandas()
year_hist=year_hist.pivot("month","year","temp")

plt.clf()
grid_kws = {"height_ratios": (0.9, .05), "hspace": .5}
f, (ax, cbar_ax) = plt.subplots(2, gridspec_kw=grid_kws)

sns.heatmap(year_hist, ax=ax, cbar_ax=cbar_ax, cmap="RdYlBu_r", \
            cbar_kws={"orientation": "horizontal"})
%matplot plt

We get the following output.

From the potting, we can see the trend has been almost similar across the years, where the temperature rises during summer months and lowers as winter approaches in the Seattle-Tacoma Airport area. You can continue exploring the datasets further, running more analyses and plotting more visuals to get the feel of the interactive and instant-on experience Athena for Apache Spark offers.

Clean up resources

When you’re done with the demo, make sure to delete the S3 bucket you created to store the workgroup calculations to avoid storage costs. Also, you can delete the workgroup, which deletes the notebook as well.

Conclusion

In this post, we saw how you can use the interactive and serverless experience of Athena for Spark as the engine to run calculations instantly. You just need to create a workgroup and notebook to start running the Spark code. We explored datasets stored in different formats in an S3 data lake and ran interactive analyses to derive various insights. Also, we ran visual analyses by plotting charts using the preinstalled libraries. To learn more about Spark on Athena, refer to Using Apache Spark in Amazon Athena.


About the Authors

Pathik Shah is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Raj Devnath is a Sr. Product Manager at AWS working on Amazon Athena. He is passionate about building products customers love and helping customers extract value from their data. His background is in delivering solutions for multiple end markets, such as finance, retail, smart buildings, home automation, and data communication systems.

Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/centrally-manage-access-and-permissions-for-amazon-redshift-data-sharing-with-aws-lake-formation/

Today’s global, data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Amazon Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another.

Some customers share their data with 50–100 data warehouses in different accounts and do a lot of cross-sharing, making it difficult to track who is accessing what data. They have to navigate to an individual account’s Amazon Redshift console to retrieve the access information. Also, many customers have their data lake on Amazon Simple Storage Service (Amazon S3), which is shared within and across various business units. As the organization grows and democratizes the data, administrators want the ability to manage the datashare centrally for governance and auditing, and to enforce fine-grained access control.

Working backward from customer ask, we are announcing the preview of the following new feature: Amazon Redshift data sharing integration with AWS Lake Formation, which enables Amazon Redshift customers to centrally manage access to their Amazon Redshift datashares using Lake Formation.

Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns, and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles.

Customers are using the data mesh approach, which provides a mechanism to share data across business units. Customers are also using a modern data architecture to share data from data lake stores and Amazon Redshift purpose-built data stores across business units. Lake Formation provides the ability to enforce data governance within and across business units, which enables secure data access and sharing, easy data discovery, and centralized audit for data access.

United Airlines is in the business of connecting people and uniting the world.

“As a data-driven enterprise, United is trying to create a unified data and analytics experience for our analytics community that will innovate and build modern data-driven applications. We believe we can achieve this by building a purpose-built data mesh architecture using a variety of AWS services like Athena, Aurora, Amazon Redshift, and Lake Formation to simplify management and governance around granular data access and collaboration.”

-Ashok Srinivas, Director of ML Engineering and Sarang Bapat, Director of Data Engineering.

In this post, we show how to centrally manage access and permissions for Amazon Redshift data sharing with Lake Formation.

Solution overview

In this solution, we demonstrate how integration of Amazon Redshift data sharing with Lake Formation for data governance can help you build data domains, and how you can use the data mesh approach to bring data domains together to enable data sharing and federation across business units. The following diagram illustrates our solution architecture.

solution architecture

The data mesh is a decentralized, domain-oriented architecture that emphasizes separating data producers from data consumers via a centralized, federated Data Catalog. Typically, the producers and consumers run within their own account. The details of these data mesh characteristics are as follows:

  • Data producers – Data producers own their data products and are responsible for building their data, maintaining its accuracy, and keeping their data product up to date. They determine what datasets can be published for consumption and share their datasets by registering them with the centralized data catalog in a central governance account. You might have a producer steward or administrator persona for managing the data products with the central governance steward or administrators team.
  • Central governance account – Lake Formation enables fine-grained access management on the shared dataset. The centralized Data Catalog offers consumers the ability to quickly find shared datasets, allows administrators to centrally manage access permissions on shared datasets, and provides security teams the ability to audit and track data product usage across business units.
  • Data consumers – The data consumer obtains access to shared resources from the central governance account. These resources are available inside the consumer’s AWS Glue Data Catalog, allowing fine-grained access on the database and table that can be managed by the consumer’s data stewards and administrators.

The following steps provide an overview of how Amazon Redshift data sharing can be governed and managed by Lake Formation in the central governance pattern of a data mesh architecture:

  1. In the producer account, data objects are created and maintained in the Amazon Redshift producer cluster. A data warehouse admin creates the Amazon Redshift datashare and adds datasets (tables, views) to the share.
  2. The data warehouse admin grants and authorizes access on the datashare to the central governance account’s Data Catalog.
  3. In the central governance account, the data lake admin accepts the datashare and creates the AWS Glue database that points to the Amazon Redshift datashare so that Lake Formation can manage it.
  4. The data lake admin shares the AWS Glue database and tables to the consumer account using Lake Formation cross-account sharing.
  5. In the consumer account, the data lake admin accepts the resource share invitation via AWS Resource Access Manager (AWS RAM) and can view the database listed in the account.
  6. The data lake admin defines the fine-grained access control and grants permissions on databases and tables to IAM users (for this post, consumer1 and consumer2) in the account.
  7. In the Amazon Redshift cluster, the data warehouse admin creates an Amazon Redshift database that points to the Glue database and authorizes usage on the created Amazon Redshift database to the IAM users.
  8. The data analyst as an IAM user can now use their preferred tools like the Amazon Redshift query editor to access the dataset based on the Lake Formation fine-grained permissions.

We use the following account setup for our example in this post:

  • Producer account – 123456789012
  • Central account – 112233445566
  • Consumer account – 665544332211

Prerequisites

Create the Amazon Redshift data share and add datasets

In the data producer account, create an Amazon Redshift cluster using the RA3 node type with encryption enabled. Complete the following steps:

  1. On the Amazon Redshift console, create a cluster subnet group.

For more information, refer to Managing cluster subnet groups using the console.

  1. Choose Create cluster.
  2. For Cluster identifier, provide the cluster name of your choice.
  3. For Preview track, choose preview_2022.

  1. For Node type, choose one of the RA3 node types.

This feature is only supported on the RA3 node type.

  1. For Number of nodes, enter the number of nodes that you need for your cluster.
  2. Under Database configurations, choose the admin user name and admin user password.
  3. Under Cluster permissions, you can select the IAM role and set it as the default.

For more information about the default IAM role, refer to Creating an IAM role as default for Amazon Redshift.

cluster permissions

  1. Turn on the Use defaults option next to Additional configurations to modify the default settings.
  2. Under Network and security, specify the following:
    1. For Virtual private cloud (VPC), choose the VPC you would like to deploy the cluster in.
    2. For VPC security groups, either leave as default or add the security groups of your choice.
    3. For Cluster subnet group, choose the cluster subnet group you created.

additional configurations

  1. Under Database configuration, in the Encryption section, select Use AWS Key Management Service (AWS KMS) or Use a hardware security module (HSM).

Encryption is disabled by default.

  1. For Choose an AWS KMS key, you can either choose an existing AWS Key Management Service (AWS KMS) key, or choose Create an AWS KMS key to create a new key.

For more information, refer to Creating keys.

database configurations

  1. Choose Create cluster.
  2. For this post, create tables and load data into the producer Amazon Redshift cluster using the following script.

Authorize the datashare

Install or update the latest AWS Command Line Interface (AWS CLI) version to run the AWS CLI to authorize the datashare. For instructions, refer to Installing or updating the latest version of the AWS CLI.

Set up Lake Formation permissions

To use the AWS Glue Data Catalog in Lake Formation, complete the following steps in the central governance account to update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

  1. Sign in to the Lake Formation console as admin.
  2. In the navigation pane, under Data catalog, choose Settings.
  3. Deselect Use only IAM access control for new databases.
  4. Deselect Use only IAM access control for new tables in new databases.
  5. Choose Version 2 for Cross account version settings.
  6. Choose Save.

data catalog settings

Set up an IAM user as a data lake administrator

If you’re using an existing data lake administrator user or role add the following managed policies, if not attached and skip the below setup steps:

AWSGlueServiceRole
AmazonRedshiftFullAccess

Otherwise, to set up an IAM user as a data lake administrator, complete the following steps:

  1. On the IAM console, choose Users in the navigation pane.
  2. Select the IAM user who you want to designate as the data lake administrator.
  3. Choose Add an inline policy on the Permissions tab.
  4. Replace <AccountID> with your own account ID and add the following policy:
{
    "Version": "2012-10-17",
    "Statement": [ {
        "Condition": {"StringEquals": {
            "iam:AWSServiceName":"lakeformation.amazonaws.com"}},
            "Action":"iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Effect": "Allow"},
            {"Action": ["iam:PutRolePolicy"],
            "Resource": "arn:aws:iam::<AccountID>:role/aws-service role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess",
            "Effect": "Allow"
     },{
                  "Effect": "Allow",
                  "Action": [
                    "ram:AcceptResourceShareInvitation",
                    "ram:RejectResourceShareInvitation",
                    "ec2:DescribeAvailabilityZones",
                    "ram:EnableSharingWithAwsOrganization"
                  ],
                  "Resource": "*"
                }]
}
  1. Provide a policy name.
  2. Review and save your settings.
  3. Choose Add permissions, and choose Attach existing policies directly.
  4. Add the following policies:
    1. AWSLakeFormationCrossAccountManager
    2. AWSGlueConsoleFullAccess
    3. AWSGlueServiceRole
    4. AWSLakeFormationDataAdmin
    5. AWSCloudShellFullAccess
    6. AmazonRedshiftFullAccess
  5. Choose Next: Review and add permissions.

Data consumer account setup

In the consumer account, follow the steps mentioned previously in the central governance account to set up Lake Formation and a data lake administrator.

  1. In the data consumer account, create an Amazon Redshift cluster using the RA3 node type with encryption (refer to the steps demonstrated to create an Amazon Redshift cluster in the producer account).
  2. Choose Launch stack to deploy an AWS CloudFormation template to create two IAM users with policies.

launch stack

The stack creates the following users under the data analyst persona:

  • consumer1
  • consumer2
  1. After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
  2. Capture the ConsoleIAMLoginURL and LFUsersCredentials values.

createiamusers

  1. Choose the LFUsersCredentials value to navigate to the AWS Secrets Manager console.
  2. In the Secret value section, choose Retrieve secret value.

secret value

  1. Capture the secret value for the password.

Both consumer1 and consumer2 need to use this same password to log in to the AWS Management Console.

secret value

Configure an Amazon Redshift datashare using Lake Formation

Producer account

Create a datashare using the console

Complete the following steps to create an Amazon Redshift datashare in the data producer account and share it with Lake Formation in the central account:

  1. On the Amazon Redshift console, choose the cluster to create the datashare.
  2. On the cluster details page, navigate to the Datashares tab.
  3. Under Datashares created in my namespace, choose Connect to database.

connect to database

  1. Choose Create datashare.

create datashare

  1. For Datashare type, choose Datashare.
  2. For Datashare name, enter the name (for this post, demotahoeds).
  3. For Database name, choose the database from where to add datashare objects (for this post, dev).
  4. For Publicly accessible, choose Turn off (or choose Turn on to share the datashare with clusters that are publicly accessible).

datashare information

  1. Under DataShare objects, choose Add to add the schema to the datashare (in this post, the public schema).
  2. Under Tables and views, choose Add to add the tables and views to the datashare (for this post, we add the table customer and view customer_view).

datashare objects

  1. Under Data consumers, choose Publish to AWS Data Catalog.
  2. For Publish to the following accounts, choose Other AWS accounts.
  3. Provide the AWS account ID of the consumer account. For this post, we provide the AWS account ID of the Lake Formation central governance account.
  4. To share within the same account, choose Local account.
  5. Choose Create datashare.

data consumers

  1. After the datashare is created, you can verify by going back to the Datashares tab and entering the datashare name in the search bar under Datashares created in my namespace.
  2. Choose the datashare name to view its details.
  3. Under Data consumers, you will see the consumer status of the consumer data catalog account as Pending Authorization.

data consumers

  1. Choose the checkbox against the consumer data catalog which will enable the Authorize option.

authorize

  1. Click Authorize to authorize the datashare access to the consumer account data catalog, consumer status will change to Authorized.

authorized

Create a datashare using a SQL command

Complete the following steps to create a datashare in data producer account 1 and share it with Lake Formation in the central account:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
  2. Choose (right-click) the cluster name and choose Edit connection or Create Connection.
  3. For Authentication, choose Temporary credentials.

Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.

  1. For Database, enter a database name (for this post, dev).
  2. For Database user, enter the user authorized to access the database (for this post, awsuser).
  3. Choose Save to connect to the database.

Connecting to an Amazon Redshift database

  1. Run the following SQL commands to create the datashare and add the data objects to be shared:
create datashare demotahoeds;
ALTER DATASHARE demotahoeds ADD SCHEMA PUBLIC;
ALTER DATASHARE demotahoeds ADD TABLE customer;
ALTER DATASHARE demotahoeds ADD TABLE customer_view;
  1. Run the following SQL command to share the producer datashare to the central governance account:
GRANT USAGE ON DATASHARE demotahoeds TO ACCOUNT '<central-aws-account-id>' via DATA CATALOG

Run the following SQL command

  1. You can verify the datashare created and objects shared by running the following SQL command:
DESC DATASHARE demotahoeds

DESC DATASHARE demotahoeds

  1. Run the following command using the AWS CLI to authorize the datashare to the central data catalog so that Lake Formation can manage them:
aws redshift authorize-data-share \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-identifier DataCatalog/<central-aws-account-id>

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [{
        "ConsumerIdentifier": "DataCatalog/XXXXXXXXXXXX",
        "Status": "AUTHORIZED",
        "CreatedDate": "2022-11-09T21:10:30.507000+00:00",
        "StatusChangeDate": "2022-11-09T21:10:50.932000+00:00"
    }]
}

You can verify the datashare status on the console by following the steps outlined in the previous section.

Central catalog account

The data lake admin accepts and registers the datashare with Lake Formation in the central governance account and creates a database for the same. Complete the following steps:

  1. Sign in to the console as the data lake administrator IAM user or role.
  2. If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
  3. Under Data catalog in the navigation pane, choose Data sharing and view the Amazon Redshift datashare invitations on the Configuration tab.
  4. Select the datashare and choose Review Invitation.

AWS Lake Formation data sharing

A window pops up with the details of the invitation.

  1. Choose Accept to register the Amazon Redshift datashare to the AWS Glue Data Catalog.

accept reject invitation

  1. Provide a name for the AWS Glue database and choose Skip to Review and create.

Skip to Review and create

  1. Review the content and choose Create database.

create database

After the AWS Glue database is created on the Amazon Redshift datashare, you can view them under Shared Databases.

Shared Databases.

You can also use the AWS CLI to register the datashare and create the database. Use the following commands:

  1. Describe the Amazon Redshift datashare that is shared with the central account:
aws redshift describe-data-shares
  1. Accept and associate the Amazon Redshift datashare to Data Catalog:
aws redshift associate-data-share-consumer \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-arn arn:aws:glue:us-east-1:<central-aws-account-id>:catalog

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:123456789012:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:123456789012:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [
        {
            "ConsumerIdentifier": "arn:aws:glue:us-east-1:112233445566:catalog",
            "Status": "ACTIVE",
            "ConsumerRegion": "us-east-1",
            "CreatedDate": "2022-11-09T23:25:22.378000+00:00",
            "StatusChangeDate": "2022-11-09T23:25:22.378000+00:00"
        }
    ]
}
  1. Register the Amazon Redshift datashare in Lake Formation:
aws lakeformation register-resource \
--resource-arn arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds
  1. Create the AWS Glue database that points to the accepted Amazon Redshift datashare:
aws glue create-database --region <central-catalog-region> --cli-input-json '{
    "CatalogId": "<central-aws-account-id>",
    "DatabaseInput": {
        "Name": "demotahoedb",
        "FederatedDatabase": {
            "Identifier": "arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds",
            "ConnectionName": "aws:redshift"
        }
    }
}'

Now the data lake administrator of the central governance account can view and share access on both the database and tables to the data consumer account using the Lake Formation cross-account sharing feature.

Grant datashare access to the data consumer

To grant the data consumer account permissions on the shared AWS Glue database, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  2. Choose Grant.
  3. Under Principals, select External accounts.
  4. Provide the data consumer account ID (for this post, 665544332211).
  5. Under LF_Tags or catalog resources, select Named data catalog resources.
  6. For Databases, choose the database demotahoedb.
  7. Select Describe for both Database permissions and Grantable permissions.
  8. Choose Grant to apply the permissions.

grant data permissions

To grant the data consumer account permissions on tables, complete the following steps:

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  2. Choose Grant.
  3. Under Principals, select External accounts.
  4. Provide the consumer account (for this post, we use 665544332211).
  5. Under LF-Tags or catalog resources, select Named data catalog resources.
  6. For Databases, choose the database demotahoedb.
  7. For Tables, choose All tables.
  8. Select Describe and Select for both Table permissions and Grantable permissions.
  9. Choose Grant to apply the changes.

grant the data consumer account permissions on tables

Consumer account

The consumer admin will receive the shared resources from the central governance account and delegate access to other users in the consumer account as shown in the following table.

IAM User Object Access Object Type Access Level
consumer1 public.customer Table All
consumer2 public.customer_view View specific columns: c_customer_id, c_birth_country, cd_gender, cd_marital_status, cd_education_status

In the data consumer account, follow these steps to accept the resources shared with the account:

  1. Sign in to the console as the data lake administrator IAM user or role.
  2. If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
  3. Sign in to the AWS RAM console.
  4. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations. You will receive 2 invitations.

Resource shares to view the pending invitations

  1. Choose the pending invitations and accept the resource share.

Choose the pending invitation and accept the resource share

  1. On the Lake formation console, under Data catalog in the navigation pane, choose Databases to view the cross-account shared database.

choose Databases to view the cross-account shared database

Grant access to the data analyst and IAM users using Lake Formation

Now the data lake admin in the data consumer account can delegate permissions on the shared database and tables to users in the consumer account.

Grant database permissions to consumer1 and consumer2

To grant the IAM users consumer1 and consumer2 database permissions, follow these steps:

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.

choose Grant database

  1. Under Principals, select IAM users and roles.
  2. Choose the IAM users consumer1 and consumer2.
  3. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  4. Select Describe for Database permissions.
  5. Choose Grant to apply the permissions.

Choose Grant to apply the permissions

Grant table permissions to consumer1

To grant the IAM user consumer1 permissions on table public.customer, follow these steps:

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose IAM user consumer1.
  5. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  6. For Tables, choose public.customer.
  7. Select Describe and Select for Table permissions.
  8. Choose Grant to apply the permissions.

Grant table permissions to consumer1

Grant column permissions to consumer2

To grant the IAM user consumer2 permissions on non-sensitive columns in public.customer_view, follow these steps:

  1. Under Data catalog in the navigation pane, choose Databases.
  2. Select the database demotahoedb and on the Actions menu, choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM user consumer2.
  5. Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
  6. For Tables, choose public.customer_view.

Grant column permissions to consumer2

  1. Select Select for Table permissions.
  2. Under Data Permissions, select Column-based access.
  3. Select Include columns and choose the non-sensitive columns (c_customer_id, c_birth_country, cd_gender, cd_marital_status, and cd_education_status).
  4. Choose Grant to apply the permissions.

table permissions

Consume the datashare from the data consumer account in the Amazon Redshift cluster

In the Amazon Redshift consumer data warehouse, log in as the admin user using Query Editor V2 and complete the following steps:

  1. Create the Amazon Redshift database from the shared catalog database using the following SQL command:
CREATE DATABASE demotahoedb FROM ARN 'arn:aws:glue:<producer-region>:<producer-aws-account-id>:database/demotahoedb' WITH DATA CATALOG SCHEMA demotahoedb ;
  1. Run the following SQL commands to create and grant usage on the Amazon Redshift database to the IAM users consumer1 and consumer2:
CREATE USER IAM:consumer1 password disable;
CREATE USER IAM:consumer2  password disable;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer1;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer2;

In order to use a federated identity to enforce Lake Formation permissions, follow the next steps to configure Query Editor v2.

  1. Choose the settings icon in the bottom left corner of the Query Editor v2, then choose Account settings.

identity to enforce Lake Formation permissions

  1. Under Connection settings, select Authenticate with IAM credentials.
  2. Choose Save.

Authenticate with IAM credentials

Query the shared datasets as a consumer user

To validate that the IAM user consumer1 has datashare access from Amazon Redshift, perform the following steps:

  1. Sign in to the console as IAM user consumer1.
  2. On the Amazon Redshift console, choose Query Editor V2 in the navigation pane.
  3. To connect to the consumer cluster, choose the consumer cluster in the tree-view pane.
  4. When prompted, for Authentication, select Temporary credentials using your IAM identity.
  5. For Database, enter the database name (for this post, dev).
  6. The user name will be mapped to your current IAM identity (for this post, consumer1).
  7. Choose Save.

edit connection for redshift

  1. Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:
select current_user;

  1. To find the federated databases created on the consumer account, run the following SQL command:
SHOW DATABASES FROM DATA CATALOG [ACCOUNT '<id1>', '<id2>'] [LIKE 'expression'];

federated databases created on the consumer account

  1. To validate permissions for consumer1, run the following SQL command:
select * from demotahoedb.public.customer limit 10;

As shown in the following screenshot, consumer1 is able to successfully access the datashare customer object.

Now let’s validate that consumer2 doesn’t have access to the datashare tables “public.customer” on the same consumer cluster.

  1. Log out of the console and sign in as IAM user consumer2.
  2. Follow the same steps to connect to the database using the query editor.
  3. Once connected, run the same query:
select * from demotahoedb.public.customer limit 10;

The user consumer2 should get a permission denied error, as in the following screenshot.

should get a permission denied error

Let’s validate the column-level access permissions of consumer2 on public.customer_view view.

  1. Connect to Query Editor v2 as consumer2 and run the following SQL command:
select c_customer_id,c_birth_country,cd_gender,cd_marital_status from demotahoedb.public.customer_view limit 10;

In the following screenshot, you can see consumer2 is only able to access columns as granted by Lake Formation.

access columns as granted by Lake Formation

Conclusion

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. Using Amazon Redshift data sharing with Lake Formation for data governance helps build the data mesh architecture, enabling data sharing and federation across business units with fine-grained access control.

Special thanks to everyone who contributed to launch Amazon Redshift data sharing with AWS Lake Formation:

Debu Panda, Michael Chess, Vlad Ponomarenko, Ting Yan, Erol Murtezaoglu, Sharda Khubchandani, Rui Bi

References


About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Ranjan Burman is a Analytics Specialist Solutions Architect, with AWS.

Vikram Sahadevan is a Senior Resident Architect on the AWS Data Lab team. He enjoys efforts that focus around providing prescriptive architectural guidance, sharing best practices, and removing technical roadblocks with joint engineering engagements between customers and AWS technical resources that accelerate data, analytics, artificial intelligence, and machine learning initiatives.

Steve Mitchell is a Senior Solution Architect with a passion for analytics and data mesh. He enjoys working closely with customers as they transition to a modern data architecture.

Data: The genesis for modern invention

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/data-the-genesis-for-modern-invention/

It only takes one groundbreaking invention—one iconic idea that solves a widespread pain point for customers—to create or transform an industry forever. From the invention of the telegraph, to the discovery of GPS, to the earliest cloud computing services, history is filled with examples of these “eureka” moments that continue to have long-lasting impacts on the way we do business today.

Cognitive scientists John Kounios and Mark Beeman demonstrated that great inventors don’t simply stumble upon their epiphanies; in reality, an idea is preceded by a collection of life experiences, educational knowledge, or even past failures the human brain processes and assimilates over time. Their ideas are preceded by a collection of data points.

When we apply this concept to organizations, and the vast amount of data being produced on a daily basis, we realize there’s an incredible opportunity to ingest, store, process, analyze, and visualize data to create the next big thing.

Today—more than ever before—data is the genesis for modern invention. But to produce new ideas with our data, we need to build dynamic, end-to-end data strategies that lead to new customer experiences as the final output. Some of the biggest brands in the world like Formula 1, Toyota, and Georgia-Pacific are already leveraging AWS to do just that.

This week at AWS re:Invent 2022, I shared several key learnings we’ve collected after working with these brands and more than 1.5 million customers who are using AWS to build their data strategies.

I also revealed several new services and innovations for our customers. Here are a few highlights.

You need a comprehensive set of services to get the job done

Creating a data lake to perform analytics and machine learning (ML) is not an end-to-end data strategy. Your needs will inevitably grow and change over time, which is why we believe every customer should have access to a wide variety of tools based on data types, personas, and their specific use cases.

And our data supports this, with 94% of the top 1,000 AWS customers using more than 10 of our databases and analytics services. A one-size-fits-all approach just doesn’t work in the long run.

You need a comprehensive set of services that enable you to store and query data in your databases, data lakes, and data warehouses; services that help you act on your data with analytics, business intelligence, and machine learning; and services that help you catalog and govern your data across your organization.

You should also have access to services that support a variety of data types for your future use cases, whether you’re working with financial data, clinical data, or retail data. Many of our customers are also using their data to create machine learning models, but some data types are still too cumbersome to work with and prepare for ML.

For example, geospatial data, which supports use cases like self-driving cars, urban planning, or even crop yield in agricultural farms, can be incredibly difficult to access, prepare, and visualize for ML. That’s why this week we announced new capabilities for Amazon SageMaker that make it easier for data scientists to work with geospatial data.

Performance and security are paramount

Performance and security continue to be critical components of our customers’ data strategies.

You’ll need to perform at scale across your data warehouses, databases, and data lakes, or when you want to quickly analyze and visualize your data. We’ve built our business on high-performing services like Amazon Aurora, Amazon DynamoDB, and Amazon Redshift, and this week, we announced several new capabilities to continue building on our performance innovations to date.

For our serverless, interactive query service, Amazon Athena, we announced a new integration with Apache Spark that enables you to spin up Spark workloads up to 75 times faster than other serverless Spark offerings. We also introduced a new feature called Elastic Clusters within our fully managed document database, Amazon DocumentDB, that enables customers to easily scale out or shard their data across multiple database instances.

To help our customers protect their data from potential compromises, we announced Amazon GuardDuty RDS Protection to intelligently detect potential threats for their data stored in Aurora, as well as a new open-source project that allows developers to safely use PostgreSQL extensions in their core databases without worrying about unintended security impacts.

Connecting data is critical for deeper insights

To get the most of your data, you need to combine data silos for deeper insights. However, connecting data across siloes typically requires complex extract, transform, and load (ETL) pipelines, which means creating a manual integration every time you want to ask a different question of your data or build a different ML model. This isn’t fast enough to keep up with the speed that businesses need to move today.

Zero ETL is the future. And we’ve been making strides in this zero-ETL future for several years by deepening integrations between our services. But this week, we’re getting closer to a zero-ETL future by announcing Aurora now supports zero-ETL integration with Amazon Redshift to bring transactional data in Aurora and the analytical capabilities in Amazon Redshift together.

We also announced a new auto-copy feature from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift that removes the need for you to build and manage ETL pipelines whenever you want to use your data for analytics. And we’re not stopping here. With AWS, you can now connect to hundreds of data sources, from software as a service (SaaS) applications to on-premises data stores.

We’ll continue to build no zero-ETL capabilities into our services to help our customers easily analyze all their data, no matter where it resides.

Data governance unleashes innovation

Governance was historically used as a defensive measure, which meant locking down data in silos. But in reality, the right governance strategy helps you move and innovate faster with guardrails that give the right people access to your data, when and where they need it.

In addition to fine-grained access controls within AWS Lake Formation, this week we’re making it easier for customers to govern access and privileges within more of our data services with new capabilities announced in Amazon Redshift and Amazon SageMaker.

Our customers also told us they want an end-to-end strategy that enables them to govern their data across the entire data journey. That’s why this week we announced Amazon DataZone, a new data management service that helps you catalog, discover, analyze, share, and govern data across the organization.

When you properly manage secure access to your data, it can flow to the right places and connect the dots across siloed teams and departments.

Build with AWS

With the introduction of these new services and features this week, as well as our comprehensive set of data services, it’s important to remember that support is available as you build your end-to-end data strategy. In fact, we have an entire team at AWS, as well as an extensive network of partners to help our customers build data foundations that will meet their needs now—and well into the future.

For more information about re:Invent 2022, please visit our event page.


About the Author

Swami Sivasubramanian is the Vice President of AWS Data and Machine Learning.

Log analytics the easy way with Amazon OpenSearch Serverless

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/log-analytics-the-easy-way-with-amazon-opensearch-serverless/

We recently announced the preview release of Amazon OpenSearch Serverless, a new serverless option for Amazon OpenSearch Service, which makes it easy for you to run large-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. It automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads.

OpenSearch Serverless supports two primary use cases:

  • Log analytics that focuses on analyzing large volumes of semi-structured, machine-generated time series data for operational, security, and user behavior insights
  • Full-text search that powers customer applications in their internal networks (content management systems, legal documents) and internet-facing applications such as ecommerce website catalog search and content search

This post focuses on building a simple log analytics pipeline with OpenSearch Serverless.

Solution overview

In the following sections, we walk through the steps to create and access a collection in OpenSearch Serverless, and demonstrate how to configure two different data ingestion pipelines to index data into the collection.

Create a collection

To get started with OpenSearch Serverless, you first create a collection. A collection in OpenSearch Serverless is a logical grouping of one or more indexes that represent an analytics workload.

The following graphic gives a quick navigation for creating a collection. Alternatively, refer to this blog post to learn more about how to create and configure a collection in OpenSearch Serverless.

Access the collection

You can use the AWS Identity and Access Management (IAM) credentials with a secret key and access key ID for your IAM users and roles to access your collection programmatically. Alternatively, you can set up SAML authentication for accessing the OpenSearch Dashboards. Note that SAML authentication is only available to access OpenSearch Dashboards; you require IAM credentials to perform any operations using the AWS Command Line Interface (AWS CLI), API, and OpenSearch clients for indexing and searching data. In this post, we use IAM credentials to access the collections.

Create a data ingestion pipeline

OpenSearch Serverless supports the same ingestion pipelines as the open-source OpenSearch and managed clusters. These clients include applications like Logstash and Amazon Kinesis Data Firehose, and language clients like Java Script, Python, Go, Java, and more. For more details on all the ingestion pipelines and supported clients, refer to ingesting data into OpenSearch Serverless collections.

Using Logstash

The open-source version of Logstash (Logstash OSS) provides a convenient way to use the bulk API to upload data into your collections. OpenSearch Serverless supports the logstash-output-opensearch output plugin, which supports IAM credentials for data access control. In this post, we show how to use the file input plugin to send data from your command line console to an OpenSearch Serverless collection. Complete the following steps:

  1. Download the logstash-oss-with-opensearch-output-plugin file (this example uses the distro for macos-x64; for other distros, refer to the artifacts):
    wget https://artifacts.opensearch.org/logstash/logstash-oss-with-opensearch-output-plugin-8.4.0-macos-x64.tar.gz

  2. Extract the downloaded tarball:
    tar -zxvf logstash-oss-with-opensearch-output-plugin-8.4.0-macos-x64.tar.gz
    cd logstash-8.4.0/

  3. Update the logstash-output-opensearch plugin to the latest version:
    ./bin/logstash-plugin update logstash-output-opensearch

    The OpenSearch output plugin for OpenSearch Serverless uses IAM credentials to authenticate. In this example, we show how to use the file input plugin to read data from a file and ingest into an OpenSearch Serverless collection.

  4. Create a log file with the following sample data and name it sample.log:
    {"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:25.855Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}
    {"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.155Z","traffic":"heavy","weather_category":"Cloudy","weather":"Heavy Fog"}
    {"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.255Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}
    {"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.556Z","traffic":"heavy","weather_category":"Cloudy","weather":"Heavy Fog"}
    {"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.756Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}

  5. Create a new file and add the following content, and save the file as logstash-output-opensearch.conf after providing the information about your file path, host, Region, access key, and secret access key:
    input {
       file {
         path => "<path/to/your/sample.log>"
         start_position => "beginning"
       }
    }
    output {
        opensearch {
            ecs_compatibility => disabled
            index => "logstash-sample"
            hosts => "<HOST>:443"
            auth_type => {
                type => 'aws_iam'
                aws_access_key_id => '<AWS_ACCESS_KEY_ID>'
                aws_secret_access_key => '<AWS_SECRET_ACCESS_KEY>'
                region => '<REGION>'
                service_name => 'aoss'
                }
            legacy_template => false
            default_server_major_version => 2
        }
    }

  6. Use the following command to start Logstash with the config file created in the previous step. This creates an index called logstash-sample and ingests the document added under the sample.log file:
    ./bin/logstash -f <path/to/your/config/file>

  7. Search using OpenSearch Dashboards by running the following query:
    GET logstash-sample/_search
    {
      "query": {
        "match_all": {}
      },
      "track_total_hits" : true
    }

In this step, you used a file input plugin from Logstash to send data to OpenSearch Serverless. You can replace the input plugin with any other plugin supported by Logstash, such as Amazon Simple Storage Service (Amazon S3), stdin, tcp, or others, to send data to the OpenSearch Serverless collection.

Using a Python client

OpenSearch provides high-level clients for several popular programming languages, which you can use to integrate with your application. With OpenSearch Serverless, you can continue to use your existing OpenSearch client to load and query your data in collections.

In this section, we show how to use the opensearch-py client for Python to establish a secure connection with your OpenSearch Serverless collection, create an index, send sample logs, and analyze those log data using OpenSearch Dashboards. In this example, we use a sample event generated from fleets carrying goods and packages. This data contains pertinent fields such as source, destination, weather, speed, and traffic. The following is a sample record:

"_source" : {
    "deviceId" : 2823605996,
    "fleetRegNo" : "IRV82MBYQ1",
    "carrier" : "AOS Van Lines",
    "temperature" : 14,
    "tripId" : 6741375582,
    "originODC" : "ODC Las Vegas",
    "originCountry" : "United States",
    "originCity" : "Las Vegas",
    "destinationCity" : "San Jose",
    "@timestamp" : "2022-11-17T17:11:25.855Z",
    "traffic" : "heavy",
    "weather" : "Cloudy"
    ...
    ...
}

To set up the Python client for OpenSearch, you must have the following prerequisites:

  • Python3 installed on your local machine or the server from where you are running this code
  • Package Installer for Python (PIP) installed
  • The AWS CLI configured; we use it to store the secret key and access key for credentials

Complete the following steps to set up the Python client:

  1. Add the OpenSearch Python client to your project and use Python’s virtual environment to set up the required packages:
    mkdir python-sample
    cd python-sample
    python3 -m venv .env
    source .env/bin/activate
    .env/bin/python3 -m pip install opensearch-py
    .env/bin/python3 -m pip install requests_aws4auth
    .env/bin/python3 -m pip install boto3
    .env/bin/python3 -m pip install geopy

  2. Save your frequently used configuration settings and credentials in files that are maintained by the AWS CLI (see Quick configuration with aws configure) by using the following commands and providing your access key, secret key, and Region:
    aws configure

  3. The following sample code uses the opensearch-py client for Python to establish a secure connection to the specified OpenSearch Serverless collection and index a sample document to index time series. You must provide values for region and host. Note that you must use aoss as the service name for OpenSearch Service. Copy the code and save in a file as sample_python.py:
    from opensearchpy import OpenSearch, RequestsHttpConnection
    from requests_aws4auth import AWS4Auth
    import boto3
    
    host = '<host>' # OpenSearch Serverless collection endpoint
    region = '<region>' # e.g. us-west-2
    
    service = 'aoss'
    credentials = boto3.Session().get_credentials()
    awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service,
    session_token=credentials.token)
    
    # Create an OpenSearch client
    client = OpenSearch(
        hosts = [{'host': host, 'port': 443}],
        http_auth = awsauth,
        use_ssl = True,
        verify_certs = True,
        connection_class = RequestsHttpConnection
    )
    # Specify index name
    index_name = 'octank-iot-logs-2022-11-19'
    
    # Prepare a document to index 
    document = {
        "deviceId" : 2823605996,
        "fleetRegNo" : "IRV82MBYQ1",
        "carrier" : "AOS Van Lines",
        "temperature" : 14,
        "tripId" : 6741375582,
        "originODC" : "ODC Las Vegas",
        "originCountry" : "United States",
        "originCity" : "Las Vegas",
        "destinationCity" : "San Jose",
        "@timestamp" : "2022-11-19T17:11:25.855Z",
        "traffic" : "heavy",
        "weather" : "Cloudy"
    }
    
    # Index Documents
    response = client.index(
        index = index_name,
        body = document
    )
    
    print('\n Document indexed with response:')
    print(response)
    
    
    # Search for the Documents
    q = 'heavy'
    query = {
        'size': 5,
            'query': {
            'multi_match': {
            'query': q,
            'fields': ['traffic']
            }
        }
    }
    
    response = client.search(
    body = query,
    index = index_name
    )
    print('\nSearch results:')
    print(response)

  4. Run the sample code:
    python3 sample_python.py

  5. On the OpenSearch Service console, select your collection.
  6. On OpenSearch Dashboards, choose Dev Tools.
  7. Run the following search query to retrieve documents:
    GET octank-iot-logs-*/_search
    {
      "query": {
        "match_all": {}
      }
    }

After you have ingested the data, you can use OpenSearch Dashboards to visualize your data. In the following example, we analyze data visually to gain insights on various dimensions such as average fuel consumed by a specific fleet, traffic conditions, distance traveled, and average mileage by the fleet.

Conclusion

In this post, you created a log analytics pipeline using OpenSearch Serverless, a new serverless option for OpenSearch Service. With OpenSearch Serverless, you can focus on building your application without having to worry about provisioning, tuning, and scaling the underlying infrastructure. OpenSearch Serverless supports the same ingestion pipelines and high-level clients as the open-source OpenSearch project. You can easily get started using the familiar OpenSearch indexing and query APIs to load and search your data and use OpenSearch Dashboards to visualize that data.

Stay tuned for a series of posts focusing on the various options available for you to build effective log analytics and search applications. Get hands-on with OpenSearch Serverless by taking the Getting Started with Amazon OpenSearch Serverless workshop and build a similar log analytics pipeline that was discussed in this post.


About the authors

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Pavani Baddepudi is a senior product manager working in search services at AWS. Her interests include distributed systems, networking, and security.

New for Amazon Redshift – Simplify Data Ingestion and Make Your Data Warehouse More Secure and Reliable

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-simplify-data-ingestion-and-make-your-data-warehouse-more-secure-and-reliable/

When we talk with customers, we hear that they want to be able to harness insights from data in order to make timely, impactful, and actionable business decisions. A common pattern with data-driven organizations is that they have many different data sources they need to ingest into their analytics systems. This requires them to build manual data pipelines spanning across their operational databases, data lakes, streaming data, and data within their warehouse. As a consequence of this complex setup, it can take data engineers weeks or even months to build data ingestion pipelines. These data pipelines are costly, and the delays can lead to missed business opportunities. Additionally, data warehouses are increasingly becoming mission critical systems that require high availability, reliability, and security.

Amazon Redshift is a fully managed petabyte-scale data warehouse used by tens of thousands of customers to easily, quickly, securely, and cost-effectively analyze all their data at any scale. This year at re:Invent, Amazon Redshift has announced a number of features to help you simplify data ingestion and get to insights easily and quickly, within a secure, reliable environment.

In this blog, I introduce some of these new features that fit into two main categories:

  • Simplify data ingestion
    • Amazon Redshift now supports auto-copy from Amazon S3 (available in preview). With this new capability, Amazon Redshift automatically loads the files that arrive in an Amazon Simple Storage Service (Amazon S3) location that you specify into your data warehouse. The files can use any of the formats supported by the Amazon Redshift copy command, such as CSV, JSON, Parquet, and Avro. In this way, you don’t need to manually or repeatedly run copy procedures. Amazon Redshift automates file ingestion and takes care of data-loading steps under the hood.
    • With Amazon Aurora zero-ETL integration with Amazon Redshift, you can use Amazon Redshift for near real-time analytics and machine learning on petabytes of transactional data stored on Amazon Aurora MySQL databases (available in limited preview). With this capability, you can choose the Amazon Aurora databases containing the data you want to analyze with Amazon Redshift. Data is then replicated into your data warehouse within seconds after transactional data is written into Amazon Aurora, eliminating the need to build and maintain complex data pipelines. You can replicate data from multiple Amazon Aurora databases into the same Amazon Redshift instance to run analytics across multiple applications. With near real-time access to transactional data, you can leverage Amazon Redshift’s analytics and capabilities, such as built-in machine learning (ML), materialized views, data sharing, and federated access to multiple data stores and data lakes, to derive insights from transactional and other data.
    • With the general availability of Amazon Redshift Streaming Ingestion, you can now natively ingest hundreds of megabytes of data per second from Amazon Kinesis Data Streams and Amazon MSK into an Amazon Redshift materialized view and query it in seconds. Learn more in this post.
  • Make your data warehouse more secure and reliable
    • You can now improve the availability of your data warehouse by choosing multiple Availability Zone (AZ) deployments. Multi-AZ deployments for your Amazon Redshift clusters are available in preview and reduce recovery times to seconds through automatic recovery. In this way, you can build solutions that are more compliant with the recommendations of the Reliability Pillar of the AWS Well-Architected Framework.
    • With dynamic data masking (available in preview), you can protect sensitive information stored in your data warehouse and ensure that only the relevant data is accessible by users based on their roles. You can limit how much identifiable data is visible to users using multiple levels of policies so different users and groups can have different levels of data access without having to create multiple copies of data. Dynamic data masking complements other granular access control capabilities in Amazon Redshift including row-level and column-level security and role-based access controls. In this way, Dynamic Data Masking helps you meet requirements for GDPR, CCPA, and other privacy regulations.
    • Amazon Redshift now supports central access controls for data sharing with AWS Lake Formation (available in public preview). You can now use Lake Formation to simplify governance of data shared from Amazon Redshift and centrally manage granular access across all data-sharing consumers.

There have been other interesting news for Amazon Redshift at re:Invent you might have already heard about:

  • The general availability of Amazon Redshift integration for Apache Spark makes it easy to build and run Spark applications on Amazon Redshift and Redshift Serverless, opening up the data warehouse for a broader set of AWS analytics and machine learning solutions.
  • AWS Backup now supports Amazon Redshift. AWS Backup allows you to define a central backup policy to manage data protection of your applications and can also protect your Amazon Redshift clusters. In this way, you have a consistent experience when managing data protection across all supported services.

Availability and Pricing
Multi-AZ deployments, central access control for data sharing with AWS Lake Formation, auto-copy from Amazon S3, and dynamic data masking are available in preview in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm).

There is no additional cost for using auto-copy from Amazon S3 and near real-time analytics on transactional data. There is no extra charge for dynamic data masking and central access control for data sharing. For more information, see Amazon Redshift pricing.

These new capabilities take you one step further in analyzing all your data across data sources with simple data ingestion capabilities, while improving the security and reliability of your data warehouse.

Danilo

New — Amazon Athena for Apache Spark

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-amazon-athena-for-apache-spark/

When Jeff Barr first announced Amazon Athena in 2016, it changed my perspective on interacting with data. With Amazon Athena, I can interact with my data in just a few steps—starting from creating a table in Athena, loading data using connectors, and querying using the ANSI SQL standard.

Over time, various industries, such as financial services, healthcare, and retail, have needed to run more complex analyses for a variety of formats and sizes of data. To facilitate complex data analysis, organizations adopted Apache Spark. Apache Spark is a popular, open-source, distributed processing system designed to run fast analytics workloads for data of any size.

However, building the infrastructure to run Apache Spark for interactive applications is not easy. Customers need to provision, configure, and maintain the infrastructure on top of the applications. Not to mention performing optimal tuning resources to avoid slow application starts and suffering from idle costs.

Introducing Amazon Athena for Apache Spark
Today, I’m pleased to announce Amazon Athena for Apache Spark. With this feature, we can run Apache Spark workloads, use Jupyter Notebook as the interface to perform data processing on Athena, and programmatically interact with Spark applications using Athena APIs. We can start Apache Spark in under a second without having to manually provision the infrastructure.

Here’s a quick preview:

Quick preview of Amazon Athena for Apache Spark

How It Works
Since Amazon Athena for Apache Spark runs serverless, this benefits customers in performing interactive data exploration to gain insights without the need to provision and maintain resources to run Apache Spark. With this feature, customers can now build Apache Spark applications using the notebook experience directly from the Athena console or programmatically using APIs.

The following figure explains how this feature works:

How Amazon Athena for Apache Spark works

On the Athena console, you can now run notebooks and run Spark applications with Python using Jupyter notebooks. In this Jupyter notebook, customers can query data from various sources and perform multiple calculations and data visualizations using Spark applications without context switching.

Amazon Athena integrates with AWS Glue Data Catalog, which helps customers to work with any data source in AWS Glue Data Catalog, including data in Amazon S3. This opens possibilities for customers in building applications to analyze and visualize data to explore data to prepare data sets for machine learning pipelines.

As I demonstrated in the demo preview section, the initialization for the workgroup running the Apache Spark engine takes under a second to run resources for interactive workloads. To make this possible, Amazon Athena for Apache Spark uses Firecracker, a lightweight micro-virtual machine, which allows for instant startup time and eliminates the need to maintain warm pools of resources. This benefits customers who want to perform interactive data exploration to get insights without having to prepare resources to run Apache Spark.

Get Started with Amazon Athena for Apache Spark
Let’s see how we can use Amazon Athena for Apache Spark. In this post, I will explain step-by-step how to get started with this feature.

The first step is to create a workgroup. In the context of Athena, a workgroup helps us to separate workloads between users and applications.

To create a workgroup, from the Athena dashboard, select Create Workgroup.

Select Create Workgroup

On the next page, I give the name and description for this workgroup.

Creating a workgroup

On the same page, I can choose Apache Spark as the engine for Athena. In addition, I also need to specify a service role with appropriate permissions to be used inside a Jupyter notebook. Then, I check Turn on example notebook, which makes it easy for me to get started with Apache Spark inside Athena. I also have the option to encrypt Jupyter notebooks managed by Athena or use the key I have configured in AWS Key Management Service (AWS KMS).

After that, I need to define an Amazon Simple Storage Service (Amazon S3) bucket to store calculation results from the Jupyter notebook. Once I’m sure of all the configurations for this workgroup, I just have to select Create workgroup.

Configure Calculation Results Settings

Now, I can see the workgroup already created in Athena.

Select newly created workgroup

To see the details of this workgroup, I can select the link from the workgroup. Since I also checked the Turn on example notebook when creating this workgroup, I have a Jupyter notebook to help me get started. Amazon Athena also provides flexibility for me to import existing notebooks that I can upload from my laptop with Import file or create new notebooks from scratch by selecting Create notebook.

Example notebook is available in the workgroup

When I select the Jupyter notebook example, I can start building my Apache Spark application.

When I run a Jupyter notebook, it automatically creates a session in the workgroup. Subsequently, each time I run a calculation inside the Jupyter notebook, all results will be recorded in the session. This way, Athena provides me with full information to review each calculation by selecting Calculation ID, which took me to the Calculation details page. Here, I can review the Code and also Results for the calculation.

Review code and results of a calculation

In the session, I can adjust the Coordinator size and Executor size, with 1 data processing unit (DPU) by default. A DPU consists of 4 vCPU and 16 GB of RAM. Changing to a larger DPU allows me to process tasks faster if I have complex calculations.

Configuring session parameters

Programmatic API Access
In addition to using the Athena console, I can also use programmatic access to interact with the Spark application inside Athena. For example, I can create a workgroup with the create-work-group command, start a notebook with create-notebook, and run a notebook session with start-session.

Using programmatic access is useful when I need to execute commands such as building reports or computing data without having to open the Jupyter notebook.

With my Jupyter notebook that I’ve created before, I can start a session by running the following command with the AWS CLI:

$> aws athena start-session \
    --work-group <WORKGROUP_NAME>\
    --engine-configuration '{"CoordinatorDpuSize": 1, "MaxConcurrentDpus":20, "DefaultExecutorDpuSize": 1, "AdditionalConfigs":{"NotebookId":"<NOTEBOOK_ID>"}}'
    --notebook-version "Jupyter 1"
    --description "Starting session from CLI"

{
    "SessionId":"<SESSION_ID>",
    "State":"CREATED"
}

Then, I can run a calculation using the start-calculation-execution API.

$ aws athena start-calculation-execution \
    --session-id "<SESSION_ID>"
    --description "Demo"
    --code-block "print(5+6)"

{
    "CalculationExecutionId":"<CALCULATION_EXECUTION_ID>",
    "State":"CREATING"
}

In addition to using code inline, with the --code-block flag, I can also pass input from a Python file using the following command:

$ aws athena start-calculation-execution \
    --session-id "<SESSION_ID>"
    --description "Demo"
    --code-block file://<PYTHON FILE>

{
    "CalculationExecutionId":"<CALCULATION_EXECUTION_ID>",
    "State":"CREATING"
}

Pricing and Availability
Amazon Athena for Apache Spark is available today in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland). To use this feature, you are charged based on the amount of compute usage defined by the data processing unit or DPU per hour. For more information see our pricing page here.

To get started with this feature, see Amazon Athena for Apache Spark to learn more from the documentation, understand the pricing, and follow the step-by-step walkthrough.

Happy building,

Donnie

New for Amazon Redshift – General Availability of Streaming Ingestion for Kinesis Data Streams and Managed Streaming for Apache Kafka

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-general-availability-of-streaming-ingestion-for-kinesis-data-streams-and-managed-streaming-for-apache-kafka/

Ten years ago, just a few months after I joined AWS, Amazon Redshift was launched. Over the years, many features have been added to improve performance and make it easier to use. Amazon Redshift now allows you to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes. More recently, Amazon Redshift Serverless became generally available to make it easier to run and scale analytics without having to manage your data warehouse infrastructure.

To process data as quickly as possible from real-time applications, customers are adopting streaming engines like Amazon Kinesis and Amazon Managed Streaming for Apache Kafka. Previously, to load streaming data into your Amazon Redshift database, you’d have to configure a process to stage data in Amazon Simple Storage Service (Amazon S3) before loading. Doing so would introduce a latency of one minute or more, depending on the volume of data.

Today, I am happy to share the general availability of Amazon Redshift Streaming Ingestion. With this new capability, Amazon Redshift can natively ingest hundreds of megabytes of data per second from Amazon Kinesis Data Streams and Amazon MSK into an Amazon Redshift materialized view and query it in seconds.

Architecture diagram.

Streaming ingestion benefits from the ability to optimize query performance with materialized views and allows the use of Amazon Redshift more efficiently for operational analytics and as the data source for real-time dashboards. Another interesting use case for streaming ingestion is analyzing real-time data from gamers to optimize their gaming experience. This new integration also makes it easier to implement analytics for IoT devices, clickstream analysis, application monitoring, fraud detection, and live leaderboards.

Let’s see how this works in practice.

Configuring Amazon Redshift Streaming Ingestion
Apart from managing permissions, Amazon Redshift streaming ingestion can be configured entirely with SQL within Amazon Redshift. This is especially useful for business users who lack access to the AWS Management Console or the expertise to configure integrations between AWS services.

You can set up streaming ingestion in three steps:

  1. Create or update an AWS Identity and Access Management (IAM) role to allow access to the streaming platform you use (Kinesis Data Streams or Amazon MSK). Note that the IAM role should have a trust policy that allows Amazon Redshift to assume the role.
  2. Create an external schema to connect to the streaming service.
  3. Create a materialized view that references the streaming object (Kinesis data stream or Kafka topic) in the external schemas.

After that, you can query the materialized view to use the data from the stream in your analytics workloads. Streaming ingestion works with Amazon Redshift provisioned clusters and with the new serverless option. To maximize simplicity, I am going to use Amazon Redshift Serverless in this walkthrough.

To prepare my environment, I need a Kinesis data stream. In the Kinesis console, I choose Data streams in the navigation pane and then Create data stream. For the Data stream name, I use my-input-stream and then leave all other options set to their default value. After a few seconds, the Kinesis data stream is ready. Note that by default I am using on-demand capacity mode. In a development or test environment, you can choose provisioned capacity mode with one shard to optimize costs.

Now, I create an IAM role to give Amazon Redshift access to the my-input-stream Kinesis data streams. In the IAM console, I create a role with this policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "kinesis:DescribeStreamSummary",
                "kinesis:GetShardIterator",
                "kinesis:GetRecords",
                "kinesis:DescribeStream"
            ],
            "Resource": "arn:aws:kinesis:*:123412341234:stream/my-input-stream"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kinesis:ListStreams",
                "kinesis:ListShards"
            ],
            "Resource": "*"
        }
    ]
}

To allow Amazon Redshift to assume the role, I use the following trust policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

In the Amazon Redshift console, I choose Redshift serverless from the navigation pane and create a new workgroup and namespace, similar to what I did in this blog post. When I create the namespace, in the Permissions section, I choose Associate IAM roles from the dropdown menu. Then, I select the role I just created. Note that the role is visible in this selection only if the trust policy allows Amazon Redshift to assume it. After that, I complete the creation of the namespace using the default options. After a few minutes, the serverless database is ready for use.

In the Amazon Redshift console, I choose Query editor v2 in the navigation pane. I connect to the new serverless database by choosing it from the list of resources. Now, I can use SQL to configure streaming ingestion. First, I create an external schema that maps to the streaming service. Because I am going to use simulated IoT data as an example, I call the external schema sensors.

CREATE EXTERNAL SCHEMA sensors
FROM KINESIS
IAM_ROLE 'arn:aws:iam::123412341234:role/redshift-streaming-ingestion';

To access the data in the stream, I create a materialized view that selects data from the stream. In general, materialized views contain a precomputed result set based on the result of a query. In this case, the query is reading from the stream, and Amazon Redshift is the consumer of the stream.

Because streaming data is going to be ingested as JSON data, I have two options:

  1. Leave all the JSON data in a single column and use Amazon Redshift capabilities to query semi-structured data.
  2. Extract JSON properties into their own separate columns.

Let’s see the pros and cons of both options.

The approximate_arrival_timestamp, partition_key, shard_id, and sequence_number columns in the SELECT statement are provided by Kinesis Data Streams. The record from the stream is in the kinesis_data column. The refresh_time column is provided by Amazon Redshift.

To leave the JSON data in a single column of the sensor_data materialized view, I use the JSON_PARSE function:

CREATE MATERIALIZED VIEW sensor_data AUTO REFRESH YES AS
    SELECT approximate_arrival_timestamp,
           partition_key,
           shard_id,
           sequence_number,
           refresh_time,
           JSON_PARSE(kinesis_data, 'utf-8') as payload    
      FROM sensors."my-input-stream";
CREATE MATERIALIZED VIEW sensor_data AUTO REFRESH YES AS
SELECT approximate_arrival_timestamp,
partition_key,
shard_id,
sequence_number,
refresh_time,
JSON_PARSE(kinesis_data) as payload 
FROM sensors."my-input-stream";

Because I used the AUTO REFRESH YES parameter, the content of the materialized view is automatically refreshed when there is new data in the stream.

To extract the JSON properties into separate columns of the sensor_data_extract materialized view, I use the JSON_EXTRACT_PATH_TEXT function:

CREATE MATERIALIZED VIEW sensor_data_extract AUTO REFRESH YES AS
    SELECT approximate_arrival_timestamp,
           partition_key,
           shard_id,
           sequence_number,
           refresh_time,
           JSON_EXTRACT_PATH_TEXT(FROM_VARBYTE(kinesis_data, 'utf-8'),'sensor_id')::VARCHAR(8) as sensor_id,
           JSON_EXTRACT_PATH_TEXT(FROM_VARBYTE(kinesis_data, 'utf-8'),'current_temperature')::DECIMAL(10,2) as current_temperature,
           JSON_EXTRACT_PATH_TEXT(FROM_VARBYTE(kinesis_data, 'utf-8'),'status')::VARCHAR(8) as status,
           JSON_EXTRACT_PATH_TEXT(FROM_VARBYTE(kinesis_data, 'utf-8'),'event_time')::CHARACTER(26) as event_time
      FROM sensors."my-input-stream";

Loading Data into the Kinesis Data Stream
To put data in the my-input-stream Kinesis Data Stream, I use the following random_data_generator.py Python script simulating data from IoT sensors:

import datetime
import json
import random
import boto3

STREAM_NAME = "my-input-stream"


def get_random_data():
    current_temperature = round(10 + random.random() * 170, 2)
    if current_temperature > 160:
        status = "ERROR"
    elif current_temperature > 140 or random.randrange(1, 100) > 80:
        status = random.choice(["WARNING","ERROR"])
    else:
        status = "OK"
    return {
        'sensor_id': random.randrange(1, 100),
        'current_temperature': current_temperature,
        'status': status,
        'event_time': datetime.datetime.now().isoformat()
    }


def send_data(stream_name, kinesis_client):
    while True:
        data = get_random_data()
        partition_key = str(data["sensor_id"])
        print(data)
        kinesis_client.put_record(
            StreamName=stream_name,
            Data=json.dumps(data),
            PartitionKey=partition_key)


if __name__ == '__main__':
    kinesis_client = boto3.client('kinesis')
    send_data(STREAM_NAME, kinesis_client)

I start the script and see the records that are being put in the stream. They use a JSON syntax and contain random data.

$ python3 random_data_generator.py
{'sensor_id': 66, 'current_temperature': 69.67, 'status': 'OK', 'event_time': '2022-11-20T18:31:30.693395'}
{'sensor_id': 45, 'current_temperature': 122.57, 'status': 'OK', 'event_time': '2022-11-20T18:31:31.486649'}
{'sensor_id': 15, 'current_temperature': 101.64, 'status': 'OK', 'event_time': '2022-11-20T18:31:31.671593'}
...

Querying Streaming Data from Amazon Redshift
To compare the two materialized views, I select the first ten rows from each of them:

  • In the sensor_data materialized view, the JSON data in the stream is in the payload column. I can use Amazon Redshift JSON functions to access data stored in JSON format.Console screenshot.
  • In the sensor_data_extract materialized view, the JSON data in the stream has been extracted into different columns: sensor_id, current_temperature, status, and event_time.Console screenshot.

Now I can use the data in these views in my analytics workloads together with the data in my data warehouse, my operational databases, and my data lake. I can use the data in these views together with Redshift ML to train a machine learning model or use predictive analytics. Because materialized views support incremental updates, the data in these views can be efficiently used as a data source for dashboards, for example, using Amazon Redshift as a data source for Amazon Managed Grafana.

Availability and Pricing
Amazon Redshift streaming ingestion for Kinesis Data Streams and Managed Streaming for Apache Kafka is generally available today in all commercial AWS Regions.

There are no additional costs for using Amazon Redshift streaming ingestion. For more information, see Amazon Redshift pricing.

It’s never been easier to use low-latency streaming data in your data warehouse and in your data lake. Let us know what you build with this new capability!

Danilo

Preview: Amazon Security Lake – A Purpose-Built Customer-Owned Data Lake Service

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/preview-amazon-security-lake-a-purpose-built-customer-owned-data-lake-service/

To identify potential security threats and vulnerabilities, customers should enable logging across their various resources and centralize these logs for easy access and use within analytics tools. Some of these data sources include logs from on-premises infrastructure, firewalls, and endpoint security solutions, and when utilizing the cloud, services such as Amazon Route 53, AWS CloudTrail, and Amazon Virtual Private Cloud (Amazon VPC).

The Amazon Simple Storage Service (Amazon S3) and AWS Lake Formation simplify the creation and management of a data lake on AWS. But, some customers’ security teams still struggle to define and implement security domain–specific aspects, such as data normalization, which requires them to analyze each log source’s structure and fields, define schemas and mappings, and pull in data enrichment such as threat intelligence.

Today we are announcing the preview release of Amazon Security Lake, a purpose-built service that automatically centralizes an organization’s security data from cloud and on-premises sources into a purpose-built data lake stored in your account. Amazon Security Lake automates the central management of security data, normalizing from integrated AWS services and third-party services and managing the lifecycle of data with customizable retention and also automates storage tiering.

Here are the key features of Amazon Security Lake:

  • Variety of supported log and event sources – During the preview, Amazon Security Lake automatically collects logs for AWS CloudTrail, Amazon VPC, Amazon Route 53, Amazon S3, and AWS Lambda, as well as security findings via AWS Security Hub for AWS Config, AWS Firewall Manager, Amazon GuardDuty, AWS Health Dashboard, AWS IAM Access Analyzer, Amazon Inspector, Amazon Macie, and AWS Systems Manager Patch Manager. Additionally, over 50 sources of third-party security findings can be sent to Amazon Security Lake. Security Partners are also directly sending data in a standard schema called the Open Cybersecurity Schema Framework (OCSF) format to Amazon Security Lake, such as Cisco Security, CrowdStrike, Palo Alto Networks, and more.
  • Data transformation and normalization – Security Lake automatically partitions and converts incoming log data to a storage and query-efficient Apache Parquet and OCSF format, making the data broadly and immediately usable for security analytics without the need for post-processing. Security Lake supports integrations with analytics partners such as IBM, Splunk, Sumo Logic, and more to address a variety of security use cases such as threat detection, investigation, and incident response.
  • Customizable data access levels – You can configure the level of subscribers consuming data stored in the Security Lake, such as specific data sources for data access to all new objects or directly querying data stored. You can also specify a rollup Region that the Security Lake is available in and multiple AWS accounts across your AWS Organizations. This can help you comply with data residency compliance requirements.

By reducing the operational overhead of security data management, you can make it easier to gather more security signals from across your organization and analyze that data to improve the protection of your data, applications, and workloads.

Configure Your Security Lake for Collection Data
To get started with Amazon Security Lake, choose Get started in the AWS console. You can enable log and event sources for all Regions and all accounts.

You can select log and event sources such as CloudTrail logs, VPC flow logs, and Route53 resolver logs into your data lake. Select Regions will contribute their data to your data lake with the Amazon S3-managed encryption that Amazon S3 will create and manage all encryption keys, as well as the specific AWS accounts in your organizations.

Next, you can select rollup and contributing Regions. All aggregated data from contributing Regions reside in the rollup Region. You can create multiple rollup Regions, which can help you comply with data residency compliance requirements. Optionally, you can define the Amazon S3 storage classes and the retention period you want the data to transition from the standard Amazon S3 storage classes used in Security Lake.

After initial configuration, choose Sources in the left pane of the console if you can add or remove log sources in your Regions or account.

You can also collect data from custom sources, such as Bind DNS logs, endpoint telemetry logs, on-premise Netflow logs, and so on. Before adding a custom source, you need to create AWS IAM role to grant permissions for AWS Glue.

To create a custom data source, choose Create custom source in the left menu of Custom sources.

It requires you to enter the IAM role Amazon Resource Names (ARNs) to write data to Security Lake and invoke AWS Glue on your behalf. Then, you can provide details about your custom source.

For efficient data processing and querying, objects from your custom sources should be partitioned by AWS Region, AWS account, year, month, day, and hour with a Parquet-formatted object.

Consume Your Data from Security Lake
Now you can create a subscriber, a service that consumes logs and events from Security Lake. To add or see your subscribers, choose Subscribers in the left pane of the console.

The Security Lake supports two types of subscriber data access methods:

  • Data access (Amazon S3) – Subscribers are notified of new objects for a source as the data is written to your Security Lake S3 bucket. You can choose to notify subscribers of new objects with an Amazon Simple Queue Service (Amazon SQS) queue or through messaging to an HTTPS endpoint provided by the subscriber. This type is useful to ingest selected data in your analytics application—good for use cases that require frequent access to data.
  • Query access (Lake Formation) – Subscribers can consume data by directly querying AWS Lake Formation tables in your S3 bucket through services like Amazon Athena. This type is useful to provide on-demand query access to data without the need to pre-ingest anything and for use cases that require infrequent access or on large volume sources too expensive to ingest upfront or retain in analytics tools.

When you add a subscriber, you can choose Amazon S3 to create data access for the subscriber. If you select the default method of notification, you can receive the following object notification message in either an HTTPS endpoint or Amazon SQS.

{
  "source": "aws.s3",
  "time": "2021-11-12T00:00:00Z",
  "region": "ca-central-1",
  "resources": [
    "arn:aws:s3:::example-bucket"
  ],
  "detail": {
    "bucket": {
      "name": "example-bucket"
    },
    "object": {
      "key": "example-key",
      "size": 5,
      "etag": "b57f9512698f4b09e608f4f2a65852e5"
    },
    "request-id": "N4N7GDK58NMKJ12R",
    "requester": "123456789012"
  }
}

Subscribers with query access can directly query data that is stored in Security Lake by using services like Amazon Athena and other services that can read from AWS Lake Formation. The following are sample queries of CloudTrail data.

SELECT 
      time, 
      api.service.name, 
      api.operation, 
      api.response.error, 
      api.response.message, 
      src_endpoint.ip 
    FROM ${athena_db}.${athena_table}
    WHERE eventHour BETWEEN '${query_start_time}' and '${query_end_time}' 
      AND api.response.error in (
        'Client.UnauthorizedOperation',
        'Client.InvalidPermission.NotFound',
        'Client.OperationNotPermitted',
        'AccessDenied')
    ORDER BY time desc
    LIMIT 25

Subscribers only have access to source data in the AWS Region that you’ve selected when you create the subscriber. To give a subscriber access to data from multiple Regions, you can set the Region where you create your subscriber as a rollup Region.

Third-Party Integrations
For supported third-party integrations, there are a number of sources as well as subscribing services integrated with Amazon Security Lake.

Amazon Security Lake supports third-party sources providing OCSF security data, including Barracuda Networks, Cisco, Cribl, CrowdStrike, CyberArk, Lacework, Laminar, Netscout, Netskope, Okta, Orca, Palo Alto Networks, Ping Identity, SecurityScorecard, Tanium, The Falco Project, Trend Micro, Vectra AI, VMware, Wiz, and Zscaler.

You can also use third-party security, automation, and analytics tools supporting Security Lake, including Datadog, IBM, Rapid7, Securonix, SentinelOne, Splunk, Sumo Logic, and Trellix. There are also service partners such as Accenture, Atos, Deloitte, DXC, Kyndryl, PWC, Rackspace, and Wipro that can work with you and Amazon Security Lake.

Join the Preview
The preview release of Amazon Security Lake is now available in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland) Regions.

To learn more, see the Amazon Security Lake page and Amazon Security Lake User Guide. We want to hear more feedback during the preview. Please send feedback in AWS re:Post and through your usual AWS support contacts.

Channy

New analytical questions available in Amazon QuickSight Q: “Why” and “Forecast”

Post Syndicated from Shannon Kalisky original https://aws.amazon.com/blogs/big-data/new-analytical-questions-available-in-amazon-quicksight-q-why-and-forecast/

Amazon QuickSight Q uses machine learning (ML) to enable any user to ask questions about business data in natural language and receive accurate answers with relevant visualizations in seconds. Today, Amazon QuickSight announces support for two new question types that simplify and scale complex analytical tasks using natural language: “forecast” and “why.”

In this post, we explore each of these new question types with examples of how to use them.

Prerequisites

The features explored in this post are part of QuickSight Q. If you’re an existing QuickSight user, be sure that the Q add-on is enabled. For steps on how to do this, see Getting Started with Amazon QuickSight Q.

Forecasting questions

Customers often ask how they can forecast future business performance. This is a useful tool to understand if things are proceeding well, or if some action may be needed to get back on track. Forecasting uses historic data to project metrics into the future.

Creating forecasts is often the job of analysts or data scientists. However, the new forecasting question type in Q enables non-analyst users to predict future trajectories for up to three measures simultaneously. Rather than learning formulas or parameter settings, you can get a forecast by entering forecast into the language bar, followed by up to three metrics that you want to see predictions for. This natural language approach is an easy and intuitive way for managers and others who depend on data to get a sense of what’s likely to happen if things don’t change.

Although the experience of creating a forecast in Q is simple, under the hood is a proven and robust forecasting algorithm called Random Cut Forest (RCF). For more information, see How RCF is applied to generate forecasts.

How to ask a forecasting question

To ask a forecasting question, start the question with the word forecast or the phrase Show me a forecast. The minimum information needed to create a forecast is one of these two question starters, plus the measure you want to forecast. For example, Forecast sales is enough to generate a forecast, as shown in the following screenshot.

Forecasting in Q also supports filters. Filters are applied by adding information to the question. The following example shows using a filter in a forecast statement.

Q allows you to forecast up to three numeric measures in a single question. The following example shows a forecast of sales, profit, and quantity.

If the data you have is dense, it can cause the forecast to be crowded into the right side of the visual. Adjusting the time granularity to a coarser step, such as going from weekly granularity to monthly, will help make the visual easier to read. To do this, simply specify the desired time granularity in the question. The following example shows a different view of the previous example grouped by month instead of week.

Note that at this release forecasting in Q doesn’t support dimensional group-by functionality. Dimensional group-bys split the forecast by the different values in a categorical field, for example: Show me a forecast of sales by region.

Why questions

“Why” is one of the most fundamental questions people ask. For many organizations, understanding why is the key to delighting customers, driving innovation, and outmaneuvering the competition. However, manually analyzing a body of data to discover contributing changes is difficult, time-consuming, and requires special analytics skill.

The new why question type enables business users to instantly get insights previously only accessible to trained analysts. Business users need to understand what contributed to changes in their data, so they can make decisions about what action to take. Why questions are easy to ask and natural to think of, so business users can quickly pinpoint insights they need to know.

When you ask a why question in Q, you trigger an on-the-fly contribution analysis that will automatically identify the key drivers of change for the measure you asked about and quantify which value from each driver contributed the most to that change. This gives you an idea of the relative influence each value had to the measure.

How to ask a why question

A why question needs three things:

  • To start with the word “why.”
  • A numeric measure, such as sales, enrollment numbers, profit, price, and so on.
  • A date or time span, such as last quarter, January 2022, or last month. Note that at this release the time span should be complete, but asking about ongoing spans such as “this week” or “this year” or specifying the current month will not yet work.

Why questions often start from seeing something that sparks our curiosity. For example, if I were an administrator reviewing student enrollment and I saw the following visual, I would naturally wonder “Why did enrollment drop in 2021?”

Now we can ask just that, as shown in the following screenshot.

The why answer identifies up to four key drivers (shown in the blue ovals on the left side of the answer), which get unpacked into contribution narratives (center of the answer) that describe the specific value from the key driver that played the biggest role. On the right side of the answer is a quick-view KPI that summarizes the change in the key driver value. Note that you may need to mouse over and scroll in the Q answer pane to see all the drivers.

Refining why questions

In the why answer displayed in the previous example, enrollment dropped more in the fall semester, which is why it appears as a top contributor to the drop in enrollment. To drill into the factors that influenced the drop in fall enrollment, you can ask more precise questions. In this case, adding in the fall to the end of the question focuses the analysis on just the fall semester.

Focusing on fall brings more specific metrics, and reveals gender is an additional key driver specific to that semester.

You can explore additional drivers by choosing the driver and changing to a different field. This can be a helpful way understand the impact of another variable or to avoid redundancy if the data structure led Q to recommend two very similar or overlapping dimensions.

In the following example, we can change State to Student Classification to explore if the drop in enrollment disproportionately impacted any particular student group, such as freshman or graduate students.

In the following result, enrollment from juniors (third-year students) was much lower than it was in 2020, and represents a large portion of the drop in enrollment.

Conclusion

With why and forecasting questions, business users can dig deeper to understand the contributing factors of metric changes or model potential growth. These new question types are available at no additional cost for all Q customers.

The examples used in post utilize the sample QuickSight topics that come included with your QuickSight subscription. For forecasting, we used the Software Sales sample topic, and for why questions, we used the Student Enrollment sample topic. To try the questions on your own, activate the applicable sample topic.


About the author

Shannon Kalisky is a Senior Product Manager – Technical that covers natural language question patterns and model robustness for Amazon QuickSight Q.

New – Amazon Redshift Integration with Apache Spark

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-redshift-integration-with-apache-spark/

Apache Spark is an open-source, distributed processing system commonly used for big data workloads. Spark application developers working in Amazon EMR, Amazon SageMaker, and AWS Glue often use third-party Apache Spark connectors that allow them to read and write the data with Amazon Redshift. These third-party connectors are not regularly maintained, supported, or tested with various versions of Spark for production.

Today we are announcing the general availability of Amazon Redshift integration for Apache Spark, which makes it easy to build and run Spark applications on Amazon Redshift and Redshift Serverless, enabling customers to open up the data warehouse for a broader set of AWS analytics and machine learning (ML) solutions.

With Amazon Redshift integration for Apache Spark, you can get started in seconds and effortlessly build Apache Spark applications in a variety of languages, such as Java, Scala, and Python.

Your applications can read from and write to your Amazon Redshift data warehouse without compromising on the performance of the applications or transactional consistency of the data, as well as performance improvements with pushdown optimizations.

Amazon Redshift integration for Apache Spark builds on an existing open source connector project and enhances it for performance and security, helping customers gain up to 10x faster application performance. We thank the original contributors on the project who collaborated with us to make this happen. As we make further enhancements we will continue to contribute back into the open source project.

Getting Started with Spark Connector for Amazon Redshift
To get started, you can go to AWS analytics and ML services, use data frame or Spark SQL code in a Spark job or Notebook to connect to the Amazon Redshift data warehouse, and start running queries in seconds.

In this launch, Amazon EMR 6.9, EMR Serverless, and AWS Glue 4.0 come with the pre-packaged connector and JDBC driver, and you can just start writing code. EMR 6.9 provides a sample notebook, and EMR Serverless provides a sample Spark Job too.

First, you should set AWS Identity and Access Management (AWS IAM) authentication between Redshift and Spark, between Amazon Simple Storage Service (Amazon S3) and Spark, and between Redshift and Amazon S3. The following diagram describes the authentication between Amazon S3, Redshift, the Spark driver, and Spark executors.

For more information, see Identity and access management in Amazon Redshift in the AWS documentation.

Amazon EMR
If you already have an Amazon Redshift data warehouse and the data available, you can create the database user and provide the right level of grants to the database user. To use this with Amazon EMR, you need to upgrade to the latest version of the Amazon EMR 6.9 that has the packaged spark-redshift connector. Select the emr-6.9.0 release when you create an EMR cluster on Amazon EC2.

You can use EMR Serverless to create your Spark application using the emr-6.9.0 release to run your workload.

EMR Studio also provides an example Jupyter Notebook configured to connect to an Amazon Redshift Serverless endpoint leveraging sample data that you can use to get started quickly.

Here is a Scalar example to build your applications both with Spark Dataframe and Spark SQL. Use IAM-based credentials for connecting to Redshift and use IAM role for unloading and loading data from S3.

// Create the JDBC connection URL and define the Redshift context
val jdbcURL = "jdbc:redshift:iam://<RedshiftEndpoint>:<Port>/<Database>?DbUser=<RsUser>"
val rsOptions = Map (
  "url" -> jdbcURL,
  "tempdir" -> tempS3Dir, 
  "aws_iam_role" -> roleARN,
  )
// Reference the sales table from Redshift 
val sales_df = spark
  .read 
  .format("io.github.spark_redshift_community.spark.redshift") 
  .options(rsOptions) 
  .option("dbtable", "sales") 
  .load() 
sales_df.createOrReplaceTempView("sales") 
// Reference the date table from Redshift using Data Frame 
sales_df.join(date_df, sales_df("dateid") === date_df("dateid"))
  .where(col("caldate") === "2008-01-05")
  .groupBy().sum("qtysold")
  .select(col("sum(qtysold)"))
  .show() 

If Amazon Redshift and Amazon EMR are in different VPCs, you have to configure VPC peering or enable cross-VPC access. Assuming both Amazon Redshift and Amazon EMR are in the same virtual private cloud (VPC), you can create a Spark job or Notebook and connect to the Amazon Redshift data warehouse and write Spark code to use the Amazon Redshift connector.

To learn more, see Use Spark on Amazon Redshift with a connector in the AWS documentation.

AWS Glue
When you use AWS Glue 4.0, the spark-redshift connector is available both as a source and target. In Glue Studio, you can use a visual ETL job to read or write to a Redshift data warehouse simply by selecting a Redshift connection to use within a built-in Redshift source or target node.

The Redshift connection contains Redshift connection details along with the credentials needed to access Redshift with the proper permissions.

To get started, choose Jobs in the left menu of the Glue Studio console. Using either of the Visual modes, you can easily add and edit a source or target node and define a range of transformations on the data without writing any code.

Choose Create and you can easily add and edit a source, target node, and the transform node in the job diagram. At this time, you will choose Amazon Redshift as Source and Target.

Once completed, the Glue job can be executed on Glue for the Apache Spark engine, which will automatically use the latest spark-redshift connector.

The following Python script shows an example job to read and write to Redshift with dynamicframe using the spark-redshift connector.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

print("================ DynamicFrame Read ===============")
url = "jdbc:redshift://<RedshiftEndpoint>:<Port>/dev"
read_options = {
    "url": url,
    "dbtable": dbtable,
    "redshiftTmpDir": redshiftTmpDir,
    "tempdir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "autopushdown": "true",
    "include_column_list": "false"
}

redshift_read = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options=read_options
) 

print("================ DynamicFrame Write ===============")

write_options = {
    "url": url,
    "dbtable": dbtable,
    "user": "awsuser",
    "password": "Password1",
    "redshiftTmpDir": redshiftTmpDir,
    "tempdir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "autopushdown": "true",
    "DbUser": "awsuser"
}

print("================ dyf write result: check redshift table ===============")
redshift_write = glueContext.write_dynamic_frame.from_options(
    frame=redshift_read,
    connection_type="redshift",
    connection_options=write_options
)

When you set up your job detail, you can only use the Glue 4.0 – Supports spark 3.3 Python 3 version for this integration.

To learn more, see Creating ETL jobs with AWS Glue Studio and Using connectors and connections with AWS Glue Studio in the AWS documentation.

Gaining the Best Performance
In the Amazon Redshift integration for Apache Spark, the Spark connector automatically applies predicate and query pushdown to optimize for performance. You can gain performance improvement by using the default Parquet format for the connector used for unloading with this integration.

As the following sample code shows, the Spark connector will turn the supported function into a SQL query and run the query in Amazon Redshift.

import sqlContext.implicits._val
sample= sqlContext.read
.format("io.github.spark_redshift_community.spark.redshift")
.option("url",jdbcURL )
.option("tempdir", tempS3Dir)
.option("unload_s3_format", "PARQUET")
.option("dbtable", "event")
.load()

// Create temporary views for data frames created earlier so they can be accessed via Spark SQL
sales_df.createOrReplaceTempView("sales")
date_df.createOrReplaceTempView("date")
// Show the total sales on a given date using Spark SQL API
spark.sql(
"""SELECT sum(qtysold)
| FROM sales, date
| WHERE sales.dateid = date.dateid
| AND caldate = '2008-01-05'""".stripMargin).show()

Amazon Redshift integration for Apache Spark adds pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from the Redshift data warehouse to the consuming Spark application, thereby improving performance.

Available Now
The Amazon Redshift integration for Apache Spark is now available in all Regions that support Amazon EMR 6.9, AWS Glue 4.0, and Amazon Redshift. You can start using the feature directly from EMR 6.9 and Glue Studio 4.0 with the new Spark 3.3.0 version.

Give it a try, and please send us feedback either in the AWS re:Post for Amazon Redshift or through your usual AWS support contacts.

Channy