Tag Archives: Amazon Athena

Improving Retail Forecast Accuracy with Machine Learning

Post Syndicated from Soonam Jose original https://aws.amazon.com/blogs/architecture/improving-retail-forecast-accuracy-with-machine-learning/

The global retail market continues to grow larger and the influx of consumer data increases daily. The rise in volume, variety, and velocity of data poses challenges with demand forecasting and inventory planning. Outdated systems generate inaccurate demand forecasts. This results in multiple challenges for retailers. They are faced with over-stocking and lost sales, and often have to rely on increased levels of safety stock to avoid losing sales.

A recent McKinsey study indicates that AI-based forecasting improves forecasting accuracy by 10–20 percent. This translates to revenue increases of 2–3 percent. An accurate forecasting system can also help determine ideal inventory levels and better predict the impact of sales promotions. It provides a single view of demand across all channels and a better customer experience overall.

In this blog post, we will show you how to build a reliable retail forecasting system. We will use Amazon Forecast, and an AWS-vetted solution called Improving Forecast Accuracy with Machine Learning. This is an AWS Solutions Implementation that automatically produces forecasts and generates visualization dashboards. This solution can be extended to use cases across a variety of industries.

Improving Forecast Accuracy solution architecture

This post will illustrate a retail environment that has an SAP S/4 HANA system for overall enterprise resource planning (ERP). We will show a forecasting solution based on Amazon Forecast to predict demand across product categories. The environment also has a unified platform for customer experience provided by SAP Customer Activity Repository (CAR). Replenishment processes are driven by SAP Forecasting and Replenishment (F&R), and SAP Fiori apps are used to manage forecasts.

The solution is divided into four parts: Data extraction and preparation, Forecasting and monitoring, Data visualization, and Forecast import and utilization in SAP.

Figure 1. Notional architecture for improving forecasting accuracy solution and SAP integration

Figure 1. Notional architecture for improving forecasting accuracy solution and SAP integration

­­Data extraction and preparation

Historical demand data such as sales, web traffic, inventory numbers, and resource demand are extracted from SAP and uploaded to Amazon Simple Storage Service (S3). There are multiple ways to extract data from an SAP system into AWS. As part of this architecture, we will use operational data provisioning (ODP) extraction. ODP acts as a data source for OData services, enabling REST-based integrations with external applications. The ODP-Based Data Extraction via OData document details this approach. The steps involved are:

  1. Create a data source using transaction RSO2, allow Change Data Capture for specific data to be extracted
  2. Create an OData service using transaction SEGW
  3. Create a Data model for ODP extraction, which refers to the defined data source, then register the service
  4. Initiate the service from SAP gateway client
  5. In the AWS Management Console, create an AWS Lambda function to extract data and upload to S3. Check out the sample extractor code using Python, referenced in the blog Building data lakes with SAP on AWS

Related data that can potentially affect demand levels can be uploaded to Amazon S3. These could include seasonal events, promotions, and item price. Additional item metadata, such as product descriptions, color, brand, size may also be uploaded. Amazon Forecast provides built-in related time series data for holidays and weather. These three components together form the forecast inputs.

Forecasting and monitoring

An S3 event notification will be initiated when new datasets are uploaded to the input bucket. This in turn, starts an AWS Step Functions state machine. The state machine combines a series of AWS Lambda functions that build, train, and deploy machine learning models in Amazon Forecast. All AWS Step Functions logs are sent to Amazon CloudWatch. Administrators will be notified with the results of the AWS Step Functions through Amazon Simple Notification Service (SNS).

An AWS Glue job combines raw forecast input data, metadata, predictor backtest exports, and forecast exports. These all go into an aggregated view of forecasts in an S3 bucket. It is then translated to the format expected by the External Forecast import interface. Amazon Athena can be used to query forecast output in S3 using standard SQL queries.

Data visualization

Amazon QuickSight analyses can be created on a per-forecast basis. This provides users with forecast output visualization across hierarchies and categories of forecasted items. It also displays item-level accuracy metrics. Dashboards can be created from these analyses and shared within the organization. Additionally, data scientists and developers can prepare and process data, and evaluate Forecast outputs using an Amazon SageMaker Notebook Instance.

Forecast import and utilization in SAP

Amazon Forecast outputs located in Amazon S3 will be imported into the Unified Demand Forecast (UDF) module within the SAP Customer Activity Repository (CAR). You can read here about how to import external forecasts. An AWS Lambda function will be initiated when aggregated forecasts are uploaded to the S3 bucket. The Lambda function performs a remote function call (RFC) to the SAP system through the official SAP JCo Library. The SAP RFC credentials and connection information may be stored securely inside AWS Secrets Manager and read on demand to establish connectivity.

Once imported, forecast values from the solution can be retrieved by SAP Forecasting and Replenishment (F&R). They will be consumed as an input to replenishment processes, which consist of requirements calculation and­­­­­ requirement quantity optimization. SAP F&R calculates requirements based on the forecast, the current stock, and the open purchase orders. The requirement quantity then may be improved in accordance with optimization settings defined in SAP F&R.

­­­

Additionally, you have the flexibly to adjust the system forecast as required by the demand situation or analyze forecasts via respective SAP Fiori Apps.

Sample use case: AnyCompany Stores, Inc.

To illustrate how beneficial this solution can be for retail organizations, let’s consider AnyCompany Stores, Inc. This is a hypothetical customer and leader in the retailer industry with 985 stores across the United States. They struggle with issues stemming from their existing forecasting implementation. That implementation only understands certain categories and does not factor in the entire product portfolio. Additionally, it is limited to available demand history and does not consider related information that may affect forecasts. AnyCompany Stores is looking to improve their demand forecasting system.

Using Improving Forecast Accuracy with Machine Learning, AnyCompany Stores can easily generate AI-based forecasts at appropriate quantiles to address sensitivities associated with respective product categories. This mitigates inconsistent inventory buys, overstocks, out-of-stocks, and margin erosion. The solution also considers all relevant related data in addition to the historical demand data. This ensures that generated forecasts are accurate for each product category.

The generated forecasts may be used to complement existing forecasting and replenishment processes. With an improved forecasting solution, AnyCompany Stores will be able to meet demand, while holding less inventory and improving customer experience. This also helps ensure that potential demand spikes are accurately captured, so staples will always be in stock. Additionally, the company will not overstock expensive items with short shelf lives that are likely to spoil.

Conclusion

In this post, we explored how to implement an accurate retail forecasting solution using a ready-to-deploy AWS Solution. We use generated forecasts to drive inventory replenishment optimization and improve customer experience. The solution can be extended to inventory, workforce, capacity, and financial planning.

We showcase one of the ways in which Improving Forecast Accuracy with Machine Learning may be extended for a use case in the retail industry. If your organization would like to improve business outcomes with the power of forecasting, explore customizing this solution to fit your unique needs.

Further reading:

Integrating Redaction of FinServ Data into a Machine Learning Pipeline

Post Syndicated from Ravikant Gupta original https://aws.amazon.com/blogs/architecture/integrating-redaction-of-finserv-data-into-a-machine-learning-pipeline/

Financial companies process hundreds of thousands of documents every day. These include loan and mortgage statements that contain large amounts of confidential customer information.

Data privacy requires that sensitive data be redacted to protect the customer and the institution. Redacting digital and physical documents is time-consuming and labor-intensive. The accidental or inadvertent release of personal information can be devastating for the customer and the institution. Having automated processes in place reduces the likelihood of a data breach.

In this post, we discuss how to automatically redact personally identifiable information (PII) data fields from your financial services (FinServ) data through machine learning (ML) capabilities of Amazon Comprehend and Amazon Athena. This will ensure you comply with federal regulations and meet customer expectations.

Protecting data and complying with regulations

Protecting PII is crucial to complying with regulations like the California Consumer Privacy Act (CCPA), Europe’s General Data Protection Regulation (GDPR), and Payment Card Industry’s data security standards (PCI DSS).

In Figure 1, we show how structured and non-structured sensitive data stored in AWS data stores can be redacted before it is made available to data engineers and data scientists for feature engineering and building ML models in compliance with organizations data security policies.

How to redact confidential information in your ML pipeline

Figure 1. How to redact confidential information in your ML pipeline

Architecture walkthrough

This section explains each step presented in Figure 1 and the AWS services used:

  1. By using services like AWS DataSync, AWS Storage Gateway, and AWS Transfer Family, data can be ingested into AWS using batch or streaming pattern. This data lands in an Amazon Simple Storage Service (Amazon S3) bucket, we call this “raw data” in Figure 1.
  2. To detect if the raw data bucket has any sensitive data, use Amazon Macie. Macie is a fully managed data security and data privacy service that uses ML and pattern matching to discover and protect your sensitive data in AWS. When Macie discovers sensitive data, you can configure it to tag the objects with an Amazon S3 object tag to identify that sensitive data was found in the object before progressing to the next stage of the pipeline. Refer to the Use Macie to discover sensitive data as part of automated data pipelines blog post for detailed instruction on building such pipeline.
  3.  This tagged data lands in a “scanned data” bucket, where we use Amazon Comprehend, a natural language processing (NLP) service that uses ML to uncover information in unstructured data. Amazon Comprehend works for unstructured text document data and redacts sensitive fields like credit card numbers, date of birth, social security number, passport number, and more. Refer to the Detecting and redacting PII using Amazon Comprehend blog post for step-by-step instruction on building such a capability.
  4. If your pipeline requires redaction for specific use cases only, you can use the information in Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3 to redact sensitive data. Using this operation, an AWS Lambda function will intercept each GET request. It will redact data as necessary before it goes back to the requestor. This allows you to keep one copy of all the data and redact the data as it is requested for a specific workload. For further details, refer to the Amazon S3 Object Lambda Access Point to redact personally identifiable information (PII) from documents developer guide.
  5. When you want to join multiple datasets from different data sources, use an Athena federated query. Using user-defined functions (UDFs) with Athena federated query will help you redact data in Amazon S3 or from other data sources such as an online transaction store like Amazon Relational Database Service (Amazon RDS), a data warehouse solution like Amazon Redshift, or a NoSQL store like Amazon DocumentDB. Athena supports UDFs, which enable you to write custom functions and invoke them in SQL queries. UDFs allow you to perform custom processing such as redacting sensitive data, compressing, and decompressing data or applying customized decryption. To read further on how you can get this set up refer to the Redacting sensitive information with user-defined functions in Amazon Athena blog post.
  6. Redacted data lands in another S3 bucket that is now ready for any ML pipeline consumption.
  7. Using AWS Glue DataBrew, the data preparation without writing any code. You can choose reusable recipes from over 250 pre-built transformations to automate data preparation tasks by jobs that can be scheduled based on your requirements.
  8. Data is then used by Amazon SageMaker Data Wrangler to do feature engineering on curated data in data preparation (step 6). SageMaker Data Wrangler offers over 300 pre-configured data transformations, such as convert column type, one hot encoding, impute missing data with mean or median, rescale columns, and data/time embedding, so you can transform your data into formats that can be effectively used for models without writing a single line of code.
  9. The output of the SageMaker Data Wrangler job is stored in Amazon SageMaker Feature Store, a purpose-built repository where you can store and access features to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent.
  10. Use ML features in SageMaker notebooks or SageMaker Studio for ML training on your redacted data. SageMaker notebook instance is an ML compute instance running the Jupyter Notebook App. Amazon SageMaker Studio is a web-based, integrated development environment for ML that lets you build, train, debug, deploy, and monitor your ML models. SageMaker Studio is integrated with SageMaker Data Wrangler.

Conclusion

Federal regulations require that financial institutions protect customer data. To achieve this, redact sensitive fields in your data.

In this post, we showed you how to use AWS services to meet these requirements with Amazon Comprehend and Amazon Athena. These services allow data engineers and data scientist in your organization to safely consume this data for machine learning pipelines.

Analyze Fraud Transactions using Amazon Fraud Detector and Amazon Athena

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/analyze-fraud-transactions-using-amazon-fraud-detector-and-amazon-athena/

Organizations with online businesses have to be on guard constantly for fraudulent activity, such as fake accounts or payments made with stolen credit cards. One way they try to identify fraudsters is by using fraud detection applications. Some of these applications use machine learning (ML).

A common challenge with ML is the need for a large, labeled dataset to create ML models to detect fraud. You will also need the skill set and infrastructure to build, train, deploy, and scale your ML model.

In this post, I discuss how to perform fraud detection on a batch of many events using Amazon Fraud Detector. Amazon Fraud Detector is a fully managed service that can identify potentially fraudulent online activities. These can be situations such as the creation of fake accounts or online payment fraud. Unlike general-purpose ML packages, Amazon Fraud Detector is designed specifically to detect fraud. You can analyze fraud transaction prediction results by using Amazon Athena and Amazon QuickSight. I will explain how to review fraud using Amazon Fraud Detector and Amazon SageMaker built-in algorithms.

Batch fraud prediction use cases

You can use a batch predictions job in Amazon Fraud Detector to get predictions for a set of events that do not require real-time scoring. You may want to generate fraud predictions for a batch of events. These might be payment fraud, account take over or compromise, and free tier misuse while performing an offline proof-of-concept. You can also use batch predictions to evaluate the risk of events on an hourly, daily, or weekly basis depending upon your business need.

Batch fraud insights using Amazon Fraud Detector

Organizations such as ecommerce companies and credit card companies use ML to detect the fraud. Some of the most common types of fraud include email account compromise (personal or business), new account fraud, and non-payment or non-delivery (which includes compromised card numbers).

Amazon Fraud Detector automates the time-consuming and expensive steps to build, train, and deploy an ML model for fraud detection. Amazon Fraud Detector customizes each model it creates to your dataset, making the accuracy of models higher than current one-size-fits-all ML solutions. And because you pay only for what you use, you can avoid large upfront expenses.

If you want to analyze fraud transactions after the fact, you can perform batch fraud predictions using Amazon Fraud Detector. Then you can store fraud prediction results in an Amazon S3 bucket. Amazon Athena helps you analyze the fraud prediction results. You can create fraud prediction visualization dashboards using Amazon QuickSight.

The following diagram illustrates how to perform fraud predictions for a batch of events and analyze them using Amazon Athena.

Figure 1. Example architecture for analyzing fraud transactions using Amazon Fraud Detector and Amazon Athena

Figure 1. Example architecture for analyzing fraud transactions using Amazon Fraud Detector and Amazon Athena

The architecture flow follows these general steps:

  1. Create and publish a detector. First create and publish a detector using Amazon Fraud Detector. It should contain your fraud prediction model and rules. For additional details, see Get started (console).
  2. Create an input Amazon S3 bucket and upload your CSV file. Prepare a CSV file that contains the events you want to evaluate. Then upload your CSV file into the input S3 bucket. In this file, include a column for each variable in the event type associated with your detector. In addition, include columns for EVENT_ID, ENTITY_ID, EVENT_TIMESTAMP, ENTITY_TYPE. Refer to Amazon Fraud Detector batch input and output files for more details. Read Create a variable for additional information on Amazon Fraud Detector variable data types and formatting.
  3. Create an output Amazon S3 bucket. Create an output Amazon S3 bucket to store your Amazon Fraud Detector prediction results.
  4. Perform a batch prediction. You can use a batch predictions job in Amazon Fraud Detector to get predictions for a set of events that do not require real-time scoring. Read more here about Batch predictions.
  5. Review your prediction results. Review your results in the CSV file that is generated and stored in the Amazon S3 output bucket.
  6. Analyze your fraud prediction results.
    • After creating a Data Catalog by using AWS Glue, you can use Amazon Athena to analyze your fraud prediction results with standard SQL.
    • You can develop user-friendly dashboards to analyze fraud prediction results using Amazon QuickSight by creating new datasets with Amazon Athena as your data source.

Fraud detection using Amazon SageMaker

The Amazon Web Services (AWS) Solutions Implementation, Fraud Detection Using Machine Learning, enables you to run automated transaction processing. This can be on an example dataset or your own dataset. The included ML model detects potentially fraudulent activity and flags that activity for review. The diagram following presents the architecture you can automatically deploy using the solution’s implementation guide and accompanying AWS CloudFormation template.

SageMaker provides several built-in machine learning algorithms that you can use for a variety of problem types. This solution leverages the built-in Random Cut Forest algorithm for unsupervised learning and the built-in XGBoost algorithm for supervised learning. In the SageMaker Developer Guide, you can see how Random Cut Forest and XGBoost algorithms work.

Figure 2. Fraud detection using machine learning architecture on AWS

Figure 2. Fraud detection using machine learning architecture on AWS

This architecture can be segmented into three phases.

  1. Develop a fraud prediction machine learning model. The AWS CloudFormation template deploys an example dataset of credit card transactions contained in an Amazon Simple Storage Service (Amazon S3) bucket. An Amazon SageMaker notebook instance with different ML models will be trained on the dataset.
  2. Perform fraud prediction. The solution also deploys an AWS Lambda function that processes transactions from the example dataset. It invokes the two SageMaker endpoints that assign anomaly scores and classification scores to incoming data points. An Amazon API Gateway REST API initiates predictions using signed HTTP requests. An Amazon Kinesis Data Firehose delivery stream loads the processed transactions into another Amazon S3 bucket for storage. The solution also provides an example of how to invoke the prediction REST API as part of the Amazon SageMaker notebook.
  3. Analyze fraud transactions. Once the transactions have been loaded into Amazon S3, you can use analytics tools and services for visualization, reporting, ad-hoc queries, and more detailed analysis.

By default, the solution is configured to process transactions from the example dataset. To use your own dataset, you must modify the solution. For more information, see Customization.

Conclusion

In this post, we showed you how to analyze fraud transactions using Amazon Fraud Detector and Amazon Athena. You can build fraud insights using Amazon Fraud Detector and Amazon SageMaker built-in algorithms Random Cut Forest and XGBoost. With the information in this post, you can build your own fraud insights models on AWS. You’ll be able to detect fraud faster. Finally, you’ll be able to solve a variety of fraud types. These can be new account fraud, online transaction fraud, and fake reviews, among others.

Read more and get started on building fraud detection models on AWS.

Using Cloud Fitness Functions to Drive Evolutionary Architecture

Post Syndicated from Hauke Juhls original https://aws.amazon.com/blogs/architecture/using-cloud-fitness-functions-to-drive-evolutionary-architecture/

“It is not the strongest of the species that survives, nor the most intelligent. It is the one that is most adaptable to change.” – often attributed to Charles Darwin

One common strategy for businesses that operate in dynamic market conditions (and thus need to continuously correct their course) is to aim for smaller, independent development teams. Microservices and two-pizza teams at Amazon are prominent examples of this strategy. But having smaller units is not the only success factor: to reduce organizational bottlenecks and make high-quality decisions quickly, these two-pizza teams need to be autonomous in most of their decision making.

Architects can no longer rely on static upfront design to meet the change rate required to be successful in such an environment.

This blog shows enterprise architects a mechanism to align decentralized architectural decision making with overall architecture goals.

Gathering data from your fitness functions

“Evolutionary architecture” was coined by Neal Ford and his colleagues from AWS Partner ThoughtWorks in their work on Building Evolutionary Architectures. It is defined as “supporting guided, incremental change as a first principle across multiple dimensions.”

Fitness functions help you obtain the necessary data to allow for the planned evolution of your architecture. They set measurable values to assess how close your solution is to achieving your set goals.

Fitness functions can and should be adapted as the architecture evolves to guide a desired change process. This provides architects with a tool to guide their teams while maintaining team autonomy.

Example of a regression fitness function in action

You’ve identified shorter time-to-market as a key non-functional requirement. You want to lower the risk of regressions and rollbacks after deployments. So, you and your team write automated test cases. To ensure that they have a good set of test cases in place, they measure test coverage. This test coverage measures the percentage of code that is tested automatically. This steers the team toward writing tests to mitigate the risk of regressions so they have fewer rollbacks and shorter time to market.

Fitness functions like this work best when they’re as automated as possible. But how do you acquire the necessary data points to use this mechanism outside of software architecture? We’ll show you how in the following sections.

AWS Cloud services with built-in fitness functions

AWS Cloud services are highly standardized, fully automated via API operations, and are built with observability in mind. This allows you to generate measurements for fitness functions automatically for areas such as availability, responsiveness, and security.

To start building your evolutionary architecture with fitness functions, use something that can be easily measured. AWS has services that can be used as inputs to fitness functions, including:

  • Amazon CloudWatch aggregates logs and metrics to check for availability, responsiveness, and reliability fitness functions.
  • AWS Security Hub provides a comprehensive view of your security alerts and security posture across your AWS accounts. Security Architects could, for example, define the fitness function of critical and high findings to be zero. Teams then would be guided into reducing the number of these findings, resulting in better security.
  • AWS Cost Explorer ensures your costs stay in line with value generated.
  • AWS Well-Architected Tool evaluates teams’ architectures in a consistent and repeatable way. The number of items acts as your fitness function, which can be queried using the API. To improve your architecture based on the results, review the Establishing Feedback Loops Based on the AWS Well-Architected Framework Review blog post.
  • Amazon SageMaker Model Monitor continuously monitors the quality of SageMaker machine learning models in production. Detecting deviations early allows you to take corrective actions like retraining models, auditing upstream systems, or fixing quality issues.

Using the observability that the cloud provides

Fitness functions can be derived by evaluating the AWS account activity such as configuration changes. AWS CloudTrail is useful for this. It records account activity and service events from most AWS services, which can then be analyzed with Amazon Athena.

Fitness functions provide feedback to engineers via metrics

Figure 1. Fitness functions provide feedback to engineers via metrics

Example of a cloud fitness function in action

In this example, we implement a fitness function that monitors the operability of your system.

You have had certain outages due to manual tasks in operations, and you have anecdotal evidence that engineers are spending time on manual work during application rollouts. To improve operations, you want to reduce manual interactions via the shell in favor of automation. First, you prevent direct secure shell (SSH) access by blocking SSH traffic via the managed AWS Config rule restricted-ssh. Second, you make use of AWS Systems Manager Session Manager, which provides a secure and auditable way to access Amazon Elastic Compute Cloud (Amazon EC2) instances.

By counting the logged API events in CloudTrail you can measure the number of shell sessions. This is shown in this sample Athena query to count the number of shell sessions:

SELECT count(*),
       DATE(from_iso8601_timestamp(eventTime)),
       userIdentity.type,
       eventSource,
       eventName
FROM "cloudtrail_logs_partition_projection"
WHERE readonly = 'false'
  AND eventsource = 'ssm.amazonaws.com'
  AND eventname in ('StartSession',
                    'ResumeSession',
                    'TerminateSession')
GROUP BY DATE(from_iso8601_timestamp(eventTime)),
         userIdentity.type,
         eventSource,
         eventName
ORDER BY DATE(from_iso8601_timestamp(eventTime)) DESC

The number of shell sessions now act as fitness function to improve operational excellence through operations as code. Coincidently, the fitness function you defined also rewards teams moving to serverless compute services such as AWS Fargate or AWS Lambda.

Fitness through exercising

Similar to people, your architecture’s fitness can be improved by exercising. It does not take much equipment, but you need to take the first step. To get started, we encourage you to think of the desired outcomes for your architecture that you can measure (and thus guide) through fitness functions. The following lessons learned will help you focus your goals:

  • Requirements and business goals may differ per domain. Thus, your fitness functions might differ. Work closely with your teams when defining fitness functions.
  • Start by taking something that can be easily measured and communicated as a goal.
  • Focus on a positive trendline rather than absolute values.
  • Make sure you and your teams are using the same metrics and the same way to measure them. We have seen examples where central governance departments had access to data the individual teams did not, leading to frustration on all sides.
  • Ensure that your architecture goals fit well into the current context and time horizon.
  • Continuously re-visit the fitness functions to ensure that they evolve with the changing business goals.

Conclusion

Fitness functions help architects focus on building. Once established, teams can use the data points from fitness functions to make decisions and work towards a common and measurable goal. The architects in turn can use the data points they get from fitness functions to confirm their hypothesis of the current state of the architecture. Get started building your fitness functions today by:

  • Gathering the most important system quality attributes.
  • Beginning with approximately three meaningful fitness functions relying on the API operations available.
  • Building a dashboard that shows progress over time, share it with your teams, and rely on this data in your daily work.

The three most important AWS WAF rate-based rules

Post Syndicated from Artem Lovan original https://aws.amazon.com/blogs/security/three-most-important-aws-waf-rate-based-rules/

In this post, we explain what the three most important AWS WAF rate-based rules are for proactively protecting your web applications against common HTTP flood events, and how to implement these rules. We share what the Shield Response Team (SRT) has learned from helping customers respond to HTTP floods and show how all AWS WAF customers can benefit from these learnings.

When you have business-critical applications that are internet-facing, you need to protect them from risks such as distributed denial of service (DDoS) attacks. AWS Shield Advanced is a managed DDoS protection service that safeguards applications that are running behind Amazon Web Services (AWS) internet-facing resources. The backend origin of your application can exist anywhere, including on premises, and Shield Advanced can protect it. Shield Advanced provides DDoS protection for Layers 3–7. It also includes 24/7 access to the SRT to help you quickly respond to sophisticated unauthorized activity scenarios that might be unique to your application. To learn more about what resource types are supported to associate AWS WAF, see AWS WAF.

Increasingly, the SRT has been assisting customers in protecting against Layer 7 HTTP flood occurrences that negatively impact application availability or performance by overloading the application with an unusually high number of HTTP requests. In many cases, these malicious events can be automatically mitigated by using AWS WAF. In addition, AWS WAF has an easy-to-configure native rate-based rule capability, which detects source IP addresses that make large numbers of HTTP requests within a 5-minute time span, and automatically blocks requests from the offending source IP until the rate of requests falls below a set threshold. In this post, we show how you can pull insights from the AWS WAF logs to determine what your rate-based rule threshold should be.

The top three most important AWS WAF rate-based rules are:

  • A blanket rate-based rule to protect your application from large HTTP floods.
  • A rate-based rule to protect specific URIs at more restrictive rates than the blanket rate-based rule.
  • A rate-based rule to protect your application against known malicious source IPs.

Solution overview

AWS WAF is a web application firewall that helps protect your web applications against common web exploits that might affect availability, compromise security, or consume excessive resources. AWS WAF gives you control over which web traffic reaches your applications. If you already know the request rates for your application, you have all the necessary information to start creating your AWS WAF rate-based rules. To learn more about how to create rules, see Creating a rule and adding conditions. However, if you don’t have this data and want to learn how to get started, this solution helps you determine appropriate rates for your applications, and how to create AWS WAF rate-based rules.

Figure 1 shows how incoming request information is captured so that the operations team can use it to determine rate-based rules.

Figure 1: The workflow to collect and query logs and apply rate-based rules

Figure 1: The workflow to collect and query logs and apply rate-based rules

Let’s go through the flow to better understand what’s happening at each step:

  1. An application user makes requests to the application.
  2. AWS WAF captures information about the incoming requests and sends this to Amazon Kinesis Data Firehose.
  3. Kinesis Data Firehose delivers the logs to an Amazon Simple Storage Service (Amazon S3) bucket, where they will be stored.
  4. The operations team uses Amazon Athena to analyze the logs with SQL queries.
  5. Athena queries the logs in the S3 bucket and shows the query results.
  6. The operations team uses the query results to determine the appropriate AWS WAF rate-based rule.

The three rate-based rules in detail

Each of the rules helps to protect web applications from unauthorized activity. Each of the rules focuses on a specific aspect of protection. The rules complement each other, and so when they’re combined, they can offer greater help in protecting your web application. We’ll look at each of the rules to understand what they do.

Blanket rate-based rule

A blanket rate-based rule is designed to prevent any single source IP address from negatively impacting the availability of a website. For example, if the threshold for the rate-based rule is set to 2,000, the rule will block all IPs that are making more than 2,000 requests in a rolling 5-minute period. This is the most basic rate-based rule, and one of the most valuable for AWS WAF customers to implement. The SRT often helps customers who are actively under a DDoS attack to quickly implement this rule. In past experiences with HTTP flood cases, if this rule were proactively in place, the customer would have been protected and wouldn’t have needed to reach out to the SRT for assistance. The blanket rate-based rule would have automatically blocked the attempt without any human intervention.

URI-specific rate-based rule

Some application URI endpoints typically receive a high request volume, but for others it would be unusual and suspicious to see a high request count. For example, multiple requests in a 5-minute period to an application’s login page is suspicious and indicates a potential brute force or credential-stuffing attack against the application. A URI-specific rule can prevent a single source IP address from connecting to the login page as few as 100 times per 5-minute period, while still allowing a much higher request volume to the rest of the application. Some applications naturally have computationally expensive URIs that, when called, require considerably more resources to process the request. An example of this could be a database query or search function. If a bad actor targets these computationally expensive URIs, this can quickly lead to application performance or availability issues. If you assign a URI-specific rate-based rule to these portions of your site, you can configure a much lower threshold than the blanket rate-based rule. It’s beyond the scope of this blog post, but some customers use Application Load Balancer access logs and the target_processing_time information to determine precisely which portions of the site are the slowest to respond and might represent a computationally expensive call. These customers then put additional rate-based rule protections on calls that are made to these URIs.

IP reputation rate-based rule

Many of the DDoS events the SRT assists customers with include HTTP floods that originate from known malicious source IPs. The AWS WAF Security Automations solution provides AWS WAF customers with a subscription to four open-source threat intelligence lists. Rate-based rules with low thresholds can be applied to requests coming from these suspect sources. Some customers feel comfortable completely blocking web requests from these IPs, but at the very least, requests from these IPs should be rate-limited to protect the application from these well-known malicious sources.

It’s also common to see HTTP floods originate from IP addresses within certain countries. You can use AWS WAF geographical matching rules to assign lower rate-based rule thresholds to requests that originate from certain countries, or countries that don’t contain your web application’s primary user base. For example, suppose your application primarily serves users in the United States. In that case, it could be beneficial to create a rate-based rule with a low threshold for requests that come from any country other than the United States. HTTP floods are also commonly seen originating from IP addresses classified as cloud hosting provider IPs. You can use AWS WAF’s “HostingProviderIPList” Managed Rule to label these requests and then assign a lower rate-based rule threshold to them as well.

Prerequisites

Before you implement the solution, verify that:

  • AWS WAF is deployed in your AWS account and is associated with an Amazon CloudFront distribution or an Application Load Balancer.
  • Your AWS WAF default action is set to Block. When you create and configure a web ACL, you set the web ACL default action, which determines how AWS WAF handles web requests that don’t match any rules in the web ACL. To learn more about default action for a web ACL, see Deciding on the default action for a web ACL.
  • AWS WAF logging is configured and logs are being stored in an S3 bucket.

    Note: You can follow these instructions to configure delivery of AWS WAF logs to your S3 bucket, and you can also use AWS Firewall Manager to configure centralized AWS WAF logging in a multi-account environment.

Set up Athena to analyze AWS WAF logs

Amazon Athena is an interactive query service that you can use to analyze data in Amazon S3 by using standard SQL. For this solution, you’ll use Athena to connect to the S3 bucket where AWS WAF logs are stored and query the AWS WAF logs. The first step is to open the Athena console and create a database.

Note: The Athena database and table creation is a once-off configuration process. You can then come back and run the queries and see the query results based on your latest AWS WAF log data.

To create an Athena database, you’ll use a data definition language (DDL) statement. Paste the following query in the Athena query editor, replacing values as described here:

  • Replace <your-bucket-name> with the S3 bucket name that holds your AWS WAF logs.
  • For <bucket-prefix-if-exist>, if AWS WAF logs are stored in an S3 bucket prefix, replace with your prefix name. Otherwise, remove this part from the query, including the slash “/” at the end.
CREATE DATABASE IF NOT EXISTS wafrulesdb
  COMMENT 'AWS WAF logs'
  LOCATION 's3://<your-bucket-name>/<bucket-prefix-if-exist>/';

Choose Run query to run the query and create the database. Successful completion will be indicated by the query result, as shown below.

Results
Query successful. 

Next, you’ll create a table inside the database. Paste the following query in the Athena query editor, replacing values as described here:

  • Replace <your-bucket-name> with the S3 bucket name that holds your AWS WAF logs.
  • For <bucket-prefix-if-exist>, if AWS WAF logs are stored in an S3 bucket prefix, replace with your prefix name. Otherwise, you can remove this part from the query, including the slash “/” at the end.
  • For has_encrypted_data, if your AWS WAF log data is encrypted at rest, change the value to true, otherwise false is the correct value.
CREATE EXTERNAL TABLE IF NOT EXISTS wafrulesdb.waftable (
  `terminatingRuleId` string,
  `httpSourceName` string,
  `action` string,
  `httpSourceId` string,
  `terminatingRuleType` string,
  `webaclId` string,
  `timestamp` float,
  `formatVersion` int,
  `ruleGroupList` array<string>,
  `httpRequest` struct<`headers`:array<struct<name:string,value:string>>,clientIp:string,args:string,requestId:string,httpVersion:string,httpMethod:string,country:string,uri:string>,
  `rateBasedRuleList` string,
  `nonTerminatingMatchingRules` string,
  `terminatingRuleMatchDetails` string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://<your-bucket-name>/<bucket-prefix-if-exist>/'
TBLPROPERTIES ('has_encrypted_data'='false');

Run the query in the Athena console. After the query completes, Athena registers the waftable table, which makes the data in it available for queries.

Run SQL queries to identify rate-based rule thresholds

Now that you have a table in Athena, know where the data is located, and have the correct schema, you can run SQL queries for each of the rate-based rules and see the query results.

Blanket rate-based rule for all application endpoints

You’ll start with a SQL query that identifies the blanket rule. The critical factor in determining the blanket rule is to run the query against AWS WAF logs data that represents a healthy high request volume. The following query defines a time window of 6 hours in the evening, expressed as 2020-12-01 16:00:00 and 2020-12-01 22:00:00. Time windows can span a few hours or several days; however, this time window must be a good representation of your traffic volume, which you will use as the basis to identify the threshold. For example, if your application is busier during certain periods, you should evaluate the log data for that time. In the example shown here, we limit the query results to the top 100 IPs in our SQL queries. You can adjust the limit to your needs by updating the LIMIT value.

SELECT
  httprequest.clientip,
  COUNT(*) AS "count"
FROM wafrulesdb.waftable
WHERE from_unixtime(timestamp/1000) BETWEEN TIMESTAMP '2020-12-01 16:00:00' AND TIMESTAMP '2020-12-01 22:00:00'
GROUP BY httprequest.clientip, FLOOR("timestamp"/(1000*60*5))
ORDER BY count DESC
LIMIT 100; 

Update the time window to your needs and run the query in the Athena console. The results will show the top requesting IPs in any 5-minute period between two dates, as illustrated in Figure 2.

Figure 2: The top requesting IP in any 5-minute period between dates

Figure 2: The top requesting IP in any 5-minute period between dates

You can visualize the results data to see a holistic view of the request count per IP. The chart in Figure 3 illustrates the SQL query results.

Figure 3: Chart: Top requesting IP in any 5-minute period between dates

Figure 3: Chart: Top requesting IP in any 5-minute period between dates

The results are sorted by showing the IPs with the highest request volume for every 5-minute period. This means that the same IP could appear multiple times, if most of the requests were made within that 5-minute interval. In our example, looking at the result, an excellent first blanket rule would limit the request volume to about 7,000 requests within a 5-minute time period. You can either create the AWS WAF rule by using the following JSON and the JSON rule editor, or by using the AWS WAF visual rule editor and following these instructions. If you’re using the following JSON, make sure to replace the Limit value with the value that you identified by running the SQL query earlier.

{
  "Name": "BlanketRule",
  "Priority": 2,
  "Action": {
    "Block": {}
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "BlanketRule"
  },
  "Statement": {
    "RateBasedStatement": {
      "Limit": 7000,
      "AggregateKeyType": "IP"
    }
  }
}

Sometimes a client connects to an application through an HTTP proxy or a content delivery network (CDN), which obscures the client origin IP. It’s important to identify the client IP instead of the one from the proxy or CDN, because blocking source IPs can cause a wider unwanted impact. You can use many tools to help you identify whether the source IP might be a CDN. In this case, you would need to query and filter on the X-Forwarded-For, True-Client-IP, or other custom headers. CDN providers typically publish which headers they add to the requests, but X-Forwarded-For and True-Client-IP are common. The following query shows how you can reference these headers, illustrating with the X-Forwarded-For header, to write rate-based rules. You can replace X-Forwarded-For with the header you expect to hold the client IP.

SELECT
  header.value,
  COUNT(*) AS "count"
FROM wafrulesdb.waftable, UNNEST(httprequest.headers) as t(header)
WHERE
    from_unixtime(timestamp/1000) BETWEEN TIMESTAMP '2020-12-01 16:00:00' AND TIMESTAMP '2020-12-01 22:00:00'
  AND
    header.name = 'X-Forwarded-For'
GROUP BY header.value, FLOOR("timestamp"/(1000*60*5))
ORDER BY count DESC
LIMIT 100;

URI-based rule for specific application endpoints

Suppose that you want to further limit requests to the login page on your website. To do this, you could add the following string match condition to a rate-based rule:

  • The part of the request to filter on is URI
  • The Match Type is Starts with
  • A Value to match is /login (this needs to be whatever identifies the login page in the URI portion of the web request)

Next you have to identify what is a typical request volume to the /login URI for the application. The following SQL query does exactly that.

SELECT
  httprequest.clientip,
  httprequest.uri,
  COUNT(*) AS "count"
FROM wafrulesdb.waftable
WHERE 
  from_unixtime(timestamp/1000) BETWEEN TIMESTAMP '2020-12-01 16:00:00' AND TIMESTAMP '2020-12-01 22:00:00'
AND
  httprequest.uri = '/login'
GROUP BY httprequest.clientip, httprequest.uri, FLOOR("timestamp"/(1000*60*5))
ORDER BY count DESC
LIMIT 100;

Replace the time window 2020-12-01 16:00:00 and 2020-12-01 22:00:00 and the httprequest.uri value, if applicable, and run the query in the Athena console. The results show the highest requesting IP and /login URI for every 5-minute period between dates, as illustrated in Figure 4.

Figure 4: The highest requesting IP and /login URI for every 5-minute period between dates

Figure 4: The highest requesting IP and /login URI for every 5-minute period between dates

Figure 5 illustrates a chart based on the query results for the highest requesting IP and /login URI for every 5-minute period between dates.

Figure 5: Chart: The highest requesting IP and /login URI for every 5-minute period between dates

Figure 5: Chart: The highest requesting IP and /login URI for every 5-minute period between dates

Based on the SQL query results, you would specify a rate limit of 150 requests per 5 minutes. Adding this rate-based rule to a web ACL will limit requests to your login page per IP address without affecting the rest of your site. Once again, you can either create the AWS WAF rule by using the following JSON and the JSON rule editor, or by using the AWS WAF visual rule editor and following these instructions. If you’re using the following JSON, make sure to replace the Limit value with the value that you identified by running the SQL query earlier.

{
  "Name": "UriBasedRule",
  "Priority": 1,
  "Action": {
    "Block": {}
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "UriBasedRule"
  },
  "Statement": {
    "RateBasedStatement": {
      "Limit": 150,
      "AggregateKeyType": "IP",
      "ScopeDownStatement": {
        "ByteMatchStatement": {
          "FieldToMatch": {
            "UriPath": {}
          },
          "PositionalConstraint": "STARTS_WITH",
          "SearchString": "/login",
          "TextTransformations": [
            {
              "Type": "NONE",
              "Priority": 0
            }
          ]
        }
      }
    }
  }
}

AWS WAF rules with a lower value for Priority are evaluated before rules with a higher value. For the AWS WAF rules to work as expected (first evaluating the more specific rule—the URI-based rule, and only after that, the more general blanket rule) you have to set the AWS WAF rule priority. You can do that by updating the JSON and setting the Priority value to 1 for the blanket rule and 0 for the URI-based rule, or by using the AWS WAF visual rule editor. The expected AWS WAF rule priority should be as illustrated in Figure 6.

Figure 6: AWS WAF rules with priority for UriBasedRule

Figure 6: AWS WAF rules with priority for UriBasedRule

If you want to know the request volume across all application URIs, the following SQL will accomplish that.

SELECT
  httprequest.clientip,
  httprequest.uri,
  COUNT(*) AS "count"
FROM wafrulesdb.waftable
WHERE from_unixtime(timestamp/1000) BETWEEN TIMESTAMP '2020-12-01 16:00:00' AND TIMESTAMP '2020-12-01 22:00:00'
GROUP BY httprequest.clientip, httprequest.uri, FLOOR("timestamp"/(1000*60*5))
ORDER BY count DESC
LIMIT 100;

Figure 7 shows a chart of what the SQL query results might look like.

Figure 7: The highest requesting IP and URI for every 5-minute period between dates

Figure 7: The highest requesting IP and URI for every 5-minute period between dates

IP reputation rule groups to block bots or other threats

You can use IP reputation rules to block requests based on their source. AWS WAF offers a wide selection of managed rule groups, and Amazon IP reputation list is the one that will help to reduce your exposure to bot traffic or exploitation attempts.

To add the Amazon IP reputation list rule to your web ACL

  1. Open the AWS WAF console and navigate to the managed rule groups view.

    Figure 8: The managed rule group view in AWS WAF

    Figure 8: The managed rule group view in AWS WAF

  2. Expand AWS managed rule groups, and for Amazon IP reputation list, choose Add to web ACL.

    Figure 9: Add the Amazon IP reputation list to the web ACL

    Figure 9: Add the Amazon IP reputation list to the web ACL

  3. Scroll to the bottom of the page and choose Add rule.
  4. At this point, you should see the Set rule priority view. Move up the Amazon managed rule so that it has the highest priority. If a request originates from a bot, you want to deny the request as early as possible, and you achieve exactly that by assigning the highest priority to the Amazon IP reputation list rule. Your final AWS WAF rules order should be as shown in Figure 10.

    Figure 10: Final AWS WAF rules ordered by priority

    Figure 10: Final AWS WAF rules ordered by priority

Considerations for rate-based rules

It’s important to note that the more specific AWS WAF rules should have a higher priority, because you want these rules to limit the request volume first. In our example, the rules strategy is first based on a specific URI, and then on a blanket rule that limits requests across the whole application.

The rate-based rules that we discussed here provide a solid foundation to help you protect your internet-facing applications from common basic HTTP request floods. However, the solution in this blog post shouldn’t be seen as a one-time setup but rather as an iterative activity.

You should determine a healthy time frame to rerun Amazon Athena queries to identify a new rate-based rule that aligns with the application’s growth and increasing request volume. Reviewing the rate-based rules on an iterative basis and incorporating it into your existing processes, such as software development life cycle, is a great way to schedule in the review process. Each AWS WAF rule can publish Amazon CloudWatch metrics, which can be used to trigger alerts before thresholds are crossed. You can use alerts to create tickets to operations teams based on thresholds you set. This alerts your operations teams to review the situation to see if it’s a DDoS attack being thwarted versus legitimate traffic being dropped.

After you define your request, add a buffer to allow for growth. Rate-based rules should have a reasonable buffer to account for near-future application growth. For instance, when an Athena query result shows a request volume of 500 requests, a rate-based rule with a limit of 1,000 requests gives a buffer for an additional 500 requests to account for application growth.

Summary

In this post, we introduced you to the top three most important AWS WAF rate-based rules to protect your web applications from common HTTP flood events. We also covered how to implement these rate-based rules and determined an appropriate request threshold for your application by using AWS WAF logs and Amazon Athena queries. To learn more about best practices that help you protect your websites and web applications against various attack vectors by using AWS WAF, see our whitepaper, Guidelines for Implementing AWS WAF.

You can learn more about AWS WAF in other AWS WAF–related Security Blog posts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS WAF forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Artem Lovan

Artem is a Senior Solutions Architect based in New York. He helps customers architect and optimize applications on AWS. He has been involved in IT at many levels, including infrastructure, networking, security, DevOps, and software development.

Author

Jesse Lepich

Jesse is a Senior Security Solutions Architect at AWS based in Lake St. Louis, Missouri, focused on helping customers implement native AWS security services. Outside of cloud security, his interests include relaxing with family, barefoot waterskiing, snowboarding/snow skiing, surfing, boating/sailing, and mountain climbing.

Creating dashboards quickly on Microsoft Power BI using Amazon Athena

Post Syndicated from Armando Segnini original https://aws.amazon.com/blogs/big-data/creating-dashboards-quickly-on-microsoft-power-bi-using-amazon-athena/

Amazon Athena is an interactive query service that makes it easy to analyze data in a data lake using standard SQL. One of the key elements of Athena is that you only pay for the queries you run. This is an attractive feature because there is no hardware to set up, manage, or maintain.

You can query Athena with SQL or by using data visualization tools such as Amazon QuickSight, Microsoft Power BI, Tableau, or other third-party options. QuickSight is a cloud-native business intelligence (BI) service that you can use to visually analyze data and share interactive dashboards with all users in your organization. QuickSight is fully managed and serverless, requires no client downloads for dashboard creation, and has a pay-per-session pricing model that allows you to pay for dashboard consumption with a maximum charge of $5.00 per reader per month. The combination of QuickSight and Athena allows you to rapidly deploy dashboards and BI to tens of thousands of users, while only paying for actual usage, and not worrying about server deployment or management.

Microsoft Power BI allows you similarly to analyze your data. Previously, creating dashboards with Microsoft Power BI and Athena required you to download all data locally on your computer. This takes time and can fail due to memory or network bandwidth constraints.

You can now create Microsoft Power BI dashboards and leverage the power of Athena through our out-of-the-box connector for Power BI. The connector is more scalable as it supports Power BI’s DirectQuery mode in which complete, raw data sets are not downloaded to your workstation. While you create or interact with a visualization, Microsoft Power BI works with Athena to dynamically query the underlying data source so you’re always viewing current data.

This post provides step-by-step guidance on how to use the Athena connector for Power BI to query, visualize, and share data with Power BI.

Solution overview

To create Microsoft Power BI dashboards using Athena as a data source, you start by designing a dashboard in Microsoft Power BI Desktop with the help of the Athena data source connector for Power BI and the Athena ODBC driver. When you finish creating your dashboard, you publish it to the Microsoft Power BI Service. To see your data on Microsoft Power BI Service, you need to install the Microsoft Power BI on-premises data gateway in your AWS account—it works like a bridge between Microsoft Power BI Service and Athena. Finally, you configure Athena as a new data source in Microsoft Power BI Service.

To authenticate yourself with Athena, you use an instance profile role because it is easier to do all the configuration, or you can use any of the different authentication options that the Athena ODBC driver provides.

The following diagram illustrates the solution architecture.

Walkthrough overview

For this post, we step through a use case using the data from the New York City Taxi Records dataset from 2015. The data is already stored in Apache Parquet format and is partitioned. For more information about optimizing your Athena queries, see Top 10 Performance Tuning Tips for Amazon Athena.

You deploy an AWS CloudFormation stack with all the infrastructure required to deploy two Amazon Elastic Compute Cloud (Amazon EC2) instances in a private subnet in an Amazon Virtual Private Cloud (Amazon VPC): one instance is used for Microsoft Power BI Desktop, and the other is used for the Microsoft Power BI on-premises data gateway. This stack uses t3.2xlarge instances because they have the minimal hardware requirements recommended. You can increase or decrease the EC2 instance type depending on the performance of the gateway.

Additionally, the CloudFormation template creates an AWS Glue table that gives you access to the dataset. It creates an AWS Lambda function as an AWS CloudFormation custom resource that updates all the partitions in the AWS Glue table.

Then, you use AWS Systems Manager Session Manager (see Starting a session (Systems Manager console) and any remote desktop client to configure the instances and to create your dashboard by following these steps:

  1. Deploy the CloudFormation stack by choosing Launch stack:
  2. On the Amazon EC2 instance that has the tag PowerBiDesktop, install and configure the Simba Athena ODBC driver and Microsoft Power BI Desktop.
  3. Create your dashboard on Microsoft Power BI Desktop and publish it.
  4. On the Amazon EC2 instance that has the tag PowerBiGateway, install and configure the Simba Athena ODBC driver and Microsoft Power BI on-premises data gateway.
  5. Open Microsoft Power BI and configure your Athena data source.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Logging in to your Microsoft Power BI Desktop instance

To reduce the surface of attack of a bastion host, the operational burden to manage bastion hosts, and the additional costs incurred, Session Manager allows you to securely connect to your EC2 instances without the need to run and operate your own bastion hosts or run SSH on your EC2 instances. For more information, see New – Port Forwarding Using AWS System Manager Session Manager.

Connect to the Microsoft Power BI Desktop instance with Session Manager. You need to run the following commands depending on the OS of your local machine. It can take a few minutes after deployment for your instance to be available.

For Linux and Mac OS, enter the following code:

# find the instance ID based on Tag Name
INSTANCE_ID=$(aws ec2 describe-instances \
--filter "Name=tag:aws:cloudformation:logical-id,Values=PowerBiDesktop" \
--query "Reservations[].Instances[?State.Name == 'running'].InstanceId[]" \
--output text)
# create the port forwarding tunnel
aws ssm start-session --target $INSTANCE_ID \
--document-name AWS-StartPortForwardingSession \
--parameters '{"portNumber":["3389"],"localPortNumber":["8889"]}'

 For Windows, enter the following code:

# find the instance ID based on Tag Name
for /f %i in ('aws ec2 describe-instances --filter "Name=tag:aws:cloudformation:logical-id,Values=PowerBiDesktop" --query "Reservations[].Instances[?State.Name == 'running'].InstanceId[]" --output text --region eu-west-1') do set INSTANCE_ID=%i
# create the port forwarding tunnel
aws ssm start-session --target %INSTANCE_ID% --document-name AWS-StartPortForwardingSession --parameters "portNumber"=["3389"],"localPortNumber"=["8889"]

Open your remote desktop application and connect to the Microsoft Power BI Desktop EC2 instance. You need the following information:

Installing and configuring Microsoft Power BI Desktop

To install and configure Microsoft Power BI Desktop, complete the following steps:

  1. Download and install the latest Athena ODBC driver for Windows 64-bit.
  2. Choose the ps1 script located in the desktop (right-click) and choose Run with Powershell.

This creates a new data source on Windows called taxiconnection.

  1. Download and install the Microsoft Power BI Desktop.
  2. Open the Microsoft Power BI Desktop application.

Creating an Athena connection on Microsoft Power BI Desktop

To create your Athena connection, complete the following steps:

  1. Open Microsoft Power BI Desktop.
  2. Choose Get Data and More.
  3. Search for and select Amazon Athena.
  4. For Data Source Name (DSN), enter taxiconnection.
  5. Choose DirectQuery.

If you choose Import mode, you can’t create the dashboard because Microsoft Power BI Desktop tries to download all data locally on your computer, and the dataset never finishes loading.

  1. Choose OK.
  2. Choose Use Data Source Configuration.
  3. Choose Connect.
  4. In the AwsDataCatalog folder, navigate to the nyctaxi folder.
  5. Choose the records.
  6. Choose Load.

Creating your dashboard on Microsoft Power BI Desktop and publishing it

You can create a dashboard to show the number of transactions by month and type in descending order. You can then publish the structure of this report to make it available on Microsoft Power BI.

  1. In the Visualizations pane, choose the Stacked bar chart
  2. In the Fields pane, drag the month field to the Axis section in the Visualizations
  3. Drag the type field to the Legend section in the Visualizations
  4. Drag the pickup_datetime field to the Value section in the Visualizations

The following screenshot shows your visualization.

  1. Choose Publish.

Because this is a new report, you’re prompted to save it before you can publish it.

  1. Give your report a name (such as taxireport), and choose Save.
  2. Sign in to be able to publish your report.
  3. Choose a destination (such as My workspace).
  4. In the Success window, choose Got it.

After this last step, the report structure is published on Microsoft Power BI. However, if you try to see the report, there isn’t any data on it because the data isn’t published with the report. You need to install the Microsoft Power BI on-premises data gateway to be able to pull the data.

Logging in to your Microsoft Power BI on-premises data gateway instance

As you did with the Microsoft Power BI Desktop, you log in to the Microsoft Power BI Gateway instance using the tags to get its IDs and a different local port.

For Linux and Mac OS, enter the following code:

# find the instance ID based on Tag Name
INSTANCE_ID=$(aws ec2 describe-instances \
--filter "Name=tag:aws:cloudformation:logical-id,Values=PowerBiGateway" \
--query "Reservations[].Instances[?State.Name == 'running'].InstanceId[]" \
--output text)
# create the port forwarding tunnel
aws ssm start-session --target $INSTANCE_ID \
--document-name AWS-StartPortForwardingSession \
--parameters '{"portNumber":["3389"],"localPortNumber":["8899"]}'

For Windows, enter the following code:

# find the instance ID based on Tag Name
for /f %i in ('aws ec2 describe-instances --filter "Name=tag:aws:cloudformation:logical-id,Values=PowerBiGateway" --query "Reservations[].Instances[?State.Name == 'running'].InstanceId[]" --output text --region eu-west-1') do set INSTANCE_ID=%i
# create the port forwarding tunnel
aws ssm start-session --target %INSTANCE_ID% --document-name AWS-StartPortForwardingSession --parameters "portNumber"=["3389"],"localPortNumber"=["8899"]

Open your remote desktop application and connect to the Microsoft Power BI Gateway Amazon EC2 instance with the following information:

Installing and configuring Microsoft Power BI on-premises data gateway

To set up your on-premises data gateway, complete the following steps:

  1. Download and install the latest Athena ODBC driver for Windows 64-bit.
  2. Choose the CongigureODBC.ps1 script located in the desktop (right-click) and choose Run with Powershell.

This creates a new data source on Windows called taxiconnection.

  1. Download the Microsoft Power BI on-premises data gateway standard mode and launch the installer. You might need to update to the latest available .NET version before starting the installation.
  2. For your gateway, choose On-premises data gateway (recommended).
  3. Accept the default values and choose Install.
  4. When the installer asks you to sign in, enter the email address associated with the admin account for the Microsoft Power BI Pro tenant.
  5. Choose Sign in.
  6. If asked to register a new gateway or migrate, restore, or take over an existing gateway, choose Register a new gateway.
  7. Give your gateway a name and provide a recovery key.
  8. Choose Configure.

You should see a green checkmark indicating the gateway is online and ready to be used.

Opening Microsoft Power BI and configuring your Athena data source

To configure your data source, complete the following steps:

  1. Open Microsoft Power BI in your browser.
  2. Choose the Settings icon.
  3. Choose Manage gateways.
  4. Find the gateway cluster you just created.
  5. Hover your mouse over your gateway name and choose the …icon.
  6. Choose Open menu.
  7. Choose Add data source.
  8. For the data source name, enter taxiconnection.
  9. For the data source type, choose Amazon Athena.
  10. For the second data source name, enter taxiconnection.
  11. On the Authentication Method, choose Anonymous, then choose Add.

Seeing your report on Microsoft Power BI

To view your report, complete the following steps:

  1. Choose the workspace where you saved your report.
  2. On the Datasets + dataflows tab, locate the dataset, locate the dataset that has the same name as your report (for example, taxireport) and choose the … icon.
  3. Choose Settings.
  4. Choose Discover Data Sources.
  5. Expand the Gateway Connection
  6. Choose your gateway.
  7. For Maps to, choose taxiconnection.
  8. Choose Apply.
  9. Return to the workspace where you saved your report.
  10. On the Content tab, choose your report (taxireport).

You can now see your report online using the most recent data.

Cleaning up

To avoid incurring future charges, delete the CloudFormation stack and the S3 bucket that you deployed as part of this post.

Conclusion

This post presented how to connect to Athena from Microsoft Power BI using the out-of-the-box data source connector and import data using DirectQuery mode. The first part of the post described the architecture components and how to successfully create a dashboard using the NYC taxi dataset. The stack deployed uses only one EC2 instance for the Microsoft Power BI on-premises data gateway, but in production, you should consider creating a high-availability cluster of gateway installations, ideally in different Availability Zones. The second part of this post deployed a demo environment and walked you through the steps to configure Microsoft Power BI with Athena to share your insights. For native access to your data in AWS without any downloads or servers, be sure to also check out Amazon QuickSight.


About the Authors

Armando Segnini is a Data Architect with AWS Professional Services. He spends his time building scalable big data and analytics solutions for AWS Enterprise and Strategic customers. Armando also loves to travel with his family all around the world and take pictures of the places he visits.

 

 

Xavier Naunay is a Data Architect with AWS Professional Services. He is part of the AWS ProServe team, helping enterprise customers solve complex problems using AWS services. In his free time, he is either traveling or learning about technology and other cultures.

Query a Teradata database using Amazon Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-a-teradata-database-using-amazon-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Teradata as your transactional data store, you may need to join the data in your data lake with Teradata in the cloud, Teradata running on Amazon Elastic Compute Cloud (Amazon EC2), or with an on-premises Teradata database, for example to build a dashboard or create consolidated reporting.

In these use cases, the Amazon Athena Federated Query feature allows you to seamlessly access the data from Teradata database without having to move the data to your S3 data lake. This removes the overhead in managing such jobs.

In this post, we will walk you through a step-by-step configuration to set up Athena Federated Query using AWS Lambda to access data in a Teradata database running on premises.

For this post, we will be using the Oracle Athena Federated Query connector developed by Trianz. The runtime includes a Teradata instance on premises. Your Teradata instance can be on the cloud, on Amazon EC2, or on premises. You can deploy the Trianz Oracle Athena Federated Query connector from the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Athena). The following diagram depicts how Athena Federated Query works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Teradata instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Teradata instance.
  4. Run federated queries with Athena.

Prerequisite

Before you start this walkthrough, make sure your Teradata database is up and running.

Create a secret for the Teradata instance

Our first step is to create a secret for the Teradata instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Set the credentials as key-value pairs (username, password) for your Teradata instance.

  1. For Secret name, enter a name for your secret. Use the prefix TeradataAFQ so it’s easy to find.
  2. Leave the remaining fields at their defaults and choose Next.
  3. Complete your secret creation.

Set up your S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we create athena-accelerator/teradata.

Configure Athena federation with the Teradata instance

To configure Athena federation with Teradata instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. Select Show apps that create custom IAM roles or resource policies.
  3. In the search field, enter TrianzTeradataAthenaJDBC.
  4. Choose the application.

  1. For SecretNamePrefix, enter TeradataAFQ.
  2. For SpillBucket, enter Athena-accelerator/teradata.
  3. For JDBCConnectorConfig, use the format teradata://jdbc:teradata://hostname/user=testUser&password=testPassword.
  4. For DisableSpillEncryption, enter false.
  5. For LambdaFunctionName, enter teradataconnector.
  6. For SecurityGroupID, enter the security group ID where the Teradata instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Teradata instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, Amazon CloudTrail, Secrets Manager, Lambda, and Athena. For more information about Athena IAM access, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your queries using lambda:teradataconnector to run against tables in the Teradata database. teradataconnector is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:teradataconnector references a data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot shows the query that joins the dataset between Teradata and the S3 data lake.

Key performance best practices

If you’re considering Athena Federated Query with Teradata, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Teradata database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the Teradata database to Athena (which could lead to query timeouts or slow performance), you may consider moving data from Teradata to your S3 data lake.
  • The star schema is a commonly used data model in Teradata. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Teradata. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Teradata database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Teradata database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated Query with Teradata. Now you don’t need to wait for all the data in your Teradata data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practices outlined in the post to help minimize the data transferred from Teradata for better performance. When queries are well written for Athena Federated Query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is an AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Query an Apache Hudi dataset in an Amazon S3 data lake with Amazon Athena part 1: Read-optimized queries

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/part-1-query-an-apache-hudi-dataset-in-an-amazon-s3-data-lake-with-amazon-athena-part-1-read-optimized-queries/

On July 16, 2021, Amazon Athena upgraded its Apache Hudi integration with new features and support for Hudi’s latest 0.8.0 release. Hudi is an open-source storage management framework that provides incremental data processing primitives for Hadoop-compatible data lakes. This upgraded integration adds the latest community improvements to Hudi along with important new features including snapshot queries, which provide near real-time views of table data, and reading bootstrapped tables which provide efficient migration of existing table data.

In this series of posts on Athena and Hudi, we will provide a short overview of key Hudi capabilities along with detailed procedures for using read-optimized queries, snapshot queries, and bootstrapped tables.

Overview

With Apache Hudi, you can perform record-level inserts, updates, and deletes on Amazon S3, allowing you to comply with data privacy laws, consume real-time streams and change data captures, reinstate late-arriving data, and track history and rollbacks in an open, vendor neutral format. Apache Hudi uses Apache Parquet and Apache Avro storage formats for data storage, and includes built-in integrations with Apache Spark, Apache Hive, and Apache Presto, which enables you to query Apache Hudi datasets using the same tools that you use today with near-real-time access to fresh data.

An Apache Hudi dataset can be one of the following table types:

  • Copy on Write (CoW) – Data is stored in columnar format (Parquet), and each update creates a new version of the base file on a write commit. A CoW table type typically lends itself to read-heavy workloads on data that changes less frequently.
  • Merge on Read (MoR) – Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files. A MoR table type is typically suited for write-heavy or change-heavy workloads with fewer reads.

Apache Hudi provides three logical views for accessing data:

  • Read-optimized – Provides the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables
  • Incremental – Provides a change stream between two actions out of a CoW dataset to feed downstream jobs and extract, transform, load (ETL) workflows
  • Real-time – Provides the latest committed data from a MoR table by merging the columnar and row-based files inline

As of this writing, Athena supports read-optimized and real-time views.

Using read-optimized queries

In this post, you will use Athena to query an Apache Hudi read-optimized view on data residing in Amazon S3. The walkthrough includes the following high-level steps:

  1. Store raw data in an S3 data lake.
  2. Transform the raw data to Apache Hudi CoW and MoR tables using Apache Spark on Amazon EMR.
  3. Query and analyze the tables on Amazon S3 with Athena on a read-optimized view.
  4. Perform an update to a row in the Apache Hudi dataset.
  5. Query and analyze the updated dataset using Athena.

Architecture

The following diagram illustrates our solution architecture.

In this architecture, you have high-velocity weather data stored in an S3 data lake. This raw dataset is processed on Amazon EMR and stored in an Apache Hudi dataset in Amazon S3 for further analysis by Athena. If the data is updated, Apache Hudi performs an update on the existing record, and these updates are reflected in the results fetched by the Athena query.

Let’s build this architecture.

Prerequisites

Before getting started, we set up our resources. For this post, we use the us-east-1 Region.

  1. Create an Amazon Elastic Compute Cloud (Amazon EC2) key pair. For instructions, see Create a key pair using Amazon EC2.
  2. Create a S3 bucket for storing the raw weather data (for this post, we call it weather-raw-bucket).
  3. Create two folders in the S3 bucket: parquet_file and delta_parquet.
  4. Download all the data files, Apache Scala scripts (data_insertion_cow_delta_script, data_insertion_cow_script, data_insertion_mor_delta_script, and data_insertion_mor_script), and Athena DDL code (athena_weather_hudi_cow.sql and athena_weather_hudi_mor.sql) from the GitHub repo.
  5. Upload the weather_oct_2020.parquet file to weather-raw-bucket/parquet_file.
  6. Upload the file weather_delta.parquet to weather-raw-bucket/delta_parquet. We update an existing weather record from a relative_humidity of 81 to 50 and a temperature of 6.4 to 10.
  7. Create another S3 bucket for storing the Apache Hudi dataset. For this post, we create a bucket with a corresponding subfolder named athena-hudi-bucket/hudi_weather.
  8. Deploy the EMR cluster using the provided AWS CloudFormation template:
  9. Enter a name for your stack.
  10. Choose a pre-created key pair name.

This is required to connect to the EMR cluster nodes. For more information, see Connect to the Master Node Using SSH.

  1. Accept all the defaults and choose Next.
  2. Acknowledge that AWS CloudFormation might create AWS Identity and Access Management (IAM) resources.
  3. Choose Create stack.

Use Apache Hudi with Amazon EMR

When the cluster is ready, you can use the provided key pair to SSH into the primary node.

  1. Use the following bash command to load the spark-shell to work with Apache Hudi:
    spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

  2. On the spark-shell, run the following Scala code in the script data_insertion_cow_script to import weather data from the S3 data lake to an Apache Hudi dataset using the CoW storage type:
    import org.apache.spark.sql.SaveMode
    import org.apache.spark.sql.functions._
    import org.apache.hudi.DataSourceWriteOptions
    import org.apache.hudi.config.HoodieWriteConfig
    import org.apache.hudi.hive.MultiPartKeysValueExtractor
    
    //Set up various input values as variables
    val inputDataPath = "s3://weather-raw-bucket/parquet_file/"
    val hudiTableName = "weather_hudi_cow"
    val hudiTablePath = "s3://athena-hudi-bucket/hudi_weather/" + hudiTableName
    
    // Set up our Hudi Data Source Options
    val hudiOptions = Map[String,String](
        DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "city_id",
    	DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "timestamp",
        DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "date", 
        HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
        DataSourceWriteOptions.OPERATION_OPT_KEY ->
            DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
        DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp", 
        DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
        DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
        DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "date", 
        DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
            classOf[MultiPartKeysValueExtractor].getName)
    
    // Read data from S3 and create a DataFrame with Partition and Record Key
    val inputDF = spark.read.format("parquet").load(inputDataPath)
    
    // Write data into the Hudi dataset
    inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiTablePath)

Replace the S3 bucket path for inputDataPath and hudiTablePath in the preceding code with your S3 bucket.

For more information about DataSourceWriteOptions, see Work with a Hudi Dataset.

  1. In the spark-shell, count the total number of records in the Apache Hudi dataset:
    scala> inputDF.count()
    res1: Long = 1000

  2. Repeat the same step for creating an MoR table using data_insertion_mor_script (the default is COPY_ON_WRITE).
  3. Run the spark.sql("show tables").show(); query to list three tables, one for CoW and two queries, _rt and _ro, for MoR.

The following screenshot shows our output.

Let’s check the processed Apache Hudi dataset in the S3 data lake.

  1. On the Amazon S3 console, confirm the subfolders weather_hudi_cow and weather_hudi_mor are in athena-hudi-bucket.
  1. Navigate to the weather_hudi_cow subfolder to see the Apache Hudi dataset that is partitioned using the date key—one for each date in our dataset.
  2. On the Athena console, create a hudi_athena_test database using following command:
    create database hudi_athena_test;

You use this database to create all your tables.

  1. Create an Athena table using the athena_weather_hudi_cow.sql script:
    CREATE EXTERNAL TABLE weather_partition_cow(
      `_hoodie_commit_time` string,
      `_hoodie_commit_seqno` string,
      `_hoodie_record_key` string,
      `_hoodie_partition_path` string,
      `_hoodie_file_name` string,
      `city_id` string,
      `timestamp` string,
      `relative_humidity` decimal(3,1),
      `temperature` decimal(3,1),
      `absolute_humidity` decimal(5,4)
      )
      PARTITIONED BY ( 
      `date` string)
    ROW FORMAT SERDE
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT
      'org.apache.hudi.hadoop.HoodieParquetInputFormat'
    OUTPUTFORMAT
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://athena-hudi-bucket/hudi_weather/weather_hudi_cow'
    

Replace the S3 bucket path in the preceding code with your S3 bucket (Hudi table path) in LOCATION.

  1. Add partitions to the table by running the following query from the athena_weather_judi_cow.sql script on the Athena console:
    ALTER TABLE weather_partition_cow ADD
    PARTITION (date = '2020-10-01') LOCATION 's3://athena-hudi-bucket/hudi_weather/weather_hudi_cow/2020-10-01/'
    PARTITION (date = '2020-10-02') LOCATION 's3://athena-hudi-bucket/hudi_weather/weather_hudi_cow/2020-10-02/'
    PARTITION (date = '2020-10-03') LOCATION 's3://athena-hudi-bucket/hudi_weather/weather_hudi_cow/2020-10-03/'
    PARTITION (date = '2020-10-04') LOCATION 's3://athena-hudi-bucket/hudi_weather/weather_hudi_cow/2020-10-04/';

Replace the S3 bucket path in the preceding code with your S3 bucket (Hudi table path) in LOCATION.

  1. Confirm the total number of records in the Apache Hudi dataset with the following query:
    SELECT count(*) FROM "hudi_athena_test"."weather_partition_cow";

It should return a single row with a count of 1,000.

Now let’s check the record that we want to update.

  1. Run the following query on the Athena console:
    SELECT * FROM "hudi_athena_test"."weather_partition_cow"
    where city_id ='1'
    and date ='2020-10-04'
    and timestamp = '2020-10-04T07:19:12Z';

The output should look like the following screenshot. Note the value of relative_humidity and temperature.

  1. Return to the Amazon EMR primary node and run the following code in the data_insertion_cow_delta_script script on the spark-shell prompt to update the data:
    import org.apache.spark.sql.SaveMode
    import org.apache.spark.sql.functions._
    import org.apache.hudi.DataSourceWriteOptions
    import org.apache.hudi.config.HoodieWriteConfig
    import org.apache.hudi.hive.MultiPartKeysValueExtractor
    
    //Set up various input values as variables
    val inputDataPath = "s3://weather-raw-bucket/delta_parquet/"
    val hudiTableName = "weather_hudi_cow"
    val hudiTablePath = "s3://athena-hudi-bucket/hudi_weather/" + hudiTableName
    
    // Set up our Hudi Data Source Options
    val hudiOptions = Map[String,String](
        DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "city_id",
    	DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "timestamp",
        DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "date", 
        HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
        DataSourceWriteOptions.OPERATION_OPT_KEY ->
            DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
        DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp", 
        DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
        DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
        DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "date", 
        DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
            classOf[MultiPartKeysValueExtractor].getName)
    
    // Read data from S3 and create a DataFrame with Partition and Record Key
    val inputDF = spark.read.format("parquet").load(inputDataPath)
    
    // Write data into the Hudi dataset
    inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(hudiTablePath)
    

Replace the S3 bucket path for inputDataPath and hudiTablePath in the preceding code with your S3 bucket.

  1. Run the following query on the Athena console to confirm no change occurred to the total number of records:
SELECT count(*) FROM "hudi_athena_test"."weather_partition_cow";

The following screenshot shows our query results.

  1. Run the following query again on the Athena console to check for the update:
SELECT * FROM "hudi_athena_test"."weather_partition_cow"
where city_id ='1'
and date ='2020-10-04'
and timestamp = '2020-10-04T07:19:12Z'

The relative_humidity and temperature values for the relevant record are updated.

  1. Repeat similar steps for the MoR table.

Clean up the resources

You must clean up the resources you created earlier to avoid ongoing charges.

  1. On the AWS CloudFormation console, delete the stack you launched.
  2. On the Amazon S3 console, empty the buckets weather-raw-bucket and athena-hudi-bucket and delete the buckets.

Conclusion

As you have learned in this post, we used Apache Hudi support in Amazon EMR to develop a data pipeline to simplify incremental data management use cases that require record-level insert and update operations. We used Athena to read the read-optimized view of an Apache Hudi dataset in an S3 data lake.


About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration and strategy. He is passionate about technology and enjoys building and experimenting in Analytics and AI/ML space.

 

 

 

Sameer Goel is a Solutions Architect in The Netherlands, who drives customer success by building prototypes on cutting-edge initiatives. Prior to joining AWS, Sameer graduated with a master’s degree from NEU Boston, with a Data Science concentration. He enjoys building and experimenting with creative projects and applications.

 

 

Imtiaz (Taz) Sayed is the WW Tech Master for Analytics at AWS. He enjoys engaging with the community on all things data and analytics.

 

Query Snowflake using Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-snowflake-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Snowflake as your data warehouse solution, you may need to join your data in your data lake with Snowflake. For example, you may want to build a dashboard by joining historical data in your Amazon S3 data lake and the latest data in your Snowflake data warehouse or create consolidated reporting.

In such use cases, Amazon Athena Federated Query allows you to seamlessly access the data from Snowflake without building ETL pipelines to copy or unload the data to the S3 data lake or Snowflake. This removes the overhead of creating additional extract, transform, and load (ETL) processes and shortens the development cycle.

In this post, we will walk you through a step-by-step configuration to set up Athena Federated Query using AWS Lambda to access data in a Snowflake data warehouse.

For this post, we are using the Snowflake connector for Amazon Athena developed by Trianz.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data Federation refers to the capability to query data in another data store using a single interface (Amazon Athena). The following diagram depicts how a single Amazon Athena federated query uses Lambda to query the underlying data source and parallelizes execution across many workers.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Snowflake instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Snowflake instance.
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have a Snowflake data warehouse up and running.

Create a secret for the Snowflake instance

Our first step is to create a secret for the Snowflake instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Enter the credentials as key-value pairs (username, password) for your Snowflake instance.
  5. For Secret name, enter a name for your secret. Use the prefix snowflake so it’s easy to find.

  1. Leave the remaining fields at their defaults and choose Next.
  2. Complete your secret creation.

Create an S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, we use athena-accelerator/snowflake.

Configure Athena federation with the Snowflake instance

To configure Athena data source connector for Snowflake with your Snowflake instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. In the search field, enter TrianzSnowflakeAthenaJDBC.

  1. For Application name, enter TrianzSnowflakeAthenaJDBC.
  2. For SecretNamePrefix, enter trianz-snowflake-athena.
  3. For SpillBucket, enter Athena-accelerator/snowflake.
  4. For JDBCConnectorConfig, use the format snowflake://jdbc:snowflake://{snowflake_instance_url}/?warehouse={warehousename}&db={databasename}&schema={schemaname}&${secretname}

For example, we enter snowflake://jdbc:snowflake://trianz.snowflakecomputing.com/?warehouse=ATHENA_WH&db=ATHENA_DEV&schema=ATHENA&${trianz-snowflake-athena}DisableSpillEncyption – False

  1. For LambdaFunctionName, enter trsnowflake.
  2. For SecurityGroupID, enter the security group ID where the Snowflake instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Snowflake instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Before running your federated query, be sure that you have selected Athena engine version 2. The current Athena engine version for any workgroup can be found in the Athena console page.

Run your federated queries using lambda:trsnowflake to run against tables in the Snowflake database. This is the name of lambda function which we have created in step 7 of previous section of this blog.

lambda:trsnowflake is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following screenshot is a unionall query example of data in Amazon S3 with a table in the AWS Glue Data Catalog and a table in Snowflake.

Key performance best practices

If you’re considering Athena Federated Query with Snowflake, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Snowflake database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from Snowflake to Athena (which could lead to query timeouts or slow performance), you may consider copying data from Snowflake to your S3 data lake.
  • The Snowflake schema, which is an extension of the star schema, is used as a data model in Snowflake. In the Snowflake schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Snowflake. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Snowflake database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Snowflake database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena federated with Snowflake using Lambda. With Athena Federated query user can leverage all of their data to produce analytics, derive business value without building ETL pipelines to bring data from different datastore such as Snowflake to Data Lake.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Snowflake for better performance. When queries are well written for federation, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Query your Oracle database using Athena Federated Query and join with data in your Amazon S3 data lake

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/query-your-oracle-database-using-athena-federated-query-and-join-with-data-in-your-amazon-s3-data-lake/

If you use data lakes in Amazon Simple Storage Service (Amazon S3) and use Oracle as your transactional data store, you may need to join the data in your data lake with Oracle on Amazon Relational Database Service (Amazon RDS), Oracle running on Amazon Elastic Compute Cloud (Amazon EC2), or an on-premises Oracle database, for example to build a dashboard or create consolidated reporting.

In these use cases, Amazon Athena Federated Query allows you to seamlessly access the data you’re your Oracle database without having to move the data to the S3 data lake. This removes the overhead in managing such jobs.

In this post, we walk you through a step-by-step configuration to set up Athena Federated query using AWS Lambda to access data in Oracle on Amazon RDS.

For this post, we will be using the Oracle Athena Federated query connector developed by Trianz. The runtime includes Oracle XE running on Amazon EC2 and Amazon RDS. Your Oracle instance can be on Amazon RDS, Amazon EC2, or on premises. You can deploy the Trianz Oracle AFQ connector available in the AWS Serverless Application Repository.

Let’s start with discussing the solution and then detailing the steps involved.

Solution overview

Data federation is the capability to integrate data in another data store using a single interface (Amazon Athena). The following diagram depicts how Athena federation works by using Lambda to integrate with a federated data source.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines to extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

When a federated query is run, Athena identifies the parts of the query that should be routed to the data source connector and executes them with Lambda. The data source connector makes the connection to the source, runs the query, and returns the results to Athena. If the data doesn’t fit into Lambda RAM runtime memory, it spills the data to Amazon S3 and is later accessed by Athena.

Athena uses data source connectors which internally use Lambda to run federated queries. Data source connectors are pre-built and can be deployed from the Athena console or from the Serverless Application Repository. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

To implement this solution, we complete the following steps:

  1. Create a secret for the Oracle instance using AWS Secrets Manager.
  2. Create an S3 bucket and subfolder for Lambda to use.
  3. Configure Athena federation with the Oracle XE instance.
  4. Run federated queries with Athena.

Prerequisites

Before getting started, make sure you have an Oracle database up and running.

Create a secret for the Oracle instance

Our first step is to create a secret for the Oracle instance with a username and password using Secrets Manager.

  1. On the Secrets Manager console, choose Secrets.
  2. Choose Store a new secret.
  3. Select Other types of secrets.
  4. Enter the credentials as key-value pairs (username, password) for your Oracle XE instance.

  1. For Secret name, enter a name for your secret. Use the prefix OracleAFQ so it’s easy to find.
  2. Leave the remaining fields at their defaults and choose Next.
  3. Complete your secret creation.

Create an S3 bucket for Lambda

On the Amazon S3 console, create a new S3 bucket and subfolder for Lambda to use. For this post, I use athena-accelerator/oracle.

Configure Athena federation with the Oracle XE instance

To configure Athena federation with your Oracle instance, complete the following steps:

  1. On the AWS Serverless Application Repository console, choose Available applications.
  2. In the search field, enter TrianzOracleAthenaJDBC.

  1. For Application name, enter TrianzOracleAthenaJDBC.
  2. For SecretNamePrefix, enter OracleAFQ_XE.
  3. For SpillBucket, enter Athena-accelerator/oracle.
  4. For JDBCConnectorConfig, use the format oracle://jdbc:oracle:thin:${secretname}@//hostname:port/servicename.

For example, we enter oracle://jdbc:oracle:thin:${OracleAFQ_XE}@//12.345.67.89:1521/xe.

  1. For DisableSpillEncryption, enter false.
  2. For LambdaFunctionName, enter oracleconnector.
  3. For SecurityGroupID, enter the security group ID where the Oracle instance is deployed.

Make sure to apply valid inbound and outbound rules based on your connection.

  1. For SpillPrefix, create a folder under the S3 bucket you created and specify the name (for example, athena-spill).
  2. For Subnetids, use the subnets where the Oracle instance is running with comma separation.

Make sure the subnet is in a VPC and has NAT gateway and internet gateway attached.

  1. Select the I acknowledge check box.
  2. Choose Deploy.

Make sure that the AWS Identity and Access Management (IAM) roles have permissions to access AWS Serverless Application Repository, AWS CloudFormation, Amazon S3, Amazon CloudWatch, AWS CloudTrail, Secrets Manager, Lambda, and Athena. For more information, see Example IAM Permissions Policies to Allow Athena Federated Query.

Run federated queries with Athena

Run your federated queries using lambda:trianzoracle against tables in the Oracle database. trianzoracle is the name of lambda function which we have created in step 7 of previous section of this blog

lambda:trianzoracle is a reference data source connector Lambda function using the format lambda:MyLambdaFunctionName. For more information, see Writing Federated Queries.

The following query joins the dataset between Oracle and the S3 data lake.

Key performance best practices

If you’re considering Athena Federated query with Oracle, we recommend the following best practices:

  • Athena Federated query works great for queries with predicate filtering because the predicates are pushed down to the Oracle database. Use filter and limited-range scans in your queries to avoid full table scans.
  • If your SQL query requires returning a large volume of data from the Oracle database to Athena (which could lead to query timeouts or slow performance), unload the large tables in your query from Oracle to your S3 data lake.
  • The star schema is a commonly used data model in Oracle. In the star schema model, unload your large fact tables into your S3 data lake and leave the dimension tables in Oracle. If large dimension tables are contributing to slow performance or query timeouts, unload those tables to your S3 data lake.
  • When you run federated queries, Athena spins up multiple Lambda functions, which causes a spike in database connections. It’s important to monitor the Oracle database WLM queue slots to ensure there is no queuing. Additionally, you can use concurrency scaling on your Oracle database cluster to benefit from concurrent connections to queue up.

Conclusion

In this post, you learned how to configure and use Athena Federated query with Oracle. Now you don’t need to wait for all the data in your Oracle data warehouse to be unloaded to Amazon S3 and maintained on a day-to-day basis to run your queries.

You can use the best practice considerations outlined in the post to help minimize the data transferred from Oracle for better performance. When queries are well written for Federated query, the performance penalties are negligible.

For more information, see the Athena User Guide and Using Amazon Athena Federated Query.


About the Author

Navnit Shukla is AWS Specialist Solution Architect in Analytics. He is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Create a secure data lake by masking, encrypting data, and enabling fine-grained access with AWS Lake Formation

Post Syndicated from Shekar Tippur original https://aws.amazon.com/blogs/big-data/create-a-secure-data-lake-by-masking-encrypting-data-and-enabling-fine-grained-access-with-aws-lake-formation/

You can build data lakes with millions of objects on Amazon Simple Storage Service (Amazon S3) and use AWS native analytics and machine learning (ML) services to process, analyze, and extract business insights. You can use a combination of our purpose-built databases and analytics services like Amazon EMR, Amazon Elasticsearch Service (Amazon ES), and Amazon Redshift as the right tool for your specific job and benefit from optimal performance, scale, and cost.

In this post, you learn how to create a secure data lake using AWS Lake Formation for processing sensitive data. The data (simulated patient metrics) is ingested through a serverless pipeline to identify, mask, and encrypt sensitive data before storing it securely in Amazon S3. After the data has been processed and stored, you use Lake Formation to define and enforce fine-grained access permissions to provide secure access for data analysts and data scientists.

Target personas

The proposed solution focuses on the following personas, with each one having different level of access:

  • Cloud engineer – As the cloud infrastructure engineer, you implement the architecture but may not have access to the data itself or to define access permissions
  • secure-lf-admin – As a data lake administrator, you configure the data lake setting and assign data stewards
  • secure-lf-business-analyst – As a business analyst, you shouldn’t be able to access sensitive information
  • secure-lf-data-scientist – As a data scientist, you shouldn’t be able to access sensitive information

Solution overview

We use the following AWS services for ingesting, processing, and analyzing the data:

  • Amazon Athena is an interactive query service that can query data in Amazon S3 using standard SQL queries using tables in an AWS Glue Data Catalog. The data can be accessed via JDBC for further processing such as displaying in business intelligence (BI) dashboards.
  • Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, and more. The logs from AWS Glue jobs and AWS Lambda functions are saved in CloudWatch logs.
  • Amazon Comprehend is a natural language processing (NLP) service that uses ML to uncover information in unstructured data.
  • Amazon DynamoDB is a NoSQL database that delivers single-digit millisecond performance at any scale and is used to avoid processing duplicates files.
  • AWS Glue is a serverless data preparation service that makes it easy to extract, transform, and load (ETL) data. An AWS Glue job encapsulates a script that reads, processes, and writes data to a new schema. This solution uses Python3.6 AWS Glue jobs for ETL processing.
  • AWS IoT provides the cloud services that connect your internet of things (IoT) devices to other devices and AWS Cloud services.
  • Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services.
  • AWS Lake Formation makes it easy to set up, secure, and manage your data lake. With Lake Formation, you can discover, cleanse, transform, and ingest data into your data lake from various sources; define fine-grained permissions at the database, table, or column level; and share controlled access across analytic, ML, and ETL services.
  • Amazon S3 is a scalable object storage service that hosts the raw data files and processed files in the data lake for millisecond access.

You can enhance the security of your sensitive data with the following methods:

  • Implement encryption at rest using AWS Key Management Service (AWS KMS) and customer managed encryption keys
  • Instrument AWS CloudTrail and audit logging
  • Restrict access to AWS resources based on the least privilege principle

Architecture overview

The solution emulates diagnostic devices sending Message Queuing Telemetry Transport (MQTT) messages onto an AWS IoT Core topic. We use Kinesis Data Firehose to preprocess and stage the raw data in Amazon S3. We then use AWS Glue for ETL to further process the data by calling Amazon Comprehend to identify any sensitive information. Finally, we use Lake Formation to define fine-grained permissions that restrict access to business analysts and data scientists who use Athena to query the data.

The following diagram illustrates the architecture for our solution.

Prerequisites

To follow the deployment walkthrough, you need an AWS account. Use us-east-1 or us-west-2 as your Region.

For this post, make sure you don’t have Lake Formation enabled in your AWS account.

Stage the data

Download the zipped archive file to use for this solution and unzip the files locally. patient.csv file is dummy data created to help demonstrate masking, encryption, and granting fine-grained access. The send-messages.sh script randomly generates simulated diagnostic data to represent body vitals. AWS Glue job uses glue-script.py script to perform ETL that detects sensitive information, masks/encrypt data, and populates curated table in AWS Glue catalog.

Create an S3 bucket called secure-datalake-scripts-<ACCOUNT_ID> via the Amazon S3 console. Upload the scripts and CSV files to this location.

Deploy your resources

For this post, we use AWS CloudFormation to create our data lake infrastructure.

  1. Choose Launch Stack:
  2. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names before deploying.

The stack takes approximately 5 minutes to complete.

The following screenshot shows the key-values the stack created. We use the TestUserPassword parameter for the Lake Formation personas to sign in to the AWS Management Console.

Load the simulation data

Sign in to the AWS CloudShell console and wait for the terminal to start.

Stage the send-messages.sh script by running the Amazon S3 copy command:

aws s3 cp s3://secure-datalake-scripts-<ACCOUNT_ID>/send-messages.sh

Run your script by using the following command:

sh send-messages.sh.

The script runs for a few minutes and emits 300 messages. This sends MQTT messages to the secure_iot_device_analytics topic, filtered using IoT rules, processed using Kinesis Data Firehose, and converted to Parquet format. After a minute, data starts showing up in the raw bucket.

Run the AWS Glue ETL pipeline

Run AWS Glue workflow (secureGlueWorkflow) from the AWS Glue console; you can also schedule to run this using CloudWatch. It takes approximately 10 minutes to complete.

The AWS Glue job that is triggered as part of the workflow (ProcessSecureData) joins the patient metadata and patient metrics data. See the following code:

# Join Patient metadata and patient metrics dataframe
combined_df=Join.apply(patient_metadata, patient_metrics, 'PatientId', 'pid', transformation_ctx = "combined_df")

The ensuing dataframe contains sensitive information like FirstName, LastName, DOB, Address1, Address2, and AboutYourself. AboutYourself is freeform text entered by the patient during registration. In the following code snippet, the detect_sensitive_info function calls the Amazon Comprehend API to identify personally identifiable information (PII):

# Apply groupBy to get unique  AboutYourself records
group=combined_df.toDF().groupBy("pid","DOB", "FirstName", "LastName", "Address1", "Address2", "AboutYourself").count()
# Apply detect_sensitive_info to get the redacted string after masking  PII data
df_with_about_yourself = Map.apply(frame = group_df, f = detect_sensitive_info)
# Apply encryption to the identified fields
df_with_about_yourself_encrypted = Map.apply(frame = group_df, f = encrypt_rows)

Amazon Comprehend returns an object that has information about the entity name and entity type. Based on your needs, you can filter the entity types that need to be masked.

These fields are masked, encrypted, and written to their respective S3 buckets where fine-grained access controls are applied via Lake Formation:

  • Masked datas3://secure-data-lake-masked-<ACCOUNT_ID>
    secure-dl-masked-data/
  • Encrypted datas3://secure-data-lake-masked-<ACCOUNT_ID>
    secure-dl-encrypted-data/
  • Curated datas3://secure-data-lake-<ACCOUNT_ID>
    secure-dl-curated-data/

Now that the tables have been defined, we review permissions using Lake Formation.

Enable Lake Formation fine-grained access

To enable fine-grained access, we first add a Lake Formation admin user.

  1. On the Lake Formation console, select Add other AWS users or roles.
  2. On the drop-down menu, choose secure-lf-admin.
  3. Choose Get started.
  4. In the navigation pane, choose Settings.
  5. On the Data Catalog Settings page, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
  6. Choose Save.

Grant access to different personas

Before we grant permissions to different user personas, let’s register the S3 locations in Lake Formation so these personas can access S3 data without granting access through AWS Identity and Access Management (IAM).

  1. On the Lake Formation console, choose Register and ingest in the navigation pane.
  2. Choose Data lake locations.
  3. Choose Register location.
  4. Find and select each of the following S3 buckets and choose Register location:
    1. s3://secure-raw-bucket-<ACCOUNT_ID>/temp-raw-table
    2. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-encrypted-data
    3. s3://secure-data-lake-<ACCOUNT_ID>/secure-dl-curated-data
    4. s3://secure-data-lake-masked-<ACCOUNT_ID>/secure-dl-masked-data

We’re now ready to grant access to our different users.

Grant read-only access to all the tables to secure-lf-admin

First, we grant read-only access to all the tables for the user secure-lf-admin.

  1. Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
  2. Navigate to AWS Lake Formation console
  3. Under Data Catalog, choose Databases.
  4. Select the database secure-db.
  5. On the Actions drop-down menu, choose Grant.
  6. Select IAM users and roles.
  7. Choose the role secure-lf-admin.
  8. Under Policy tags or catalog resources, select Named data catalog resources.
  9. For Database, choose the database secure-db.
  10. For Tables, choose All tables.
  11. Under Permissions, select Table permissions.
  12. For Table permissions, select Super.
  13. Choose Grant.
  14. Choosesecure_dl_curated_data table.
  15. On the Actions drop-down menu, chose View permissions.
  16. Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-business-analyst

Now we grant read-only access to certain encrypted columns to the user secure-lf-business-analyst.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Select the database secure-db and choose View tables.
  3. Select the table secure_dl_encrypted_data.
  4. On the Actions drop-down menu, choose Grant.
  5. Select IAM users and roles.
  6. Choose the role secure-lf-business-analyst.
  7. Under Permissions, select Column-based permissions.
  8. Choose the following columns:
    1. count
    2. address1_encrypted
    3. firstname_encrypted
    4. address2_encrypted
    5. dob_encrypted
    6. lastname_encrypted
  9. For Grantable permissions, select Select.
  10. Choose Grant.
  11. Chose secure_dl_encrypted_data table.
  12. On the Actions drop-down menu, chose View permissions.
  13. Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Grant read-only access to secure-lf-data-scientist

Lastly, we grant read-only access to masked data to the user secure-lf-data-scientist.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Select the database secure-db and choose View tables.
  3. Select the table secure_dl_masked_data.
  4. On the Actions drop-down menu, choose Grant.
  5. Select IAM users and roles.
  6. Choose the role secure-lf-data-scientist.
  7. Under Permissions, select Table permissions.
  8. For Table permissions, select Select.
  9. Choose Grant.
  10. Under Data Catalog, chose Tables.
  11. Chose secure_dl_masked_data table.
  12. On the Actions drop-down menu, chose View permissions.
  13. Check IAMAllowedPrincipals and select Revoke and click on Revoke button.

You can confirm your user permissions on the Data Permissions page.

Query the data lake using Athena from different personas

To validate the permissions of different personas, we use Athena to query against the S3 data lake.

Make sure you set the query result location to the location created as part of the CloudFormation stack (secure-athena-query-<ACCOUNT_ID>). The following screenshot shows the location information in the Settings section on the Athena console.

You can see all the tables listed under secure-db.

  1. Sign in to the console with secure-lf-admin (use the password value for TestUserPassword from the CloudFormation stack) and make sure you’re in the same Region.
  2. Navigate to Athena Console.
  3. Run a SELECT query against the secure_dl_curated_data

The user secure-lf-admin should see all the columns with encryption or masking.

Now let’s validate the permissions of secure-lf-business-analyst user.

  1. Sign in to the console with secure-lf-business-analyst.
  2. Navigate to Athena console.
  3. Run a SELECT query against the secure_dl_encrypted_data table.

The secure-lf-business-analyst user can only view the selected encrypted columns.

Lastly, let’s validate the permissions of secure-lf-data-scientist.

  1. Sign in to the console with secure-lf-data-scientist.
  2. Run a SELECT query against the secure_dl_masked_data table.

The secure-lf-data-scientist user can only view the selected masked columns.

If you try to run a query on different tables, such as secure_dl_curated_data, you get an error message for insufficient permissions.

Clean up

To avoid unexpected future charges, delete the CloudFormation stack.

Conclusion

In this post, we presented a potential solution for processing and storing sensitive data workloads in an S3 data lake. We demonstrated how to build a data lake on AWS to ingest, transform, aggregate, and analyze data from IoT devices in near-real time. This solution also demonstrates how you can mask and encrypt sensitive data, and use fine-grained column-level security controls with Lake Formation, which benefits those with a higher level of security needs.

Lake Formation recently announced the preview for row-level access; and you can sign up for the preview now!


About the Authors

Shekar Tippur is an AWS Partner Solutions Architect. He specializes in machine learning and analytics workloads. He has been helping partners and customers adopt best practices and discover insights from data.

 

 

Ramakant Joshi is an AWS Solution Architect, specializing in the analytics and serverless domain. He has over 20 years of software development and architecture experience, and is passionate about helping customers in their cloud journey.

 

 

Navnit Shukla is AWS Specialist Solution Architect, Analytics, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions.

Vertical Integration Strategy Powered by Amazon EventBridge

Post Syndicated from Tiago Oliveira original https://aws.amazon.com/blogs/architecture/vertical-integration-strategy-powered-by-amazon-eventbridge/

Over the past few years, midsize and large enterprises have adopted vertical integration as part of their strategy to optimize operations and profitability. Vertical integration consists of separating different stages of the production line from other related departments, such as marketing and logistics. Enterprises implement such strategy to gain full control of their value chain: from the raw material production to the assembly lines and end consumer.

To achieve operational efficiency, enterprises must keep a level of independence between departments. However, this can lead to unstandardized operations and communication issues. Moreover, with this kind of autonomy for independent and dynamic verticals, the enterprise may lose some measure of visibility and control. As a result, it becomes challenging to generate a basic report from multiple departments. This blog post provides a high-level solution to integrate your different business verticals, using an event-driven architecture on top of Amazon EventBridge.

Event-driven architecture

Event-driven architecture is an architectural pattern to model communication between services while decoupling applications from each other. Applications scale and fail independently, and a central event bus facilitates the communication between the services in the enterprise. Instead of a particular application sending a request directly to another, it produces an event. The central event router captures it and forwards the message to the proper destinations.

For instance, when a customer places a new order on the retail website, the application sends the event to the event bus. Following, the event bus sends the message to the ERP system and the fulfillment center for dispatch. In this scenario, we call the application sending the event, an event publisher, and the applications receiving the event, event consumers.

Because all messages are going through the central event bus, there is clear independence between the applications within the enterprise. Here are some benefits:

  • Application independence occurs even if they belong to the same business workflow
  • You can plug in more event consumers to receive the same event type
  • You can add a data lake to receive all new order events from the retail website
  • You can receive all the events from the payment system and the customer relations department

This ensures you can integrate independent departments, increase overall visibility, and make sense of specific processes happening in the organization using the right tools.

Implementing event-driven architecture with Amazon EventBridge

Each vertical organically generates lifecycle events. Enterprises can use the event-driven architecture paradigm to make the information flow between the departments by asynchronously exchanging events through the event bus. This way, each department can react to events generated by other departments and initiate processes or actions depending on its business needs.

Such an approach creates a dynamic and flexible choreography between the different participants, which is unique to the enterprise. Such choreography can be followed and monitored using analytics and fine-grained event data collected on the data lake. Read Using AWS X-Ray tracing with Amazon EventBridge to learn how to debug and analyze this kind of distributed application.

Figure 1. Architecture diagram depicting enterprise vertical integration with Amazon EventBridge

Figure 1. Architecture diagram depicting enterprise vertical integration with Amazon EventBridge

In Figure 1, Amazon EventBridge works as the central event bus, the core component of this event-driven architecture. Through Amazon EventBridge, each event publisher sends or receives lifecycle events to and from all the other participants. Amazon EventBridge has an advanced routing mechanism using the concept of rules. Each rule defines up to five targets for the event arriving on the bus. Events are selected based on the event pattern. You can set up routing rules to determine where to send your data to build application architectures. These will react in real time to your data sources, with event publisher and consumer decoupled.

In addition to initiating the heavy routing and distribution of events, Amazon EventBridge can also give real-time insights into how the business runs. Using metrics automatically sent to Amazon CloudWatch, it is possible to see which kinds of events are arriving, and at which rate. You can also see how those events are distributed across the registered targets, and any failures that occur during this distribution. Every event can also be archived using the Amazon EventBridge events archiving feature.

Amazon Simple Storage Service (S3) is the backend storage, or data lake, for all the events that have ever transited via the event bus. With Amazon S3, customers have a cost-efficient storage service at any scale, with 11 9’s of durability. To help customers manage and secure their data, S3 provides features such as Amazon S3 Lifecycle to optimize costs. S3 Object Lock allows the write-once-read-many (WORM) model. You can expand this data and transform it into information using S3. Using services like Amazon AthenaAmazon Redshift, and Amazon EMR, those events can be transformed, correlated, and aggregated to generate insights on the business. The Amazon S3 data lake can also be the input to a data warehouse, machine learning models, and real-time analytics. Learn more about how to use Amazon S3 as the data lake storage.

A critical feature of this solution is the initiation of complex queries on top of the data lake. Amazon API Gateway provides one single flexible and elastic API entry point to retrieve data from the data lake. It also can publish events directly to the event bus. For complex queries, Amazon API Gateway can be integrated with an AWS Lambda. It will coordinate the execution of standard SQL queries using Amazon Athena as the query engine. You can read about a fully functional example of such an API called athena-express.

After collecting data from multiple departments, third-party entities, and shop floors, you can use the data to derive business value using cross-organization dashboards. In this way, you can increase visibility over the different entities and make sense of the data from the distributed systems. Even though this design allows you to use your favorite BI tool, we are using Amazon QuickSight for this solution. For example, with QuickSight, you can author your interactive dashboards, which include machine learning-powered insights. Those dashboards can then connect the marketing campaigns data with the sales data. You can measure how effective those campaigns were and forecast the demand on the production lines.

Conclusion

In this blog post, we showed you how to use Amazon EventBridge as an event bus to allow event-driven architectures. This architecture pattern streamlines the adoption of vertical integration. Enterprises can decouple IT systems from each other while retaining visibility into the data they generate. Integrating those systems can happen asynchronously using a choreography approach instead of having an orchestrator as a central component. There are technical challenges to implement this kind of solution, such as maintaining consistency in distributed applications and transactions spanning multiple microservices. Refer to the saga pattern for microservices-based architecture, and how to implement it using AWS Step Functions.

With a data lake in place to collect all the data produced by IT systems, you can create BI dashboards that provide a holistic view of multiple departments. Moreover, it allows organizations to get better insights into their valuable data and explore other use cases, such as machine learning. To support the data lake creation and management, refer to AWS Lake Formation and a series of other blog posts.

To learn more about Amazon EventBridge from a hands-on perspective, refer to this EventBridge workshop.

Architecting Persona-centric Data Platform with On-premises Data Sources

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/architecting-persona-centric-data-platform-with-on-premises-data-sources/

Many organizations are moving their data from silos and aggregating it in one location. Collecting this data in a data lake enables you to perform analytics and machine learning on that data. You can store your data in purpose-built data stores, like a data warehouse, to get quick results for complex queries on structured data.

In this post, we show how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi. We will also discuss Lake House architecture on AWS, which is the next evolution from data warehouse and data lake-based solutions.

Data movement services

AWS provides a wide variety of services to bring data into a data lake:

You may want to bring on-premises data into the AWS Cloud to take advantage of AWS purpose-built analytics services, derive insights, and make timely business decisions. Apache NiFi is an open source tool that enables you to move and process data using a graphical user interface.

For this use case and solution architecture, we use Apache NiFi to ingest data into Amazon S3 and AWS purpose-built analytics services, based on user personas.

Building persona-centric data platform on AWS

When you are building a persona-centric data platform for analytics and machine learning, you must first identify your user personas. Who will be using your platform? Then choose the appropriate purpose-built analytics services. Envision a data platform analytics architecture as a stack of seven layers:

  1. User personas: Identify your user personas for data engineering, analytics, and machine learning
  2. Data ingestion layer: Bring the data into your data platform and data lineage lifecycle view, while ingesting data into your storage layer
  3. Storage layer: Store your structured and unstructured data
  4. Cataloging layer: Store your business and technical metadata about datasets from the storage layer
  5. Processing layer: Create data processing pipelines
  6. Consumption layer: Enable your user personas for purpose-built analytics
  7. Security and Governance: Protect your data across the layers

Reference architecture

The following diagram illustrates how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi.

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Architecture flow:

    1. Identify user personas: You must first identify user personas to derive insights from your data platform. Let’s start with identifying your users:
      • Enterprise data service users who would like to consume data from your data lake into their respective applications.
      • Business users who would like to like create business intelligence dashboards by using your data lake datasets.
      • IT users who would like to query data from your data lake by using traditional SQL queries.
      • Data scientists who would like to run machine learning algorithms to derive recommendations.
      • Enterprise data warehouse users who would like to run complex SQL queries on your data warehouse datasets.
    2. Data ingestion layer: Apache NiFi scans the on-premises data stores and ingest the data into your data lake (Amazon S3). Apache NiFi can also transform the data in transit. It supports both Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) data transformations. Apache NiFi also supports data lineage lifecycle while ingesting data into Amazon S3.
    3. Storage layer: For your data lake storage, we recommend using Amazon S3 to build a data lake. It has unmatched 11 nines of durability and 99.99% availability. You can also create raw, transformed, and enriched storage layers depending upon your use case.
    4. Cataloging layer: AWS Lake Formation provides the central catalog to store and manage metadata for all datasets hosted in the data lake by AWS Glue Data Catalog. AWS services such as AWS Glue, Amazon EMR, and Amazon Athena natively integrate with Lake Formation. They automate discovering and registering dataset metadata into the Lake Formation catalog.
    5. Processing layer: Amazon EMR processes your raw data and places them into a new S3 bucket. Use AWS Glue DataBrew and AWS Glue to process the data as needed.
    6. Consumption layer or persona-centric analytics: Once data is transformed:
      • AWS Lambda and Amazon API Gateway will allow you to develop data services for enterprise data service users
      • You can develop user-friendly dashboards for your business users using Amazon QuickSight
      • Use Amazon Athena to query transformed data for your IT users
      • Your data scientists can utilize AWS Glue DataBrew to clean and normalize the data and Amazon SageMaker for machine learning models
      • Your enterprise data warehouse users can use Amazon Redshift to derive business intelligence
    7. Security and governance layer: AWS IAM provides users, groups, and role-level identity, in addition to the ability to configure coarse-grained access control for resources managed by AWS services in all layers. AWS Lake Formation provides fine-grained access controls and you can grant/revoke permissions at the database- or table- or column-level access.

Lake House architecture on AWS

The vast majority of data lakes are built on Amazon S3. At the same time, customers are leveraging purpose-built analytics stores that are optimized for specific use cases. Customers want the freedom to move data between their centralized data lakes and the surrounding purpose-built analytics stores. And they want to get insights with speed and agility in a seamless, secure, and compliant manner. We call this modern approach to analytics the Lake House architecture.

Figure 2. Lake House architecture on AWS

Figure 2. Lake House architecture on AWS

Refer to the whitepaper Derive Insights from AWS Lake house for various design patterns to derive persona-centric analytics by using the AWS Lake House approach. Check out the blog post Build a Lake House Architecture on AWS  for a Lake House reference architecture on AWS.

Conclusion

In this post, we show you how to build a persona-centric data platform on AWS with a seven-layered approach. This uses Apache NiFi as a data ingestion tool and AWS purpose-built analytics services for persona-centric analytics and machine learning. We have also shown how to build persona-centric analytics by using the AWS Lake House approach.

With the information in this post, you can now build your own data platform on AWS to gain faster and deeper insights from your data. AWS provides you the broadest and deepest portfolio of purpose-built analytics and machine learning services to support your business needs.

Read more and get started on building a data platform on AWS:

Enhancing Existing Building Systems with AWS IoT Services

Post Syndicated from Lewis Taylor original https://aws.amazon.com/blogs/architecture/enhancing-existing-building-systems-with-aws-iot-services/

With the introduction of cloud technology and by extension the rapid emergence of Internet of Things (IoT), the barrier to entry for creating smart building solutions has never been lower. These solutions offer commercial real estate customers potential cost savings and the ability to enhance their tenants’ experience. You can differentiate your business from competitors by offering new amenities and add new sources of revenue by understanding more about your buildings’ operations.

There are several building management systems to consider in commercial buildings, such as air conditioning, fire, elevator, security, and grey/white water. Each system continues to add more features and become more automated, meaning that control mechanisms use all kinds of standards and protocols. This has led to fragmented building systems and inefficiency.

In this blog, we’ll show you how to use AWS for the Edge to bring these systems into one data path for cloud processing. You’ll learn how to use AWS IoT services to review and use this data to build smart building functions. Some common use cases include:

  • Provide building facility teams a holistic view of building status and performance, alerting them to problems sooner and helping them solve problems faster.
  • Provide a detailed record of the efficiency and usage of the building over time.
  • Use historical building data to help optimize building operations and predict maintenance needs.
  • Offer enriched tenant engagement through services like building control and personalized experiences.
  • Allow building owners to gather granular usage data from multiple buildings so they can react to changing usage patterns in a single platform.

Securely connecting building devices to AWS IoT Core

AWS IoT Core supports connections with building devices, wireless gateways, applications, and services. Devices connect to AWS IoT Core to send and receive data from AWS IoT Core services and other devices. Buildings often use different device types, and AWS IoT Core has multiple options to ingest data and enabling connectivity within your building. AWS IoT Core is made up of the following components:

  • Device Gateway is the entry point for all devices. It manages your device connections and supports HTTPS and MQTT (3.1.1) protocols.
  • Message Broker is an elastic and fully managed pub/sub message broker that securely transmits messages (for example, device telemetry data) to and from all your building devices.
  • Registry is a database of all your devices and associated attributes and metadata. It allows you to group devices and services based upon attributes such as building, software version, vendor, class, floor, etc.

The architecture in Figure 1 shows how building devices can connect into AWS IoT Core. AWS IoT Core supports multiple connectivity options:

  • Native MQTT – Multiple building management systems or device controllers have MQTT support immediately.
  • AWS IoT Device SDK – This option supports MQTT protocol and multiple programming languages.
  • AWS IoT Greengrass – The previous options assume that devices are connected to the internet, but this isn’t always possible. AWS IoT Greengrass extends the cloud to the building’s edge. Devices can connect directly to AWS IoT Greengrass and send telemetry to AWS IoT Core.
  • AWS for the Edge partner products – There are several partner solutions, such as Ignition Edge from Inductive Automation, that offer protocol translation software to normalize in-building sensor data.
Data ingestion options from on-premises devices to AWS

Figure 1. Data ingestion options from on-premises devices to AWS

Challenges when connecting buildings to the cloud

There are two common challenges when connecting building devices to the cloud:

  • You need a flexible platform to aggregate building device communication data
  • You need to transform the building data to a standard protocol, such as MQTT

Building data is made up of various protocols and formats. Many of these are system-specific or legacy protocols. To overcome this, we suggest processing building device data at the edge, extracting important data points/values before transforming to MQTT, and then sending the data to the cloud.

Transforming protocols can be complex because they can abstract naming and operation types. AWS IoT Greengrass and partner products such as Ignition Edge make it possible to read that data, normalize the naming, and extract useful information for device operation. Combined with AWS IoT Greengrass, this gives you a single way to validate the building device data and standardize its processing.

Using building data to develop smart building solutions

The architecture in Figure 2 shows an in-building lighting system. It is connected to AWS IoT Core and reports on devices’ status and gives users control over connected lights.

The architecture in Figure 2 has two data paths, which we’ll provide details on in the following sections, but here’s a summary:

  1. The “cold” path gathers all incoming data for batch data analysis and historical dashboarding.
  2. The “warm” bidirectional path is for faster, real-time data. It gathers devices’ current state data. This path is used by end-user applications for sending control messages, real-time reporting, or initiating alarms.
Figure 2. Architecture diagram of a building lighting system connected to AWS IoT Core

Figure 2. Architecture diagram of a building lighting system connected to AWS IoT Core

Cold data path

The cold data path gathers all lighting device telemetry data, such as power consumption, operating temperature, health data, etc. to help you understand how the lighting system is functioning.

Building devices can often deliver unstructured, inconsistent, and large volumes of data. AWS IoT Analytics helps clean up this data by applying filters, transformations, and enrichment from other data sources before storing it. By using Amazon Simple Storage Service (Amazon S3), you can analyze your data in different ways. Here we use Amazon Athena and Amazon QuickSight for building operational dashboard visualizations.

Let’s discuss a real-world example. For building lighting systems, understanding your energy consumption is important for evaluating energy and cost efficiency. Data ingested into AWS IoT Core can be stored long term in Amazon S3, making it available for historical reporting. Athena and QuickSight can quickly query this data and build visualizations that show lighting state (on or off) and annual energy consumption over a set period of time. You can also overlay this data with sunrise and sunset data to provide insight into whether you are using your lighting systems efficiently. For example, adjusting the lighting schedule accordingly to the darker winter months versus the brighter summer months.

Warm data path

In the warm data path, AWS IoT Device Shadow service makes the device state available. Shadow updates are forwarded by an AWS IoT rule into downstream services such an AWS IoT Event, which tracks and monitors multiple devices and data points. Then it initiates actions based on specific events. Further, you could build APIs that interact with AWS IoT Device Shadow. In this architecture, we have used AWS AppSync and AWS Lambda to enable building controls via a tenant smartphone application.

Let’s discuss a real-world example. In an office meeting room lighting system, maintaining a certain brightness level is important for health and safety. If that space is unoccupied, you can save money by turning the lighting down or off. AWS IoT Events can take inputs from lumen sensors, lighting systems, and motorized blinds and put them into a detector model. This model calculates and prompts the best action to maintain the room’s brightness throughout the day. If the lumen level drops below a specific brightness threshold in a room, AWS IoT Events could prompt an action to maintain an optimal brightness level in the room. If an occupancy sensor is added to the room, the model can know if someone is in the room and maintain the lighting state. If that person leaves, it will turn off that lighting. The ongoing calculation of state can also evaluate the time of day or weather conditions. It would then select the most economical option for the room, such as opening the window blinds rather than turning on the lighting system.

Conclusion

In this blog, we demonstrated how to collect and aggregate the data produced by on-premises building management platforms. We discussed how augmenting this data with the AWS IoT Core platform allows for development of smart building solutions such as building automation and operational dashboarding. AWS products and services can enable your buildings to be more efficient while and also provide engaging tenant experiences. For more information on how to get started please check out our getting started with AWS IoT Core developer guide.

Building a Cloud-based OLAP Cube and ETL Architecture with AWS Managed Services

Post Syndicated from Peter Chung original https://aws.amazon.com/blogs/architecture/building-a-cloud-based-olap-cube-and-etl-architecture-with-aws-managed-services/

For decades, enterprises used online analytical processing (OLAP) workloads to answer complex questions about their business by filtering and aggregating their data. These complex queries were compute and memory-intensive. This required teams to build and maintain complex extract, transform, and load (ETL) pipelines to model and organize data, oftentimes with commercial-grade analytics tools.

In this post, we discuss building a cloud-based OLAP cube and ETL architecture that will yield faster results at lower costs without sacrificing performance by:

  • Connecting your on-premises database to the cloud for data profiling, discovery, and transformation
  • Running OLAP workloads without costly third-party software licenses, dedicated infrastructure, or the need to migrate data
  • Using AWS Glue Data Catalog, Amazon Athena, Amazon QuickSight, and Amazon SageMaker to catalog and visualize data with machine learning (ML)

Data analytics pipeline with AWS Managed Services

The proposed architecture in Figure 1 relies on AWS Managed Services. AWS Glue DataBrew is a no-code data transformation service that you can use to quickly build your transformation jobs. AWS Glue crawlers collect metadata from the transformed data and catalogs it for analytics and visualization using Athena and QuickSight. SageMaker will build, train, and deploy ML models.

This architecture will help you get answers from your data to your users as fast as possible without needing to migrate your data to AWS. There is no coding required, so you can leverage data transformation, cataloging, analytics, and ML quickly.

Figure 1. Example architecture using AWS Managed Services

Figure 1. Example architecture using AWS Managed Services

Benefits of AWS Managed Services for data analytics

Immediate connectivity to on-premises databases

The example architecture in Figure 1 begins with an online transaction processing (OLTP) database running in your corporate data center. Figure 2 shows how you can establish a Java database connectivity (JDBC) connection from the OLTP database to DataBrew running in AWS to run OLAP workloads. DataBrew supports data sources using JDBC for common data stores such as Microsoft SQL Server, MySQL, Oracle, and PostgreSQL.

DataBrew - JDBC connection to data source

Figure 2. DataBrew – JDBC connection to data source

Automatic data discovery

Figures 3 through 6 show how DataBrew summarizes your data for discovery. You can profile your data to understand patterns and detect anomalies. You can also run transformations called “jobs” in DataBrew without writing any code using over 250 built-in transforms.

DataBrew - dataset profiling overview

Figure 3. DataBrew – dataset profiling overview

 

DataBrew - data correlation patterns

Figure 4. DataBrew – data correlation patterns

 

DataBrew - data points distribution

Figure 5. DataBrew – data points distribution

No-code data transformation and cataloging

To run OLAP-type transactions, you can create jobs based on the transformation steps shown in Figure 6. These steps collectively are referred to as DataBrew recipes. These recipe results can be run as a job and outputted to an Amazon Simple Storage Service (Amazon S3) bucket.

A DataBrew project user interface view with sample data and transformation functions

Figure 6. A DataBrew project user interface view with sample data and transformation functions

Scheduled DataBrew jobs act similarly to scheduled ETL pipelines in OLAP. Based on data refresh and business requirements, DataBrew can run a job on a recurring basis (for example, every 12 hours). This can be run at a particular time of day, or as defined by a valid CRON expression. This helps you automate your transformation workflows.

The OLAP catalog is a set of metadata that sits between the actual OLAP data stored and applications. To create a Data Catalog, you can use AWS Glue crawlers to automatically classify your data to determine the data’s format, schema, and associated properties. Figure 7 shows the results of a crawler’s results written to Data Catalog as metadata to help data users find the data they need.

AWS Glue crawler metadata table output of column names and data types

Figure 7. AWS Glue crawler metadata table output of column names and data types

Data analytics without third-party software licenses

You can run analytics on your data by referring to the metadata definitions in the Data Catalog as references to the actual data in Amazon S3 using Athena. Athena is well suited for running one-time queries using standard SQL to query the transformed data directly in Amazon S3 without having to move data around. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Enterprises often supplement their OLAP workloads with separate visualization and business intelligence (BI) tools. These tools often come with their own licensing, server management, and security considerations.

You can visualize curated data using QuickSight, a scalable, serverless, embeddable, ML-powered BI service. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights, as shown in Figure 8. These dashboards can be shared with other users and embedded within your own applications.

A sample of data visualization options with Amazon QuickSight

Figure 8. A sample of data visualization options with Amazon QuickSight

Finally, you can incorporate ML workloads to OLAP workloads using SageMaker. In the past, ML workloads were often expensive, resource-intensive, and inaccessible. SageMaker provides a fully managed ML service to quickly and easily build and train ML models and directly deploy them into a production-ready hosted environment.

Conclusion

In this post, we show you how to connect your on-premises database using a JDBC connection to DataBrew for data profiling, discovery, and transformation. We looked at how you can use DataBrew recipes and jobs to run OLAP workloads without costly third-party software licenses, dedicated infrastructure, or the need to migrate any data. We also looked at AWS capabilities in data cataloging, visualization, and machine learning using Data Catalog, Athena, QuickSight, and SageMaker without having to manage any servers.

Laying the foundation to modernize an analytics workflow is critical for many enterprises that are looking to reduce the time it takes to understand their business. With AWS, you can perform enterprise-scale analytics with our portfolio of analytics services.

 

Hydrate your data lake with SaaS application data using Amazon AppFlow

Post Syndicated from Ninad Phatak original https://aws.amazon.com/blogs/big-data/hydrate-your-data-lake-with-saas-application-data-using-amazon-appflow/

Organizations today want to make data-driven decisions. The data could lie in multiple source systems, such as line of business applications, log files, connected devices, social media, and many more. As organizations adopt software as a service (SaaS) applications, data becomes increasingly fragmented and trapped in different “data islands.” To make decision-making easier, organizations are building data lakes, which is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, ad hoc analytics, and machine learning (ML) to guide better decisions.

AWS provides services such as AWS Glue, AWS Lake Formation, Amazon Database Migration Service (AWS DMS), and many third-party solutions on AWS Marketplace to integrate data from various source systems into the Amazon Simple Storage Service (Amazon S3) data lake. If you’re using SaaS applications like Salesforce, Marketo, Slack, and ServiceNow to run your business, you may need to integrate data from these sources into your data lake. You likely also want to easily integrate these data sources without writing or managing any code. This is precisely where you can use Amazon AppFlow.

Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications like Salesforce, Marketo, Slack, and ServiceNow and AWS services like Amazon S3 and Amazon Redshift. With Amazon AppFlow, you can run data flows at nearly any scale at the frequency you choose—on a schedule, in response to a business event in real time, or on demand. You can configure data transformations such as data masking and concatenation of fields as well as validate and filter data (omitting records that don’t fit a criteria) to generate rich, ready-to-use data as part of the flow itself, without additional steps. Amazon AppFlow automatically encrypts data in motion, and optionally allows you to restrict data from flowing over the public internet for SaaS applications that are integrated with AWS PrivateLink, reducing exposure to security threats. For a complete list of all the SaaS applications that can be integrated with Amazon AppFlow, see Amazon AppFlow integrations.

In this post, we look at how to integrate data from Salesforce into a data lake and query the data via Amazon Athena. Amazon AppFlow recently announced multiple new capabilities such as availability of APIs and integration with AWS CloudFormation. We take advantage of these new capabilities and deploy the solution using a CloudFormation template.

Solution architecture

The following diagram depicts the architecture of the solution that we deploy using AWS CloudFormation.

As seen in the diagram, we use Amazon AppFlow to integrate data from Salesforce into a data lake on Amazon S3. We then use Athena to query this data with the table definitions residing in the AWS Glue Data Catalog.

Deploy the solution with AWS CloudFormation

We use AWS CloudFormation to deploy the solution components in your AWS account. Choose an AWS Region for deployment where the following services are available:

  • Amazon AppFlow
  • AWS Glue
  • Amazon S3
  • Athena

You need to meet the following prerequisites before deploying the solution:

  • Have a Salesforce account with credentials authorized to pull data using APIs.
  • If you’re deploying the stack in an account using the Lake Formation permission model, validate the following settings:
    • The AWS Identity and Access Management (IAM) user used to deploy the stack is added as a data lake administrator under Lake Formation, or the IAM user used to deploy the stack has IAM privileges to create databases in the AWS Glue Data Catalog.
    • The Data Catalog settings under Lake Formation are configured to use only IAM access control for new databases and new tables in new databases. This makes sure that all access to the newly created databases and tables in the Data Catalog are controlled solely using IAM permissions. The following screenshot shows the Data catalog settings page on the Lake Formation console, where you can set these permissions.

These Lake Formation settings are required so that all permissions to the Data Catalog objects are controlled using IAM only.

Although you need these Lake Formation settings for the CloudFormation stack to deploy properly, in a production setting we recommend you use Lake Formation to govern access to the data in the data lake. For more information about Lake Formation, see What Is AWS Lake Formation?

We now deploy the solution and the following components:

  • An Amazon AppFlow flow to integrate Salesforce account data into Amazon S3
  • An AWS Glue Data Catalog database
  • An AWS Glue crawler to crawl the data pulled into Amazon S3 so that it can be queried using Athena.
  1. On the Amazon AppFlow console, on the Connections page, choose Create connection.
  2. For Connection name, enter a name for your connection.
  3. Choose Continue.

You’re redirected to the Salesforce login page, where you enter your Salesforce account credentials.

  1. Enter the appropriate credentials and grant OAuth2 access to the Amazon AppFlow client in the next step, after which a new connector profile is set up in your AWS account.
  2. To deploy the remaining solution components, choose Launch Stack:
  3. For Stack name, enter an appropriate name for the CloudFormation stack.
  4. For Parameters, enter the name of the Salesforce connection you created.
  5. Choose Next.
  6. Follow through the CloudFormation stack creation wizard, leaving rest of the default values unchanged.
  7. On the final page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  8. Choose Create stack.
  9. Wait for the stack status to change to CREATE_COMPLETE.
  10. On the Outputs tab of the stack, record the name of the S3 bucket.

Run the flow

The CloudFormation stack has deployed a flow named SFDCAccount. Open the flow to see the configuration. The flow has been configured to do the following:

  • Pull the account object from your Salesforce account into a S3 bucket. The flow pulls certain attributes from the object in Parquet format.
  • Mask the last five digits of the phone number associated with the Salesforce account.
  • Build a validation on the Account ID field that ignores the record if the value is NULL.

Make sure that all these attributes pulled by the flow are part of your account object in Salesforce. Make any additional changes that you may want to the flow and save the flow.

  1. Run the flow by choosing Run flow.
  2. When the flow is complete, navigate to the S3 bucket created by the CloudFormation stack to confirm its contents.

The Salesforce account data is stored in Parquet format in the SFDCData/SFDCAccount/ folder in the S3 bucket.

  1. On the AWS Glue console, run the crawler AppFlowGlueCrawler.

This crawler has been created by the CloudFormation stack and is configured to crawl the S3 bucket and create a table in the appflowblogdb database in the Data Catalog.

When the crawler is complete, a table named SFDCAccount exists in the appflowblogdb database.

  1. On the Athena console, run the following query:
    Select * from appflowblogdb.SFDCAccount limit 10;

The output shows the data pulled by the Amazon AppFlow flow into the S3 bucket.

Clean up

When you’re done exploring the solution, complete the following steps to clean up the resources deployed by AWS CloudFormation:

  1. Empty the S3 bucket created by the CloudFormation stack.
  2. Delete the CloudFormation stack.

Conclusion

In this post, we saw how you can easily set up an Amazon AppFlow flow to integrate data from Salesforce into your data lake. Amazon Appflow allows you to integrate data from many other SaaS applications into your data lake. After the data lands in Amazon S3, you can take it further for downstream processing using services like Amazon EMR and AWS Glue. You can then use the data in the data lake for multiple analytics use cases ranging from dashboards to ad hoc analytics and ML.


About the Authors

Ninad Phatak is a Principal Data Architect at Amazon Development Center India. He specializes in data engineering and datawarehousing technologies and helps customers architect their analytics use cases and platforms on AWS.

 

 

 

Vinay Kondapi is Head of product for Amazon AppFlow. He specializes in Application and data integration with SaaS products at AWS.

 

 

 

Build a data quality score card using AWS Glue DataBrew, Amazon Athena, and Amazon QuickSight

Post Syndicated from Nitin Aggarwal original https://aws.amazon.com/blogs/big-data/build-a-data-quality-score-card-using-aws-glue-databrew-amazon-athena-and-amazon-quicksight/

Data quality plays an important role while building an extract, transform, and load (ETL) pipeline for sending data to downstream analytical applications and machine learning (ML) models. The analogy “garbage in, garbage out” is apt at describing why it’s important to filter out bad data before further processing. Continuously monitoring data quality and comparing it with predefined target metrics helps you comply with your governance frameworks.

In November 2020, AWS announced the general availability of AWS Glue DataBrew, a new visual data preparation tool that helps you clean and normalize data without writing code. This reduces the time it takes to prepare data for analytics and ML by up to 80% compared to traditional approaches to data preparation.

In this post, we walk through a solution in which we apply various business rules to determine the quality of incoming data and separate good and bad records. Furthermore, we publish a data quality score card using Amazon QuickSight and make records available for further analysis.

Use case overview

For our use case, we use a public dataset that is available for download at Synthetic Patient Records with COVID-19. It contains 100,000 synthetic patient records in CSV format. Data hosted within SyntheticMass has been generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.

When we unzip the 100k_synthea_covid19_csv.zip file, we see the following CSV files:

  • Allergies.csv
  • Careplans.csv
  • Conditions.csv
  • Devices.csv
  • Encounters.csv
  • Imaging_studies.csv
  • Immunizations.csv
  • Medications.csv
  • Observations.csv
  • Organizations.csv
  • Patients.csv
  • Payer_transitions.csv
  • Payers.csv
  • Procedures.csv
  • Providers.csv
  • Supplies.csv

We perform the data quality checks categorized by the following data quality dimensions:

  • Completeness
  • Consistency
  • Integrity

For our use case, these CSV files are maintained by your organization’s data ingestion team, which uploads the updated CSV file to Amazon Simple Storage Service (Amazon S3) every week. The good and bad records are separated through a series of data preparation steps, and the business team uses the output data to create business intelligence (BI) reports.

Architecture overview

The following architecture uses DataBrew for data preparation and building key KPIs, Amazon Athena for data analysis with standard SQL, and QuickSight for building the data quality score card.

The workflow includes the following steps:

  1. The ingestion team receives CSV files in an S3 input bucket every week.
  2. The DataBrew job scheduled to run every week triggers the recipe job.
  3. DataBrew processes the input files and generates output files that contain additional fields depending on the recipe job logic.
  4. After the output data is written, we create external table on top of it by creating and running an AWS Glue crawler.
  5. The good and bad records are separated by creating views on top of the external table.
  6. Data analysts can use Athena to analyze good and bad records.
  7. The records can also be separated directly using QuickSight calculated fields.
  8. We use QuickSight to create the data quality score card in the form of a dashboard, which fetches data through Athena.

Prerequisites

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

Additionally, create the S3 input and output buckets to capture the data, and upload the input data into the input bucket.

Create DataBrew datasets

To create a DataBrew dataset for the patient data, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. Choose Connect new dataset.
  3. For Dataset name, enter a name (for this post, Patients).
  4. For Enter your source from S3, enter the S3 path of the patients input CSV.
  5. Choose Create Dataset.

Repeat these steps to create datasets for other CSV files, such as encounters, conditions, and so on.

Create a DataBrew project

To create a DataBrew project for marketing data, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create a project.
  3. For Project name, enter a name (for this post, patients-data-quality).
  4. For Select a dataset, select My datasets.
  5. Select the patients dataset.
  6. Under Permissions, for Role name, choose an AWS Identity and Access Management (IAM) role that allows DataBrew to read from your Amazon S3 input location.

You can choose a role if you already created one, or create a new one. For more information, see Adding an IAM role with data resource permissions.

  1. Wait till the dataset is loaded (about 1–2 minutes).
  2. To make a consistency check, choose Birthdate.
  3. On the Create menu, choose Flag column.
  4. Under Create column, for Values to flag, select Custom value.
  5. For Source column, choose BIRTHDATE.
  6. For Values to flag, enter the regular expression (?:(?:18|19|20)[0-9]{2}).
  7. For Flag values as, choose Yes or no.
  8. For Destination column, enter BIRTHDATE_flagged.

The new column BIRTHDATE_FLAGGED now displays Yes for a valid four-digit year within BIRTHDATE.

  1. To create a completeness check, repeat the preceding steps to create a DRIVERS_FLAGGED column by choosing the DRIVERS column to mark missing values.
  2. To create an integrity check, choose the JOIN transformation.
  3. Choose the encounters dataset and choose Next.
  4. For Select join type, select Left join.
  5. For Join keys, choose Id for Table A and Patient for Table B.
  6. Under Column list, unselect all columns from Table B except for Patient.
  7. Choose Finish.
  8. Choose the Patient column and create another flag column PATIENTS_FLAG to mark missing values from the Patient column.

For our use case, we created three new columns to demonstrate data quality checks for data quality dimensions in scope (consistency, completeness, and integrity), but you can integrate additional transformations on the same or additional columns as needed.

  1. After you finish applying all your transformations, choose Publish on the recipe.
  2. Enter a description of the recipe version and choose Publish.

Create a DataBrew job

Now that our recipe is ready, we can create a job for it, which gets invoked through our AWS Lambda functions.

  1. On the DataBrew console, choose Jobs.
  2. Choose Create a job.
  3. For Job name¸ enter a name (for example, patient-data-quality).

Your recipe is already linked to the job.

  1. Under Job output settings¸ for File type, choose your final storage format (for this post, we choose CSV).
  2. For S3 location, enter your final S3 output bucket path.
  3. For Compression, choose the compression type you want to apply (for this post, we choose None).
  4. For File output storage, select Replace output files for each job run.

We choose this option because our use case is to publish a data quality score card for every new set of data files.

  1. Under Permissions, for Role name¸ choose your IAM role.
  2. Choose Create and run job.

Create an Athena table

If you’re familiar with Apache Hive, you may find creating tables on Athena to be familiar. You can create tables by writing the DDL statement on the query editor, or by using the wizard or JDBC driver. To use the query editor, enter the following DDL statement to create a table:

CREATE EXTERNAL TABLE `blog_output`(
  `id` string, 
  `birthdate` string, 
  `birthdate_flagged` string, 
  `deathdate` string, 
  `ssn` string, 
  `drivers` string, 
  `drivers_flagged` string, 
  `passport` string, 
  `prefix` string, 
  `first` string, 
  `last` string, 
  `suffix` string, 
  `maiden` string, 
  `marital` string, 
  `race` string, 
  `ethnicity` string, 
  `gender` string, 
  `birthplace` string, 
  `address` string, 
  `city` string, 
  `state` string, 
  `county` string, 
  `zip` bigint, 
  `lat` double, 
  `lon` double, 
  `healthcare_expenses` double, 
  `healthcare_coverage` double, 
  `patient` string, 
  `patient_flagged` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<your-bucket>/blog_output/';

Let’s validate the table output in Athena by running a simple SELECT query. The following screenshot shows the output.

Create views to filter good and bad records (optional)

To create a good records view, enter the following code:

CREATE OR REPLACE VIEW good_records AS
SELECT * FROM "databrew_blog"."blog_output"
where 
birthdate_flagged = 'Yes' AND
drivers_flagged = 'No' AND
patient_flagged = 'No'

To create a bad records view, enter the following code:

CREATE OR REPLACE VIEW bad_records AS
SELECT * FROM "databrew_blog"."blog_output"
where 
birthdate_flagged = 'No' OR
drivers_flagged = 'Yes' OR 
patient_flagged = 'Yes'

Now you have the ability to query the good and bad records in Athena using these views.

Create a score card using QuickSight

Now let’s complete our final step of the architecture, which is creating a data quality score card through QuickSight by connecting to the Athena table.

  1. On the QuickSight console, choose Athena as your data source.
  2. For Data source name, enter a name.
  3. Choose Create data source.
  4. Choose your catalog and database.
  5. Select the table you have in Athena.
  6. Choose Select.

Now you have created a dataset.

To build the score card, you add calculated fields by editing the dataset blog_output.

  1. Locate your dataset.
  2. Choose Edit dataset.
  3. Choose Add calculated field.
  4. Add the field DQ_Flag with value ifelse({birthdate_flagged} = 'No' OR {drivers_flagged} = 'Yes' OR {patient_flagged} = 'Yes' , 'Invalid', 'Valid').

Similarly, add other calculated fields.

  1. Add the field % Birthdate Invalid Year with value countIf({birthdate_flagged}, {birthdate_flagged} = 'No')/count({birthdate_flagged}).
  2. Add the field % Drivers Missing with value countIf({drivers_flagged}, {drivers_flagged} = 'Yes')/count({drivers_flagged}).
  3. Add the field % Patients missing encounters with value countIf({patient_flagged}, {patient_flagged} = 'Yes')/count({patient_flagged}).
  4. Add the field % Bad records with the value countIf({DQ_Flag}, {DQ_Flag} = 'Invalid')/count({DQ_Flag}).

Now we create the analysis blog_output_analysis.

  1. Change the format of the calculated fields to display the Percent format.
  2. Start adding visuals by choosing Add visual on the + Add menu.

Now you can create a quick report to visualize your data quality score card, as shown in the following screenshot.

If QuickSight is using SPICE storage, you need to refresh the dataset in QuickSight after you receive notification about the completion of the data refresh. If the QuickSight report is running an Athena query for every request, you might see a “table not found” error when data refresh is in progress. We recommend using SPICE storage to get better performance.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

Conclusion

This post explains how to create a data quality score card using DataBrew, Athena queries, and QuickSight.

This gives you a great starting point for using this solution with your datasets and applying business rules to build a complete data quality framework to monitor issues within your datasets. We encourage you to use various built-in transformations to get the maximum value for your project.


About the Authors

Nitin Aggarwal is a Senior Solutions Architect at AWS, where helps digital native customers with architecting data analytics solutions and providing technical guidance on various AWS services. He brings more than 16 years of experience in software engineering and architecture roles for various large-scale enterprises.

 

 

 

Gaurav Sharma is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

 

 

 

Vivek Kumar is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

Effective data lakes using AWS Lake Formation, Part 1: Getting started with governed tables

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-1-effective-data-lakes-using-aws-lake-formation-part-1-getting-started-with-governed-tables/

Thousands of customers are building their data lakes on Amazon Simple Storage Service (Amazon S3). You can use AWS Lake Formation to build your data lakes easily—in a matter of days as opposed to months. However, there are still some difficult challenges to address with your data lakes:

  • Supporting streaming updates and deletes in your data lakes, for example, database replication, and supporting privacy regulations such as GDPR and CCPA
  • Achieving fine-grained secure sharing not only with table- or column-level access control, but with row-level access control
  • Optimizing the layout of various tables and files on Amazon S3 to improve analytics performance

We announced Lake Formation transactions, row-level security, and acceleration for preview at AWS re:Invent 2020. These capabilities are available via new, open, and public update and access APIs for data lakes. These APIs extend the governance capabilities of Lake Formation with row-level security, and provide transactions semantics on data lakes.

In this series of the posts, we provide a step-by-step instruction to use these new Lake Formation features. In this post, we focus on the first step of setting up governed tables.

Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, sign up for the preview. You need to be approved for the preview to gain access to these features.

Governed Table

The Data Catalog supports a new type of metadata tables: governed tables. Governed tables are unique to Lake Formation. Governed tables are a new Amazon S3 table type that supports atomic, consistent, isolated, and durable (ACID) transactions. Lake Formation transactions simplify ETL script and workflow development, and allow multiple users to concurrently and reliably insert, delete, and modify rows across multiple governed tables. Lake Formation automatically compacts and optimizes storage of governed tables in the background to improve query performance. When you create a table, you can specify whether or not the table is governed.

Setting up resources with AWS CloudFormation

In this post, I demonstrate how you can create a new governed table using existing data on Amazon S3. We use the Amazon Customer Reviews Dataset, which is stored in a public S3 bucket as sample data. You don’t need to copy the data to your bucket or worry about Amazon S3 storage costs. You can just set up a governed table pointing to this existing public data to see how it works.

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. If you prefer setting up resources on the AWS Management Console rather than AWS CloudFormation, see the instructions in the appendix at the end of this post.

The CloudFormation template generates the following resources:

To create your resources, complete the following steps:

  1. Sign in to the CloudFormation console in us-east-1 Region.
  2. Choose Launch Stack:
  3. Choose Next.
  4. For DatalakeAdminUserNameand DatalakeAdminUserPassword, enter your IAM user name and password for data lake admin user.
  5. For DataAnalystUserNameand DataAnalystUserPassword, enter your IAM user name and password for data analyst user.
  6. For DatabaseName, leave as the default.
  7. Choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation can take up to 2 minutes.

Setting up a governed table

Now you can create and configure your first governed table in AWS Lake Formation.

Creating a governed table

To create your governed table, complete the following steps:

  1. Sign in to the Lake Formation console in us-east-1 Region using the DatalakeAdmin1 user.
  2. Choose Tables.
  3. Choose Create table.
  4. For Name, enter amazon_reviews_governed.
  5. For Database, enter lakeformation_tutorial_amazon_reviews.
  6. Select Enable governed data access and management.
  7. Select Enable row based permissions.

Select Enable row based permissions.

    1. For Data is located in, choose Specified path in another account.
    2. Enter the path s3://amazon-reviews-pds/parquet/.
    3. For Classification, choose PARQUET.
    4. Choose Upload Schema.
    5. Enter the following JSON array into the text box:
[
    {
        "Name": "marketplace",
        "Type": "string"
    },
    {
        "Name": "customer_id",
        "Type": "string"
    },
    {
        "Name": "review_id",
        "Type": "string"
    },
    {
        "Name": "product_id",
        "Type": "string"
    },
    {
        "Name": "product_parent",
        "Type": "string"
    },
    {
        "Name": "product_title",
        "Type": "string"
    },
    {
        "Name": "star_rating",
        "Type": "int"
    },
    {
        "Name": "helpful_votes",
        "Type": "int"
    },
    {
        "Name": "total_votes",
        "Type": "int"
    },
    {
        "Name": "vine",
        "Type": "string"
    },
    {
        "Name": "verified_purchase",
        "Type": "string"
    },
    {
        "Name": "review_headline",
        "Type": "string"
    },
    {
        "Name": "review_body",
        "Type": "string"
    },
    {
        "Name": "review_date",
        "Type": "bigint"
    },
    {
        "Name": "year",
        "Type": "int"
    }
]
  1. Choose Upload.
  2. Choose Add column.
  3. For Column name, enter product_category.
  4. For Data type, choose String.
  5. Select Partition Key.
  6. Choose Add.
  7. Choose Submit.

Now you can see that the new governed table has been created.

When you choose the table name, you can see the details of the governed table, and you can also see Governance: Enabled in this view. It means that it’s a Lake Formation governed table. If you have other existing tables, it should show as Governance: Disabled because the tables are not governed tables.
Now you can see that the new governed table has been created.

You can also see lakeformation.aso.status: true under Table properties. It means that automatic compaction is enabled for this table. For this post, we use a read-only table and don’t utilize automatic compaction. To disable the automatic compaction, complete the following steps:

  1. Choose Edit table.
  2. Deselect Automatic compaction.
  3. Choose Save.

Currently, no data and no partitions are registered to this governed table. In the next step, we register existing S3 objects to the governed table using Lake Formation manifest APIs.

Even if you locate your data in the table location of the governed table, the data isn’t recognized yet. To make the governed table aware of the data, you need to make a Lake Formation API call, or use an AWS Glue job with Lake Formation transactions.

Even if you locate your data in the table location of the governed table, the data isn’t recognized yet.

Configuring Lake Formation permissions

You need to grant Lake Formation permissions for your governed table. Complete the following steps:

Table-level permissions

  1. Sign in to the Lake Formation console in us-east-1 Region using the DatalakeAdmin1 user.
  2. Under Permissions, choose Data permissions.
  3. Under Data permission, choose Grant.
  4. For Database, choose lakeformation_tutorial_amazon_reviews.
  5. For Table, choose amazon_reviews_governed.
  6. For IAM users and roles, choose the role LFRegisterLocationServiceRole-<CloudFormation stack name> and the user DatalakeAdmin1.
  7. Select Table permissions.
  8. Under Table permissions, select Alter, Insert, Drop, Delete, Select, and Describe.
  9. Choose Grant.
  10. Under Data permission, choose Grant.
  11. For Database, choose lakeformation_tutorial_amazon_reviews.
  12. For Table, choose amazon_reviews_governed.
  13. For IAM users and roles, choose the user DataAnalyst1.
  14. Under Table permissions, select Select and Describe.
  15. Choose Grant.

Row-level permissions

  1. Under Permissions, choose Data permissions.
  2. Under Data permission, choose Grant.
  3. For Database, choose lakeformation_tutorial_amazon_reviews.
  4. For Table, choose amazon_reviews_governed.
  5. For IAM users and roles, choose the role LFRegisterLocationServiceRole-<CloudFormation stack name>, the users DatalakeAdmin1 and DataAnalyst.
  6. Select Row-based permissions.
  7. For Filter name, enter allowAll.
  8. For Choose filter type, select Allow access to all rows.
  9. Choose Grant.

Adding table objects into the governed table

To register S3 objects to a governed table, you need to call the UpdateTableObjects API needs for the objects. You can call it using the AWS Command Line Interface (AWS CLI) and SDK, and also the AWS Glue ETL library (the API is called implicitly in the library). For this post, we use the AWS CLI to explain the behavior in the API level. If you don’t have the AWS CLI, see Installing, updating, and uninstalling the AWS CLI. You also need to install the service model file provided in the Lake Formation preview program. You need to run the following commands using DatalakeAdmin1 user’s credential (or an IAM role or user where sufficient permissions are granted).

First, begin a new transaction with the BeginTransaction API:

$ aws lakeformation-preview begin-transaction
{
    "TransactionId": "7e5d506a757f32252ae3402a10191b13bfd1d7aa1c26a099d4a1911241589b8f"
}

Now you can register any files on the location. For this post, we choose one sample partition product_category=Camera from the amazon-reviews-pds table, and choose one file under this partition. Uri, ETag, and Size are the required information for further steps, so you need to copy them.

$ aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Camera/
2018-04-09 15:37:05   65386769 part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:06   65619234 part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:06   64564669 part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07   65148225 part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07   65227429 part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:07   65269357 part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:08   65595867 part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:08   65012056 part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:09   65137504 part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 15:37:09   64992488 part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

$ aws s3api head-object --bucket amazon-reviews-pds --key parquet/product_category=Camera/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
{
    "AcceptRanges": "bytes",
    "LastModified": "Mon, 09 Apr 2018 06:37:07 GMT",
    "ContentLength": 65227429,
    "ETag": "\"980669fcf6ccf31d2d686b9cccdd45e3-8\"",
    "ContentType": "binary/octet-stream",
    "Metadata": {}
}

Create a new file named write-operations1.json and enter the following JSON: (replace Uri, ETag, and Size with the values you copied.)

[
    {
        "AddObject": {
            "Uri": "s3://amazon-reviews-pds/parquet/product_category=Camera/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
            "ETag": "d4c25c40f33071620fb31cf0346ed2ec-8",
            "Size": 65386769,
            "PartitionValues": [
                "Camera"
            ]
        }
    }
]

Let’s register an existing object on the bucket to the governed table by making an UpdateTableObjects API call using write-operations1.json you created. (replace <transaction-id> with the transaction id you got in begin-transaction command.)

$ aws lakeformation-preview update-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id> --write-operations file://./write-operations1.json$ 
Note current date time right after making the UpdateTableObjects API call here. We use this timestamp for time travel queries later.
$ date -u
Tue Feb  2 12:12:00 UTC 2021

You can ensure the change before the transaction commit by making the GetTableObjects API call with the same transaction ID: (Replace <transaction-id> with the id you got in begin-transaction command.)

$ aws lakeformation-preview get-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id>
{
    "Objects": [
        {
            "PartitionValues": [
                "Camera"
            ],
            "Objects": [
                {
                    "Uri": "s3://amazon-reviews-pds/parquet/product_category=Camera/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
                    "ETag": "d4c25c40f33071620fb31cf0346ed2ec-8",
                    "Size": 65386769
                }
            ]
        }
    ]
}

To make this data available for other transactions, you need to call the CommitTransaction API: (replace <transaction-id> with the transaction id you got in begin-transaction command.)

$ aws lakeformation-preview commit-transaction --transaction-id <transaction-id>
After running the preceding command, you can see the partition on the Lake Formation console.

After running the preceding command, you can see the partition on the Lake Formation console.

Let’s add one more partition into this table. This time we add one file per partition, and add only two partitions as an example. For actual usage, you need to add all the files under all the partitions that you need.

Add partitions with following commands:

  1. Call the BeginTransaction API to start another Lake Formation transaction:
    $ aws lakeformation-preview begin-transaction
    {
         "TransactionId": "d70c60e859e832b312668723cf48c1b84ef9109c5dbf6e9dbe8834c481c0ec81"
    }

  2. List Amazon S3 objects located on amazon-reviews-pds bucket to choose another sample file:
    $ aws s3 ls s3://amazon-reviews-pds/parquet/product_category=Books/
    2018-04-09 15:35:58 1094842361 part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:35:59 1093295804 part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:36:00 1095643518 part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:36:00 1095218865 part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:36:00 1094787237 part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:36:33 1094302491 part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:36:35 1094565655 part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:36:35 1095288096 part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:36:35 1092058864 part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    2018-04-09 15:36:35 1093613569 part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

  3. Call the HeadObject API against one sample file in order to copy ETag and Size
    $ aws s3api head-object --bucket amazon-reviews-pds --key parquet/product_category=Books/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
    {
         "AcceptRanges": "bytes",
         "LastModified": "Mon, 09 Apr 2018 06:35:58 GMT",
         "ContentLength": 1094842361,
         "ETag": "\"9805c2c9a0459ccf337e01dc727f8efc-131\"",
         "ContentType": "binary/octet-stream",
         "Metadata": {}
    }

  4. Create a new file named write-operations2.json and enter the following JSON: (Replace Uri, ETag, and Size with the values you copied.)
    [
        {
                "AddObject": {
                "Uri": "s3://amazon-reviews-pds/parquet/product_category=Books/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
                "ETag": "9805c2c9a0459ccf337e01dc727f8efc-131",
                "Size": 1094842361,
                "PartitionValues": [
                    "Books"
               ]
           }
        }
    ]

  5. Call the UpdateTableObjects API using write-operations2.json: (replace <transaction-id> with the transaction id you got in begin-transaction command.)
    $ aws lakeformation-preview update-table-objects --database-name lakeformation_tutorial_amazon_reviews --table-name amazon_reviews_governed --transaction-id <transaction-id> --write-operations file://./write-operations2.json

    Call the CommitTransaction API: (replace <transaction-id> with the transaction id you got in begin-transaction command.)

    $ aws lakeformation-preview commit-transaction --transaction-id <transaction-id>

    Now the two partitions are visible on the Lake Formation console.

Now the two partitions are visible on the Lake Formation console.

Querying the governed table using Amazon Athena

Now your governed table is ready! Let’s start querying the governed table using Amazon Athena. Sign in to the Athena console in us-east-1 Region using DataAnalyst1 user.

If it’s your first time running queries on Athena, you need to configure a query result location. For more information, see Specifying a Query Result Location.

To utilize Lake Formation preview features, you need to create a special workgroup named AmazonAthenaLakeFormationPreview, and join the workgroup. For more information, see Managing Workgroups.

Running a simple query

Sign in to the Athena console in us-east-1 Region using the DataAnalyst1 user. First, let’s preview 10 records stored in a governed table:

SELECT * 
FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed
LIMIT 10

The following screenshot shows the query results.

The following screenshot shows the query results.

Running an analytic query

Next, let’s run an analytic query with aggregation for simulating real-world use cases:

SELECT product_category, count(*) as TotalReviews, avg(star_rating) as AverageRating
FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed 
GROUP BY product_category

The following screenshot shows the results. This query returned the total number of reviews and average rating per product category.

The following screenshot shows the results

Running an analytic query with time travel

Each governed table maintains a versioned manifest of the Amazon S3 objects that it comprises. You can use previous versions of the manifest for time travel queries. Your queries against governed tables in Athena can include a timestamp to indicate that you want to discover the state of the data at a particular date and time.

To submit a time travel query in Athena, add a WHERE clause that sets the column __asOfDate to the epoch time (long integer) representation of the required date and time. Let’s run the time travel query: (replace <epoch-milliseconds> with the timestamp which is right after you made the first UpdateTableObjects call. To retrieve the epoch milliseconds, see the tips introduced after the screenshots in this post.)

SELECT product_category, count(*) as TotalReviews, avg(star_rating) as AverageRating
FROM lakeformation.lakeformation_tutorial_amazon_reviews.amazon_reviews_governed
WHERE __asOfDate = <epoch-milliseconds>
GROUP BY product_category

The following screenshot shows the query results. The result only includes the record of product_category=Camera. This is because that the file under product_category=Books has been added after this timestamp (1612267920000 ms = 2021/02/02 12:12:00 UTC), which has been specified in the time travel column __asOfDate.

The following screenshot shows the query results.

To retrieve epoch time from commands, you can run below commands.

The following command is for Linux (GNU date command):

$ echo $(($(date -u -d '2021/02/02 12:12:00' +%s%N)/1000000)) 
1612267920000

The following command is for OSX (BSD date command):

$ echo $(($(date -u -j -f "%Y/%m/%d %T" "2021/02/02 12:12:00" +'%s * 1000 + %-N / 1000000')))
1612267920000

Cleaning up

Now to the final step, cleaning up the resources.

  1. Delete the CloudFormation stack. The governed table you created is automatically deleted with the stack.
  2. Delete the Athena workgroup AmazonAthenaLakeFormationPreview.

Conclusion

In this blog post, we explained how to create a Lake Formation governed table with existing data in an AWS public dataset. In addition, we explained how to query against governed tables and how to run time travel queries for governed tables. With Lake Formation governed tables, you can achieve transactions, row-level security, and query acceleration. In Part 2 of this series, we show you how to create a governed table for streaming data sources and demonstrate how Lake Formation transactions work.

Lake Formations transactions, row-level security, and acceleration are currently available for preview in the US East (N. Virginia) AWS Region. To get early access to these capabilities, please sign up for the preview.


Appendix: Setting up resources via the console

When following the steps in this section, use the Region us-east-1 because as of this writing, this Lake Formation preview feature is available only in us-east-1.

Configuring IAM roles and IAM users

First, you need to set up two IAM roles, one is for AWS Glue ETL jobs, another is for the Lake Formation data lake location.

IAM policies

To create your policies, complete the following steps:

  1. On the IAM console, create a new Policy for Amazon S3.
  2. Save the policy as S3DataLakePolicy as follows:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::amazon-reviews-pds/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::amazon-reviews-pds"
                ]
            }
        ]
    }

  3. Create a new IAM policy named LFLocationPolicy with the following statements:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "LFPreview1",
                "Effect": "Allow",
                "Action": "execute-api:Invoke",
                "Resource": "arn:aws:execute-api:*:*:*/*/POST/reportStatus"
            },
            {
                "Sid": "LFPreview2",
                "Effect": "Allow",
                "Action": [
                    "lakeformation:BeginTransaction",
                    "lakeformation:CommitTransaction",
                    "lakeformation:AbortTransaction",
                    "lakeformation:GetTableObjects",
                    "lakeformation:UpdateTableObjects"
                ],
                "Resource": "*"
            }
        ]
    }

    
    

  4. Create a new IAM policy named LFQuery Policy with the following statements:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "LFPreview1",
                "Effect": "Allow",
                "Action": "execute-api:Invoke",
                "Resource": "arn:aws:execute-api:*:*:*/*/POST/reportStatus"
            },
            {
                "Sid": "LFPreview2",
                "Effect": "Allow",
                "Action": [
                    "lakeformation:BeginTransaction",
                    "lakeformation:CommitTransaction",
                    "lakeformation:AbortTransaction",
                    "lakeformation:ExtendTransaction",
                    "lakeformation:PlanQuery",
                    "lakeformation:GetTableObjects",
                    "lakeformation:GetQueryState",
                    "lakeformation:GetWorkUnits",
                    "lakeformation:Execute"
                ],
                "Resource": "*"
            }
        ]
    }

    IAM role for AWS Lake Formation

To create your IAM role for the Lake Formation data lake location, complete the following steps:

  1. Create a new Lake Formation role called LFRegisterLocationServiceRole with a Lake Formation trust relationship:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": [
              "lakeformation.amazonaws.com"
            ]
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    

    Attach the customer managed policies S3DataLakePolicy and LFLocationPolicy you created in the previous step.

This role is used to register locations with Lake Formation which in-turn performs credential vending for Athena at query time.

IAM users

To create your users, complete the following steps:

  1. Create an IAM user named DatalakeAdmin.
  2. Attach the following AWS managed policies:
    1. AWSLakeFormationDataAdmin
    2. AmazonAthenaFullAccess
    3. IAMReadOnlyAccess
  3. Attach the customer managed policy LFQueryPolicy.
  4. Create an IAM user named DataAnalyst that can use Athena to query data.
  5. Attach the AWS managed policy AmazonAthenaFullAccess.
  6. Attach the customer managed policy LFQueryPolicy.

Configuring Lake Formation

If you’re new to Lake Formation, you can follow below steps for getting started with AWS Lake Formation.

  1. On the Lake Formation console, under Permissions, choose Admins and database creators.
  2. In the Data lake administratorssection, choose Grant.
  3. For IAM users and roles, choose your IAM user DatalakeAdmin.
  4. Choose Save.
  5. In the Database creators section, choose Grant.
  6. For IAM users and roles, choose the LFRegisterLocationServiceRole.
  7. Select Create Database.
  8. Choose Grant.
  9. Under Register and ingest, choose Data lake locations.
  10. Choose Register location.
  11. For Amazon S3 path, enter your Amazon S3 path to the bucket where your data is stored. This needs to be the same bucket you listed in LFLocationPolicy. Lake Formation uses this role to vend temporary Amazon S3 credentials to query services that need read/write access to the bucket and all prefixes under it.
  12. For IAM role, choose the LFRegisterLocationServiceRole.
  13. Choose Register location.
  14. Under Data catalog, choose Settings.
  15. Make sure that both check boxes for Use only IAM access control for new databases and Use only IAM access control for new tables in new databases are deselected.
  16. Under Data catalog, choose Databases.
  17. Choose Create database.
  18. Select Database.
  19. For Name, enter lakeformation_tutorial_amazon_reviews.
  20. Choose Create database.

About the Author

Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue & Lake Formation. His passion is for implementing software artifacts for building data lakes more effectively and easily. During his spare time, he loves to spend time with his family, especially hunting bugs—not software bugs, but bugs like butterflies, pill bugs, snails, and grasshoppers.

Analyzing Freshdesk data using Amazon EventBridge and Amazon Athena

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/analyzing-freshdesk-data-using-amazon-eventbridge-and-amazon-athena/

This post is written by Shashi Shankar, Application Architect, Shared Delivery Teams

Freshdesk is an omnichannel customer service platform by Freshworks. It provides automation services to help speed up customer support processes.

The Freshworks connector to Amazon EventBridge allows real time streaming of Freshdesk events with minimal configuration and setup. This integration provides real-time insights into customer support operations without the operational overhead of provisioning and maintaining any servers.

In this blog post, I walk through a serverless approach to ingest and analyze Freshdesk data. This solution uses EventBridge, Amazon Kinesis Data Firehose, Amazon S3, and Amazon Athena. I also look at examples of customer service questions that can be answered using this approach.

The following diagram shows a high-level architecture of the proposed solution:

  1. When a Freshdesk ticket is updated or created, the Freshworks connector pushes event data to the Amazon EventBridge partner event bus.
  2. A rule on the partner event bus pushes the event data to Kinesis Data Firehose.
  3. Kinesis Data Firehose batches data before sending to S3. An AWS Lambda function transforms the data by adding a new line to each record before sending.
  4. Kinesis Data Firehose delivers the batch of records to S3.
  5. Athena is used to query relevant data from S3 using standard SQL.

The walkthrough shows you how to:

  1. Add the EventBridge app to Freshdesk account.
  2. Configure a Freshworks partner event bus in EventBridge.
  3. Deploy a Kinesis Data Firehose stream, a Lambda function, and an S3 bucket.
  4. Set up a custom rule on the event bus to push data to Kinesis Data Firehose.
  5. Generate sample Freshdesk data to validate the ingestion process.
  6. Set up a table in Athena to query the S3 bucket.
  7. Query and analyze data

Pre-requisites

  • A Freshdesk account (which can be created here).
  • An AWS account.
  • AWS Serverless Application Model (AWS SAM CLI), installed and configured.

Adding the Amazon EventBridge app to a Freshdesk account

  1. Log in to your Freshdesk account and navigate to Admin Helpdesk Productivity Apps. Search for EventBridge:
  2. Choose the Amazon EventBridge icon and choose Install.
  • Enter your AWS account number in the AWS Account ID field.
  • Enter “OnTicketCreate”, “OnTicketUpdate” in the Events field.
  • Enter the AWS Region to send the Freshdesk events in the Region field. This walkthrough uses the us-east-1 Region.

Configuring a Freshworks partner event bus in EventBridge

Once previous step is completed, a partner event source is automatically created in the EventBridge console. Copy the partner event source name to a clipboard.

  1. Clone the GitHub repo and deploy the AWS SAM template:
    git clone https://github.com/aws-samples/amazon-eventbridge-freshdesk-example.git
    cd ./amazon-eventbridge-freshdesk-example
    sam deploy --guided
  2. PartnerEventSource – Enter partner event source name copied from the previous step.
  3. S3BucketName – Enter an S3 bucket name to store Freshdesk ticket event data.

The AWS SAM template creates an association between the partner event source and event bus:

    Type: AWS::Events::EventBus
    Properties:
      EventSourceName: !Ref PartnerEventSource
      Name: !Ref PartnerEventSource

The template creates a Kinesis Data Firehose delivery stream, Lambda function, and S3 bucket to process and store the events from Freshdesk tickets. It also adds a rule to the custom event bus with the Kinesis Data Firehose stream as the target:

  PushToFirehoseRule:
    Type: "AWS::Events::Rule"
    Properties:
      Description: Test Freshdesk Events Rule
      EventBusName: !Ref PartnerEventSource
      EventPattern:
        account: [!Ref AWS::AccountId]
      Name: freshdeskeventrule
      State: ENABLED
      Targets:
        - Arn:
            Fn::GetAtt:
              - "FirehoseDeliveryStream"
              - "Arn"
          Id: "idfreshdeskeventrule"
          RoleArn: !GetAtt EventRuleTargetIamRole.Arn

  EventRuleTargetIamRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Sid: ""
            Effect: "Allow"
            Principal:
              Service:
                - "events.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Path: "/"
      Policies:
        - PolicyName: Invoke_Firehose
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: "Allow"
                Action:
                  - "firehose:PutRecord"
                  - "firehose:PutRecordBatch"
                Resource:
                  - !GetAtt FirehoseDeliveryStream.Arn

Generating sample Freshdesk data to validate the ingestion process:

To generate sample Freshdesk data, login to the Freshdesk account and browse to the “Tickets” screen as shown:

Follow the steps to simulate two customer service operations:

  1. To create a ticket of type “Refund”. Choose the New button and enter the details:
  2. Update an existing ticket and change the priority to “Urgent”.
  3. Within a few minutes of updating the ticket, the data is pushed via the Freshworks connector to the S3 bucket created using the AWS SAM template. To verify this, browse to the S3 bucket and see that a new object with the ticket data is created:

You can also use the S3 Select option under object actions to view the raw JSON data that is sent from the partner system. You are now ready to analyze the data using Athena.

Setting up a table in Athena to query the S3 bucket

If you are familiar with Apache Hive, you may find creating tables on Athena helpful. You can create tables by writing the DDL statement in the query editor or by using the wizard or JDBC driver. To create a table in Athena:

  1. Copy and paste the following DDL statement in the Athena query editor to create a Freshdesk’s events table. For this example, the table is created in the default database.
  2. Replace S3_Bucket_Name in the following query with the name of the S3 bucket created by deploying the previous AWS SAM template:
CREATE EXTERNAL TABLE ` freshdeskevents`(
  `id` string COMMENT 'from deserializer', 
  `detail-type` string COMMENT 'from deserializer', 
  `source` string COMMENT 'from deserializer', 
  `account` string COMMENT 'from deserializer', 
  `time` string COMMENT 'from deserializer', 
  `region` string COMMENT 'from deserializer', 
  `detail` struct<ticket:struct<subject:string,description:string,is_description_truncated:boolean,description_text:string,is_description_text_truncated:boolean,due_by:string,fr_due_by:string,fr_escalated:boolean,is_escalated:boolean,fwd_emails:array<string>,reply_cc_emails:array<string>,email_config_id:string,id:int,group_id:bigint,product_id:string,company_id:string,requester_id:bigint,responder_id:bigint,status:int,priority:int,type:string,tags:array<string>,spam:boolean,source:int,tweet_id:string,cc_emails:array<string>,to_emails:string,created_at:string,updated_at:string,attachments:array<string>,custom_fields:string,changes:struct<responder_id:array<bigint>,ticket_type:array<string>,status:array<int>,status_details:array<struct<id:int,name:string>>,group_id:array<bigint>>>,requester:struct<id:bigint,name:string,email:string,mobile:string,phone:string,language:string,created_at:string>> COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'paths'='account,detail,detail-type,id,region,resources,source,time,version') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION  's3://S3_Bucket_Name/'

The table is created on the data stored in S3 and is ready to be queried. Note that table freshdeskevents points at the bucket s3://S3_Bucket_Name/. As more data is added to the bucket, the table automatically grows, providing a near-real-time data analysis experience.

Querying and analyzing data

You can use the following examples to get started with querying the Athena table.

  1. To get all the events data, run:
SELECT * FROM default.freshdeskevents  limit 10

The preceding output has a detail column containing the details related to the ticket. Tickets can be filtered on nested notations to build more insightful queries. Also, the detail-type column provides classification of tickets as new (onTicketCreate) vs updated (onTicketUpdate).

  1. To show new tickets created today with the type “Refund”:
SELECT detail.ticket.subject,detail.ticket.description_text, detail.ticket.type  FROM default.freshdeskevents
where detail.ticket.type = 'Refund' and "detail-type" = 'onTicketCreate' and date(from_iso8601_timestamp(time)) = date(current_date)
  1. All tickets with an “Urgent” priority but not assigned to an agent:
SELECT "detail-type", detail.ticket.responder_id,detail.ticket.priority, detail.ticket.subject, detail.ticket.type  FROM default.freshdeskevents
where detail.ticket.responder_id is null and detail.ticket.priority = 4

Conclusion

In this blog post, you learn how to configure Freshworks partner event source from the Freshdesk console. Once a partner event source is configured, an AWS SAM template is deployed that creates a custom event bus by attaching the partner event source. A Kinesis Data Firehose, Lambda function, and S3 bucket is used to ingest Freshdesk’s ticket events data for analysis. An EventBridge rule is configured to route the event data to the S3 bucket.

Once event data starts flowing into the S3 bucket, an Amazon Athena table is created to run queries and analyze the ticket events data. Alternative customer service data analysis use cases can be built on the architecture shown in this blog.

To learn more about other partner integrations and the native capabilities of EventBridge, visit the AWS Compute Blog.

Integrating Datadog data with AWS using Amazon AppFlow for intelligent monitoring

Post Syndicated from Gopalakrishnan Ramaswamy original https://aws.amazon.com/blogs/big-data/integrating-datadog-data-with-aws-using-amazon-appflow-for-intelligent-monitoring/

Infrastructure and operation teams are often challenged with getting a full view into their IT environments to do monitoring and troubleshooting. New monitoring technologies are needed to provide an integrated view of all components of an IT infrastructure and application system.

Datadog provides intelligent application and service monitoring by bringing together data from servers, databases, containers, and third-party services in the form of a software as a service (SaaS) offering. It provides operations and development professionals the ability to measure application and infrastructure performance, visualize metrics with the help of a unified dashboard and create alerts and notifications.

Amazon AppFlow is a fully managed service that provides integration capabilities by enabling you to transfer data between SaaS applications like Datadog, Salesforce, Marketo, and Slack and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift. It provides capabilities to transform, filter, and validate data to generate enriched and usable data in a few easy steps.

In this post, I walk you through the process of extracting log data from Datadog, using Amazon AppFlow and storing it in Amazon S3, and querying it with Amazon Athena.

Solution overview

The following diagram shows the flow of our solution.

The following diagram shows the flow of our solution.

The Datadog Agent is a lightweight software that can be installed in many different platforms, either directly or as a containerized version. It collects events and metrics from hosts and sends them to Datadog. Amazon AppFlow extracts the log data from Datadog and stores it in Amazon S3, which is then queried using Athena.

To implement the solution, you complete the following steps:

  1. Install and configure the Datadog Agent.
  2. Create a new Datadog application key.
  3. Create an Amazon AppFlow connection for Datadog.
  4. Create a flow in Amazon AppFlow.
  5. Run the flow and query the data.

Prerequisites

The walkthrough requires the following:

  • An AWS account
  • A Datadog account

Installing and configuring the Datadog Agent

The Datadog Agent is lightweight software installed on your hosts. With additional setup, the Agent can report live processes, logs, and traces. The Agent needs an API key, which is used to associate the Agent’s data with your organization. Complete the following steps to install and configure the Datadog Agent:

  1. Create a Datadog account if you haven’t already.
  2. Login to your account.
  3. Under Integrations, choose APIs.
  4. Copy the API key.
  5. Download the Datadog Agent software for the selected platform.
  6. Install the Agent on the hosts using the API key you copied.

Collecting logs is disabled by default in Datadog Agent. To enable Agent log collection and configure a custom log collection, perform the following steps on your host:

  1. Update the Datadog Agent’s main configuration file (datadog.yaml) with the following code:
    logs_enabled: true

In Windows this file is in C:\ProgramData\Datadog.

  1. Create custom log collection by customizing the conf.yaml file.

For example in Windows this file would be in the path C:\ProgramData\Datadog\conf.d\win32_event_log.d. The following code is a sample entry in the conf.yaml file that enables collection of Windows security events:

logs:
  - type: windows_event
    channel_path: Security
    source: Security
    service: windowsOS
    sourcecategory: windowsevent

Getting the Datadog application key

The application keys in conjunction with your organization’s API key give you full access to Datadog’s programmatic API. Application keys are associated with the user account that created them. The application key is used to log all requests made to the API. Get your application key with the following steps:

  1. Login into your Datadog account.
  2. Under Integrations, choose APIs.
  3. Expand Application Keys.
  4. For Application key name, enter a name.
  5. Choose Create Application key.

Creating an Amazon AppFlow connection for Datadog

A connection defines the source or destination to use in a flow. To create a new connection for Datadog, complete the following steps:

  1. On the Amazon AppFlow console, in the navigation pane, choose Connections. 
  2. For Connectors, choose Datadog.
  3. Choose Create Connection.
  4. For API key and Application Key, enter the keys procured from the previous steps.
  5. For Connection Name, enter a name; for example, myappflowconnection.
  6. Choose Connect.

Choose Connect.

Creating a flow in Amazon AppFlow

After you create the data connection, you can create a flow that uses the connection and defines the destination, data mapping, transformation, and filters.

Creating an S3 bucket

Create an S3 bucket as your Amazon AppFlow transfer destination.

  1. On the Amazon S3 console, choose Create bucket.
  2. Enter a name for your bucket; for example, mydatadoglogbucket.
  3. Ensure that Block all public access is selected.
  4. Enable bucket versioning and encryption (optional).
  5. Choose Create bucket.
  6. Enable Amazon S3 server access logging (optional).

Configuring the flow source

After you create the Datadog agent and the S3 bucket, complete the following steps to create a flow:

  1. On the Amazon AppFlow console, in the navigation pane, choose Flows.
  2. Choose Create flow.
  3. For Flow name, enter a name for your flow; for example mydatadogflow.
  4. For Source name, choose Datadog.
  5. For Choose Datadog connection, choose the connection created earlier.
  6. For Choose Datadog object, choose Logs.

For Choose Datadog object, choose Logs.

Choosing a destination

In the Destination details section, provide the following information:

  1. For Destination name, Choose Amazon S3.
  2. For Bucket details, choose the name of the S3 bucket created earlier.

This step create a folder with the flow name you specified within the bucket to store the logs.

This step creates a folder with the flow name you specified within the bucket to store the logs.

Additional settings

You can provide additional settings for data format (JSON, CSV, Parquet), data transfer preference, filename preference, flow trigger and transfer mode. Leave all settings as default:

  • For Data format preference, choose JSON format.
  • For Data transfer preference, choose No aggregation.
  • For Filename preference, choose No timestamp.
  • For Folder structure preference, choose No timestamped folder.

Adding a flow trigger

Flows can be run on a schedule, based on an event or on demand. For this post, we choose Run on demand.

Mapping data fields

You can map manually or using a CSV file. This determines how data is transferred from source to destination. You can apply transformations like concatenation, masking, and truncation to the mappings.

  1. In the Map data fields section, for Mapping method, choose Manually map fields.
  2. For Source field name, choose Map all fields directly.
  3. Choose Next.Choose Next.

Validation

You can add validation to perform certain actions based on conditions on field values.

  1. In the Validations section, for Field name choose Content.
  2. For Condition, choose Values are missing or null.
  3. For Action, choose Ignore record.For Action, choose Ignore record.

Filters

Filters specify which records to transfer. You can add multiple filters with criterion. For the Datadog data source, it’s mandatory to specify filters for Date_Range and Query. The format for specifying filter query for metrics and logs are different.

  1. In the Add filters section, for Field name, choose Date_Range.
  2. For Condition, choose is between.
  3. For Criterion 1 and Criterion 2, enter start and end dates for log collection.
  4. Choose Add filter.
  5. For your second filter, for Field name, choose
  6. For Condition, enter host:<yourhostname> AND service:(windowsOS OR LinuxOS).
  7. Choose Save.

Choose Save.

The service names specified in the filter should have Datadog logs enabled (refer to the earlier step when you installed and configured the Datadog Agent).

The following are some examples of the filter Query for metrics:

  • load.1{*} by {host}
  • avg:system.cpu.idle{*}
  • avg:system.cpu.system{*}
  • avg:system.cpu.user{*}
  • avg:system.cpu.guest{*}
  • avg:system.cpu.user{host:yourhostname}

The following are some examples of the filter Query for logs:

  • service:servicename
  • host:myhostname
  • host:hostname1 AND service:(servicename1 OR servicename2) 

Running the Flow and querying the data

If a flow is based on a trigger, you can activate or deactivate it. If it’s on demand, it must be run each time data needs to be transferred. When you run the flow, the logs or metrics are pulled into files residing in Amazon S3. The data is in the form of a nested JSON in this example. Use AWS Glue and Athena to create a schema and query the log data.

Querying data with Athena

When the Datadog data is in AWS, there are a host of possibilities to store, process, integrate with other data sources, and perform advanced analytics. One such method is to use Athena to query the data directly from Amazon S3.

  1. On the AWS Glue console, in the navigation pane, choose Databases.
  2. Choose Add database.
  3. For Database name, enter a name such as mydatadoglogdb.
  4. Choose Create.
  5. In the navigation pane, choose Crawlers.
  6. Choose Add Crawler.
  7. For Crawler name, enter a name, such as mylogcrawler.
  8. Choose Next.
  9. For Crawler source type, select Data stores.
  10. Choose Next.
  11. In the Add a data store section, choose S3 for the data store.
  12. Enter the path to the S3 folder that has the log files; for example s3://mydatadoglogbucket/logfolder/.
  13. In the Choose an IAM role section, select Create an IAM role and provide a name.
  14. For Frequency select Run on demand.
  15. In the Configure the crawler’s output section, for Database, select the database created previously.
  16. Choose Next.
  17. Review and choose Finish.
  18. When the crawler’s status changes to Active, select it and choose Run Crawler.

When the crawler finishes running, it creates the tables and populates them with data based on the schema it infers from the JSON log files.

  1. On the Athena console, choose Settings.
  2. Select an S3 bucket and folder where Athena results are stored.
  3. In the Athena query window, enter the following query:
    select * 
    from mydatadoglogdb.samplelogfile
    where content.attributes.level = 'Information'
    

  4. Choose Run Query.

This sample query gets all the log entries where the level is Information. We’re traversing a nested JSON object in the Athena query, simply with a dot notation.

Summary

In this post, I demonstrated how we can bring Datadog data into AWS. Doing so opens a host of opportunities to use the tools available in AWS to drive advance analytics and monitoring while integrating with data from other sources.

With Amazon AppFlow, you can integrate applications in a few minute, transfer data at massive scale, and enrich the data as it flows, using mapping, merging, masking, filtering, and validation. For more information about integrating SaaS applications and AWS, see Amazon AppFlow.


About the Author

Gopalakrishnan Ramaswamy is a Solutions Architect at AWS based out of India with extensive background in database, analytics, and machine learning. He helps customers of all sizes solve complex challenges by providing solutions using AWS products and services. Outside of work, he likes the outdoors, physical activities and spending time with friends and family.