Tag Archives: Amazon Athena

Enforce column-level authorization with Amazon QuickSight and AWS Lake Formation

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/enforce-column-level-authorization-with-amazon-quicksight-and-aws-lake-formation/

Amazon QuickSight is a fast, cloud-powered, business intelligence service that makes it easy to deliver insights and integrates seamlessly with your data lake built on Amazon Simple Storage Service (Amazon S3). QuickSight users in your organization often need access to only a subset of columns for compliance and security reasons. Without having a proper solution to enforce column-level security, you have to develop additional solutions, such as views, data masking, or encryption, to enforce security.

QuickSight accounts can now take advantage of AWS Lake Formation column-level authorization to enforce granular-level access control for their users.

Overview of solution

In this solution, you build an end-to-end data pipeline using Lake Formation to ingest data from an Amazon Aurora MySQL database to an Amazon S3 data lake and use Lake Formation to enforce column-level access control for QuickSight users.

The following diagram illustrates the architecture of this solution.

Walkthrough overview

The detailed steps in this solution include building a data lake using Lake Formation, which uses an Aurora MySQL database as the source and Amazon S3 as the target data lake storage. You create a workflow in Lake Formation that imports a single table from the source database to the data lake. You then use Lake Formation security features to enforce column-level security for QuickSight service on the imported table. Finally, you use QuickSight to connect to this data lake and visualize only the columns for which Lake Formation has given access to QuickSight user.

To implement the solution, you complete the following steps:

  1. Prerequisites
  2. Creating a source database
  3. Importing a single table from the source database
    • Creating a connection to the data source
    • Creating and registering your S3 bucket
    • Creating a database in the Data Catalog and granting permissions
    • Creating and running the workflow
    • Granting Data Catalog permissions
  4. Enforcing column-level security in Lake Formation
  5. Creating visualizations in QuickSight

Prerequisites

For this walkthrough, you should have the following prerequisites:

Creating a source database

In this step, create an Aurora MySQL database cluster and use the DDLs in the following GitHub repo to create an HR schema with associated tables and sample data.

You should then see the schema you created using the MySQL monitor or your preferred SQL client. For this post, I used SQL Workbench. See the following screenshot.

Record the Aurora database JDBC endpoint information; you need it in subsequent steps.

Importing a single table from the source database

Before you complete the following steps, make sure you have set up Lake Formation and met the JDBC prerequisites.

The Lake Formation setup creates a datalake_user IAM user. You need to add the same user as a QuickSight user. For instructions, see Managing User Access Inside Amazon QuickSight. For Role, choose AUTHOR.

Creating a connection to the data source

After you complete the Lake Formation prerequisites, which include creating IAM users datalake_admin and datalake_user, create a connection in your Aurora database. For instructions, see Create a Connection in AWS Glue. Provide the following information:

  • Connection name<yourPrefix>-blog-datasource
  • Connection type – JDBC
  • Database connection parameters – JDBC URL, user name, password, VPC, subnet, and security group

Creating and registering your S3 bucket

In this step, you create an S3 bucket named <yourPrefix>-blog-datalake, which you use as the root location of your data lake. After you create the bucket, you need to register the Amazon S3 path. Lastly, grant data location permissions.

Creating a database in the Data Catalog and granting permissions

Create a database in the Lake Formation Data Catalog named <yourPrefix>-blog-database, which stores the metadata tables. For instructions, see Create a Database in the Data Catalog.

After you create the database, you grant data permissions to the metadata tables to the LakeFormationWorkflowRole role, which you use to run the workflows.

Creating and running the workflow

In this step, you copy the EMPLOYEES table from the source database using a Lake Formation blueprint. Provide the following information:

  • Blueprint type – Database snapshot
  • Database connection<yourPrefix>-blog-datasource
  • Source data pathHR/EMPLOYEES
  • Target database<yourPrefix>-blog-database
  • Target storage location<yourPrefix>-blog-datalake
  • Workflow name<yourPrefix>-datalake-quicksight
  • IAM roleLakeFormationWorkflowRole
  • Table prefixblog

For instructions, see Use a Blueprint to Create a Workflow.

When the workflow is ready, you can start the workflow and check its status by choosing View graph. When the workflow is complete, you can see the employee table available in your Data Catalog under <yourPrefix>-blog-database. See the following screenshot.

You can also view the imported data using Athena, which is integrated with Lake Formation. You need to select “View Data” from “Actions” drop down menu for this purpose. See the following screenshot.

Granting Data Catalog permissions

In this step, you provide the Lake Formation Data Catalog access to the IAM user datalake_user. This is the same user that you added in QuickSight to create the dashboard. For Database permissions, select Create table and Alter for this use case, but you can change the permission level based on your specific requirements. For instructions, see Granting Data Catalog Permissions.

When this step is complete, you see the permissions for your database <yourPrefix>-blog-database.

Enforcing column-level security in Lake Formation

Now that your table is imported into the data lake, enforce column-level security to the dataset. For this use case, you want to hide the Salary and Phone_Number columns from business intelligence QuickSight users.

  1. In the Lake Formation Data Catalog, choose Databases.
  2. From the list of databases, choose <yourPrefix>-blog-database.
  3. Choose View tables.
  4. Select blog_hr_employees.
  5. From the Actions drop-down menu, choose Grant.

  1. For Active Directory and Amazon QuickSight users and groups, provide the QuickSight user ARN.

You can find the ARN by entering the code aws quicksight list-users --aws-account-id <your AWS account id> --namespace default --region us-east-1 in the AWS Command Line Interface (AWS CLI).

  1. For Database, choose <yourPrefix>-blog-database.
  2. For Table, choose blog_hr_employees.
  3. For Columns, choose Exclude columns.
  4. For Exclude columns, choose salary and phone_number.
  5. For Table permissions, select Select.

You should receive a confirmation on the console that says Permission granted for: datalake_user to Exclude: <yourPrefix>-blog-database.blog_hr_employees.[salary, phone_number].

You can also verify that appropriate permission is reflected for the QuickSight user on the Lake Formation console by navigating to the Permissions tab and filtering for your database and table.

You can also specify column-level permissions in the AWS CLI with the following code:

aws lakeformation grant-permissions --principal DataLakePrincipalIdentifier=<QuickSight User ARN> --permissions "SELECT" --resource '{ "TableWithColumns": {"DatabaseName":"<yourPrefix>-blog-database", "Name":"blog_hr_employees", "ColumnWildcard": {"ExcludedColumnNames": ["salary", "phone_number"]}}}'  --region us-west-2 --profile datalake_admin

Creating visualizations in QuickSight

In this step, you use QuickSight to access the blog_hr_employees table in your data lake. While accessing this dataset from QuickSight, you can see that QuickSight doesn’t show the salary and phone_number columns, which you excluded from the source table in the previous step.

  1. Log in to QuickSight using the datalake_user IAM user.
  2. Choose New analysis.
  3. Choose New dataset.
  4. For the data source, choose Athena.

  1. For your data source name, enter Athena-HRDB.
  2. For Database, choose <yourPrefix>-blog-database.
  3. For Tables, select blog_hr_employees.
  4. Choose Select.

  1. Choose Import to SPICE for quicker analysis or Directly query your data.

For this use case, choose Import to SPICE. This provides faster visualization in a production setup, and you can run a scheduled refresh to make sure your dashboards are referring to the current data. For more information, see Scheduled Refresh for SPICE Data Sets on Amazon QuickSight.

When you complete the previous steps, your data is imported to your SPICE machine and you arrive at the QuickSight visualization dashboard. You can see that SPICE has excluded the salary and phone_number fields from the table. In the following screenshot, we created a pie chart visualization to show how many employees are present in each department.

Cleaning up

To avoid incurring future charges, delete the resources you created in this walkthrough, including your S3 bucket, Aurora cluster, and other associated resources.

Conclusion

Restricting access to sensitive data to various users in a data lake is a very common challenge. In this post, we demonstrated how to use Lake Formation to enforce column-level access to QuickSight dashboard users. You can enhance security further with Athena workgroups. For more information, see Creating a Data Set Using Amazon Athena Data and Benefits of Using Workgroups.

 


About the Author

Avijit Goswami is a Sr. Startups Solutions Architect at AWS, helping startup customers become tomorrow’s enterprises. When not at work, Avijit likes to cook, travel, watch sports, and listen to music.

 

 

How Wind Mobility built a serverless data architecture

Post Syndicated from Pablo Giner original https://aws.amazon.com/blogs/big-data/how-wind-mobility-built-a-serverless-data-architecture/

Guest post by Pablo Giner, Head of BI, Wind Mobility.

Over the past few years, urban micro-mobility has become a trending topic. With the contamination indexes hitting historic highs, cities and companies worldwide have been introducing regulations and working on a wide spectrum of solutions to alleviate the situation.

We at Wind Mobility strive to make commuters’ life more sustainable and convenient by bringing short distance urban transportation to cities worldwide.

At Wind Mobility, we scale our services at the same pace as our users demand them, and we do it in an economically and environmentally viable way. We optimize our fleet distribution to avoid overcrowding cities with more scooters than those that are actually going to be used, and we position them just meters away from where our users need them and at the time of the day when they want them.

How do we do that? By optimizing our operations to their fullest. To do so, we need to be very well informed about our users’ behavior under varying conditions and understand our fleet’s potential.

Scalability and flexibility for rapid growth

We knew that before we could solve this challenge, we needed to collect data from many different sources, such as user interactions with our application, user demand, IoT signals from our scooters, and operational metrics. To analyze the numerous datasets collected and extract actionable insights, we needed to build a data lake. While the high-level goal was clear, the scope was less so. We were working hard to scale our operation as we continued to launch new markets. The rapid growth and expansion made it very difficult to predict the volume of data we would need to consume. We were also launching new microservices to support our growth, which resulted in more data sources to ingest. We needed an architecture that allowed us to be agile and quickly adopt to meet our growth. It became clear that a serverless architecture was best positioned to meet those needs, so we started to design our 100% serverless infrastructure.

The first challenge was ingesting and storing data from our scooters in the field, events from our mobile app, operational metrics, and partner APIs. We use AWS Lambda to capture changes in our operational databases and mobile app and push the events to Amazon Kinesis Data Streams, which allows us to take action in real time. We also use Amazon Kinesis Data Firehose to write the data to Amazon Simple Storage Service (Amazon S3), which we use for analytics.

After we were in Amazon S3 and adequately partitioned as per its most common use cases (we partition by date, region, and business line, depending on the data source), we had to find a way to query this data for both data profiling (understanding structure, content, and interrelationships) and ad hoc analysis. For that we chose AWS Glue crawlers to catalog our data and Amazon Athena to read from the AWS Glue Data Catalog and run queries. However, ad hoc analysis and data profiling are relatively sporadic tasks in our team, because most of the data processing computing hours are actually dedicated to transforming the multiple data sources into our data warehouse, consolidating the raw data, modeling it, adding new attributes, and picking the data elements, which constitute 95% of our analytics and predictive needs.

This is where all the heavy lifting takes place. We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from Amazon S3, processes it using Apache Spark, and writes the results to our Amazon Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Architecture overview

The following diagram represents our current data architecture, showing two serverless data collection, processing, and reporting pipelines:

  • Operational databases from Amazon Relational Database Service (Amazon RDS) and MongoDB
  • IoT and application events, followed by Athena for data profiling and Amazon Redshift for reporting

Our data is curated and transformed multiple times a day using an automated pipeline running on AWS Glue. The team can now focus on analyzing the data and building machine learning (ML) applications.

We chose Amazon QuickSight as our business intelligence tool to help us visualize and better understand our operational KPIs. Additionally, we use Amazon Elastic Container Registry (Amazon ECR) to store our Docker images containing our custom ML algorithms and Amazon Elastic Container Service (Amazon ECS) where we train, evaluate, and host our ML models. We schedule our models to be trained and evaluated multiple times a day. Taking as input curated data about demand, conversion, and flow of scooters, we run the models to help us optimize fleet utilization for a particular city at any given time.

The following diagram represents how data from the data lake is incorporated into our ML training, testing, and serving system. First, our developers work in the application code and commit their changes, which are built into new Docker images by our CI/CD pipeline and stored in the Amazon ECR registry. These images are pushed into Amazon ECS and tested in DEV and UAT environments before moving to PROD (where they are triggered by the Amazon ECS task scheduler). During their execution, the Amazon ECS tasks (some train the demand and usage forecasting models, some produce the daily and hourly predictions, and others optimize the fleet distribution to satisfy the forecast) read their configuration and pull data from Amazon S3 (which has been previously produced by scheduled AWS Glue jobs), finally storing their results back into Amazon S3. Executions of these pipelines are tracked via MLFlow (in a dedicated Amazon Elastic Compute Cloud (Amazon EC2) server) and the final result indicating the fleet operations required is fit into a Kepler map, which is then consumed by the operators on the field.

Conclusion

We at Wind Mobility place data at the forefront of our operations. For that, we need our data infrastructure to be as flexible as the industry and the context we operate in, which is why we chose serverless. Over the course of a year, we have built a data lake, a data warehouse, a BI suite, and a variety of (production) data science applications. All of that with a very small team.

Also, within the last 12 months, we have scaled up several of our data pipelines by a factor of 10, without slowing our momentum or redesigning any part of our architecture. When it came to double our fleet in 1 week and increase the frequency at which we capture data from scooters by a factor of 10, our serverless data architecture scaled with no issues. This allowed us to focus on adding value by simplifying our operation, reacting to changes quickly, and delighting our users.

We have measured our success in multiple dimensions:

  • Speed – Serverless is faster to deploy and expand; we believe we have reduced our time to market for the entire infrastructure by a factor of 2
  • Visibility – We have 360 degree visibility of our operations worldwide, accessible by our city managers, finance team, and management board
  • Optimized fleet deployment – We know, at any minute of the day, the number of scooters that our customers need over the next few hours, which reduces unsatisfied demand by more than 50%

If you face a similar challenge, our advice is clear: go fully serverless and use the spectrum of solutions available from AWS.

Follow us and discover more about Wind Mobility on Facebook, Instagram and LinkedIn.

 


About the Author

Pablo Giner is Head of BI at Wind Mobility. Pablo’s background is in wheels (motorcycle racing > vehicle engineering > collision insurance > eScooters sharing…) and for the last few years he has specialized in forming and developing data teams. At Wind Mobility, he leads the data function (data engineering + analytics + data science), and the project he is most proud of is what they call smart fleet rebalancing, an AI backed solution to reposition their fleet in real-time. “In God we trust. All others must bring data.” – W. Edward Deming

 

 

 

Process data with varying data ingestion frequencies using AWS Glue job bookmarks

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/process-data-with-varying-data-ingestion-frequencies-using-aws-glue-job-bookmarks/

We often have data processing requirements in which we need to merge multiple datasets with varying data ingestion frequencies. Some of these datasets are ingested one time in full, received infrequently, and always used in their entirety, whereas other datasets are incremental, received at certain intervals, and joined with the full datasets to generate output. To address this requirement, this post demonstrates how to build an extract, transform, and load (ETL) pipeline using AWS Glue.

Using AWS Glue

AWS Glue provides a serverless environment to extract, transform, and load a large number of datasets from several sources for analytics purposes. It has a feature called job bookmarks to process incremental data when rerunning a job on a scheduled interval. A job bookmark is composed of the states for various job elements, such as sources, transformations, and targets. This is done by persisting state information from a job run that helps AWS Glue prevent reprocessing old data.

For this use case, we use an AWS Glue job with job bookmarks enabled to process files received in varying frequencies (a full dataset signifying files that are received one time, and incremental datasets signifying files that are received in certain regular intervals). These files are merged together. In addition to enabling job bookmarks, we also use an optional parameter transformation_ctx (transformation context) in an AWS Glue PySpark dynamic frame. This acts as a unique identifier for the ETL operator instance to identify state information within a job bookmark for a given operator. AWS Glue uses transformation_ctx to index the key to the bookmark state.

You can capture and maintain state information for incremental datasets and avoid reprocessing by using transformation context. Transformation context is omitted for the full dataset file, which results in the job run state information not getting captured for the full dataset and allowing it to participate in the next processing event in its entirety. Even though the job bookmark flag is enabled at the AWS Glue job level, because transformation context is omitted for the full dataset, every time the job runs, the entire data from the full dataset is used as part of the job. In contrast, only the newly arrived datasets are processed for the incremental datasets.

Solution overview

To demonstrate the job bookmark utility of AWS Glue, we use TLC Trip Record Data datasets. We use NYC yellow taxi trip data monthly files as the incremental dataset, and NYC taxi zone lookup as the full dataset. The monthly yellow taxi trip data has a field named PULocationID (where a customer was picked up), which is joined with the LocationID field from the NYC taxi zone lookup file to create an output dataset that contains Borough, Zone, and service_zone from the NYC taxi zone lookup dataset and all the fields (except the PULocationID field) from the monthly NYC taxi trip data file.

The following diagram depicts a high-level architecture of the process.

Descriptions of Diagram

  • Two Amazon S3 Raw bucket locations are used for storing incoming CSV source data (NYC taxi monthly files (Incremental Dataset) and NYC Taxi lookup file (Full Dataset)).
  • A Bookmark enabled glue Job joins data between monthly trip data file and the taxi zone lookup file to generate output parquet files and creates NYC taxi trip table in Glue Data Catalog and Redshift database.
  • S3 Curated Bucket is used to store NYC Taxi monthly processed parquet files.

Creating the AWS CloudFormation stack

You use the following AWS CloudFormation template to create the below mentioned resources in your preferred AWS account and Region:

Additionally, make sure you have an Amazon EC2 key pair created in the account and Region you’re launching the stack from.

To provide the stack parameters, complete the following steps:

  1. For Stack name, enter BigDataBlog-GlueBookmark-Stack.

  1. For RedshiftClusterIdentifier, enter bigdatablogrscluster.
  2. For NodeType, choose large.
  3. For NumberOfNodes, choose 2.
  4. For DatabaseName, enter bigdatablogdev.

  1. For MasterUserName, enter bigdatabloguser.
  2. For MasterUserPassword, enter a password for the master user account.
  3. For Maintenancewindow, enter sun:05:00-sun:05:30.
  4. For EC2InstanceType, choose micro.
  5. For SubscriptionEmail, enter your preferred email.
  6. For MyIPAddressCidr, enter your IP address.

You can find your IP Address by browsing https://www.whatismyip.com/ and looking up the value for My Public IPv4 is:. Add /32 at the end to make it CIDR-compatible and most restrictive.

  1. For DestinationPrefixListId, enter your prefix list ID.

To find your ID, set AWS credentials by entering aws configure in the command prompt. Run aws ec2 describe-prefix-lists to get the PrefixListId where PrefixListName is com.amazonaws.<<AWS region>>.s3 from the output.

  1. For NewS3BucketName, enter the name of your S3 bucket.

  1. For gluedatabase, enter bigdatabloggluedb.
  2. For EC2KeyName, enter the name of your key pair.

For instructions on creating a stack, see Creating a Stack on the AWS CloudFormation Console.

Make sure the stack is complete before moving to the next steps.

Creating the AWS Glue job

To create your AWS Glue job, complete the following steps:

  1. Download NYC yellow monthly trip data for October 2019 and November 2019 and save them under the s3://<<Your S3 Bucket>>/tripdata/ prefix.
  2. Download the NYC Taxi Zone lookup table and save it under the s3://<<Your S3 Bucket>>/tripdata-lookup/ prefix.
  3. Use the following PySpark script and change the piece of the code enclosed inside <<…>>.

You can find the values for the following keys on the Outputs tab for the CloudFormation stack:

    • S3Bucket
    • Snstopic

You can find the values for the following keys on the Parameters tab for the CloudFormation stack:

    • EC2KeyName
    • MyIPAddressCidr
    • NewS3BucketName
    • SubscriptionEmail

  1. When the AWS Glue script is ready, upload it to the S3 bucket under the s3://<<Your S3 Bucket>>/glue-script/ prefix.

You refer to this when you create the AWS Glue job.

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Create job.
  3. For Name, enter a name for the job. For more information about AWS Glue job naming, see Jobs.
  4. For IAM role, choose the role the CloudFormation template created. Use the value for the key Glueaccessrole from the stack outputs.
  5. For Type, choose Spark.
  6. For Glue version, choose Spark 2.4, Python 3 (Glue Version 1.0).
  7. For This job runs, choose An existing script that you provide.
  8. For S3 path where the script is stored, choose the script file that you saved earlier under the s3://<<Your S3 Bucket>>/Glue-script/ prefix.
  9. In the Advanced properties section, for Job bookmark, choose Enable.
  10. For Catalog options, select Use Glue Data Catalog as the Hive metastore.
  11. For Connections, enter the value of the key GlueConnection from the stack outputs.
  12. Choose Save job and edit script.

Creating an Amazon Redshift database schema

Before you run the AWS Glue job, you need to connect to the Amazon Redshift cluster and create an Amazon Redshift database schema named Glue_bookmark_redshift_schema. To connect to the cluster, use one of the JDBC client-based SQL tools, such as SQL Workbench/J. For instructions, see How can I access a private Amazon Redshift cluster from my local machine?

To access the cluster, you use the Amazon Redshift master user bigdatabloguser (the value for MasterUserName on the Parameters tab of the CloudFormation stack) and the password you provided when creating the stack.

Running AWS Glue job

The Glue Job takes only one argument; name of the file being processed. Pass the file name, such as yellow_tripdata_2019-10.csv, while processing that file. This enables you to track the records that belong to a specific file so that it’s easier to evaluate the result of multiple job runs using different files.

When the Glue job run is successful, you can see the output Parquet files under the /tripdata-joined-output/ prefix inside the S3 bucket you created by running the CloudFormation template. You can also use Amazon Athena to query the data from the table created in the Data Catalog. For more information, see Running SQL Queries Using Amazon Athena.

Query the Amazon Redshift database table named redshift_bookmark_table and review the output.

Explaining the solution

A bookmark-enabled AWS Glue job (in PySpark) is created that reads the NYC yellow taxi trip’s monthly file, joins it with NYC taxi zone lookup file, produces files in Parquet format, and saves them in an Amazon s3 location.

A Data Catalog table is created that refers to the Parquet files’ location in Amazon S3. The resulting dataset is also loaded into an Amazon Redshift table using the AWS Glue PySpark job.

The AWS Glue job bookmark transformation context is used while the AWS Glue dynamic frame is created by reading a monthly NYC taxi file, whereas the transformation context is disabled while reading and creating the dynamic frame for the taxi zone lookup file (because the entire file is required for processing each monthly trip file). This allows you to process each monthly trip file exactly one time and reuse the entire taxi zone lookup file as many times as required because the missing transformation context for the lookup file doesn’t allow the bookmark context to be set for that file.

When a new NYC trip data monthly file arrives and the AWS Glue job runs, it only processes the newly arrived monthly file and ignores any previously processed monthly files. Similarly, when the Data Catalog table data is copied into Amazon Redshift, it only copies the newly processed underlying Parquet files’ data and appends it to the Amazon Redshift table. At this time the transformation context is enabled to utilize the job bookmark, and the AWS Glue dynamic frame is created by reading the Data Catalog table.

The following PySpark code uses the transformation context to create an AWS Glue dynamic frame while reading the monthly incremental file:

taxidata = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths": [InputDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'},transformation_ctx = "taxidata")

However, the following code omits transformation context when creating the AWS Glue dynamic frame for the lookup file:

Lookupdata  = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})

Additionally, the following code uses the transformation context while reading the Data Catalog table, which is loaded into an Amazon Redshift table:

datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")

You can see in the screenshot below that the 2019 October yellow taxi trip data file has arrived for processing (the incremental dataset).

To process each month’s data, you need the taxi zone lookup (full dataset).

The following screenshot shows the output of the AWS Glue job after processing the 2019 October trip data, saved in Parquet format.

The following two screenshots show the Amazon Redshift table, displaying the count of records for the October 2019 taxi data and only October 2019 taxi data file has been processed so far, respectively

The following screenshot shows that the November 2019 NYC taxi data file has arrived for processing.

The following screenshot shows the output of the AWS Glue job after processing the 2019 November trip data, saved in Parquet format. The job only processed the November data and ignored the October data (to be reprocessed) because the job bookmark and transformation context was enabled.

The following screenshot shows that the Amazon Redshift table now has both October and November data and shows the total record count.

The following screenshot shows individual record count for each month.

Querying with Athena

You can also review the dataset in Athena, which uses the same Glue Data Catalog. The following screenshot of an Athena query shows the Data Catalog table has both October and November data, with the total record count.

The following screenshot of an Athena query shows the individual record count for each month.

The following screenshot shows the location information, including borough, zone, and service zone, which is available in the taxi zone lookup and is joined with the October taxi trip data.

The following screenshot shows the output for the same query on the November data.

Cleaning up

When you’re done using this solution, you should delete the CloudFormation stack to avoid incurring any further charges.

Conclusion

This post describes how you can merge datasets received in different frequencies as part of your ETL pipeline processing using AWS Glue job bookmarks. The use case demonstrated how to use job bookmarks and transformation context to build an ETL pipeline for processing several incremental datasets.

 


About the Authors

Dipankar is a Senior Data Architect with AWS Professional Services, helping customers build analytics platform and solutions. He has a keen interest in distributed computing. Dipankar enjoys spending time playing chess and watching old Hollywood movies.

 

 

 

Ashok Padmanabhan is a big data consultant with AWS Professional Services, helping customers build big data and analytics platform and solutions. When not building and designing data lakes, Ashok enjoys spending time at beaches near his home in Florida.

Moovit embraces data lake architecture by extending their Amazon Redshift cluster to analyze billions of data points every day

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/moovit-embraces-data-lake-architecture-by-extending-their-amazon-redshift-cluster-to-analyze-billions-of-data-points-every-day/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

Moovit is a leading Mobility as a Service (MaaS) solutions provider and maker of the top urban mobility app. Guiding over 800 million users in more than 3,200 cities across 103 countries to get around town effectively and conveniently, Moovit has experienced exponential growth of their service in the last few years. The company amasses up to 6 billion anonymous data points a day to add to the world’s largest repository of transit and urban mobility data, aided by Moovit’s network of more than 685,000 local editors that help map and maintain local transit information in cities that would otherwise be unserved.

Like Moovit, many companies today are using Amazon Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics using services such as Amazon EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using Amazon Athena and Amazon Redshift Spectrum. You can also use a data lake with ML services such as Amazon SageMaker to gain insights.

Moovit uses an Amazon Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.

Additionally, Moovit was looking to manage their storage costs and evolve to a model that allowed cooler data to be maintained at the lowest cost in S3, and maintain the hottest data in Redshift for the most efficient query performance. The proposed solution implemented a hot/cold storage pattern using Amazon Redshift Spectrum and reduced the local disk utilization on the Amazon Redshift cluster to make sure costs are maintained. Moovit is currently evaluating the new RA3 node with managed storage as an additional level of flexibility that will allow them to easily scale the amount of hot/cold storage without limit.

In this post we demonstrate how Moovit, with the support of AWS, implemented a lake house architecture by employing the following best practices:

  • Unloading data into Amazon Simple Storage Service (Amazon S3)
  • Instituting a hot/cold pattern using Amazon Redshift Spectrum
  • Using AWS Glue to crawl and catalog the data
  • Querying data using Athena

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

  1. Unload data from Amazon Redshift to Amazon S3
  2. Create an AWS Glue Data Catalog using an AWS Glue crawler
  3. Query the data lake in Amazon Athena
  4. Query Amazon Redshift and the data lake with Amazon Redshift Spectrum

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. The following AWS services and access: Amazon Redshift, Amazon S3, AWS Glue, and Athena.
  4. The appropriate AWS Identity and Access Management (IAM) permissions for Amazon Redshift Spectrum and AWS Glue to access Amazon S3 buckets. For more information, see IAM policies for Amazon Redshift Spectrum and Setting up IAM Permissions for AWS Glue.

Walkthrough

To demonstrate the process Moovit used during their data architecture, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.

The Orders table has the following columns:

ColumnType
O_ORDERKEYint4
O_CUSTKEYint4
O_ORDERSTATUSvarchar
O_TOTALPRICEnumeric
O_ORDERDATEdate
O_ORDERPRIORITYvarchar
O_CLERKvarchar
O_SHIPPRIORITYint4
O_COMMENTvarchar
SKIPvarchar

Unloading data from Amazon Redshift to Amazon S3

Amazon Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in Amazon S3, compared with text formats.

To unload cold or historical data from Amazon Redshift to Amazon S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):

UNLOAD ('select o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment, skip
FROM tpc.orders
ORDER BY o_orderkey, o_orderdate') 
TO 's3://tpc-bucket/orders/' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account_number>:role/>Role<'
FORMAT AS parquet allowoverwrite PARTITION BY (o_orderdate);

It is important to define a partition key or column that minimizes Amazon S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the o_orderdate field as the partition key.

Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the UNLOAD command splits the results to one or more files per node slice (virtual worker in the Amazon Redshift cluster) which allows you to use the Amazon Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In Moovit’s use case, the default UNLOAD using PARALLEL ON yielded dozens of small (MBs) files. For Moovit, PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the LEADER node and wrote it out as a single stream controlling the file size using the MAXFILESIZE option.

Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have min_value and max_value column statistics for each row group that allow Amazon Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from Amazon S3, which results in improved performance and reduced cost.

After unloading the data to your data lake, you can view your Parquet file’s content in Amazon S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.

You’re now ready to populate your Data Catalog using an AWS Glue crawler.

Creating a Data Catalog with an AWS Glue crawler

To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.

An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.

Querying the data lake in Athena

After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.

Querying Amazon Redshift and the data lake using a unified view with Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. Amazon Redshift clusters transparently use the Amazon Redshift Spectrum feature when the SQL query references an external table stored in Amazon S3. Large multiple queries in parallel are possible by using Amazon Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from Amazon S3 back to the Amazon Redshift cluster.

Following best practices, Moovit decided to persist all their data in their Amazon S3 data lake and only store hot data in Amazon Redshift. They could query both hot and cold datasets in a single query with Amazon Redshift Spectrum.

The first step is creating an external schema in Amazon Redshift that maps a database in the Data Catalog. See the following code:

CREATE EXTERNAL SCHEMA spectrum 
FROM data catalog 
DATABASE 'datalake' 
iam_role 'arn:aws:iam::<account_number>:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

After the crawler creates the external table, you can start querying in Amazon Redshift using the mapped schema that you created earlier. See the following code:

SELECT * FROM spectrum.orders;

Lastly, create a late binding view that unions the hot and cold data:

CREATE OR REPLACE VIEW lake_house_joint_view AS (SELECT * FROM public.orders WHERE o_orderdate >= dateadd(‘day’,-90,date_trunc(‘day’,getdate())) 
UNION ALL SELECT * FROM spectrum.orders WHERE o_orderdate < dateadd(‘day’,-90,date_trunc(‘day’,getdate())) WITH NO SCHEMA BINDING;

Summary

In this post, we showed how Moovit unloaded data from Amazon Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within Moovit can access the data, analyze it with various tools, and come up with new insights.

As an additional benefit, Moovit reduced their Amazon Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the Amazon Redshift cluster. Keeping only hot data on the Amazon Redshift cluster prevents Moovit from deleting data frequently, which saves IT resources, time, and effort.

If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.

 


About the Authors

Yonatan Dolan is a Business Development Manager at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.

 

 

 

 

Alon Gendler is a Startup Solutions Architect at Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.

 

 

 

 

Vincent Gromakowski is a Specialist Solutions Architect for Amazon Web Services.

 

 

Analyzing Google Analytics data with Amazon AppFlow and Amazon Athena

Post Syndicated from Shimura Makoto original https://aws.amazon.com/blogs/big-data/analyzing-google-analytics-data-with-amazon-appflow-and-amazon-athena/

Software as a service (SaaS) applications are rapidly growing in importance. This data is essential to include when performing analytics to influence business decisions. Amazon AppFlow is a fully managed integration service that helps you transfer SaaS data to your data lake securely. You can run data transfer flow on demand, on a schedule, or after an event. You can quickly analyze this data using Amazon Athena and join it with numerous datasets already stored on Amazon Simple Storage Service (Amazon S3). You can join multiple SaaS datasets and combine it with operational data sitting in traditional databases such as Amazon Relational Database Service (Amazon RDS) via the Athena federated query feature.

This post walks you through extracting Google Analytics data using Amazon AppFlow and storing it in Amazon S3 so you can query it with Athena.

Architecture overview

The following diagram shows the flow described in this post. You first create a new flow inside Amazon AppFlow to transfer Google Analytics data to Amazon S3. The format of transferred data is multi-line JSON, which Athena doesn’t support. An AWS Lambda function transforms this JSON format file into Apache Parquet format. This transformation enables you to run a query efficiently and cost-effectively. This function can also include other transformations, such as Amazon S3 prefix changes and storing the data using Hive style partitions. Amazon AppFlow supports scheduled jobs to extract only new data, so you can develop an automated workflow with using an Amazon S3 event trigger and a transformation Lambda function. Amazon AppFlow is currently available in 15 Regions; pick the Region where your S3 bucket is located. In this walkthrough, you use US East (N. Virginia).

In this post, you use a sample Google account, OAuth client with appropriate permission, and Google Analytics data. You can also use your own Google resources. To enable Google Analytics access from Amazon AppFlow, you should set up a new OAuth client in advance. Complete the following steps:

  1. On the Google API Console (https://console.developers.google.com), choose Library.
  2. Enter analytics in the search field.
  3. Choose Google Analytics API.
  4. Choose ENABLE and return to the previous page.
  5. Choose Google Analytics Reporting API listed in the search results.
  6. Choose ENABLE and return to the main page.
  7. Choose OAuth consent screen.
  8. Create a new Internal app (if you’re using your personal account, choose External).
  9. Add com as Authorized domains.
  10. Choose Add scope.
  11. Add ../auth/analytics.readonly as Scopes for Google APIs.
  12. Choose Save.
  13. Choose Credentials.
  14. Add OAuth client ID credentials.
  15. Choose Web application.
  16. Enter https://console.aws.amazon.com/ as an authorized JavaScript origins URL.
  17. Enter https://AWSREGION.console.aws.amazon.com/appflow/oauth as an authorized redirect URL. (Replace AWSREGION with the Region you’re working in. If you’re using Amazon AppFlow in us-east-1, enter https://console.aws.amazon.com/appflow/oauth.)
  18. Choose Save.

Setting up Lambda and Amazon S3

You need to start by creating a new S3 bucket as your Amazon AppFlow transfer destination. Then you develop a new Lambda function to transform JSON format data into Parquet format using pandas and pyarrow modules. Finally, you set an Amazon S3 event trigger to automatically call the Lambda function when a new Amazon S3 object is created.

Creating a new S3 bucket

To create an Amazon S3 bucket, complete the following steps:

  1. On the Amazon S3 console, choose Create bucket.
  2. Enter a name for your bucket; for example, appflow-ga-sample.
  3. Choose Create bucket.

Preparing a .zip file for your Lambda layer

To create a .zip file that includes pandas and pyarrow module, complete the following steps:

  1. Set up any environment that can run Docker.
  2. Run the following command:
mkdir python
docker run -it --rm -v $(pwd)/python:/python python:3.6.8 pip install -t /python pandas==0.23.4 pyarrow==0.11.1
zip -r pandas-pyarrow.zip python
  1. On the Amazon S3 console, choose appflow-ga-sample.
  2. Choose Create folder.
  3. Enter a name for your folder; for example, lambda-layer.
  4. Choose Save.
  5. Choose lambda-layer.
  6. Choose Upload.
  7. Choose pandas-pyarrow.zip and choose Upload.

Creating a Lambda layer for Parquet export

To create a Lambda layer, complete the following steps:

  1. On the Lambda console, choose Layers.
  2. Choose Create layer.
  3. For name, enter a name for your layer; for example, pandas-parquet.
  4. Select Upload a file from Amazon S3.
  5. For Amazon S3 link URL, enter an Amazon S3 path for your zip file; for example, s3://appflow-sample/lambda-layer/pandas-parquet.zip.
  6. For Compatible runtimes, choose Python 3.6.
  7. Choose Create.

Creating a Lambda function for data transformation

To create a Lambda function and trigger an Amazon S3 event, complete the following steps:

  1. On the Lambda console, choose Create function.
  2. Select Author from scratch.
  3. For Function name, enter a name for your function; for example, ga-converter.
  4. For Runtime, choose Python 3.6.
  5. Select Create a new role with basic Lambda permissions.
  6. Choose Create function.

  1. At the Lambda function configuration, enter the following code in the lambda_function area in the Function code

This Lambda function downloads AppFlow output file, extracts the necessary data from the Google Analytics JSON file, and transforms it into Parquet format. Finally, it uploads it to Amazon S3 again with a different key name. You can modify the script, especially in dimensions and values names, or any other transformations according to your needs.

import pandas as pd
import boto3
from datetime import datetime as dt

def lambda_handler(event, context):
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_key = event['Records'][0]['s3']['object']['key']

    s3_client = boto3.client('s3')
    raw_object = s3_client.get_object(Bucket=bucket_name, Key=object_key)
    raw_data = json.loads(raw_object['Body'].read().decode('utf-8'))
    
    record_dates = [dt.strptime(r['dimensions'][0], '%Y%m%d%H') for r in raw_data['reports'][0]['data']['rows']]
    devices = [r['dimensions'][1] for r in raw_data['reports'][0]['data']['rows']]
    user_counts = [int(r['metrics'][0]['values'][0]) for r in raw_data['reports'][0]['data']['rows']]
    df = pd.DataFrame({
        'year': [r.year for r in record_dates],
        'month': [r.month for r in record_dates],
        'day': [r.day for r in record_dates],
        'hour': [r.hour for r in record_dates],
        'device': devices,
        'user_count': user_counts
    })
    
    output_file = dt.now().strftime('%Y%m%d%H%M%S')
    output_path = '/tmp/{}.parquet'.format(output_file)
    df.to_parquet(output_path)

    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucket_name)
    bucket.upload_file(output_path, 'ga-data/{}.parquet'.format(output_file))

Adding layers

To add layers to your Lambda function, complete the following steps:

  1. Choose Layers in the Designer
  2. Choose Add a layer.
  3. Select Select from list of runtime compatible layers.
  4. For Name, choose pandas-pyarrow.
  5. For Version, choose 1.
  6. Choose Add.

Increasing your timeout setting

To increase the Lambda timeout setting, complete the following steps:

  1. On the Configuration tab, under Basic settings, choose Edit.
  2. Set 30 sec as Timeout.
  3. Choose Save.
  4. On the Configuration tab, choose Save.

Adding an event trigger

To add an Amazon S3 event trigger, complete the following steps:

  1. In the Designer section, choose Add trigger.
  2. Choose S3, and choose the bucket you created.
  3. For Event type, choose All object create events.
  4. For Prefix, enter raw.
  5. Choose Add.

Adding permissions

To add appropriate permissions for this Lambda function to read and write Amazon S3 objects, complete the following steps:

  1. On the Permissions tab, enter a role name; for example, ga-converter-role-zm6u0f4g.
  2. On the AWS Identity and Access Management (IAM) console, choose Policies.
  3. Choose Create Policy.
  4. On the JSON tab, enter the following policy (replace the target bucket name arn:aws:s3:::appflow-ga-sample* with your own bucket name).
{	
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
             ],
            "Resource": [
                "arn:aws:s3:::appflow-ga-sample*"
            ]
        }
    ]
}
  1. Choose Review policy.
  2. Enter a name for your new policy; for example, lambda-s3-ga-converter-policy.
  3. Choose Create policy.
  4. On the IAM console, choose
  5. Enter your role name (ga-converter-role-zm6u0f4g) in the search field.
  6. Choose your role.
  7. Choose Attach policies.
  8. Choose lambda-s3-ga-converter-policy.
  9. Choose Attach policy.

Setting up Amazon AppFlow

Now you can create a new Amazon AppFlow flow to transfer from Google Analytics to Amazon S3. To create a new Amazon AppFlow transfer flow, complete the following steps:

  1. On the Amazon AppFlow console, choose Create flow.
  2. Enter a name for your flow; for example, my-ga-flow.
  3. Choose Next.
  4. For Source name, choose Google Analytics.
  5. Choose Create new connection.
  6. Enter your OAuth client ID and client secret, then name your connection; for example, ga-connection.
  7. In the pop-up window, choose to allow amazon.com access to the Google Analytics API.
  8. For Choose Google Analytics object, choose Reports.
  9. For Choose Google Analytics view, choose All Web Site Data.
  10. For Destination name, choose Amazon S3.
  11. For Bucket details, choose the bucket you created.
  12. Enter raw as a prefix.
  13. Select Run on demand.

  1. Choose Next.
  2. Select Manually map fields.
  3. Select the following three fields for Source field name:
    • Time: DIMENSION: ga:dateHour
    • Platform or Device: DIMENSION: ga:deviceCategory
    • User: METRIC: ga:users
  4. Choose Map fields directly.

  1. Choose Next.
  2. In the Add filters section, choose Next.
  3. Choose Create flow.

Running the flow

After creating your new flow, you can run it on demand:

  1. On the Amazon AppFlow console, choose my-ga-flow.
  2. Choose Run flow.

For this walkthrough, you choose on-demand job execution for ease of understanding. In practice, you can choose a scheduled job and periodically extract only newly added data. The Amazon S3 event trigger also helps you transform data automatically.

Querying via Athena

You need to create an external table before querying. Complete the following steps:

  1. On the Athena console, enter create database appflow_data into the query editor.
  2. Choose Run query.
  3. Enter the following command in the query editor (replace the target bucket name appflow-ga-sample with your own bucket):
CREATE EXTERNAL TABLE appflow_data.ga_sample (
  `year` int,
  `month` int,
  `day` int,
  `hour` int,
  `device` string,
  `user_count` int
 )
STORED AS PARQUET
LOCATION 's3://appflow-ga-sample/ga-data'
tblproperties ("parquet.compression"="SNAPPY")
;
  1. Choose Run query.

Now you can query Google Analytics data. Enter the following query and run it. This query shows what kind of device is popular for accessing your website on an hourly basis:

SELECT
  year
  , month
  , day
  , device
  , count(user_count) as cnt
FROM
  appflow_data.ga_sample
GROUP BY
  year
  , month
  , day
  , device
ORDER BY
  cnt DESC
LIMIT 10
; 

The following screenshot shows the query results.

Summary

This post demonstrated how you can transfer Google Analytics data to Amazon S3 using Amazon AppFlow and analyze it with Amazon Athena. You no longer need to build your own application to extract data from Google Analytics and other SaaS applications. Amazon AppFlow enables you to develop a fully automated data transfer and transformation workflow and an integrated query environment in one place.


About the Author

Makoto Shimura is a specialist solutions architect, analytics at Amazon Web Services. He helps customers develop efficient data pipelines on the AWS platform. Previously, he worked as a data engineer, developing a distributed data platform. Outside of work, he loves to spend time with his family, play with his dog, and also play video games.

 

Setting up trust between ADFS and AWS and using Active Directory credentials to connect to Amazon Athena with ODBC driver

Post Syndicated from Arun Alapati original https://aws.amazon.com/blogs/big-data/setting-up-trust-between-adfs-and-aws-and-using-active-directory-credentials-to-connect-to-amazon-athena-with-odbc-driver/

Amazon Athena is a serverless and interactive query service that allows you to easily analyze your raw and processed datasets in Amazon Simple Storage Service (Amazon S3) using standard SQL. The JDBC and ODBC drivers that Athena provides allow you to easily integrate your data analytics tools (such as Microsoft Power BI, Tableau, or SQLWorkBench) with Athena seamlessly and gain insights about your data in minutes.

Before November 2018, you had to connect to Athena with ODBC or JDBC drivers using your IAM user or role credentials. However, with the November 20, 2018 release of support for Microsoft Active Directory Federation Services (ADFS 3.0) and Security Assertion Markup Language (SAML 2.0) in the Athena ODBC/JDBC driver, you can now connect to Athena directly using your Microsoft Active Directory (AD) credentials.

Microsoft ADFS 3.0, a component of Windows Server, supports SAML 2.0 and is integrated with AWS Identity and Authentication Management (IAM). This integration allows Active Directory (AD) users to federate to AWS using corporate directory credentials, such as username and password from Microsoft Active Directory.

This post walks you through configuring ADFS 3.0 on a Windows Server 2012 R2 Amazon Elastic Compute Cloud (Amazon EC2) instance and setting up trust between ADFS 3.0 IdP and AWS through SAML 2.0. The post then demonstrates how to install the Athena OBDC driver on Amazon Linux EC2 instance (RHEL instance) and configure it to use ADFS for authentication.

Solution overview

The following architecture diagram shows how an AD user in your organization is authenticated in the Athena ODBC/JDBC driver:

The process includes the following steps:

  1. A user in your organization uses a client application with the JDBC or ODBC driver to request authentication from your organization’s IdP. The IdP is ADFS 3.0.
  2. The IdP authenticates the user against AD, which is your organization’s Identity Store.
  3. The IdP constructs a SAML assertion with information about the user and sends the assertion to the client application via the JDBC or ODBC driver.
  4. The JDBC or ODBC driver calls the AWS Security Token Service AssumeRoleWithSAML API operation, passing it the following parameters:
    • The ARN of the SAML provider
    • The ARN of the role to assume
    • The SAML assertion from the IdP
  5. The API response to the client application via the JDBC or ODBC driver includes temporary security credentials.
  6. The client application uses the temporary security credentials to call Athena API operations, which allows your users to access Athena API operations.

This post walks you through configuring ADFS-AWS trust through SAML and using that trust to federate AD users in the Athena JDBC/ODBC driver.

To implement this solution, you complete the following steps:

  • Configure ADFS (3.0) on a Windows Server 2012 R2 Amazon EC2 instance
  • Set up trust between AWS and ADFS (3.0) through SAML 2.0 rules
  • Install the Athena ODBC driver 1.0.5 on RHEL EC2 instance and configure it to use ADFS

Prerequisites

For this walkthrough, you need to have the following prerequisites:

  • An understanding of the concepts of Active Directory. The steps for configuring Active Directory on a Windows instance are outside the scope of this post.
  • An understanding of IAM roles and concepts.
  • DNS and networking set up between your Active Directory server and the instance on which the Athena ODBC or JDBC driver is installed. This post sets up Active Directory (that runs ADFS) on a Windows Amazon EC2 instance and ODBC driver on another EC2 instance that are part of the same AWS VPC and subnet. However, for your use case, you need to provide that connectivity between the ADFS server and the OBDC/JDBC instance.

Configuring ADFS (3.0) on a Windows Server 2012 R2 Amazon EC2 instance

In the following steps, you install ADFS 3.0 on a Windows Server 2012 R2 Amazon EC2 instance. As per the prerequisites, you already installed Active Directory on a Windows Server 2012 R2 EC2 instance. For this post, the domain name is arunad.local. For instructions on setting up an Active Directory domain controller on an EC2 instance, see Building Your First Domain Controller on 2012 R2 on the Microsoft TechNet website.

Installing prerequisites for ADFS 3.0

To configure ADFS 3.0 on a Windows domain controller, you must have the following:

  • An SSL certificate – For this post, you can create a self-signed certificate by installing IIS (Internet Information Server)
  • Configuring a service account – Create an Active Directory user with Domain Admin groups
  1. Install IIS on Windows Server 2012 R2 Amazon EC2 instance. For instructions, see How to install and configure IIS on Windows Server 2012 R2 on The Solving website. For this post, you can skip Step 2 in the preceding instructions.
  2. After you install IIS, create a self-signed certificate. For instructions, see How to Create a Self Signed SSL Certificate with Windows Server on the Sophos Community website. For this post, you can skip the step about binding the self-signed certificate in the preceding instructions.

To configure a service account in your domain controller, you create a user in your active directory with the name ADFSSVC and add the user to the domain admins group.

  1. Open Server Manager.
  2. Choose Tools.
  3. Choose Active Directory Users and Computers.
  4. Expand your domain (arunad.local).
  5. Choose User (right-click).
  6. Choose New.
  7. Choose User.
  1. Create a user with the name ADFSSVC.
  2. Set the password to never expire.

You can now add user ADFSSVC to the domain admins group.

  1. Choose Users.
  2. Choose ADFSSV (right-click) and choose Add to group.
  3. In the search bar, enter domain.
  4. Choose checknames.
  5. Choose Domain Admins.
  6. Choose OK.

You receive a message that the user is added to the group, but should still verify it.

  1. Choose ADFSSVC (right-click) and choose Properties.
  2. On the Member Of tab, check that Domain Admins is listed.

Installing and configuring ADFS 3.0

Now that you have installed the prerequisites for ADFS 3.0, you can install and configure ADFS 3.0 on Windows Server 2012 R2 EC2 instance.

  1. Open Server Manager.
  2. Choose Roles and Features.
  3. Select Role-based or feature-based installation.

  1. Choose Next until you reach the Select server roles
  2. For Roles, select Active Directory Federation Services.

  1. Choose Next until you reach the Confirmation installation selections
  2. Choose Install.

  1. Choose Configure the Federation Service for this server.
  2. Select Create the first federation server in a federation server farm.
  3. Choose Next.

  1. Choose Next until you reach the Specify Service Properties
  2. For SSL Certificate, choose the self-signed certificate you installed earlier.
  3. For Federation Service Display Name, enter ArunADFS.
  4. Choose Next.

  1. In Specify Service Account page, select Use an existing domain user account or group Managed Service Account.
  2. Choose Select.

  1. In the text box, enter ADFSSVC.
  2. Choose Check names.
  3. When the name is populated, choose OK.

  1. Enter your password and choose Next.

  1. Select Create a database on this server using Windows Internal Database.
  2. Choose Next.

  1. Choose Next until you reach the Pre-requisite Checks
  2. Choose Configure.

When the server is successfully configured, you may see the following warning message: 

An error occurred during an attempt to set the SPN for the specified service account. Set the SPN for the service account manually. For more information about setting the SPN of the service account manually, see the AD FS Deployment Guide. Error message: The SPN required for this Federation Service is already set on another Active Directory account. Choose a different Federation Service name and try again.

To fix the problem, run the following command by opening PowerShell as an administrator:

setspn -a host/localhost adfssvc

The following code shows the output.

The ADFS 3.0 configuration is now complete.

  1. To download your ADFS server’s federation XML file, open a browser on your Windows Server and enter the following address: https://<yourservername>/FederationMetadata/2007-06/FederationMetadata.xml.

This file is required to set up trust between ADFS and AWS.

Alternatively, you can download the ADFS server’s federation XML file by running the following command as administrator in PowerShell 3.0+:

wget https://<your-server-name>/FederationMetadata/2007-06/FederationMetadata.xml -OutFile FederationMetadata.xml

You can see your server name by clicking Start button (Windows icon) > Right click My Computer > Click Properties > Check for Full Computer Name.

Copy the downloaded XML file on to your local machine so you can use it when creating a SAML-based role in IAM in the next step.

Establishing trust between Windows AD (using ADFS IDP) and AWS via SAML 2.0

Now that you have configured the ADFS 3.0 on a Windows Server, you can establish the trust between AWS and the IdP (ADFS) via SAML assertion claim rules. By establishing this trust, users in your AD can federate into AWS using an IAM role and access AWS resources such as Athena or the AWS Glue Data Catalog.

Setting up this trust requires configuration in both AWS and Active Directory. For AWS, you set up IAM roles and establish a SAML provider. For Active Directory, you write the SAML assertion and claim rules.

Setting up your SAML provider in IAM

To set up your SAML provider, complete the following steps:

  1. On the IAM console, choose Identity provider.
  2. Choose Create provider.
  3. For Provider Type, choose SAML.
  4. For Provider Name, enter MytestADFS.
  5. For Metadata Document, choose the XML file you downloaded earlier.

  1. Create a new role in IAM and choose the trusted entity as SAML 2.0 federation.
  2. For SAML provider, choose the provider you created earlier (MytestADFS).
  3. For Attribute, select SAML:aud.

  1. Add the necessary IAM permissions to this role and create the IAM role.

For this post, attach the AthenaFullAccess managed policy and name it ArunADFSTest.

The role name you give in this step is crucial because any users and groups you create in Active Directory as part of establishing trust in the following steps are based on this name.

Configuring the SAML assertion rules in ADFS 3.0 IdP

In this step, you configure the SAML assertion rules in your ADFS so that users can federate into AWS using the IAM role you created.

Determining how you create and delineate your AD groups and IAM roles in AWS is crucial in how you secure access to your account and manage resources. One approach for creating the AD groups that uniquely identify the IAM role mapping is by selecting a common group naming convention.

For this post, create a group in Active Directory with the naming convention AWS-<AccountID>-<IAMRolename>; for example, AWS-123456789012-ArunADFSTest.

This naming convention is extremely crucial in the next steps because you write SAML assertion claim rules where you fetch all the AD groups of your AD user that contain the string AWS-<AccountID>- in them and extract the last portion of the group name to map it with IAM role in AWS. For example, if the AD user that you’re authenticating to AWS is part of AD group AWS-123456789012-ArunADFSTest, then the claim rules, which you write later, find all groups of the AD user that match with string AWS-123456789012-, extract the last portion of the group name (ArunADFSTest), and send it as the role name to the AWS SAML endpoint in the format aws:arn::iam::<AccountID>: role/ArunADFSTest.

  1. In Server Manager, under Tools, choose Active directory users and computers.
  2. Choose your domain (right-click) and choose New.
  3. Choose Group.

The following screenshot shows creating an AD group with name AWS-123456789012-ArunADFSTest:

  1. After you create the group in AD with name AWS-123456789012-ArunADFSTest, create a new user in that group. For this post, name the user myldapuser1.

  1. Make sure the E-mail field of the user is filled with a valid email address syntax because you use this email field of the user and pass it as RolesessionName to AWS when constructing the SAML token.

  1. After you create the user, add the user to the AD group AWS-123456789012-ArunADFSTest.

Now that you’ve created the AD groups, AD users, and IAM roles, you create the relying party trust in ADFS and write the claim rules. The ADFS IdP needs to construct the following values in the SAML assertion and send the values to AWS for authentication:

    • NameID
    • RoleSessionName
    • Roles (which contains your SAML IDP in AWS and role name)

For instructions on setting up the relying trust and claim rules in ADFS, see AWS Federated Authentication with Active Directory Federation Services (AD FS). For this walkthrough, you can start at the Active Directory Federation Services Configuration section.

  1. For Display name, enter My Amazon Portal.

After the configuration, your claim rules looks similar to the following screenshots.

The following screenshot shows the rules for NameID.

The following screenshot shows the rules for RoleSessionName.

The following screenshot shows the rules for Get AD Groups.

The following screenshot shows the rules for Roles. The SAML provider in IAM was created with the name MyTestADFS (arn arn:aws:iam::123456789012:saml-provider/MytestADFS), so you need to use that same value here (MyTestADFS) in the claim script. If you used a different name, replace it with your IdP ARN name.

After you create these four rules, your ADFS relying trust setup is complete.

Verifying your IdP

To verify that you set up your IdP successfully, complete the following steps:

  1. Navigate to the following URL in the browser on your ADFS server Windows instance (use your Windows Server hostname): https://<windows-hostname>/adfs/ls/IdpInitiatedSignOn.aspx.

  1. Select your ADFS display name (My Amazon Portal).

  1. Enter the AD credentials for the user myldapuser1 you created earlier.

Because the domain name for this demo is arunad.local, specify the user name as ARUNAD\myldapuser1, or you can specify it as [email protected].

If everything is successful, you should be able to sign in to the AWS Management Console.

If you encounter any errors, review the preceding steps. For more information about common errors with SAML, see Troubleshooting SAML 2.0 Federation with AWS. For additional information about troubleshooting, see How to View a SAML Response in Your Browser for Troubleshooting.

The Athena ODBC/JDBC driver when initiating connection to your ADFS server also uses the same federation URL (https://<windows-hostname>/adfs/ls/IdpInitiatedSignOn.aspx) for retrieving the SAML assertion AWS properties.

Installing the Athena ODBC driver 1.0.5 on an Amazon EC2 Linux instance and configuring it to use ADFS

Now that you have configured ADFS 3.0 and set up trust with AWS, the last step is to configure the Athena ODBC driver to use ADFS as its authentication mechanism.

As per the prerequisites, you launched the Active Directory Windows EC2 instance and the Athena ODBC driver Linux (RHEL) EC2 instance in the same VPC and subnet and allowed all traffic between both instances. However, in your environment, you need to make sure your ADFS server can communicate with the machine that has Athena JDBC/ODBC installed.

For this post, the domain name servers in the DHCP option set of VPC (in which you launch the Linux EC2) are modified as follows: 172.31.X.X, AmazonProvidedDNS, where 172.31.X.X is the IP address of the EC2 Windows instance on which ADFS is configured with the domain name arunad.local.

Setting up the environment on your EC2 instance

To set up your environment, complete the following steps:

  1. Launch a Linux EC2 instance with an AMI that supports the Red Hat Enterprise Distribution (for example, AMI with RHEL-7.6_HVM_GA-20181017-x86_64-0-Hourly2-GP2).
  2. SSH into the instance and enter the following commands:
sudo yum install telnet
sudo yum install nc
sudo yum install gcc
sudo yum install wget
sudo yum install vim
  1. Verify connectivity between your ADFS server and RHEL EC2 instance with a networking tool of your choice.

This post uses the ping utility. The following code shows the output:

  1. Install the open-ldap client and test if you can search for a user in AD from this Linux instance (replace the values with your user and domain name):
sudo yum install openldap-clients
ldapsearch -h arunad.local -p 389 -D "CN=mylapuser1,CN=Users,DC=arunad,DC=local" -x -W -b "DC=arunad,DC=local"

If these commands are successful, the RHEL EC2 instance can communicate with the AD server and retrieve the credentials.

Installing and configuring UnixODBC Driver Manager 2.3.4

The Athena ODBC driver on Linux requires you to have one of the following ODBC Driver Managers installed to set up the connection:

  • iODBC 3.52.9, 3.52.10, 3.52.11, or 3.52.12
  • unixODBC 2.3.2,2.3.3, or 2.3.4
  1. Install the UnixODBC driver manager 2.3.4 on your RHEL EC2 instance. For instructions, see unixODBC-2.3.4 on the Beyond Linux From Scratch website.

The command needed for installation on your EC2 instance should look similar to the following code:

wget ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.4.tar.gz
tar -zxvf unixODBC-2.3.4.tar.gz
cd unixODBC-2.3.4
./configure --prefix=/usr --sysconfdir=/etc/unixODBC && make
  1. Install the ODBC Driver Manager by switching to the root user. See the following code:
sudo -su root

make install &&

find doc -name "Makefile*" -delete                &&
chmod 644 doc/{lst,ProgrammerManual/Tutorial}/*   &&

install -v -m755 -d /usr/share/doc/unixODBC-2.3.4 &&
cp      -v -R doc/* /usr/share/doc/unixODBC-2.3.4

  1. After the ODBC Driver Manager is installed, make sure the following files are present, which indicate a successful installation of the ODBC driver manager on an RHEL EC2 instance:
    • /etc/unixODBC/odbcinst.ini
    • /etc/unixODBC/odbc.ini
  1. The ODBC Driver Manager library files created as part of the installation are present in the path /usr/lib. Set the shared library path to point your ODBC Driver Manager libraries by entering the following command as a non-root user:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib

  1. Verify the ODBC Driver Manager environment configuration is loaded properly by entering the code odbcinst -j.

Installing and configuring the Simba Athena ODBC Driver Manager (64 bit)

Now that the ODBC driver manager is configured, the final step is to install the Athena ODBC driver 1.0.5 on this Linux Instance and configure it to use ADFS as the authentication mechanism.

  1. To install the Athena ODBC driver on this EC2 instance, enter the following code:
wget https://s3.amazonaws.com/athena-downloads/drivers/ODBC/SimbaAthenaODBC_1.0.5/Linux/simbaathena-1.0.5.1006-1.x86_64.rpm

sudo yum --nogpgcheck localinstall simbaathena-1.0.5.1006-1.x86_64.rpm

The example ODBC configuration file of the downloaded driver is included in the path /opt/simba/athenaodbc/Setup/odbc.ini.

  1. To configure the ODBC driver to use ADFS as an authentication mechanism, log in as ec2-user and enter the following code:

vim .odbc.ini

  1. Insert the following configuration directives:
[ODBC]
Trace=no

[ODBC Data Sources]
Simba Athena 64-bit=Simba Athena ODBC Driver 64-bit


[Simba Athena 64-bit]
Description=Simba Athena ODBC Driver (64-bit) DSN
Driver=/opt/simba/athenaodbc/lib/64/libathenaodbc_sb64.so


# Connection configurations should be set here.
AwsRegion=us-west-2
Schema=default
S3OutputLocation=s3://aws-athena-query-results-123456780912-us-west-2
AuthenticationType=ADFS
UID=ARUNAD\myldapuser1
PWD=XXXXXXXX
IdP_Host=win-qikm653mpj9.arunad.local
IdP_Port=443
SSL_Insecure=true

Replace the values for IdP_Host and IdP_Port to point to your ADFS server. For this post, these values are win-qikm653mpj9.arunad.local and 443. Similarly, replace UID and PWD with the LDAP user name and password you created earlier. This post uses ARUNAD\myldapuser1.  Also,  replace AwsRegion and S3OutputLocation values according to your environment. For production workloads, make sure that you set SSL_Insecure to false so the driver can verify the server certificate.

You can retrieve the HTTPS port number of your ADFS server by entering the following code on the Windows AD server instance’s power shell:

Get-AdfsProperties

  1. To enable DEBUG level logging on your Athena ODBC driver, edit the file /opt/simba/athenaodbc/lib/64/simba.athenaodbc.ini and set the following values:
[Driver]
## - Note that this default DriverManagerEncoding of UTF-32 is for iODBC.
ErrorMessagesPath=/opt/simba/athenaodbc/ErrorMessages
LogLevel=5
LogPath=/home/ec2-user/odbclogs/
SwapFilePath=/tmp
  1. Now that the Athena ODBC driver is configured, you can test it by entering the following code:

isql -v "Simba Athena 64-bit"

In the preceding code, Simba Athena 64-bit refers to the name of your DSN you specified in /home/ec2-user/.odbc.ini while connecting to the ODBC driver.

If you’re connected, it means you have successfully connected the Athena ODBC driver manager by authenticating your user against ADFS.

You can also check the connection log to verify the connection URI used by your driver and values returned by ADFS to the Athena ODBC driver.

Entries from connection log snippet looks as follows:

May 09 01:40:58.761 DEBUG 50743104 IAMAdfsCredentialsProvider::FormBasedAuthentication: verifySSL: false
May 09 01:40:58.761 DEBUG 50743104 IAMAdfsCredentialsProvider::FormBasedAuthentication: Using URI: https://win-qikm653mpj9.arunad.local:443/adfs/ls/IdpInitiatedSignOn.aspx?loginToRp=urn:amazon:webservices
May 09 01:40:58.821 DEBUG 50743104 IAMSamlPluginCredentialsProvider::GetAWSCredentialsWithSaml: Using RoleArn: arn:aws:iam::143280751103:role/ArunADFSTest, PrincipalArn: arn:aws:iam::143280751103:saml-provider/MytestADFS

Conclusion

This post demonstrated how to configure ADFS 3.0 on your Active Directory and use it as an IdP to federate into AWS using SAML. This post also showed how you can integrate your Athena ODBC driver to ADFS and use your Active Directory credentials directly to connect to Athena. Integrating your Active Directory with the Athena ODBC driver gives you the flexibility to access Athena from BI tools you’re already familiar with and analyze the data in Amazon S3 using SQL, without needing to create separate IAM users.

If your organization has single sign-on (SSO) into AWS enabled with the OKTA service provider, you can use the latest version of the Athena JDBC driver, version 2.0.9, to use OKTA as the authentication mechanism. For more information, see using OKTA in Athena JDBC

If you have any questions or feedback, please leave a comment.

 


About the Author

Alapati Arun is a Cloud Support Engineer with AWS based out of Dallas. He focuses on supporting customers in using big data technologies. He enjoys travel and watching movies.

 

Measure Effectiveness of Virtual Training in Real Time with AWS AI Services

Post Syndicated from Rajeswari Malladi original https://aws.amazon.com/blogs/architecture/measure-effectiveness-of-virtual-training-in-real-time-with-aws-ai-services/

As per International Data Corporation (IDC), worldwide spending on digital transformation will reach $2.3 trillion in 2023. As organizations adopt digital transformation, training becomes an important aspect of this journey. Whether these are internal trainings to upskill existing workforce or a packaged content for commercial use, these trainings need to be efficient and cost effective. With the advent of education technology, it is a common practice to deliver trainings via digital platforms. This makes it accessible for larger population and is cost effective, but it is important that the trainings are interactive and effective. According to  a recent article published by Forbes, immersive education and data driven insights are among the top five Education Technology (EdTech) innovations. These are the key characteristics of creating an effective training experience.

An earlier blog series explored how to build a virtual trainer on AWS using Amazon Sumerian. This series illustrated how to easily build an immersive and highly engaging virtual training experience without needing additional devices or a complex virtual reality platform management. These trainings are easy to maintain and are cost effective.

In this blog post, we will further extend the architecture to gather real-time feedback about the virtual trainings and create data-driven insights to measure its effectiveness with the help of Amazon artificial intelligence (AI) services.

Architecture and its benefits

Virtual training on AWS and AI Services - Architecture

Virtual training on AWS and AI Services – Architecture

Consider a scenario where you are a vendor in the health care sector. You’ve developed a cutting-edge device, such as patient vital monitoring hardware that goes through frequent software upgrades and it is about to be rolled out across different U.S. hospitals. The nursing staff needs to be well trained before it can begin using the device. Let’s take a look at an architecture to solve this problem. We will first explain the architecture for building the training and then we will show how we can measure its effectiveness.

At the core of the architecture is Amazon Sumerian. Sumerian is a managed service that lets you create and run 3D, Augmented Reality (AR), and Virtual Reality (VR) applications. Within Sumerian, real-life scenes from a hospital environment can be created by importing the assets from the assets library. Scenes consist of host(s) and an AI-driven animated character with built-in animation, speech, and behavior. The hosts act as virtual trainers that interact with the nursing staff. The speech component assigns text to the virtual trainer for playback with Amazon Polly. Polly helps convert training content from Sumerian to life-like speech in real time and ensures the nursing staff receives the latest content related to the equipment on which it’s being trained.

The nursing staff accesses the training via web browsers on iOS or Android mobile devices or laptops, and authenticates using Amazon Cognito. Cognito is a service that lets you easily add user sign-up and authentication to your mobile and web apps. Sumerian then uses the Cognito identity pool to create temporary credentials to access AWS services.

The flow of the interactions within Sumerian is controlled using a visual state machine in the Sumerian editor. Within the editor, the dialogue component assigns an Amazon Lex chatbot to an entity, in this case the virtual trainer or host. Lex is a service for building conversational interfaces with voice and text. It provides you the ability to have interactive conversations with the nursing staff, understand its areas of interest, and deliver appropriate training material. This is an important aspect of the architecture where you can customize the training per users’ needs.

Lex has native interoperability with AWS Lambda, a serverless compute offering where you just write and run your code in Lambda functions. Lambda can be used to validate user inputs or apply any business logic, such as fetching the user selected training material from Amazon DynamoDB (or another database) in real time. This material is then delivered to Lex as a response to user queries.

You can extend the state machine within the Sumerian editor to introduce new interactive flows to collect user feedback. Amazon Lex collects user feedback, which is saved in Amazon Simple Storage Service (S3) and analyzed by Amazon Comprehend. Amazon Comprehend is a natural language processing service that uses AI to find meaning and insights/sentiments in text. Insights from user feedback are stored in S3, which is a highly scalable, durable, and highly available object storage.

You can analyze the insights from user feedback using Amazon Athena, an interactive query service which analyzes data in S3 using standard SQL. You can then easily build visualizations using Amazon QuickSight.

By using this architecture, you not only deliver the virtual training to your nursing staff in an immersive environment created by Amazon Sumerian, but you can also gather the feedback interactively. You can gain insights from this feedback and iterate over it to make the training experience more effective.

Conclusion and next steps

In this blog post we reviewed the architecture to build interactive trainings and measure their effectiveness. The serverless nature of this architecture makes it cost effective, agile, and easy to manage, and you can apply it to a number of use cases. For example, an educational institution can develop training content designed for multiple learning levels and the training level can be adjusted in real time based on live interactions with the students. In the manufacturing scenario, you can build a digital twin of your process and train your resources to handle different scenarios with full interactions. You can integrate AWS services just like Lego blocks, and you can further expand this architecture to integrate with Amazon Kendra to build interactive FAQ or integrate with Amazon Comprehend Medical to build trainings for the healthcare industry. Happy building!

Build an AWS Well-Architected environment with the Analytics Lens

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/build-an-aws-well-architected-environment-with-the-analytics-lens/

Building a modern data platform on AWS enables you to collect data of all types, store it in a central, secure repository, and analyze it with purpose-built tools. Yet you may be unsure of how to get started and the impact of certain design decisions. To address the need to provide advice tailored to specific technology and application domains, AWS added the concept of well-architected lenses 2017. AWS now is happy to announce the Analytics Lens for the AWS Well-Architected Framework. This post provides an introduction of its purpose, topics covered, common scenarios, and services included.

The new Analytics Lens offers comprehensive guidance to make sure that your analytics applications are designed in accordance with AWS best practices. The goal is to give you a consistent way to design and evaluate cloud architectures, based on the following five pillars:

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization

The tool can help you assess the analytics workloads you have deployed in AWS by identifying potential risks and offering suggestions for improvements.

Using the Analytics Lens to address common requirements

The Analytics Lens models both the data architecture at the core of the analytics applications and the application behavior itself. These models are organized into the following six areas, which encompass the vast majority of analytics workloads deployed on AWS:

  1. Data ingestion
  2. Security and governance
  3. Catalog and search
  4. Central storage
  5. Processing and analytics
  6. User access

The following diagram illustrates these areas and their related AWS services.

There are a number of common scenarios where the Analytics Lens applies, such as the following:

  • Building a data lake as the foundation for your data and analytics initiatives
  • Efficient batch data processing at scale
  • Building a platform for streaming ingest and real-time event processing
  • Handling big data processing and streaming
  • Data-preparation operations

Whichever of these scenarios fits your needs, building to the principles of the Analytics Lens in the AWS Well-Architected Framework can help you implement best practices for success.

The Analytics Lens explains when and how to use the core services in the AWS analytics portfolio. These include Amazon Kinesis, Amazon Redshift, Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation. It also explains how Amazon Simple Storage Service (Amazon S3) can serve as the storage for your data lake and how to integrate with relevant AWS security services. With reference architectures, best practices advice, and answers to common questions, the Analytics Lens can help you make the right design decisions.

Conclusion

Applying the lens to your existing architectures can validate the stability and efficiency of your design (or provide recommendations to address the gaps that are identified). AWS is committed to the Analytics Lens as a living tool; as the analytics landscape evolves and new AWS services come on line, we’ll update the Analytics Lens appropriately. Our mission will always be to help you design and deploy well-architected applications.

For more information about building your own Well-Architected environment using the Analytics Lens, see the Analytics Lens whitepaper.

Special thanks to the following individuals who contributed to building this resource, among many others who helped with review and implementation: Radhika Ravirala, Laith Al-Saadoon, Wallace Printz, Ujjwal Ratan, and Neil Mukerje.

Are there questions you’d like to see answered in the tool? Share your thoughts and questions in the comments.

 


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at Amazon Web Services. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 


Radhika Ravirala is a specialist solutions architect at Amazon Web Services, where she helps customers craft distributed analytics applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley.

New – Announcing Amazon AppFlow

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/new-announcing-amazon-appflow/

Software as a service (SaaS) applications are becoming increasingly important to our customers, and adoption is growing rapidly. While there are many benefits to this way of consuming software, one challenge is that data is now living in lots of different places. To get meaningful insights from this data, we need to have a way to analyze it, and that can be hard when our data is spread out across multiple data islands.

Developers spend huge amounts of time writing custom integrations so they can pass data between SaaS applications and AWS services so that it can be analysed; these can be expensive and can often take months to complete. If data requirements change, then costly and complicated modifications have to be made to the integrations. Companies that don’t have the luxury of engineering resources might find themselves manually importing and exporting data from applications, which is time-consuming, risks data leakage, and has the potential to introduce human error.

Today it is my pleasure to announce a new service called Amazon AppFlow that will solve this issue. Amazon AppFlow allows you to automate the data flows between AWS services and SaaS applications such as Salesforce, Zendesk, and ServiceNow. SaaS application administrators, business analysts, and BI specialists can quickly implement most of the integrations they need without waiting months for IT to finish integration projects.

As well as allowing data to flow in from SaaS applications to AWS services, it’s also capable of sending data from AWS services to SaaS applications. Security is our top priority at AWS, and so all of the data is encrypted while in motion. Some of the SaaS applications have integrated with AWS PrivateLink; this adds an extra layer of security and privacy. When the data flows between the SaaS application and AWS with AWS PrivateLink, the traffic stays on the Amazon network rather than using the public internet. If the SaaS application supports it, Amazon AppFlow automatically takes care of this connection, making private data transfer easy for everyone and minimizing the threat from internet-based attacks and the risk of sensitive data leakage.

You can schedule the data transfer to happen on a schedule, in response to a business event, or on demand, giving you speed and flexibility with your data sharing.

To show you the power of this new service, I thought it would be interesting to show you how to set up a simple flow.

I run a Slack workspace in the United Kingdom and Ireland for web community organizers. Since Slack is one of the supported SaaS applications in Amazon AppFlow, I thought it would be nice to try and import some of the conversation data into S3. Once it was in S3 I could then start to analyze it using Amazon Athena and then ultimately create a visualization using Amazon QuickSight.

To get started, I go to the Amazon AppFlow console and click the Create Flow button.

In the next step, I enter the Flow name and a Flow description. There are also some options for data encryption. By default, all data is encrypted in transit, and in the case of S3 it’s also encrypted at rest. You have the option to supply your own encryption key, but for this demo, I’m just going to use the key in my account that is used by default.

On this step you are also given the option to enter Tags for the resource, I have been getting into the habit of tagging demo infrastructure in my account as Demo which makes it easier for me to know which resources I can delete.

On the next step, I select the source of my data. I pick Slack and go through the wizard to establish a connection with my Slack workspace. I also get to choose what data I want to import from my Slack Workspace. I select the Conversations object in the general slack channel. This will import any messages that are posted to the general channel and then send it to the destination that I configure next.

There are a few destinations that I can pick, but to keep things simple, I ask for the data to be sent to an S3 bucket. I also set the frequency that I want to fetch the data on this step. I want the data to be retrieved every hour, so I select Run the flow on schedule and make the necessary configurations. Slack can be triggered on demand or on schedule; some other sources can be triggered by specific events, such as converting a lead in Salesforce.

The next step is to map the data fields, I am just going to go with the defaults, but you could customize this and combine fields or take only the specific fields required for analysis.

Now the flow has been created, and I have activated it; it runs automatically every hour, adding new data to my S3 bucket.

I won’t go it the specifics of Amazon Athena or Amazon QuickSight, but I used both of these AWS services to take my data stored in S3 and produce a word cloud of the most common words that are used in my Slack Channel.

The cool thing about Athena is that you can run SQL queries directly over the encrypted data in S3 without needing any additional data warehouse. You can see the results in the image below. I could now easily share this as a dashboard with anyone in my organization.

Amazon AppFlow is launching today with support for S3 and 13 SaaS applications as sources of data, and S3, Amazon Redshift, Salesforce, and Snowflake as destinations, and you will see us add hundreds more as the service develops.

The service automatically scales up or down to meet the demands you place on it, it also allows you to transfer 100GB in a single flow which means you don’t need to break data down into batches. You can trust Amazon AppFlow with your most valuable data as we have architected to be highly available and resilient.

Amazon AppFlow is available from today in US East (Northern Virginia), US East (Ohio), US West (Northern California), US West (Oregon), Canada (Central), Asia Pacific (Singapore), Asia Pacific (Toyko), Asia Pacific (Sydney), Asia Pacific (Seoul), Asia Pacific (Mumbai), Europe (Paris), Europe (Ireland), Europe (Frankfurt), Europe (London), and South America (São Paulo) with more regions to come.

Happy Data Flowing

— Martin

Serving Billions of Ads in Just 100 ms Using Amazon Elasticache for Redis

Post Syndicated from Rodrigo Asensio original https://aws.amazon.com/blogs/architecture/serving-billions-of-ads-with-amazon-elasticache-for-redis/

This post was co-written with Lucas Ceballos, CTO of Smadex

Introduction

Showing ads may seem to be a simple task, but it’s not. Showing the right ad to the right user is an incredibly complex challenge that involves multiple disciplines such as artificial intelligence, data science, and software engineering. Doing it one million times per second with a 100-ms constraint is even harder.

In the ad-tech business, speed and infrastructure costs are the keys to success. The less the final user waits for an ad, the higher the probability of that user clicking on the ad. Doing that while keeping infrastructure costs under control is crucial for business profitability.

About Smadex

Smadex is the leading mobile-first programmatic advertising platform specifically built to deliver best user acquisition performance and complete transparency.

Its state-of-the-art digital signal processing (DSP) technology provides advertisers with the tools they need to achieve their goals and ROI, with measurable results from web forms, post-app install events, store visits, and sales.

Smadex advertising architecture

What does showing ads look like under the hood? At Smadex, our technology works based on the OpenRTB (Real-Time Bidding) protocol.

RTB is a means by which advertising inventory is bought and sold on a per-impression basis, via programmatic instantaneous auction, which is similar to financial markets.

To show ads, we participate in auctions deciding in real time which ad to show and how much to bid trying to optimize the cost of every impression.

High level diagram

  1. The final user browses the publisher’s website or app.
  2. Ad-exchange is called to start a new auction.
  3. Smadex receives the bid request and has to decide which ad to show and how much to offer in just 100 ms (and this is happening one million times per second).
  4. If Smadex won the auction, the ad must be sent and rendered on the publisher’s website or app.
  5. In the end, the user interacts with the ad sending new requests and events to Smadex platform.

Flow of data

As you can see in the previous diagram, showing ads is just one part of the challenge. After the ad is shown, the final user interacts with it in multiple ways, such as clicking it, installing an application, subscribing to a service, etc. This happens during a determined period that we call the “attribution window.” All of those interactions must be tracked and linked to the original bid transaction (using the request_id parameter).

Doing this is complicated: billions of bid transactions must be stored and available so that they can be quickly accessed every time the user interacts with the ad. The longer we store the transactions, the longer we can “wait” for an interaction to take place, and the better for our business and our clients, too.

Detailed diagram

Challenge #1: Cost

The challenge is: What kind of database can store billions of records per day, with at least a 30-day retention capacity (attribution window), be accessed by key-value, and all by spending as little as possible?

The answer is…none! Based on our research, all the available options that met the technical requirements were way out of our budget.

So…how to solve it? Here is when creativity and the combination of different AWS services comes into place.

We started to analyze the time dispersion of the events trying to find some clues. The interesting thing we spotted was that 90% of what we call “post-bid events” (impression, click, install, etc.) happened within one hour after the auction took place.

That means that we can process 90% of post-bid events by storing just one hour of bids.

Under our current workload, in one hour we participate in approximately 3.7 billion auctions generating 100 million bid records of an average 600 bytes each. This adds up to 55 gigabytes per hour, an easier amount of data to process.

Instead of thinking about one single database to store all the bid requests, we decided to split bids into two different categories:

  • Hot Bid: A request that took place within the last hour (small amount and frequently accessed)
  • Cold Bid: A request that took place more than our hour ago (huge amount and infrequently accessed)

Amazon ElastiCache for Redis is the best option to store 55 GB of data in memory, which gives us the ability to query in a key-value way with the lowest possible latency.

Hot Bids flow

Hot Bids flow diagram

  1. Every new bid is a hot bid by definition so it’s going to be stored in the hot bids Redis cluster.
  2. At the moment of the user interaction with the ad, the Smadex tracker component receives an HTTPS notification, including the bid request UUID that originated it.
  3. Based on the date of occurrence extracted from the received UUID, the tracker component can determine if it’s looking for a hot bid or not. If it’s a hot bid, the tracker reads it directly from Redis performing a key-value lookup query.

It’s been easy so far but what to do with the other 29 days and 23 hours we need to store?

Challenge #2: Performance

As we previously mentioned, cold bids are a huge infrequently accessed number of records with only 10% of post-bid events pointing to them. That sounds like a good use case for an inexpensive and slower data store like Amazon S3.

Thanks to the S3 low-cost storage prices combined with the ability to query S3 objects directly using Amazon Athena, we were able to optimize our costs by storing and querying cold bids by implementing a serverless architecture.

Cold Bids Flow

Cold Bids flow diagram

  1. Incoming bids are buffered by Fluentd and flushed to S3 every one minute in JSON format. Every single file flushed to S3 contains all the bids processed by a specific EC2 instance for one minute.
  2. An AWS Lambda function is automatically triggered on every new PutObject event from S3. This function transforms the JSON records to Parquet format and will save it back the S3 bucket, but this time into a specific partition folder based on file creation timestamp.
  3. As seen on the hot bids flow, the tracker component will determine if it’s looking for a hot or a cold bid based on the extracted timestamp of the request UUID. In this case, the cold bid will be retrieved by running an Amazon Athena look-up query leveraging the use of partitions and Parquet format to reduce as much as possible the latency and data that needs to be scanned.

Conclusion

Thanks to this combined approach using different technologies and a variety of AWS services we were able to extend our attribution window from 30 to 90 days while reducing the infrastructure costs by 45%.

 

 

Query, visualize, and forecast TruFactor web session intelligence with AWS Data Exchange

Post Syndicated from Jay Park original https://aws.amazon.com/blogs/big-data/query-visualize-and-forecast-trufactor-web-session-intelligence-with-aws-data-exchange/

Given the infinite nature of data, finding the right data set to gain business insights can be a challenge. You can improve your business by having access to a central repository of various data sets to query, visualize, and forecast. With AWS Data Exchange, finding the right data set has become much simpler. As an example, you can use data sets on web session visitation and demographics to help you understand which demographic groups visit your website most frequently. You can then improve your business through machine learning (ML) models and visitation forecasts.

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. After you subscribe to a data product within AWS Data Exchange, you can use the AWS Data Exchange API, AWS CLI, or the AWS Management Console to load data into Amazon S3 directly. You can then analyze the imported data with a wide variety of AWS services, ranging from analytics to machine learning.

This post showcases TruFactor Intelligence-as-a-Service data on AWS Data Exchange. TruFactor’s anonymization platform and proprietary AI ingests, filters, and transforms more than 85 billion high-quality raw signals daily from wireless carriers, OEMs, and mobile apps into a unified phygital consumer graph across physical and digital dimensions. TruFactor intelligence is application-ready for use within any AWS analytics or ML service to power your models and applications running on AWS, with no additional processing required. Common use cases include the following:

  • Consumer segmentation – Web intelligence on internet browsing behavior in the US provides a complete view of the consumer, including interests, opinions, values, digital behavior, and sentiment, to inform segmentation of your customers and those of your competitors.
  • Customer acquisition or churn campaigns – Internet browsing behavior can identify affinity properties for new prospects as well as switching to competitors’ websites.

This walkthrough uses TruFactor’s Daily Mobile Web Session Index and Daily Demographics by Mobile Web Sessions data sets, which are both available for free subscription through the AWS Data Exchange console. While there are commercial data sets available for purchase in AWS Data Exchange, this post uses trial data sets to showcase the breadth and depth of analytics possible with TruFactor’s intelligence.

This TruFactor intelligence is aggregated on over 3 billion records from telco carrier networks and mobile apps per day, originating from approximately 30 million consistent users, distilled into session-level information that provides a complete view of user digital interests. The accuracy, breadth of data provided, and the persistency of the panel deliver a unified view of consumers that can inform insights or power analytic models or applications on AWS.

These two data sets have applications across verticals such as retail, financial services, and advertising. Common use cases include creating detailed customer segmentation (for example, full DNA maps of consumers based on visits to specific web HTTP hosts), identifying affinity properties, and estimating demand for apps or services. This intelligence is also ideal for identifying trends and changes over time.

Solution overview

The following diagram illustrates the architecture of the solution.

The workflow is comprised of the following steps:

  1. Subscribe to a data set from AWS Data Exchange and export to Amazon S3
  2. Run an AWS Glue crawler to load product data
  3. Perform queries with Amazon Athena
  4. Visualize the queries and tables with Amazon QuickSight
  5. Run an ETL job with AWS Glue
  6. Create a time series forecast with Amazon Forecast
  7. Visualize the forecasted data with Amazon QuickSight

This post looks at the demographic distributions across various websites and how to use ML to forecast website visitation.

Walkthrough overview

The walkthrough includes the following steps:

  1. Subscribe to a TruFactor data set from the AWS Data Exchange console and export the data set to Amazon S3
  2. Use an AWS Glue crawler to load the product data into an AWS Glue Data Catalog
  3. Use Amazon Athena for SQL querying
  4. Visualize the query views and tables with Amazon QuickSight
  5. Use AWS Glue jobs to extract, transform, and load your data for forecasting with Amazon Forecast
  6. Use Amazon Forecast to create a time series forecast of the transformed data
  7. Visualize the forecasted web visitation data with Amazon QuickSight

You do not have to perform additional processing or manipulation of the TruFactor intelligence for this walkthrough.

The data sets

The TruFactor data sets this post uses are in Parquet format and snappy compression. The following section provides additional details and schema for each data set.

TruFactor Daily Mobile Web Session Index (US – Nationwide) — Trial

The TruFactor Daily Mobile Web Session Index (US – Nationwide) — Trial data set provides aggregate information per HTTP host as a view of the internet browsing behavior in the US. TruFactor generates the data from high-quality packet layer data sourced from mobile carriers that includes the mobile internet traffic originating from a user’s device. TruFactor derives the projected counts from observed counts that are filtered for exclusion and anonymized to make sure users cannot be re-identified. It extrapolates values from US Census data using a proprietary algorithm. For the avoidance of doubt, this data set does not include user-level data.

The following screenshot shows the schema for the mobile web session data set by HTTP host, session time, MB transferred, number of events, sessions, users, and dates.

TruFactor Daily Demographics by Mobile Web Session (US) — Trial

The TruFactor Daily Demographics by Mobile Web Session (US) — Trial data set includes aggregate demographics: a projected distribution of users per HTTP host as a view of the internet browsing behavior in the US. TruFactor generates the data from high-quality packet layer data sourced from mobile carriers that includes the mobile internet traffic originating from a user’s device. TruFactor derives the distribution from observed counts that are filtered for exclusion and anonymized to make sure users cannot be re-identified. It extrapolates values from US Census data using a proprietary algorithm. Demographics include gender, age range, ethnicity, and income range.

The following screenshot shows the partial schema for the demographics by web session data set. The full schema includes the following attributes: HTTP host, age ranges, genders, ethnicity, income ranges, and date.

Prerequisites

To complete this walkthrough successfully, you must have the following resources:

  • An AWS account.
  • Familiarity with AWS core services and concepts.
  • The ability to launch new resources in your account. Some resources may not be eligible for Free Tier usage and might incur costs.
  • Subscription to TruFactor’s Daily Mobile Web Session Index (US – Nationwide) – Trial and Daily Demographics by Mobile Web Session (US) – Trial data sets. For instructions on subscribing to a data set on AWS Data Exchange, see AWS Data Exchange – Find, Subscribe To, and Use Data Products.

Using AWS Data Exchange, Amazon S3, AWS Glue, Amazon Athena, and Amazon QuickSight

This section examines the key demographics of visitors to the top seven e-commerce websites. This information can help you understand which demographic groups are visiting your website most frequently and also help you target ads and cater to certain demographics groups. You use AWS Glue crawlers to crawl your data sets in Amazon S3, populate your AWS Glue Data Catalog, query the AWS Glue Data Catalog using Amazon Athena, and use Amazon QuickSight to visualize the queries.

Step 1: Exporting the data from AWS Data Exchange to Amazon S3

To export your TruFactor data set subscriptions into an Amazon S3 bucket, complete the following steps:

  1. Create an Amazon S3 bucket in your working account. For the purposes of our demo, we have named our S3 bucket trufactor-data-exchange-bucket.
  2. Create two folders within the S3 bucket: web_sess and demo_by_web_sess.

This post uses a trial data set with a sample of 14 days. A paid subscription to TruFactor’s Web Sessions data on AWS Data Exchange includes 6 months of historical data, which refreshes daily.

The following screenshot shows the two folders within the S3 bucket.You are now ready to export the data sets.

  1. On the AWS Data Exchange console, under Subscriptions, locate TruFactor Daily Mobile Web Sessions Index (US – Nationwide) – Trial.
  2. Under Revisions, choose the most recent Revision ID.
  3. Choose all assets except the manifest.json files.
  4. Choose Export to Amazon S3.
  5. In the window that opens, choose the S3 bucket and folder to export the product data into.
    • Export all the assets into the S3 bucket’s web_sess folder.
  6. Repeat the previous steps for the TruFactor Daily Demographics by Mobile Web Sessions (US) – Trial data set, with the following change:
    • Export the assets into the demo_by_web_sess folder in your S3 bucket.
  7. Check to make sure you successfully imported the TruFactor data sets in the Overview. The following screenshot shows that the data sets are partitioned into folders by date. Each folder contains Parquet files of web session data for each day.

Step 2: Populating your AWS Glue Data Catalog with the TruFactor data sets

Now that you have successfully exported the TruFactor data sets into an Amazon S3 bucket, you create and run an AWS Glue crawler to crawl your Amazon S3 bucket and populate the AWS Glue Data Catalog. Complete the following steps:

  1. On the AWS Glue console, under Data Catalog, choose Crawlers.
  2. Choose Add crawler.
  3. For Crawler name, enter a name; for example, trufactor-data-exchange-crawler.
  4. For Crawler source type, choose Data stores.
  5. Choose Next.
  6. For Choose a data store, choose S3.
  7. For Crawl data in, select Specified path in my account.
  8. For Include path, enter the path for the web_sess data set folder. The crawler points to the following path: s3://<trufactor-data-exchange-bucket>/web_sess.
  9. Choose Next.
  10. Select Yes to Add another data store.
  11. Choose Next.
  12. For Include path, enter the path for the demo_by_web_sess data set folder. The crawler points to the following path: s3://<trufactor-data-exchange-bucket>/demo_by_web_sess.
  13. Choose Next.
  14. In the Choose an IAM role section, select Create an IAM role. This is the role that the AWS Glue crawler and AWS Glue jobs use to access the Amazon S3 bucket and its content.
  15. For IAM role, enter the suffix demo-data-exchange.
  16. Choose Next.
  17. In the schedule section, leave the Frequency with the default Run on Demand.
  18. Choose Next.
  19. In the Output section, choose Add database.
  20. Enter a name for the database; for example, trufactor-db.
  21. Choose Next, then choose Finish.This database contains the tables that the crawler discovers and populates. With these data sets separated into different tables, you join and relationalize the data.
  1. In the Review all steps section, review the crawler settings and choose Finish.
  2. Under Data Catalog, choose Crawlers.
  3. Select the crawler you just created.
  4. Choose Run crawler.The AWS Glue crawler crawls the data sources and populates your AWS Glue Data Catalog. This process can take up to a few minutes. When the crawler is finished, you can see two tables added to your crawler details. See the following screenshot.You can now view your new tables.
  1. Under Databases, choose Tables.
  2. Choose your database.
  3. Choose View the tables. The table names correspond to the Amazon S3 folder directory you used to point your AWS Glue crawler. See the following screenshot.

Step 3: Querying the data using Amazon Athena

After you populate the AWS Glue Data Catalog with TruFactor’s Mobile Web Session and Demographics data, you can use Amazon Athena to run SQL queries and create views for visualization. Complete the following steps:

  1. On the Amazon Athena console, choose Query Editor.
  2. On the Database drop-down menu, choose the database you created.
  3. To preview one of the tables in Amazon Athena, choose Preview table.
    On the Results section, you should see 10 records from the web_sess table. See the following screenshot.In this next step, you run a query that creates a view of the Web Session Index and Demographics data across a group of e-commerce HTTP hosts. This is broken down by the percentage of users categorized by age and gender, number of users, MB transferred, and number of sessions ordered by date.
  4. Run the following SQL query in Amazon Athena:
    CREATE OR REPLACE VIEW e_commerce_web_sess_data AS 
    SELECT
      "date_parse"("a"."partition_0", '%Y%m%d') "date",
      "a"."http_host",
      "a"."users",
      "a"."mb_transferred",
      "a"."number_of_sessions",
      "b"."18_to_25",
      "b"."26_to_35",
      "b"."36_to_45",
      "b"."46_to_55",
      "b"."56_to_65",
      "b"."66_to_75",
      "b"."76_plus",
      "b"."male",
      "b"."female"
    FROM  
      ((
       SELECT
         "partition_0",
         "http_host",
         "users",
         "mb_transferred",
         "number_of_sessions"
       FROM
         "trufactor-db"."web_sess"
       WHERE ("http_host" IN ('www.amazon.com', 'www.walmart.com', 'www.ebay.com', 'www.aliexpress.com', 'www.etsy.com', 'www.rakuten.com', 'www.craigslist.com'))
    )  a
    LEFT JOIN (
       SELECT
         "http_host" "http_host_2",
         "partition_0" "partition_2",
         "age_ranges"."18_to_25",
         "age_ranges"."26_to_35",
         "age_ranges"."36_to_45",
         "age_ranges"."46_to_55",
         "age_ranges"."56_to_65",
         "age_ranges"."66_to_75",
         "age_ranges"."76_plus",
         "genders"."male",
         "genders"."female"
       FROM
         "trufactor-db"."demo_by_web_sess"
       WHERE ("http_host" IN ('www.amazon.com', 'www.walmart.com', 'www.ebay.com', 'www.aliexpress.com', 'www.etsy.com', 'www.rakuten.com', 'www.craigslist.com'))
    )  b ON (("a"."http_host" = "b"."http_host_2") AND ("a"."partition_0" = "b"."partition_2")))
    ORDER BY "date" ASC

  5. After you create the view, you can preview it by repeating the above steps for previewing a table. The following screenshot shows the results, which include the number of users, user percentages by age group and gender, and a list of e-commerce hosts listed by date.

Step 4: Visualizing with Amazon QuickSight

After you query your data sets in Amazon Athena, you can use Amazon QuickSight to visualize your results. You must first grant Amazon QuickSight access to the Amazon S3 bucket that holds the TruFactor data sets, which you can do through the Manage QuickSight setting on the Amazon QuickSight console. After you grant access to the Amazon S3 bucket, you visualize the tables and queries with Amazon QuickSight. Complete the following steps:

  1. In the Amazon QuickSight console, choose New Analysis.
  2. Choose New data set.
  3. Choose Athena as the data source.
  4. For Data source name, enter trufactor-data-exchange-source.
  5. From the drop-down menu, choose the database and view you created.
  6. Choose Directly query your data.
  7. Choose Visualize. Because TruFactor intelligence is application-ready, you can gain immediate insights by using Amazon Athena to query and Amazon QuickSight to visualize. This post includes visualizations of the data set for the first two weeks of October 2019. The following graph visualizes the number of users on different HTTP hosts.The following pie charts further filter the HTTP hosts by age range.The following bar chart offers another visualization of users by age range.You could add other fields such as income range, ethnicity, and gender.

Running AWS Glue Jobs and Amazon Forecast

This section discusses how to use AWS Glue jobs to query and export your data set for forecasting with Amazon Forecast. This walkthrough examines the amount of users’ visitation over 14 days across the top 50 HTTP hosts ranked by users’ visitation. From there, you forecast the users’ visitation for these HTTP hosts for the next three days.

Step 1: Creating and running an AWS Glue job

To create and run your AWS Glue job, complete the following steps:

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Add job.
  3. For Name, enter a name for the AWS Glue job; for example, demo-glue-job.
  4. For Type and Glue version, keep the default values.
  5. For This job runs, select A new script to be authored by you.
  6. In the Security configuration, script libraries, and job parameters (optional) section, set the Maximum capacity cluster size to 2. This reduces the cost of running the AWS Glue job. By default, the cluster size is set to 10 Data Processing Units (DPU).
  7. Choose Next.
  8. In the Connections section, keep the default values.
  9. Choose Save job and edit script.
  10. Enter the following code in the script section, and replace YOUR_BUCKET_NAME on line 42 with the name of your bucket.
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.dynamicframe import DynamicFrame
    from awsglue.job import Job
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    db_name = "trufactor-db"
    tbl_name = "web_sess"
    
    web_sess_dyf = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_name, transformation_ctx = "web_sess_dyf")
    web_sess_df = web_sess_dyf.toDF()
    web_sess_df.createOrReplaceTempView("webSessionTable")
    web_sess_sql_df = spark.sql("""
    SELECT to_date(partition_0, 'yyyyMMdd') AS date,
             http_host,
             users
    FROM 
        (SELECT partition_0,
             http_host,
             users,
             row_number()
            OVER ( PARTITION By partition_0
        ORDER BY users DESC ) AS rn
        FROM webSessionTable )
    WHERE rn<=50
    ORDER BY date""")
    
    web_sess_sql_df.coalesce(1).write.format("csv").option("header","false").save("s3://YOUR_BUCKET_NAME/amazon_forecast_demo/dataset/sampleset")
    job.commit()

    This code queries the top 50 HTTP hosts, ranked by users’ visitation during the first half of October and returns the users, date, and HTTP hosts columns. The query results upload to your Amazon S3 bucket in CSV format (you need the files in CSV to use Amazon Forecast).

  11. Choose Save and close the AWS Glue job screen.Before you can run the AWS Glue job, you need to modify the IAM role associated with AWS Glue. Currently, the IAM role only has permission to get and put objects in the directories you specified earlier. You need to update the IAM policy to allow permission to get and put objects in all subdirectories of the Amazon S3 bucket.
  12. On the IAM console, choose the role you used for this walkthrough: AWSGlueServiceRole-demo-data-exchange.
  13. In the Summary section for the IAM role, on the Permissions tab, choose the IAM policy associated with the Managed policy.
  14. Choose Edit policy.
  15. Change the view from Visual editor to JSON.
  16. Within this JSON object, under Resource, add another resource into the list of values. The following code is the updated IAM policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::trufactor-data-exchange-bucket/web_sess*",
                    "arn:aws:s3:::trufactor-data-exchange-bucket/demo_by_web_sess*",
                    "arn:aws:s3:::trufactor-data-exchange-bucket/*"
                ]
            }
        ]
    }

  17. Choose Review policy and Save changes.
  18. On the AWS Glue console, under ETL, choose Jobs. Select the job you created earlier.
  19. From the Action drop-down menu, choose Run job. On the History tab, you can see when the status changes to Succeeded. See the following screenshot.This job can take 15–20 minutes to complete.

Step 2: Creating a dataset group, training a predictor, and creating forecasts in Amazon Forecast

To create your dataset group, train a predictor, and create forecasts, complete the following steps:

  1. On the Amazon Forecast console, choose View dataset groups.
  2. Choose Create dataset group.
  3. For Dataset group name, enter a name; for example, users_visitation_sample_dataset_group.
  4. For Forecasting domain, choose Web traffic.
  5. Choose Next.
  6. On the Create target time series dataset page, for Dataset name, enter the name of your dataset; for example, users_visitation_sample_dataset.
  7. For Frequency of your data, choose 1 day.
  8. For Data schema, update the data schema JSON object with the following code:
    {
      "Attributes":[
        {
          "AttributeName": "timestamp",
          "AttributeType": "timestamp"
        },
        {
          "AttributeName": "item_id",
          "AttributeType": "string"
        },
        {
          "AttributeName": "value",
          "AttributeType": "float"
        }
      ]
    }

  9. Choose Next.
  10. On the Import target time series data page, for Dataset import name, enter your dataset name; for example, users_visitation_sample_dataset_import.
  11. For Timestamp format, enter yyyy-MM-dd.
  12. For IAM Role, create a new role and grant Amazon Forecast access to the S3 bucket that you are using for this demo.
  13. For Data Location, use the S3 path that you exported your CSV file to after the AWS Glue job: s3://<trufactor-data-exchange-bucket>/amazon_forecast_demo/dataset/sampleset.
  14. Review the settings for import target time series data and choose Start import.The process of importing the data can take approximately 10 minutes. When the status changes to Active, you can begin training a predictor.
  1. On the Dashboard page, choose Start next to Predictor training.
  2. On the Train predictor page, for Predictor name, enter a name for the predictor; for example, users_visitation_sample_dataset_predictor.
  3. For Forecast horizon, choose 3.
  4. For Forecast frequency, choose day.
  5. For Algorithm selection, select Manual. If you use the other algorithm option, AutoML, you allow Amazon Forecast to choose the right algorithm based on a pre-defined objective function, which is not necessary for this walkthrough.
  6. For Algorithm, choose Deep_AR_Plus (you use deep learning to forecast users’ visitation across 50 HTTP hosts).
  7. Leave all other options with the default values.
  8. Review the settings and choose Train predictor. The predictor training process can take 20–30 minutes. When the training completes, the status changes to Active. To evaluate the predictor’s (ML model) accuracy, Amazon Forecast splits the input time series data into two data sets: training and test. This process tests a predictive model on historical data and is called backtesting. When it splits the input time series data, it maintains the data’s order, which is crucial for time series data. After training the dataset, Amazon Forecast calculates the root mean square error (RSME) and weighted quantile losses to determine how well the predictor performed. For more detailed information about backtesting and predictor metrics, see Evaluating Predictor Accuracy. When the predictor is finished training, you can create a forecast.
  9. On the Dashboard page, under Generate forecasts, choose Start.
  10. For Forecast name, enter a forecast name; for example, users_visitation_sample_forecast.
  11. For Predictor, choose your trained predictor.
  12. For Forecast types, you can enter any values between 0.01 and 0.99 and the mean. These are percentage probabilities of satisfying the original demand. This post enters .50, .90, .99, mean.
  13. Choose Create a forecast.The forecast creation process can take 15–20 minutes.
  14. When the forecast is complete, choose Forecasts.
    You should see a single forecast. See the following screenshot.
    You can now export the generated forecast to a new folder within your existing Amazon S3 bucket for visualization with Amazon QuickSight.
  1. Choose the newly generated forecast.
  2. Under Exports, choose Create forecast export.
  3. For Export name, enter a name for the export; for example, users_visitation_sample_forecast_export.
  4. For Generated forecast, choose users_visitation_sample_forecast.
  5. For IAM Role, choose the role you created earlier.
  6. For S3 forecast export location, enter the S3 path to store the forecasts: s3://<trufactor-data-exchange-bucket>/amazon_forecast_demo/forecasts/sampleset.
  7. Choose Create forecast export.The exporting process can take up to 5 minutes. Alternatively, you can visualize the user visitation forecasts for the 50 HTTP hosts directly through the Amazon Forecast console or Query API.

Step 3: Querying a view using Amazon Athena and downloading the forecast file

Before you visualize users’ visitation forecast data, create a view in Amazon Athena for the top 50 HTTP hosts ranked by users’ visitation over 14 days. Complete the following steps:

  1. Run the following query in Amazon Athena:
    CREATE OR REPLACE VIEW "top_50_users" AS
    SELECT date_format(date_parse(partition_0,
             '%Y%m%d'),'%Y-%m-%d') AS "date", http_host, users
    FROM 
        (SELECT partition_0,
             http_host,
             users,
             row_number()
            OVER (PARTITION By partition_0
        ORDER BY  users DESC ) AS rn
        FROM "trufactor-db"."web_sess")
    WHERE rn<=50
    ORDER BY date

    The code queries the top 50 HTTP hosts ranked by users’ visitation sorted by date.

  2. In the Amazon S3 console, navigate to the S3 bucket and directory holding the files: s3://<trufactor-data-exchange-bucket>/amazon_forecast_demo/forecasts/sampleset. The following screenshot shows three different files inside the folder.
  1. Download the CSV file.

Step 4: Visualizing in Amazon QuickSight

To visualize the data in Amazon QuickSight, complete the following steps:

  1. On the Amazon QuickSight console, choose Manage data.
  2. Choose New data set.
  3. Choose Upload a file.
  4. Upload the CSV file that you downloaded.
  5. On the Confirm file upload settings page, choose Next.
  6. Choose Visualize.
  7. Return to the Amazon QuickSight console and choose Manage data.
  8. Choose New data set for the top 50 HTTP hosts view you queried earlier.
  9. On the Create a Data set page, find the data source you created earlier: trufactor-data-exchange-source.
  10. From the drop-down list, choose the database and view you created.
  11. Choose Directly query your data.
  12. Choose Visualize.
  13. On the new Amazon QuickSight analysis page, choose the pencil icon next to Data set.
  14. Choose Add data set.
  15. Choose the CSV file you uploaded.

You now have a single Amazon QuickSight analysis with multiple data sets to visualize.

The following graphs highlight the historical data for the users’ visitation across 50 HTTP hosts for the first two weeks of October and the mean forecast for users’ visitation for the next three days.

The following graphs highlight the historical data and forecasted P50, P90, and P99 quantile values for www.google.com.

Amazon Forecast makes it easier to get started with machine learning without having to create your own ML models from scratch. You can use this information to anticipate the web traffic for the upcoming week, which can aid in scaling your resources and applications accordingly.

Cleaning up

To avoid incurring future charges, delete the following resources that you created in this walkthrough:

  • The Amazon S3 bucket trufactor-data-exchange-bucket
  • The AWS Glue crawler trufactor-data-exchange-crawler
  • The AWS Glue job demo-glue-job
  • The AWS IAM role AWSGlueServiceRole-demo-data-exchange
  • The AWS Glue database trufactor-db
  • The Amazon QuickSight demo data sets and analysis
  • The following Amazon Forecast resources (in this order) for users_visitation_sample_dataset_group via the console:
    • Existing forecasts under Forecasts
    • Existing predictors under Predictors
    • Existing datasets under Datasets

Conclusion

This walkthrough detailed how to import a data set to Amazon S3 from AWS Data Exchange, use AWS Glue to run crawlers and an ETL job on the data, run SQL queries with Amazon Athena, create a time series forecast of the queried data with Amazon Forecast, and visualize the queried and forecasted data with Amazon QuickSight.

This post used TruFactor Intelligence-as-a-Service, one of the AWS Data Exchange launch partners, to power this walkthrough. TruFactor intelligence on AWS Data Exchange highlighted the ease of loading directly into Amazon S3 and layering advanced AWS services.

For more information about TruFactor and the AWS Data Exchange, see TruFactor on AWS Data Exchange on the TruFactor website. You can subscribe to TruFactor Intelligence directly on AWS Data Exchange or engage with TruFactor directly to identify the right offering from the larger product portfolio of anonymized consumer intelligence.


About the Authors

Jay Park is a solutions architect at AWS.

 

 

 

 

Ariana Rahgozar is a solutions architect at AWS.

 

 

 

 

Build a cloud-native network performance analytics solution on AWS for wireless service providers

Post Syndicated from Angelo Sampietro original https://aws.amazon.com/blogs/big-data/build-a-cloud-native-network-performance-analytics-solution-on-aws-for-wireless-service-providers/

This post demonstrates a serverless, cloud-based approach to building a network performance analytics solution using AWS services that can provide flexibility and performance while keeping costs under control with pay-per-use AWS services.

Without good network performance, you may struggle to face the challenges of real-time and low latency services and the increase of the total bandwidth your customers consume.

Considering the large volume of data that you need to ingest, store, and process every second for optimal performance monitoring, standard on-premises monitoring approaches are no longer efficient.

A cloud-native approach allows you to invest in the solutions that generate business value and move from the typical capital expenditure model to an operational expenditure model by avoiding upfront costs and over-provisioning of infrastructure.

Data and voice network complexity for mobile service providers

According to Cisco’s global mobile data traffic forecast, there will be 13.1 billion mobile-connected devices by 2023, and 1.4 billion of those will be 5G capable.

As a mobile service provider, you must understand how to perform accurate network planning and sizing for your access and core networks.

The increase in the global demand for network throughput and the number of services such as VoLTE, IoT, and video streaming on mobile networks is forcing mobile service providers to implement new architectures to match the desired quality of service (QoS).

Addressing optimal QoS when a multitude of services are running on a converged network is not an easy task. The workflow is complex, starting from collecting counters and statistics data from a multitude of different network elements to transforming the collected data into key performance indicators (KPIs) that you can link to the quality of one of the multiple services delivered over the network.

Modern mobile networks, with deployment of 4G, 5G, and IoT services, have given rise to an increased number of cells that are deployed on the territory, so you must collect counters and generate KPIs on thousands of different network elements.

Considering that every network element can generate a few thousand counters, the network performance system has to manage millions of measurements at every collection cycle.

This is difficult to manage at scale with on-premises deployments without a high-cost solution. Instead, you can use AWS services to design a modern network performance analytics solution that covers all the requirements of different departments of telecommunications service providers (TSPs).

Data and voice network architecture

The main problem that you may face as a service provider is the complexity of modern mobile networks, which originate from several evolutions of communication standards in the last few decades (from 2G to 5G for data core, from CS to VoLTE for voice core) and the hardware and capabilities of the network elements.

The following diagram illustrates a simplified schema of a mobile wireless network element currently deployed, where you can find:

  • Access network, with the network elements needed for the coverage from 2G to 5G
  • Core Network, that includes the network elements for all the functions needed to deliver the  services, the authentication and the database of all the users on the network
  • Services delivered by the network, that includes Voice (PSTN/PLMN), Internet (data service ) and Application Services

IoT network architecture and service differentiation

The IoT traffic paradigm is completely different from the traffic patterns of smartphone and tablets. These mobile devices normally open an always on session (PDP context or bearer) and generate a considerable amount of user plane traffic. An IoT device opens a session, sends few bytes, and closes the session to limit the power consumption and avoid allocating resources on cells when not necessary.

The following diagram depicts a real-world use case in which a service provider deploys IoT services, in most cases, IoT traffic is delivered to a different 4G core for the following reasons:

  • EPC optimization needed to manage IoT traffic
  • Avoiding IoT traffic managed by the same core network of customers’ traffic for security reasons
  • IoT devices generate huge volumes of control plane traffic that can impact customer performance if managed by the same network

This usage pattern requires a different traffic model and forces mobile service providers to perform firmware upgrades to the current cells or deploy new cells into the network. This has an impact on the QoS due to the increase in the number of counters and cells.

What is network performance, and why is it important?

You can use network performance monitoring systems to generate reports and insights and track network performance. Multiple organizations like Operations, Network Planning, and Engineering use these tools to view the overall quality of networks and services and control important KPIs like availability, response times, and download and upload speeds.

Network performance is strictly connected to the QoS that, by definition, is the mechanism to control the performance, reliability, and usability of a telecommunications service.

Pain points for service providers

Every vendor of network elements usually also provides the element managers for the hardware. These systems are proprietary, but most of the time, they comply with the standard provided by 3rd Generation Partnership Project (3GPP) for the XML file format of the performance measurement (PM) files to export the counters you need to measure network performance accurately.

Service providers normally also use multiple vendors for core and access network elements, so monitoring a non-heterogenous environment is one of the main challenges in today’s deployments.

Data types and 3GPP standards

Counters and statistics about the performance on a network element are exported in PM files. 3GPP standardized the format in a specific document, which you can download from the 3GPP website.

The network vendor is not guaranteed to follow this standard, but even if there are some customizations, the structure and format in most cases remain very similar to the open standard.

The following schema from 3GPP describes the XML format for PM that is exported from network elements.

The XML schema root element is “measCollectFile” and it has three child elements:

  1. “fileHeader”: contains information such as file sender and the beginning time of the measurements.
  2. “fileFooter”: contains the end time of the measurements.
  3. “measData”: contains all the measurements information such as measurement types and their values.

The most important tags are:

  • measInfo – Contains the family of the measurements, the granularity, and the counters list for each measValue
  • measValue – Contains multiple measResults with the results of the measurements

Building a network performance solution on AWS

In this section we describe a possible architecture that can be used to implement the solution on AWS following a serverless approach.

Prerequisites

To implement this solution, you must have the following prerequisites:

  • An AWS account in the same Region
  • The AdministratorAccess policy granted to your AWS account (for production, you should restrict access as needed)

This post uses the EU (Ireland) Region. However, you can choose another Region of your choice where the following services are available:

For more information about AWS Regions and where AWS services are available, see Region Table.

The following diagram illustrates the high-level, end-to-end solution and the AWS services it uses, the workflow begins with the ingestion of the files using SFTP or Kinesis Data Firehose, data is stored in S3 and processed with a Lambda function and Glue to create a data catalog. Data querying is done with Athena and the visualization in Quicksight.

Collecting data using Kinesis Data Firehose or AWS SFTP

For information about collecting PM files via element managers, see chapter 5 of Technical Specification 3GPP TS 31.432 on the 3GPP website.

Element managers act as collectors of XML measurement files that the network elements provide. They can send the files to an S3 bucket via AWS SFTP or Kinesis Data Firehose.

For the data transfer, you can refer to the following documentation about creating a Kinesis Firehose Delivery Stream and sending data to Kinesis Firehose Delivery Stream, or If you are planning to use SFTP transfer, you can read here how AWS Transfer for SFTP works.

Transforming data using a Lambda function

This post provides a Lambda function that is associated with the ObjectCreated event type of the destination S3 bucket rawxml prefix (s3://wireless-pm/rawxml). The function runs every time a new XML file is saved in this location.

The Lambda function is written in Python 3.7. You can download it from the GitHub repository for this blog post. The function uses a layer to resolve the dependency for the xmltodict library used in the code.

The following screenshot of the AWS Management Console shows some of the main properties of the Lambda function

  1. The Lambda name and its associated Lambda Layer.
  2. The S3 bucket trigger event on the ObjectCreated event.

The function transposes the XML files, converts them into CSV or JSON (depending on the value set in the function’s output_format environment variable), and formats the files with one record per measurement (measData), measure type, and value (measInfo). The technical specification 3GPP TS 32.435 document on the 3GPP website provides three example XML files in the Annex section.

The following code is an example A.1 XML file:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<measCollecFile xmlns="http://www.3gpp.org/ftp/specs/archive/32_series/32.435#measCollec"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.3gpp.org/ftp/specs/archive/32_series/32.435#measCollec http://www.3gpp.org/ftp/specs/archive/32_series/32.435#measCollec">
    <fileHeader fileFormatVersion="32.435 V7.0" vendorName="Company NN" dnPrefix="DC=a1.companyNN.com,SubNetwork=1,IRPAgent=1">
        <fileSender localDn="SubNetwork=CountryNN,MeContext=MEC-Gbg-1,ManagedElement=RNC-Gbg-1" elementType="RNC"/>
        <measCollec beginTime="2000-03-01T14:00:00+02:00"/>
    </fileHeader>
    <measData>
        <managedElement localDn="SubNetwork=CountryNN,MeContext=MEC-Gbg-1,ManagedElement=RNC-Gbg-1" userLabel="RNC Telecomville"/>
        <measInfo>
            <job jobId="1231"/>
            <granPeriod duration="PT900S" endTime="2000-03-01T14:14:30+02:00"/>
            <repPeriod duration="PT1800S"/>
            <measTypes>attTCHSeizures succTCHSeizures attImmediateAssignProcs succImmediateAssignProcs</measTypes>
            <measValue measObjLdn="RncFunction=RF-1,UtranCell=Gbg-997">
                <measResults>234 345 567 789</measResults>
            </measValue>
            <measValue measObjLdn="RncFunction=RF-1,UtranCell=Gbg-998">
                <measResults>890 901 123 234</measResults>
            </measValue>
            <measValue measObjLdn="RncFunction=RF-1,UtranCell=Gbg-999">
                <measResults>456 567 678 789</measResults>
                <suspect>true</suspect>
            </measValue>
        </measInfo>
    </measData>
    <fileFooter>
        <measCollec endTime="2000-03-01T14:15:00+02:00"/>
    </fileFooter>
</measCollecFile>

The Lambda function transforms the preceding file into the following JSON format (this code example only shows the first record of the transformed and transposed dataset):

{
    "fh_file_format_version": "32.435 V7.0",
    "fh_vendor_name": "Company NN",
    "fh_dn_prefix": "DC=a1.companyNN.com,SubNetwork=1,IRPAgent=1",
    "fh_fs_local_dn": "SubNetwork=CountryNN,MeContext=MEC-Gbg-1,ManagedElement=RNC-Gbg-1",
    "fh_fs_element_type": "RNC",
    "fh_mc_begin_time": "2000-03-01T14:00:00+02:00",
    "ff_mc_end_time": "2000-03-01T14:15:00+02:00",
    "md_me_local_dn": "SubNetwork=CountryNN,MeContext=MEC-Gbg-1,ManagedElement=RNC-Gbg-1",
    "md_me_user_label": "RNC Telecomville",
    "md_me_sw_version": "",
    "md_mi_meas_info_id": "",
    "md_mi_job_jobid": "1231",
    "md_mi_gp_duration": "PT900S",
    "md_mi_gp_end_time": "2000-03-01T14:14:30+02:00",
    "md_mi_rp_duration": "PT1800S",
    "md_mi_meas_obj_ldn": "RncFunction=RF-1,UtranCell=Gbg-997",
    "md_mi_meas_name": "attTCHSeizures",
    "md_mi_meas_value": 234,
    "md_mi_meas_p": null,
    "md_mi_meas_suspect": null
}

You can change the output field names in the get_record_header static method of the GPPXml class defined in the Lambda function. The fields md_mi_meas_name and md_mi_meas_value contain the measure name and measure value, respectively.

The transformed CSV and JSON files are saved in the raw_transform_csv and raw_transform_json prefixes, respectively, in the S3 bucket (only one format is created for each execution, depending on the value of the output_format environment variable). The following screenshot shows the S3 bucket overview on the Amazon S3 console.

For this use case, a Lambda function is triggered for the transformation task every time a new XML file arrives. Depending on the volume and size of the XML files, you can choose other compute options on AWS based on the right fit, such as in containers using AWS Batch, Amazon ECS, or Amazon EKS. For more information about configuring and running processes using Amazon ECS, see Building, deploying, and operating containerized applications with AWS Fargate.

Alternatively, you could ingest the raw XML file using a Kinesis Firehose stream with a Lambda function attached to it to transform the data to JSON and output directly to Amazon S3 in Parquet or ORC format using the record format conversion feature of Kinesis Firehose. You would need to predefine a table in the Data Catalog whose schema, serializer, and deserializer you use to convert the data.

Building a Data Catalog using AWS Glue

To create an AWS Glue table in the Data Catalog, first create an AWS Glue crawler. Complete the following steps:

  1. Select AWS Glue from the services in the AWS Console.
  2. Select Crawlers in the left side menu.
  3. Click the button “Add Crawlers”.
  4. For Crawler name, enter a name for your crawler; for example, Wireless_PM_crawler.
  5. Choose Next.
  6. For Choose a data store, choose S3.
  7. For Crawl data in, select Specified path in my account.
  8. For Include path, enter the path to the input files in CSV format.
  9. Choose Next.
  10. In the Choose an IAM role section, select Choose an existing IAM role.
  11. For IAM role, enter the name of your role to grant read permission to access the S3 bucket.
  12. Choose Next.The following code is the AWSGlueServiceRole-S3 IAM policy for this post:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::npm-blog/raw_transform_csv/Parquet*",
                    "arn:aws:s3:::npm-blog/raw_transform_csv/*"
                ]
            }
        ]
    }

  13. In the Choose the crawler’s output section, for Database, choose the database in which the Data Catalog table is created; for example, wireless_pm.
  14. Choose Next.In these next steps, you create an ETL job that converts from CSV to Parquet (a columnar file format) and writes the Parquet files to an S3 bucket. Parquet file format helps with performance of the downstream consumption of the data. To convert to Parquet, you can either use an AWS Glue generated script (which this post uses) or use your own PySpark or Scala script.
  15. In the Configure the job properties section, for Name, enter wireless_pm_parquet_conversion.
  16. For IAM role, enter AWSGlueServiceRole-wirelesspm
  17. For Type, choose Spark.
  18. For Glue version, choose Spark 2.4, Python 3 (Glue Version 1.0).
  19. For This job runs, select A proposed script generated by AWS Glue.
  20. For Script file name, enter wireless_pm_parquet_conversion.
  21. In the Choose a data target section, select Create tables in your data target.
  22. For Data store, choose Amazon S3.
  23. For Format, choose Parquet.
  24. For Target path, enter the path to your S3 bucket.
  25. Choose Next.

You can configure a trigger to start the job by creating a workflow in AWS Glue that runs the job after the crawler. When the job is complete, a new set of Parquet files are stored in the destination S3 bucket.

You can analyze the processed data directly with Athena by selecting the database created in AWS Glue, and analyzing the data from the AWS Glue table.

The following screenshot shows the AWS Glue database and the two tables created.

The following screenshot shows an example Athena query, which you can run by selecting a few columns to validate the data from the Data Catalog table you created.

You can also use other AWS services for analytics such as Amazon Redshift or Amazon EMR for further processing, analysis, and KPI calculations from the data.

Visualizing data using Amazon QuickSight

Amazon QuickSight is a business analytics service you can use to build visualizations, perform ad hoc analysis, and get business insights from your data. For more information, see What Is Amazon QuickSight?

Amazon QuickSight provides out-of-the-box integration with Athena, which lets you run SQL queries on top of the metadata in your Data Catalog. For more information, see Creating a Data Set Using Amazon Athena Data.

The following screenshot shows an Amazon QuickSight analysis created from an Amazon QuickSight dataset. The dataset is based on the table that AWS Glue defined, and you query it via Athena.

You can create a visualization by loading the example dataset of PM counters. For example, the following screenshot shows the measurement values for the GPRS SuccessAttach and AbortedAttach and highlights a problem.

Summary

This post discussed the main pain points for wireless service providers and how AWS services can help you build a serverless solution that scales according to your network growth with no upfront costs.

This solution also helps you visualize and analyze data. Additionally, it provided insights that can help operations and network planning departments manage and evolve their network according to current and future standards and services.

As always, AWS welcomes feedback. Please submit comments or questions in the comments section.

 


About the Authors

Angelo Sampietro is a senior partner solutions architect at Amazon Web Services. Angelo has a strong background in cloud computing, with 20 years of experience in the Telecommunications industry, working in the United States and Europe. With 5 AWS Certifications, including Big Data and Machine Learning Specialties, he currently works as Senior Partner Solutions Architect, helping Global System Integrators to be successful in the partnership with AWS. He loves to share new ideas with colleagues and friends and propose new solutions thinking out of the box.

 

 

Francesco Marelli is a senior solutions architect at Amazon Web Services. He has lived and worked in London for 10 years, after that he has worked in Italy, Switzerland and other countries in EMEA. He is specialized in the design and implementation of Analytics, Data Management and Big Data systems, mainly for Enterprise and FSI customers. Francesco also has a strong experience in systems integration and design and implementation of web applications. He loves sharing his professional knowledge, collecting vinyl records and playing bass.

 

 

 

A public data lake for analysis of COVID-19 data

Post Syndicated from AWS Data Lake Team original https://aws.amazon.com/blogs/big-data/a-public-data-lake-for-analysis-of-covid-19-data/

As the COVID-19 pandemic continues to threaten and take lives around the world, we must work together across organizations and scientific disciplines to fight this disease. Innumerable healthcare workers, medical researchers, scientists, and public health officials are already on the front lines caring for patients, searching for therapies, educating the public, and helping to set policy. At AWS, we believe that one way we can help is to provide these experts with the data and tools needed to better understand, track, plan for, and eventually contain and neutralize the virus that causes COVID-19.

Today, we are making a public AWS COVID-19 data lake available – a centralized repository of up-to-date and curated datasets on or related to the spread and characteristics of the novel corona virus (SARS-CoV-2) and its associated illness, COVID-19. Globally, there are several efforts underway to gather this data, and we are working with partners to make this crucial data freely available and keep it up-to-date. Hosted on the AWS cloud, we have seeded our curated data lake with COVID-19 case tracking data from Johns Hopkins and The New York Times, hospital bed availability from Definitive Healthcare, and over 45,000 research articles about COVID-19 and related coronaviruses from the Allen Institute for AI. We will regularly add to this data lake as other reliable sources make their data publicly available.

The breakthroughs that can win the battle against this disease arrive faster when it’s easy for everyone to access and experiment with this vital information. The AWS COVID-19 data lake allows experimenters to quickly run analyses on the data in place without wasting time extracting and wrangling data from all the available data sources. They can use AWS or third-party tools to perform trend analysis, do keyword search, perform question/answer analysis, build and run machine learning models, or run custom analyses to meet their specific needs. Since every stakeholder in this battle brings their own perspective, users can choose to work with the public data lake, combine it with their own data, or subscribe to the source datasets directly through AWS Data Exchange.

We imagine local health authorities could build dashboards to track infections and collaborate to efficiently deploy vital resources like hospital beds and ventilators. Or epidemiologists could complement their own models and datasets to generate better forecasts of hotspots and trends.

For example, at Chan Zuckerberg Biohub, a nonprofit where leaders in science and technology collaborate to cure, prevent, or manage disease, scientists are using the AWS COVID-19 data lake for new epidemiological insights. “Our team of researchers is now analyzing trends in disease spread, its geography, and time evolution by leveraging datasets from the AWS COVID-19 data lake, combined with our own data, in order to better predict COVID epidemiology,” said Jim Karkanias, Vice President of Data Science and Information Technology at Chan Zuckerberg Biohub.

This post walks you through examples of how to use the AWS COVID-19 data lake for analysis. This data lake is comprised of data in a publicly readable Amazon S3 bucket (s3://covid19-lake). The post shows how to set up the definitions for that data in an AWS Glue Data Catalog to expose it to analytics engines. You can then query the AWS COVID-19 data lake with Amazon Athena, a serverless SQL query engine.

Prerequisites

This post assumes you have the following:

  • Access to an AWS account
  • Permissions to create an AWS CloudFormation stack
  • Permissions to create AWS Glue resources (catalog databases and tables)

Configuring access to the data using a CloudFormation template

To make the data from the AWS COVID-19 data lake available in the Data Catalog in your AWS account, create a CloudFormation stack using the following template. If you are signed in to your AWS account, the following link fills out most of the stack creation form for you. All you need to do is choose Create stack. For instructions on creating a CloudFormation stack, see Get Started in the Cloud Formation documentation.

This template creates a covid-19 database in your Data Catalog and tables that point to the public AWS COVID-19 data lake. You do not need to host the data in your account, and you can rely on AWS to refresh the data as datasets are updated through AWS Data Exchange.

Exploring the data through the Data Catalog in your AWS account

When the CloudFormation stack shows a status of CREATE_COMPLETE, access the Glue Data Catalog to see the tables that the template created. You should see the following tables:

  • Global Coronavirus (COVID-19) Data – Tracks confirmed COVID-19 cases in provinces, states, and countries across the world with a breakdown to the county level in the US.

 

Table NameDescriptionSourceProvider
enigma_jhuConfirmed COVID-19 casesJohns HopkinsEnigma

 

 

Table NameDescriptionSourceProvider
nytimes_statesData on COVID-19 cases at US state levelNY TimesRearc
nytimes_countiesData on COVID-19 cases at US county level

 

 

Table NameDescriptionSourceProvider
covid_testing_states_dailyUSA total test daily trend by stateCOVID Tracking ProjectRearc
covid_testing_us_dailyUSA total test daily trend
covid_testing_us_totalUSA total tests

 

 

Table NameDescriptionSourceProvider
hospital_bedsHospital beds and their utilization in the USDefinitive HealthcareRearc

 

 

Table NameDescriptionSource/Provider
alleninstitute_metadataMetadata on papers pulled from the CORD-19 dataset. The sha column indicates the paper ID, which is the file name of the paper in the data lake.Allen Institute for AI
alleninstitute_comprehend_medicalResults from Amazon Comprehend Medical run against the CORD-19 dataset.

 

  • Lookup tables to support visualizations.

 

Table NameDescription
country_codesLookup table for country codes
county_populationsLookup table for the population for each county based on recent census data
us_state_abbreviationsLookup table for US state abbreviations

In addition, you can see descriptions of the columns in these tables. For example, the following screenshot shows the metadata of the table containing COVID-19 cases from Johns Hopkins.

Querying data via Amazon Athena

This section demonstrates how to query these tables using Athena. Athena is a serverless interactive query service that makes it easy to analyze the data in the AWS COVID19 data lake. Athena supports SQL, a common language that data analysts use for analyzing structured data. To query the data, complete the following steps:

  1. Sign in to the Athena console.

If this is the first time you are using Athena, you must specify a query result location on Amazon S3.

  1. From the drop-down menu, choose the covid-19 database.
  2. Enter your query.

The following query returns the growth of confirmed cases for the past 7 days joined side-by-side with hospital bed availability, broken down by US county:

SELECT 
  cases.fips, 
  admin2 as county, 
  province_state, 
  confirmed,
  growth_count, 
  sum(num_licensed_beds) as num_licensed_beds, 
  sum(num_staffed_beds) as num_staffed_beds, 
  sum(num_icu_beds) as num_icu_beds
FROM 
  "covid-19"."hospital_beds" beds, 
  ( SELECT 
      fips, 
      admin2, 
      province_state, 
      confirmed, 
      last_value(confirmed) over (partition by fips order by last_update) - first_value(confirmed) over (partition by fips order by last_update) as growth_count,
      first_value(last_update) over (partition by fips order by last_update desc) as most_recent,
      last_update
    FROM  
      "covid-19"."enigma_jhu" 
    WHERE 
      from_iso8601_timestamp(last_update) > now() - interval '7' day AND country_region = 'US') cases
WHERE 
  beds.fips = cases.fips AND last_update = most_recent
GROUP BY cases.fips, confirmed, growth_count, admin2, province_state
ORDER BY growth_count desc

The following screenshot shows the results of this query.

Athena also allows you to run these queries through REST APIs, for example, for building your own visualizations. Moreover, Athena is just one of the many engines that you can use on the data lake. For example, you can use Amazon Redshift Spectrum to join lake data with other datasets in your Redshift data warehouse, or use Amazon QuickSight to visualize your datasets.

We have also created a public Amazon QuickSight dashboard from the COVID-19 case tracking data, testing data, and hospital bed data. You can track daily updates with this dashboard. You can also drill-down to see breakdowns by country, province, and county without having to write a line of SQL. The following is a recent screenshot of the dashboard.

CORD-19 research articles

The CORD-19 dataset is a collection of metadata and full-text of research articles about COVID-19, SARS-CoV-2, and related coronaviruses. You can index this data with Amazon Kendra for question/answer exploration, or enrich the data with Amazon Comprehend Medical. We have already done the latter and put it in the table called alleninstitute_comprehend_medical.

The alleninsitute_metadata table provides detailed fields for each paper, such as the title, authors, journal, and URL. The alleninstitute_comprehend_medical table contains key medical concepts such as medical condition, medication, dosage, strength, and frequency. With this metadata, you can quickly query over concepts, analyze or aggregate over authors and journals, and locate papers.

Aggregating over journals

Using IL-6 inhibitors is a possible therapy for COVID-19, and clinical trials are underway. To demonstrate how to use these tables, this post presents a use case in which you want to understand which journals discuss IL-6 the most by counting the papers they published. You can do this by running the following query:

SELECT m.journal,
       count(distinct(cm.paper_id)) as paper_count
FROM "covid-19".alleninstitute_metadata m
JOIN "covid-19".alleninstitute_comprehend_medical cm
    ON (contains(split(m.sha, '; '), cm.paper_id))
WHERE contains(generic_name, 'IL-6')
GROUP BY  m.journal
ORDER BY paper_count desc

The following screenshot shows an example of the results. The data provider updates this dataset over time, so your results may look different (here, we notice that the second highest count has no journal information).

Drilling down into papers

To see the URLs and the titles of the papers in one of these journals, you simply query both these tables again. For example, to drill into IL-6 related papers in the Crit Care journal, enter the following query:

SELECT distinct m.url, m.title
FROM "covid-19".alleninstitute_metadata m
JOIN "covid-19".alleninstitute_comprehend_medical cm
    ON (contains(split(m.sha, '; '), cm.paper_id))
WHERE contains(generic_name, 'IL-6')
      AND m.journal = 'Crit Care'

The following screenshot shows an example of the results.

These examples are a few of the innumerable analyses you can run on the public data lake. You incur no additional cost for accessing the AWS COVID-19 data lake beyond the standard charges for the AWS services that you use. For example, if you use Athena, you will incur the costs for running queries and the data storage in the S3 query result location, but incur no costs for accessing the data lake. In addition, if you want this data in raw form, you can subscribe to, download, and stay up-to-date through AWS Data Exchange. We encourage you to try using the public AWS COVID-19 data lake yourself.

Conclusion

Combining our efforts across organizations and scientific disciplines can help us win the fight against the COVID-19 pandemic. With the AWS COVID-19 data lake, anyone can experiment with and analyze curated data related to the disease, as well as share their own data and results. We believe that through an open and collaborative effort that combines data, technology, and science, we can inspire insights and foster breakthroughs necessary to contain, curtail, and ultimately cure COVID-19.

For daily updates on how AWS is addressing the crisis, see Amazon’s COVID-19 blog.

 


About the Authors

The AWS Data Lake Team members are Roy Ben-Alta, Jason Berkowitz, Chris Casey, Patrick Combes, Lucy Friedmann, Fred Lee, Megan Maxwell, Rourke McNamara, Herain Oberoi, Stephen Orban, Brian Ross, Nikki Rouda, Noah Schwartz, Noritaka Sekiyama, Mehul A. Shah, Ben Snively, and Ying Wang.

Monitoring and management with Amazon QuickSight and Athena in your CI/CD pipeline

Post Syndicated from Umair Nawaz original https://aws.amazon.com/blogs/devops/monitoring-and-management-with-amazon-quicksight-and-athena-in-your-ci-cd-pipeline/

One of the many ways to monitor and manage required CI/CD metrics is to use Amazon QuickSight to build customized visualizations. Additionally, by applying Lean management to software delivery processes, organizations can improve delivery of features faster, pivot when needed, respond to compliance and security changes, and take advantage of instant feedback to improve the customer delivery experience. This blog post demonstrates how AWS resources and tools can provide monitoring and information pertaining to their CI/CD pipelines.

There are three principles in Lean management that this artifact enables and to which it contributes:

  • Limiting work in progress by establishing constraints that drive process improvement and increase throughput.
  • Creating and maintaining dashboards displaying key quality information, productivity metrics, and current status of work (including defects).
  • Using data from development performance and operations monitoring tools to enable business decisions more frequently.

Overview

The following architectural diagram shows how to use AWS services to collect metrics from a CI/CD pipeline and deliver insights through Amazon QuickSight dashboards.

Architecture diagram showing an overview of how CI/CD metrics are extracted and transformed to create a dynamic QuickSight dashboard

In this example, the orchestrator for the CI/CD pipeline is AWS CodePipeline with the entry point as an AWS CodeCommit Git repository for source control. When a developer pushes a code change into the CodeCommit repository, the change goes through a series of phases in CodePipeline. AWS CodeBuild is responsible for performing build actions and, upon successful completion of this phase, AWS CodeDeploy kicks off the actions to execute the deployment.

For each action in CodePipeline, the following series of events occurs:

  • An Amazon CloudWatch rule creates a CloudWatch event containing the action’s metadata.
  • The CloudWatch event triggers an AWS Lambda function.
  • The Lambda function extracts relevant reporting data and writes it to a CSV file in an Amazon S3 bucket.
  • Amazon Athena queries the Amazon S3 bucket and loads the query results into SPICE (an in-memory engine for Amazon QuickSight).
  • Amazon QuickSight obtains data from SPICE to build dashboard displays for the management team.

Note: This solution is for an AWS account with an existing CodePipeline(s). If you do not have a CodePipeline, no metrics will be collected.

Getting started

To get started, follow these steps:

  • Create a Lambda function and copy the following code snippet. Be sure to replace the bucket name with the one used to store your event data. This Lambda function takes the payload from a CloudWatch event and extracts the field’s pipeline, time, state, execution, stage, and action to transform into a CSV file.

Note: Athena’s performance can be improved by compressing, partitioning, or converting data into columnar formats such as Apache Parquet. In this use-case, the dataset size is negligible therefore, a transformation from CSV to Parquet is not required.


import boto3
import csv
import datetime
import os

 # Analyze payload from CloudWatch Event
 def pipeline_execution(data):
     print (data)
     # Specify data fields to deliver to S3
     row=['pipeline,time,state,execution,stage,action']
     
     if "stage" in data['detail'].keys():
         stage=data['detail']['execution']
     else:
         stage='NA'
         
     if "action" in data['detail'].keys():
         action=data['detail']['action']
     else:
         action='NA'
     row.append(data['detail']['pipeline']+','+data['time']+','+data['detail']['state']+','+data['detail']['execution']+','+stage+','+action)  
     values = '\n'.join(str(v) for v in row)
     return values

 # Upload CSV file to S3 bucket
 def upload_data_to_s3(data):
     s3=boto3.client('s3')
     runDate = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M:%S:%f")
     csv_key=runDate+'.csv'
     response = s3.put_object(
         Body=data,
         Bucket='*<example-bucket>*',
         Key=csv_key
     )

 def lambda_handler(event, context):
     upload_data_to_s3(pipeline_execution(event))
  • Create an Athena table to query the data stored in the Amazon S3 bucket. Execute the following SQL in the Athena query console and provide the bucket name that will hold the data.
CREATE EXTERNAL TABLE `devops`(
   `pipeline` string, 
   `time` string, 
   `state` string, 
   `execution` string, 
   `stage` string, 
   `action` string)
 ROW FORMAT DELIMITED 
   FIELDS TERMINATED BY ',' 
 STORED AS INPUTFORMAT 
   'org.apache.hadoop.mapred.TextInputFormat' 
 OUTPUTFORMAT 
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
 LOCATION
   's3://**<example-bucket>**/'
 TBLPROPERTIES (
   'areColumnsQuoted'='false', 
   'classification'='csv', 
   'columnsOrdered'='true', 
   'compressionType'='none', 
   'delimiter'=',', 
   'skip.header.line.count'='1',  
   'typeOfData'='file')  
  • Create a CloudWatch event rule that passes events to the Lambda function created in Step 1. In the event rule configuration, set the Service Name as CodePipeline and, for Event Type, select All Events.

Sample Dataset view from Athena.

Sample Athena query and the results

Amazon QuickSight visuals

After the initial setup is done, you are ready to create your QuickSight dashboard. Be sure to check that the Athena permissions are properly set before creating an analysis to be published as an Amazon QuickSight dashboard.

Below are diagrams and figures from Amazon QuickSight that can be generated using the event data queried from Athena. In this example, you can see how many executions happened in the account and how many were successful.

The following screenshot shows that most pipeline executions are failing. A manager might be concerned that this points to a significant issue and prompt an investigation in which they can allocate resources to improve delivery and efficiency.

QuickSight Dashboard showing total execution successes and failures

The visual for this solution is dynamic in nature. In case the pipeline has more or fewer actions, the visual will adjust automatically to reflect all actions. After looking at the success and failure rates for each CodePipeline action in Amazon QuickSight, as shown in the following screenshot, users can take targeted actions quickly. For example, if the team sees a lot of failures due to vulnerability scanning, they can work on improving that problem area to drive value for future code releases.

QuickSight Dashboard showing the successes and failures of pipeline actions

Day-over-day visuals reflect date-specific activity and enable teams to see their progress over a period of time.

QuickSight Dashboard showing day over day results of successful CI/CD executions and failures

Amazon QuickSight offers controls that can be configured to apply filters to visuals. For example, the following screenshot demonstrates how users can toggle between visuals for different applications.

QuickSight's control function to switch between different visualization options

Cleanup (optional)

In order to avoid unintended charges, delete the following resources:

  • Amazon CloudWatch event rule
  • Lambda function
  • Amazon S3 Bucket (the location in which CSV files generated by the Lambda function are stored)
  • Athena external table
  • Amazon QuickSight data sets
  • Analysis and dashboard

Conclusion

In this blog, we showed how metrics can be derived from a CI/CD pipeline. Utilizing Amazon QuickSight to create visuals from these metrics allows teams to continuously deliver updates on the deployment process to management. The aggregation of the captured data over time allows individual developers and teams to improve their processes. That is the goal of creating a Lean DevOps process: to oversee the meta-delivery pipeline and optimize all future releases by identifying weak spots and points of risk during the entire release process.

___________________________________________________________

About the Authors

Umair Nawaz is a DevOps Engineer at Amazon Web Services in New York City. He works on building secure architectures and advises enterprises on agile software delivery. He is motivated to solve problems strategically by utilizing modern technologies.
Christopher Flores is an Engagement Manager at Amazon Web Services in New York City. He leads AWS developers, partners, and client teams in using the customer engagement accelerator framework. Christopher expedites stakeholder alignment, enterprise cohesion and risk mitigation while ensuring feedback loops to close the engagement lifecycle.
Carol Liao is a Cloud Infrastructure Architect at Amazon Web Services in New York City. She enjoys designing and developing modern IT solutions in the cloud where there is always more to learn, more problems to solve, and more to build.

 

How Siemens built a fully managed scheduling mechanism for updates on Amazon S3 data lakes

Post Syndicated from Pedro Bento original https://aws.amazon.com/blogs/big-data/how-siemens-built-a-fully-managed-scheduling-mechanism-for-consistent-updates-on-amazon-s3-data-lakes/

Siemens is a global technology leader with more than 370,000 employees and 170 years of experience. To protect Siemens from cybercrime, the Siemens Cyber Defense Center (CDC) continuously monitors Siemens’ networks and assets. To handle the resulting enormous data load, the CDC built a next-generation threat detection and analysis platform called ARGOS. ARGOS is a hybrid-cloud solution that makes heavy use of fully managed AWS services for streaming, big data processing, and machine learning.

Users such as security analysts, data scientists, threat intelligence teams, and incident handlers continuously access data in the ARGOS platform. Further, various automated components update, extend, and remove data to enrich information, improve data quality, enforce PII requirements, or mutate data due to schema evolution or additional data normalization requirements. Keeping the data always available and consistent presents multiple challenges.

While object-based data lakes are highly beneficial from a cost perspective compared to traditional transactional databases in such scenarios, they hardly allow for atomic updates or require highly complex and costly extensions. To overcome this problem, Siemens designed a solution that enables atomic file updates on Amazon S3-based data lakes without compromising query performance and availability.

This post presents this solution, which is an easy-to-use scheduling service for S3 data update tasks. Siemens uses it for multiple purposes, including pseudonymization, anonymization, and removal of sensitive data. This post demonstrates how to use the solution to remove values from a dataset after a predefined amount of time. Adding further data processing tasks is straightforward because the solution has a well-defined architecture and the whole stack consists of fewer than 200 lines of source code. It is solely based on fully managed AWS services and therefore achieves minimal operational overhead.

Architecture overview

This post uses an S3-based data lake with continuous data ingestion and Amazon Athena as query mechanism. The goal is to remove certain values after a predefined time automatically after ingestion. Applications and users consuming the data via Athena are not impacted (for example, they do not observe downtimes or data quality issues like duplication).

The following diagram illustrates the architecture of this solution.

Siemens built the solution with the following services and components:

  1. Scheduling trigger – New data (for example, in JSON format) is continuously uploaded to a S3 bucket.
  2. Task scheduling – As soon as new files land, an AWS Lambda function processes the resulting S3 bucket notification events. As part of the processing, it creates a new item on Amazon DynamoDB that specifies a Time to Live (TTL) and the path to that S3 object.
  3. Task execution trigger – When the TTL expires, the DynamoDB item is deleted from the table and the DynamoDB stream triggers a Lambda function that processes the S3 object at that path.
  4. Task execution – The Lambda function derives meta information (like the relevant S3 path) from the TTL expiration event and processes the S3 object. Finally, the new S3 object replaces the older version.
  5. Data usage – The updated data is available for querying from Athena without further manual processing, and uses S3’s eventual consistency on read operations.

About DynamoDB Streams and TTL

TTL for DynamoDB lets you define when items in a table expire so they can be deleted from the database automatically. TTL comes at no extra cost as a way to reduce storage use and reduce the cost of storing irrelevant data without using provisioned throughput. You can set a timestamp for deletion on a per-item basis, which allows you to limit storage usage to only those records that are relevant, by enabling TTL on a table.

Solution overview

To implement this solution manually, complete the following steps:

  1. Create a DynamoDB table and configure DynamoDB Streams.
  2. Create a Lambda function to insert TTL records.
  3. Configure an S3 event notification on the target bucket.
  4. Create a Lambda function that performs data processing tasks.
  5. Use Athena to query the processed data.

If you want to deploy the solution automatically, you may skip these steps, and use the AWS Cloudformation template provided.

Prerequisites

To complete this walkthrough, you must have the following:

  • An AWS account with access to the AWS Management Console.
  • A role with access to S3, DynamoDB, Lambda, and Athena.

Creating a DynamoDB table and configuring DynamoDB Streams

Start first with the time-based trigger setup. For this, you use S3 notifications, DynamoDB Streams, and a Lambda function to integrate both services. The DynamoDB table stores the items to process after a predefined time.

Complete the following steps:

  1. On the DynamoDB console, create a table.
  2. For Table name, enter objects-to-process.
  3. For Primary key, enter path and choose String.
  4. Select the table and click on Manage TTL next to “Time to live attribute” under table details.
  5. For TTL attribute, enter ttl.
  6. For DynamoDB Streams, choose Enable with view type New and old images.

Note that you can enable DynamoDB TTL on non-numeric attributes, but it only works on numeric attributes.

The DynamoDB TTL is not minute-precise. Expired items are typically deleted within 48 hours of expiration. However, you may experience shorter deviations of only 10–30 minutes from the actual TTL value. For more information, see Time to Live: How It Works.

Creating a Lambda function to insert TTL records

The first Lambda function you create is for scheduling tasks. It receives a S3 notification as input, recreates the S3 path (for example, s3://<bucket>/<key>), and creates a new item on DynamoDB with two attributes: the S3 path and the TTL (in seconds). For more information about a similar S3 notification event structure, see Test the Lambda Function.

To deploy the Lambda function, on the Lambda console, create a function named NotificationFunction with the Python 3.7 runtime and the following code:

import boto3, os, time

# Put here a new parameter for TTL, default 300, 5 minutes
default_ttl = 300

s3_client = boto3.client('s3')
table = boto3.resource('dynamodb').Table('objects-to-process')

def parse_bucket_and_key(s3_notif_event):
    s3_record = s3_notif_event['Records'][0]['s3']
    return s3_record['bucket']['name'], s3_record['object']['key']

def lambda_handler(event, context):
    try:
        bucket_name, key = parse_bucket_and_key(event)
        head_obj = s3_client.head_object(Bucket=bucket_name, Key=key)
        tags = s3_client.get_object_tagging(Bucket=bucket_name, Key=key)
        if(head_obj['ContentLength'] > 0 and len(tags['TagSet']) == 0):
            record_path = f"s3://{bucket_name}/{key}"
            table.put_item(Item={'path': record_path, 'ttl': int(time.time()) + default_ttl})
    except:
        pass # Ignore

Configuring S3 event notifications on the target bucket

You can take advantage of the scalability, security, and performance of S3 by using it as a data lake for storing your datasets. Additionally, you can use S3 event notifications to capture S3-related events, such as the creation or deletion of objects within a bucket. You can forward these events to other AWS services, such as Lambda.

To configure S3 event notifications, complete the following steps:

  1. On the S3 console, create an S3 bucket named data-bucket.
  2. Click on the bucket and go to “Properties” tab.
  3. Under Advanced Settings, choose Events and add a notification.
  4. For Name, enter MyEventNotification.
  5. For Events, select All object create events.
  6. For Prefix, enter dataset/.
  7. For Send to, choose Lambda Function.
  8. For Lambda, choose NotificationFunction.

This configuration restricts the scheduling to events that happen within your previously defined dataset. For more information, see How Do I Enable and Configure Event Notifications for an S3 Bucket?

Creating a Lambda function that performs data processing tasks

You have now created a time-based trigger for the deletion of the record in the DynamoDB table. However, when the system delete occurs and the change is recorded in DynamoDB Streams, no further action is taken. Lambda can poll the stream to detect these change records and trigger a function to process them according to the activity (INSERT, MODIFY, REMOVE).

This post is only concerned with deleted items because it uses the TTL feature of DynamoDB Streams to trigger task executions. Lambda gives you the flexibility to either process the item by itself or to forward the processing effort to somewhere else (such as an AWS Glue job or an Amazon SQS queue).

This post uses Lambda directly to process the S3 objects. The Lambda function performs the following tasks:

  1. Gets the S3 object from the DynamoDB item’s S3 path attribute.
  2. Modifies the object’s data.
  3. Overrides the old S3 object with the updated content and tags the object as processed.

Complete the following steps:

  1. On the Lambda console, create a function named JSONProcessingFunction with Python 3.7 as the runtime and the following code:
    import os, json, boto3
    from functools import partial
    from urllib.parse import urlparse
    
    s3 = boto3.resource('s3')
    
    def parse_bucket_and_key(s3_url_as_string):
        s3_path = urlparse(s3_url_as_string)
        return s3_path.netloc, s3_path.path[1:]
    
    def extract_s3path_from_dynamo_event(event):
        if event["Records"][0]["eventName"] == "REMOVE":
            return event["Records"][0]["dynamodb"]["Keys"]["path"]["S"]
    
    def modify_json(json_dict, column_name, value):
        json_dict[column_name] = value
        return json_dict
        
    def get_obj_contents(bucketname, key):
        obj = s3.Object(bucketname, key)
        return obj.get()['Body'].iter_lines()
    
    clean_column_2_func = partial(modify_json, column_name="file_contents", value="")
    
    def lambda_handler(event, context):
        s3_url_as_string = extract_s3path_from_dynamo_event(event)
        if s3_url_as_string:
            bucket_name, key = parse_bucket_and_key(s3_url_as_string)
            updated_json = "\n".join(map(json.dumps, map(clean_column_2_func, map(json.loads, get_obj_contents(bucket_name, key)))))
            s3.Object(bucket_name, key).put(Body=updated_json, Tagging="PROCESSED=True")
        else:
            print(f"Invalid event: {str(event)}")

  2. On the Lambda function configuration webpage, click on Add trigger.
  3. For Trigger configuration, choose DynamoDB.
  4. For DynamoDB table, choose objects-to-process.
  5. For Batch size, enter 1.
  6. For Batch window, enter 0.
  7. For Starting position, choose Trim horizon.
  8. Select Enable trigger.

You use batch size = 1 because each S3 object represented on the DynamoDB table is typically large. If these files are small, you can use a larger batch size. The batch size is essentially the number of files that your Lambda function processes at a time.

Because any new objects on S3 (in a versioning-enabled bucket) create an object creation event, even if its key already exists, you must make sure that your task schedule Lambda function ignores any object creation events that your task execution function creates. Otherwise, it creates an infinite loop. This post uses tags on S3 objects: when the task execution function processes an object, it adds a processed tag. The task scheduling function ignores those objects in subsequent executions.

Using Athena to query the processed data

The final step is to create a table for Athena to query the data. You can do this manually or by using an AWS Glue crawler that infers the schema directly from the data and automatically creates the table for you. This post uses a crawler because it can handle schema changes and add new partitions automatically. To create this crawler, use the following code:

aws glue create-crawler --name data-crawler \ 
--role <AWSGlueServiceRole-crawler> \
--database-name data_db \
--description 'crawl data bucket!' \
--targets \
"{\
  \"S3Targets\": [\
    {\
      \"Path\": \"s3://<data-bucket>/dataset/\"\
    }\
  ]\
}"

Replace <AWSGlueServiceRole-crawler> and <data-bucket> with the name of your AWSGlueServiceRole and S3 bucket, respectively.

When the crawling process is complete, you can start querying the data. You can use the Athena console to interact with the table while its underlying data is being transparently updated. See the following code:

SELECT * FROM data_db.dataset LIMIT 1000

Automated setup

You can use the following AWS CloudFormation template to create the solution described on this post on your AWS account. To launch the template, choose the following link:

This CloudFormation stack requires the following parameters:

  • Stack name – A meaningful name for the stack, for example, data-updater-solution.
  • Bucket name – The name of the S3 bucket to use for the solution. The stack creation process creates this bucket.
  • Time to Live – The number of seconds to expire items on the DynamoDB table. Referenced S3 objects are processed on item expiration.

Stack creation takes up to a few minutes. Check and refresh the AWS CloudFormation Resources tab to monitor the process while it is running.

When the stack shows the state CREATE_COMPLETE, you can start using the solution.

Testing the solution

To test the solution, download the mock_uploaded_data.json dataset created with the Mockaroo data generator. The use case is a web service in which users can upload files. The goal is to delete those files some predefined time after the upload to reduce storage and query costs. To this end, the provided code looks for the attribute file_contents and replaces its value with an empty string.

You can now upload new data into your data-bucket S3 bucket under the dataset/ prefix. Your NotificationFunction Lambda function processes the resulting bucket notification event for the upload, and a new item appears on your DynamoDB table. Shortly after the predefined TTL time, the JSONProcessingFunction Lambda function processes the data and you can check the resulting changes via an Athena query.

You can also confirm that a S3 object was processed successfully if the DynamoDB item corresponding to this S3 object is no longer present in the DynamoDB table and the S3 object has the processed tag.

Conclusion

This post showed how to automatically re-process objects on S3 after a predefined amount of time by using a simple and fully managed scheduling mechanism. Because you use S3 for storage, you automatically benefit from S3’s eventual consistency model, simply by using identical keys (names) both for the original and processed objects. This way, you avoid query results with duplicate or missing data. Also, incomplete or only partially uploaded objects do not result in data inconsistencies because S3 only creates new object versions for successfully completed file transfers.

You may have previously used Spark to process objects hourly. This requires you to monitor objects that must be processed, to move and process them in a staging area, and to move them back to their actual destination. The main drawback is the final step because, due to Spark’s parallelism nature, files are generated with different names and contents. That prevents direct file replacement in the dataset and leads to downtimes or potential data duplicates when data is queried during a move operation. Additionally, because each copy/delete operation could potentially fail, you have to deal with possible partially processed data manually.

From an operations perspective, AWS serverless services simplify your infrastructure. You can combine the scalability of these services with a pay-as-you-go plan to start with a low-cost POC and scale to production quickly—all with a minimal code base.

Compared to hourly Spark jobs, you could potentially reduce costs by up to 80%, which makes this solution both cheaper and simpler.

Special thanks to Karl Fuchs, Stefan Schmidt, Carlos Rodrigues, João Neves, Eduardo Dixo and Marco Henriques for their valuable feedback on this post’s content.

 


About the Authors

Pedro Completo Bento is a senior big data engineer working at Siemens CDC. He holds a Master in Computer Science from the Instituto Superior Técnico in Lisbon. He started his career as a full-stack developer, specializing later on big data challenges. Working with AWS, he builds highly reliable, performant and scalable systems on the cloud, while keeping the costs at bay. In his free time, he enjoys to play boardgames with his friends.

 

 

Arturo Bayo is a big data consultant at Amazon Web Services. He promotes a data-driven culture in enterprise customers around EMEA, providing specialized guidance on business intelligence and data lake projects while working with AWS customers and partners to build innovative solutions around data and analytics.

 

 

 

 

ICYMI: Serverless Q4 2019

Post Syndicated from Rob Sutter original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2019/

Welcome to the eighth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, checkout what happened last quarter here.

The three months comprising the fourth quarter of 2019

AWS re:Invent

AWS re:Invent 2019

re:Invent 2019 dominated the fourth quarter at AWS. The serverless team presented a number of talks, workshops, and builder sessions to help customers increase their skills and deliver value more rapidly to their own customers.

Serverless talks from re:Invent 2019

Chris Munns presenting 'Building microservices with AWS Lambda' at re:Invent 2019

We presented dozens of sessions showing how customers can improve their architecture and agility with serverless. Here are some of the most popular.

Videos

Decks

You can also find decks for many of the serverless presentations and other re:Invent presentations on our AWS Events Content.

AWS Lambda

For developers needing greater control over performance of their serverless applications at any scale, AWS Lambda announced Provisioned Concurrency at re:Invent. This feature enables Lambda functions to execute with consistent start-up latency making them ideal for building latency sensitive applications.

As shown in the below graph, provisioned concurrency reduces tail latency, directly impacting response times and providing a more responsive end user experience.

Graph showing performance enhancements with AWS Lambda Provisioned Concurrency

Lambda rolled out enhanced VPC networking to 14 additional Regions around the world. This change brings dramatic improvements to startup performance for Lambda functions running in VPCs due to more efficient usage of elastic network interfaces.

Illustration of AWS Lambda VPC to VPC NAT

New VPC to VPC NAT for Lambda functions

Lambda now supports three additional runtimes: Node.js 12, Java 11, and Python 3.8. Each of these new runtimes has new version-specific features and benefits, which are covered in the linked release posts. Like the Node.js 10 runtime, these new runtimes are all based on an Amazon Linux 2 execution environment.

Lambda released a number of controls for both stream and async-based invocations:

  • You can now configure error handling for Lambda functions consuming events from Amazon Kinesis Data Streams or Amazon DynamoDB Streams. It’s now possible to limit the retry count, limit the age of records being retried, configure a failure destination, or split a batch to isolate a problem record. These capabilities help you deal with potential “poison pill” records that would previously cause streams to pause in processing.
  • For asynchronous Lambda invocations, you can now set the maximum event age and retry attempts on the event. If either configured condition is met, the event can be routed to a dead letter queue (DLQ), Lambda destination, or it can be discarded.

AWS Lambda Destinations is a new feature that allows developers to designate an asynchronous target for Lambda function invocation results. You can set separate destinations for success and failure. This unlocks new patterns for distributed event-based applications and can replace custom code previously used to manage routing results.

Illustration depicting AWS Lambda Destinations with success and failure configurations

Lambda Destinations

Lambda also now supports setting a Parallelization Factor, which allows you to set multiple Lambda invocations per shard for Kinesis Data Streams and DynamoDB Streams. This enables faster processing without the need to increase your shard count, while still guaranteeing the order of records processed.

Illustration of multiple AWS Lambda invocations per Kinesis Data Streams shard

Lambda Parallelization Factor diagram

Lambda introduced Amazon SQS FIFO queues as an event source. “First in, first out” (FIFO) queues guarantee the order of record processing, unlike standard queues. FIFO queues support messaging batching via a MessageGroupID attribute that supports parallel Lambda consumers of a single FIFO queue, enabling high throughput of record processing by Lambda.

Lambda now supports Environment Variables in the AWS China (Beijing) Region and the AWS China (Ningxia) Region.

You can now view percentile statistics for the duration metric of your Lambda functions. Percentile statistics show the relative standing of a value in a dataset, and are useful when applied to metrics that exhibit large variances. They can help you understand the distribution of a metric, discover outliers, and find hard-to-spot situations that affect customer experience for a subset of your users.

Amazon API Gateway

Screen capture of creating an Amazon API Gateway HTTP API in the AWS Management Console

Amazon API Gateway announced the preview of HTTP APIs. In addition to significant performance improvements, most customers see an average cost savings of 70% when compared with API Gateway REST APIs. With HTTP APIs, you can create an API in four simple steps. Once the API is created, additional configuration for CORS and JWT authorizers can be added.

AWS SAM CLI

Screen capture of the new 'sam deploy' process in a terminal window

The AWS SAM CLI team simplified the bucket management and deployment process in the SAM CLI. You no longer need to manage a bucket for deployment artifacts – SAM CLI handles this for you. The deployment process has also been streamlined from multiple flagged commands to a single command, sam deploy.

AWS Step Functions

One powerful feature of AWS Step Functions is its ability to integrate directly with AWS services without you needing to write complicated application code. In Q4, Step Functions expanded its integration with Amazon SageMaker to simplify machine learning workflows. Step Functions also added a new integration with Amazon EMR, making EMR big data processing workflows faster to build and easier to monitor.

Screen capture of an AWS Step Functions step with Amazon EMR

Step Functions step with EMR

Step Functions now provides the ability to track state transition usage by integrating with AWS Budgets, allowing you to monitor trends and react to usage on your AWS account.

You can now view CloudWatch Metrics for Step Functions at a one-minute frequency. This makes it easier to set up detailed monitoring for your workflows. You can use one-minute metrics to set up CloudWatch Alarms based on your Step Functions API usage, Lambda functions, service integrations, and execution details.

Step Functions now supports higher throughput workflows, making it easier to coordinate applications with high event rates. This increases the limits to 1,500 state transitions per second and a default start rate of 300 state machine executions per second in US East (N. Virginia), US West (Oregon), and Europe (Ireland). Click the above link to learn more about the limit increases in other Regions.

Screen capture of choosing Express Workflows in the AWS Management Console

Step Functions released AWS Step Functions Express Workflows. With the ability to support event rates greater than 100,000 per second, this feature is designed for high-performance workloads at a reduced cost.

Amazon EventBridge

Illustration of the Amazon EventBridge schema registry and discovery service

Amazon EventBridge announced the preview of the Amazon EventBridge schema registry and discovery service. This service allows developers to automate discovery and cataloging event schemas for use in their applications. Additionally, once a schema is stored in the registry, you can generate and download a code binding that represents the schema as an object in your code.

Amazon SNS

Amazon SNS now supports the use of dead letter queues (DLQ) to help capture unhandled events. By enabling a DLQ, you can catch events that are not processed and re-submit them or analyze to locate processing issues.

Amazon CloudWatch

Amazon CloudWatch announced Amazon CloudWatch ServiceLens to provide a “single pane of glass” to observe health, performance, and availability of your application.

Screenshot of Amazon CloudWatch ServiceLens in the AWS Management Console

CloudWatch ServiceLens

CloudWatch also announced a preview of a capability called Synthetics. CloudWatch Synthetics allows you to test your application endpoints and URLs using configurable scripts that mimic what a real customer would do. This enables the outside-in view of your customers’ experiences, and your service’s availability from their point of view.

CloudWatch introduced Embedded Metric Format, which helps you ingest complex high-cardinality application data as logs and easily generate actionable metrics. You can publish these metrics from your Lambda function by using the PutLogEvents API or using an open source library for Node.js or Python applications.

Finally, CloudWatch announced a preview of Contributor Insights, a capability to identify who or what is impacting your system or application performance by identifying outliers or patterns in log data.

AWS X-Ray

AWS X-Ray announced trace maps, which enable you to map the end-to-end path of a single request. Identifiers show issues and how they affect other services in the request’s path. These can help you to identify and isolate service points that are causing degradation or failures.

X-Ray also announced support for Amazon CloudWatch Synthetics, currently in preview. CloudWatch Synthetics on X-Ray support tracing canary scripts throughout the application, providing metrics on performance or application issues.

Screen capture of AWS X-Ray Service map in the AWS Management Console

X-Ray Service map with CloudWatch Synthetics

Amazon DynamoDB

Amazon DynamoDB announced support for customer-managed customer master keys (CMKs) to encrypt data in DynamoDB. This allows customers to bring your own key (BYOK) giving you full control over how you encrypt and manage the security of your DynamoDB data.

It is now possible to add global replicas to existing DynamoDB tables to provide enhanced availability across the globe.

Another new DynamoDB capability to identify frequently accessed keys and database traffic trends is currently in preview. With this, you can now more easily identify “hot keys” and understand usage of your DynamoDB tables.

Screen capture of Amazon CloudWatch Contributor Insights for DynamoDB in the AWS Management Console

CloudWatch Contributor Insights for DynamoDB

DynamoDB also released adaptive capacity. Adaptive capacity helps you handle imbalanced workloads by automatically isolating frequently accessed items and shifting data across partitions to rebalance them. This helps reduce cost by enabling you to provision throughput for a more balanced workload instead of over provisioning for uneven data access patterns.

Amazon RDS

Amazon Relational Database Services (RDS) announced a preview of Amazon RDS Proxy to help developers manage RDS connection strings for serverless applications.

Illustration of Amazon RDS Proxy

The RDS Proxy maintains a pool of established connections to your RDS database instances. This pool enables you to support a large number of application connections so your application can scale without compromising performance. It also increases security by enabling IAM authentication for database access and enabling you to centrally manage database credentials using AWS Secrets Manager.

AWS Serverless Application Repository

The AWS Serverless Application Repository (SAR) now offers Verified Author badges. These badges enable consumers to quickly and reliably know who you are. The badge appears next to your name in the SAR and links to your GitHub profile.

Screen capture of SAR Verifiedl developer badge in the AWS Management Console

SAR Verified developer badges

AWS Developer Tools

AWS CodeCommit launched the ability for you to enforce rule workflows for pull requests, making it easier to ensure that code has pass through specific rule requirements. You can now create an approval rule specifically for a pull request, or create approval rule templates to be applied to all future pull requests in a repository.

AWS CodeBuild added beta support for test reporting. With test reporting, you can now view the detailed results, trends, and history for tests executed on CodeBuild for any framework that supports the JUnit XML or Cucumber JSON test format.

Screen capture of AWS CodeBuild

CodeBuild test trends in the AWS Management Console

Amazon CodeGuru

AWS announced a preview of Amazon CodeGuru at re:Invent 2019. CodeGuru is a machine learning based service that makes code reviews more effective and aids developers in writing code that is more secure, performant, and consistent.

AWS Amplify and AWS AppSync

AWS Amplify added iOS and Android as supported platforms. Now developers can build iOS and Android applications using the Amplify Framework with the same category-based programming model that they use for JavaScript apps.

Screen capture of 'amplify init' for an iOS application in a terminal window

The Amplify team has also improved offline data access and synchronization by announcing Amplify DataStore. Developers can now create applications that allow users to continue to access and modify data, without an internet connection. Upon connection, the data synchronizes transparently with the cloud.

For a summary of Amplify and AppSync announcements before re:Invent, read: “A round up of the recent pre-re:Invent 2019 AWS Amplify Launches”.

Illustration of AWS AppSync integrations with other AWS services

Q4 serverless content

Blog posts

October

November

December

Tech talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year. These are listed in the Serverless section of the AWS Online Tech Talks page.

Here are the ones from Q4:

Twitch

October

There are also a number of other helpful video series covering Serverless available on the AWS Twitch Channel.

AWS Serverless Heroes

We are excited to welcome some new AWS Serverless Heroes to help grow the serverless community. We look forward to some amazing content to help you with your serverless journey.

AWS Serverless Application Repository (SAR) Apps

In this edition of ICYMI, we are introducing a section devoted to SAR apps written by the AWS Serverless Developer Advocacy team. You can run these applications and review their source code to learn more about serverless and to see examples of suggested practices.

Still looking for more?

The Serverless landing page has much more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. We’re also kicking off a fresh series of Tech Talks in 2020 with new content providing greater detail on everything new coming out of AWS for serverless application developers.

Throughout 2020, the AWS Serverless Developer Advocates are crossing the globe to tell you more about serverless, and to hear more about what you need. Follow this blog to keep up on new launches and announcements, best practices, and examples of serverless applications in action.

You can also follow all of us on Twitter to see latest news, follow conversations, and interact with the team.

Chris Munns: @chrismunns
Eric Johnson: @edjgeek
James Beswick: @jbesw
Moheeb Zara: @virgilvox
Ben Smith: @benjamin_l_s
Rob Sutter: @rts_rob
Julian Wood: @julian_wood

Happy coding!

Collect and distribute high-resolution crypto market data with ECS, S3, Athena, Lambda, and AWS Data Exchange

Post Syndicated from Jared Katz original https://aws.amazon.com/blogs/big-data/collect-and-distribute-high-resolution-crypto-market-data-with-ecs-s3-athena-lambda-and-aws-data-exchange/

This is a guest post by Floating Point Group. In their own words, “Floating Point Group is on a mission to bring institutional-grade trading services to the world of cryptocurrency.”

The need and demand for financial infrastructure designed specifically for trading digital assets may not be obvious. There’s a rather pervasive narrative that these coins and tokens are effectively natively digital counterparts to traditional assets such as currencies, commodities, equities, and fixed income. This narrative often manifests in the form of pithy one-liners recycled by pundits attempting to communicate the value proposition of various projects in the space (such as, “Bitcoin is just a currency with an algorithmically controlled, tamper-proof monetary policy,” or, “Ether is just a commodity like gasoline that you can use to pay for computational work on a global computer.”). Unsurprisingly, we at FPG often hear the question, “What’s so special about cryptocurrencies that they warrant dedicated financial services? Why do we need solutions for problems that have already been solved?”

The truth is that these assets and the widespread public interest surrounding them are entirely unprecedented. The decentralized ledger technology that serves as an immutable record of network transactions, the clever use of proof-of-work algorithms to economically incentivize rational actors to help uphold the security of the network (the proof-of-work concept dates back at least as far as 1993, but it was not until bitcoin that the technology showed potential for widespread adoption), the irreversible nature of transactions that poses unique legal challenges in cases such as human error or extortion, the precariousness of self-custody (third-party custody solutions don’t exactly have track records that inspire trust), the regulatory uncertainties that come with the difficulty of both classifying these assets as well as arbitrating their exchange which must ultimately be reconciled by entities like the IRS, SEC, and CFTC—it is all very new, and very weird. With 24-hour market volume regularly exceeding $100 billion, we decided to direct our focus towards problems related specifically to trading these assets. Granted, crypto trading has undoubtedly matured since the days of bartering for bitcoin in web forums and witnessing 10% price spreads between international exchanges. But there is still a long path ahead.

One major pain point we are aiming to address for institutional traders involves liquidity (or, more precisely, the lack thereof). Simply put, the buying and selling of cryptocurrencies occurs across many different trading venues (exchanges), and liquidity (the offers to buy or sell a certain quantity of an asset at a certain price) continues to become more fragmented as new exchanges emerge. So say you’re trying to buy 100 bitcoins. You must buy from people who are willing to sell. As you take the best (cheapest) offers, you’re left with increasingly expensive offers. By the time you fill your order (in this example, buy all 100 bitcoins), you may have paid a much higher average price than, say, the price you paid for the first bitcoin of your order. This phenomenon is referred to as slippage. One easy way to minimize slippage is by expanding your search for offers. So rather than looking at the offers on just one exchange, look at the offers across hundreds of exchanges. This process, traditionally referred to as smart order routing (SOR), is one of the core services we provide. Our SOR service allows traders to easily submit orders that our system can match against the best offers available across multiple trading venues by actively monitoring liquidity across dozens of exchanges.

Fanning out large orders in search of the best prices is a rather intuitive and widely applicable concept—roughly 75% of equities are purchased and sold via SOR. But the value of such a service for crypto markets is particularly salient: a perpetual cycle of new exchanges surging in popularity while incumbents falter has resulted in a seemingly incessant fragmentation of liquidity across trading venues—yet traders tend to assume an exchange-agnostic mindset, concerned exclusively with finding the best price for a given quantity of an asset.

Access to both real-time and historical market data is essential to the functionality of our SOR service. The highest resolution data we could hope to obtain for a given market would include every trade and every change applied to the order book, effectively allowing us to recreate the state of a market at any given point in time. The updates provided through the WebSocket streams are not sufficient for reconstructing order books. We also need to periodically fetch snapshots of the order books and store those, which we can do using an exchange’s REST API. We can fetch a snapshot and apply the corresponding updates from the streams to “replay” the order book.

Fortunately, this data is freely available, because many exchanges offer real-time feeds of market data via WebSocket APIs. We found several third-party vendors selling subscriptions to these data sets, typically in the form of CSV dumps delivered at a weekly or monthly cadence. This presented the question of build vs. buy. Given that we felt capable of building a robust and reliable system for ingesting real-time market data in a relatively short amount of time and at a fraction of the cost of purchasing the data from a vendor, we were already leaning in favor of building. Further investigation made buying look like an increasingly unattractive option. Disclaimers that multiple vendors issued about their inability to guarantee data quality and consistency did not inspire confidence. Inspecting sample data sets revealed that some essential fields provided in the original data streams were missing—fields necessary for achieving our goal of recreating the state of a market at an arbitrary point in time. We also recognized that a weekly or monthly delivery schedule would restrict our ability to explore relatively recent market data.

This post provides a high-level overview of how we ingest and store real-time market data and how we use the AWS Data Exchange API to organize and publish our data sets programmatically. Our system’s functionality extends well beyond data ingestion, normalization, and persistence; we run dedicated services for data validation, caching the most recent trade and order book for every market, computing and storing derivative metrics, and other services that help safeguard data accuracy and minimize the latency of our trading systems.

Data ingestion

The WebSocket streams we connect to for data consumption are often the same APIs responsible for providing real-time updates to an exchange’s trading dashboard.

WebSocket connections transmit data as discrete messages. We can inspect the content of individual messages as they stream into the browser. For example, the following screenshot shows a batch of order book updates.

The updates are expressed as arrays of bids and asks that were either added to the book or removed from it. Client-side code processes each update, resulting in a real-time rendering of the market’s order book. In practice, our data ingestion service (Ingester) does not read a single stream, but rather thousands of different streams, covering various data feeds for all markets across multiple exchanges. All the connections required for such broad coverage and the resulting flood of incoming data raise some obvious concerns about data loss. We’ve taken several measures to mitigate such concerns, including a redundant system design that allows us to spin up an arbitrary number of instances of the Ingester service. Like most of our microservices, Ingester is a Dockerized service run on Amazon ECS and deployed via Terraform.

All these instances consume the same data feeds as each other while a downstream mechanism handles deduplication (this is covered in more detail later in this post). We also set up Amazon CloudWatch alerts to notify us when we detect non-contiguous messages, indicating a gap in the incoming data. The alerts don’t directly mitigate data loss, but they do serve the important function of prompting an investigation.

Ingester builds up separate buffers of incoming messages, split out by data-type/exchange/market. Then, after a fixed time interval, each buffer is flushed into Amazon S3 as a gzipped JSON file. The buffer-flush cycle repeats.

The following screenshot shows a portion of the file content.

This code snippet is a single, pretty-printed JSON record from the file in the screenshot above.

{
   "event_type":"trade",
   "timestamp":1571980320422,
   "ticker_pair":"BTCUSDT",
   "trade_id":194230159,
   "price":"7405.69000000",
   "quantity":"3.20285300",
   "buyer_order_id":730178987,
   "seller_order_id":730178953,
   "trade_timestamp":1571980320417,
   "buyer_market_maker":false,
   "M":true
}

Ingester handles additional functionality, such as applying pre-defined mappings of venue-specific field names to our internal field names. Data normalization is one of many processes necessary to enable our systems to build a holistic understanding of market dynamics.

As with most distributed system designs, our services are written with horizontal scalability as a first-order priority. We took the same approach in designing our data ingestion service, but it has some features that make it a bit different than the archetypical horizontally scalable microservice. The most common motivations for adjusting the number of instances of a given service are load-balancing and throttling throughput. Either your system is experiencing backpressure and a consumer service scales to alleviate that pressure, or the consumer is over-provisioned and you scale down the number of instances for the sake of parsimony. For our data ingestion service, however, our motivation for running multiple instances is to minimize data loss via redundancy. The CPU usage for each instance is independent of instance count, because each instance does identical work.

For example, rather than helping alleviate backpressure by pulling messages from a single queue, each instance of our data ingestion service connects to the same WebSocket streams and performs the same amount of work. Another somewhat unusual and confounding aspect of horizontally scaling our data ingestion service is related to state: we batch records in memory and flush the records to S3 every minute (based on the incoming message’s timestamp, not the system timestamp, because those would be inconsistent). Redundancy is our primary measure for minimizing data loss, but we also need each instance to write the files to S3 in such a way that we don’t end up with duplicate records. Our first thought was that we’d need a mechanism for coordinating activity across the instances, such as maintaining a cache that would allow us to check if a record had already been persisted. But we realized that we could perform this deduplication without any coordination between instances at all. Most of the message streams we consume publish messages with sequence IDs. We can combine the sequence IDs with the incoming message timestamp to achieve our deduplication mechanism: we can deterministically generate the same exact file names containing the exact same data by writing our service code to check that the message added to the batch has the appropriate sequence ID relative to the previous message in the batch and using the timestamp on the incoming message to determine the exact start and end of each batch (we typically get a UNIX timestamp and check when we’ve rolled over to the next clock minute). This allows us to simply rely on a key collision in S3 for deduplication.

AWS suggests a similar solution for a slightly different problem, relating to Amazon Kinesis Data Streams. For more information, see Handling Duplicate Records.

With this scheme, even if records are processed more than one time, the resulting Amazon S3 file has the same name and has the same data. The retries only result in writing the same data to the same file more than one time.

After we store the data, we can perform simple analytics queries on the billions of records we’ve stored in S3 using Amazon Athena, a query service that requires minimal configuration and zero infrastructure overhead. Athena has a concept of partitions (inherited from one of its underlying services, Apache Hive). Partitions are mappings between virtual columns (in our case: pair, year, month, and day) and the S3 directories in which the corresponding data is stored.

S3’s file system is not actually hierarchical. Files are prepended with long key prefixes that are rendered as directories in the AWS console when browsing a bucket’s contents. This has some non-trivial performance consequences when querying or filtering on large data sets.

The following screenshot illustrates a typical directory path.

By pointing Athena directly to a particular subset of data, a well-defined partitioning scheme can drastically reduce query run times and costs. Though the ability the perform ad hoc business analytics queries is primarily a convenience, taking time to choose a sane multi-level partitioning scheme for Athena based on some of our most common access patterns seemed worthwhile. A poorly designed partition structure can result in Athena unnecessarily scanning huge swaths of data and ultimately render the service unusable.

Data publication

Our pipeline for transforming thousands of small gzipped JSON files into clean CSVs and loading them into AWS Data Exchange involves three distinct jobs, each expressed as an AWS Lambda function.

Job 1

Job 1 is initiated shortly after midnight UTC by a cron-scheduled CloudWatch event. As mentioned previously, our data ingestion service’s batching mechanism flushes each batch to S3 at a regular time interval. A timestamp on the incoming message (applied server-side) determines the rollover from one interval to the next, as opposed to the ingestion service’s system timestamp, so in the rare case that a non-trivial amount of time elapses between the consumption of the final message of batch n and the first message of batch n+1, we kick off the first Lambda function 20 minutes after midnight UTC to minimize the likelihood of omitting data pending write.

Job 1 formats values for the date and data source into an Athena query template and outputs the query results as a CSV to a specified prefix path in S3. (Every Athena query produces a .metadata file and a CSV file of the query results, though DDL statements do not output a CSV.) This PUT request to S3 triggers an S3 event notification.

We run a full replica data ingestion system as an additional layer of redundancy. Using the coalesce conditional expression, the Athena query in Job 1 merges data from our primary system with the corresponding data from our replica system, and fills in any gaps while deduplicating redundant records.

We experimented fairly extensively with AWS Glue and PySpark for the ETL-related work performed in Job 1. When we realized that we could merge all the small source files into one, join the primary and replica data sets, and sort the results with a single Athena query, we decided to stick with this seemingly simpler and more elegant approach.

The following code shows one of our Athena query templates.

Job 2

Job 2 is triggered by the S3 event notification from Job 1. Job 2 simply copies the query results CSV file to a different key within the same S3 bucket.

The motivation for this step is twofold. First, we cannot dictate the name of an Athena query results CSV file; it is automatically set to the Athena query ID. Second, when adding an S3 object as an asset to an AWS Data Exchange revision, the asset’s name is automatically set to the S3 object’s key. So to dictate how the CSV file name appears in AWS Data Exchange, we must first rename it, which we accomplish by copying it to a specified S3 key.

Job 3

Job 3 handles all work related to AWS Data Exchange and AWS Marketplace Catalog via their respective APIs. We use boto3, AWS’s Python SDK, to interface with these APIs. The AWS Marketplace Catalog API is necessary for adding data set revisions to products that have already been published. For more information, see Tutorial: Adding New Data Set Revisions to a Published Data Product.

Our code explicitly defines mappings with the following structure:

data source / DataSet / Product

The following code shows how we configure relationships between data sources, data sets, and products.

Our data sources are typically represented by a trading venue and data type combination (such as Binance trades or CoinbasePro order books). Each new file for a given data source is delivered as a single asset within a single new revision for a particular data set.

An S3 trigger kicks off the Lambda function. The trigger is scoped to a specified prefix that maps to a single data set. The function alias feature of AWS Lambda allows us to define the unique S3 triggers for each data set while reusing the same underlying Lambda function. Job 3 carries out the following steps (note that steps 1 through 5 refer to the AWS Data Exchange API while steps 6 and 7 refer to the AWS Marketplace Catalog API):

  1. Submits a request to create a new revision for the corresponding data set via CreateRevision.
  2. Adds the file that was responsible for triggering the Lambda function to the newly created revision via CreateJob using the IMPORT_ASSETS_FROM_S3 job type. To submit this job, we need to supply a few values: the S3 bucket and key values for the file are pulled from the Lambda event message, while the RevisionID argument comes from the response to the CreateRevision call in the previous step.
  3. Kicks off the job with StartJob, sourcing the JobID argument from the response to the CreateJob call in the previous step.
  4. Polls the job’s status via GetJob (using the job ID from the response to the StartJob call in the previous step) to check that our file (the asset) was successfully added to the revision.
  5. Finalizes the revision via UpdateRevision.
  6. Requests a description of the marketplace entity using DescribeEntity, passing in the product ID stored in our hardcoded mappings as the EntityID
  7. Kicks off the entity ChangeSet via StartChangeSet, passing in the entity ID from the previous step, the entity ID from the DescribeEntity response in the previous step as EntityID, the revision ARN parsed from the response to our earlier call to CreateRevision as RevisionArn, and the data set ARN as DataSetArn, which we fetch at the start of the code’s runtime using AWS Data Exchange API’s GetDataSet.

Here’s a thin wrapper class we wrote to carry out the steps detailed above:

from time import sleep
import logging
import json

import boto3

from config import (
    DATA_EXCHANGE_REGION,
    MARKETPLACE_CATALOG_REGION,
    LambdaS3TriggerMappings
)

logger = logging.getLogger()


class CustomDataExchangeClient:
    def __init__(self):
        self._de_client = boto3.client('dataexchange', region_name=DATA_EXCHANGE_REGION)
        self._mc_client = boto3.client('marketplace-catalog', region_name=MARKETPLACE_CATALOG_REGION)
    
    def _get_s3_data_source(self, bucket, prefix):
        return LambdaS3TriggerMappings[(bucket, prefix)]

    # Job State can be one of: WAITING | IN_PROGRESS | ERROR | COMPLETED | CANCELLED | TIMED_OUT
    def _wait_for_de_job_completion(self, job_id):
        while True:
            get_job_resp = self._de_client.get_job(JobId=job_id)
            if get_job_resp['State'] == 'COMPLETED':
                logger.info(f"Job '{job_id}' succeeded:\n\t{get_job_resp}")
                break
            elif get_job_resp['State'] in ('ERROR', 'CANCELLED'):
                raise Exception(f"Job '{job_id}' failed:\n\t{get_job_resp}")
            else:
                sleep(5)
                logger.info(f"Still waiting on job {job_id}...")
        return get_job_resp

    # ChangeSet Status can be one of: PREPARING | APPLYING | SUCCEEDED | CANCELLED | FAILED
    def _wait_for_mc_change_set_completion(self, change_set_id):
        while True:
            describe_change_set_resp = self._mc_client.describe_change_set(
                Catalog='AWSMarketplace',
                ChangeSetId=change_set_id
                )
            if describe_change_set_resp['Status'] == 'SUCCEEDED':
                logger.info(
                    f"ChangeSet '{change_set_id}' succeeded:\n\t{describe_change_set_resp}"
                )
                break
            elif describe_change_set_resp['Status'] in ('FAILED', 'CANCELLED'):
                raise Exception(
                    f"ChangeSet '{change_set_id}' failed:\n\t{describe_change_set_resp}"
                )
            else:
                sleep(1)
                logger.info(f"Still waiting on ChangeSet {change_set_id}...")
        return describe_change_set_resp

    def process_s3_event(self, s3_event):
        source_bucket = s3_event['Records'][0]['s3']['bucket']['name']
        source_key = s3_event['Records'][0]['s3']['object']['key']
        source_prefix = '/'.join(source_key.split('/')[0:-1])
        s3_data_source = self._get_s3_data_source(source_bucket, source_prefix)
        obj_name = source_key.split('/')[-1]
        
        s3_data_source.validate_object_name(obj_name)
        
        for data_set in s3_data_source.lambda_s3_trigger_target_data_sets:
            # Create revision
            create_revision_resp = self._de_client.create_revision(
                DataSetId=data_set.id,
                Comment=obj_name
            )
            logger.debug(create_revision_resp)
            revision_id = create_revision_resp['Id']
            revision_arn = create_revision_resp['Arn']

            # Create job
            create_job_resp = self._de_client.create_job(
                Type='IMPORT_ASSETS_FROM_S3',
                Details={
                    'ImportAssetsFromS3': {
                      'AssetSources': [
                          {
                              'Bucket': source_bucket,
                              'Key': source_key
                          },
                      ],
                      'DataSetId': data_set.id,
                      'RevisionId': revision_id
                    }
                }
            )
            logger.debug(create_job_resp)

            # Start job
            job_id = create_job_resp['Id']
            start_job_resp = self._de_client.start_job(JobId=job_id)
            logger.debug(start_job_resp)

            # Wait for Data Exchange job completion
            get_job_resp = self._wait_for_de_job_completion(job_id)
            logger.debug(get_job_resp)

            # Finalize revision
            update_revision_resp = self._de_client.update_revision(
                DataSetId=data_set.id,
                RevisionId=revision_id,
                Finalized=True
            )
            logger.debug(update_revision_resp)

            # Ensure revision finalization succeeded
            finalized_status = update_revision_resp['Finalized']
            if finalized_status is not True:
                raise Exception(f"Failed to finalize revision:\n{update_revision_resp}")

            # Publish the new revision to each product associated with the data set
            for product in data_set.products:
                # Describe the AWS Marketplace entity corresponding to the Data Exchange product
                describe_entity_resp = self._mc_client.describe_entity(
                    Catalog='AWSMarketplace',
                    EntityId=product.id
                )
                logger.debug(describe_entity_resp)

                entity_type = describe_entity_resp['EntityType']
                entity_id = describe_entity_resp['EntityIdentifier']

                # Isolate the target data set in the DescribeEntity response
                describe_entity_resp_data_sets = json.loads(describe_entity_resp['Details'])['DataSets']
                describe_entity_resp_data_set = list(
                    filter(lambda ds: ds['DataSetArn'] == data_set.arn, describe_entity_resp_data_sets)
                )
                # We should get the data set of interest in describe_entity_resp and only that data set
                assert len(describe_entity_resp_data_set) == 1

                # Start a ChangeSet to add the newly finalized revision to an existing product
                start_change_set_resp = self._mc_client.start_change_set(
                    Catalog='AWSMarketplace',
                    ChangeSet=[
                        {
                            "ChangeType": "AddRevisions",
                            "Entity": {
                                "Identifier": entity_id,
                                "Type": entity_type
                            },
                            "Details": json.dumps({
                                "DataSetArn": data_set.arn,
                                "RevisionArns": [revision_arn]
                            })
                        }
                    ]
                )
                logger.debug(start_change_set_resp)

                # Wait for the ChangeSet workflow to complete
                change_set_id = start_change_set_resp['ChangeSetId']
                describe_change_set_resp = self._wait_for_mc_change_set_completion(change_set_id)
                logger.debug(describe_change_set_resp)

The following screenshot shows the S3 trigger for Job 3.

The following screenshot shows an example of CloudWatch logs for Job 3.

The following screenshot shows a CloudWatch alarm for Job 3.

Finally, we can verify that our revisions were successfully added to their corresponding data sets and products through the AWS console.

AWS Data Exchange allows you to create private offers for your AWS account IDs, providing a convenient means of checking that revisions show up in each product as expected.

Conclusion

This post demonstrated how you can integrate AWS Data Exchange into an existing data pipeline frictionlessly. We’re pleased to have been invited to participate in the AWS Data Exchange private preview, and even more pleased with the service itself, which has proven to be a sophisticated yet natural extension of our system.

I want to offer special thanks to both Kyle Patsen and Rafic Melhem of the AWS Data Exchange team for generously fielding my questions (and patiently enduring my ramblings) for the better part of the past year. I also want to thank Lucas Adams for helping me design the system discussed in this post and, more importantly, for his unwavering vote of confidence.

If you are interested in learning more about FPG, don’t hesitate to contact us.

 

Extract, Transform and Load data into S3 data lake using CTAS and INSERT INTO statements in Amazon Athena

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/extract-transform-and-load-data-into-s3-data-lake-using-ctas-and-insert-into-statements-in-amazon-athena/

Amazon Athena is an interactive query service that makes it easy to analyze the data stored in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. You can reduce your per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. To learn more about best practices to boost query performance and reduce costs, see Top 10 Performance Tuning Tips for Amazon Athena.

Overview

This blog post discusses how to use Athena for extract, transform and load (ETL) jobs for data processing. This example optimizes the dataset for analytics by partitioning it and converting it to a columnar data format using Create Table as Select (CTAS) and INSERT INTO statements.

CTAS statements create new tables using standard SELECT queries to filter data as required. You can also partition the data, specify compression, and convert the data into columnar formats like Apache Parquet and Apache ORC using CTAS statements. As part of the execution, the resultant tables and partitions are added to the AWS Glue Data Catalog, making them immediately available for subsequent queries.

INSERT INTO statements insert new rows into a destination table based on a SELECT query statement that runs on a source table. If the source table’s underlying data is in CSV format and destination table’s data is in Parquet format, then INSERT INTO can easily transform and load data into destination table’s format. CTAS and INSERT INTO statements can be used together to perform an initial batch conversion of data as well as incremental updates to the existing table.

Here is an overview of the ETL steps to be followed in Athena for data conversion:

  1. Create a table on the original dataset.
  2. Use a CTAS statement to create a new table in which the format, compression, partition fields and location of the new table can be specified.
  3. Add more data into the table using an INSERT INTO statement.

This example uses a subset of NOAA Global Historical Climatology Network Daily (GHCN-D), a publicly available dataset on Amazon S3, in this example.

This subset of data is available at the following S3 location:

s3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/
Total objects: 41727 
Size of CSV dataset: 11.3 GB
Region: us-east-1

Procedure

Follow these steps to use Athena for an ETL job.

Create a table based on original dataset

The original data is in CSV format with no partitions in Amazon S3. The following files are stored at the Amazon S3 location:

2019-10-31 13:06:57  413.1 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0000
2019-10-31 13:06:57  412.0 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0001
2019-10-31 13:06:57   34.4 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0002
2019-10-31 13:06:57  412.2 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0100
2019-10-31 13:06:57  412.7 KiB artifacts/athena-ctas-insert-into-blog/2010.csv0101

Note that the file sizes are pretty small. Merging them into larger files and reducing total number of files would lead to faster query execution. CTAS and INSERT INTO can help achieve this.

To execute queries in the Athena console (preferably in us-east-1 to avoid inter-region Amazon S3 data transfer charges). First, create a database for this demo:

CREATE DATABASE blogdb

Now, create a table from the data above.

CREATE EXTERNAL TABLE `blogdb`.`original_csv` (
  `id` string, 
  `date` string, 
  `element` string, 
  `datavalue` bigint, 
  `mflag` string, 
  `qflag` string, 
  `sflag` string, 
  `obstime` bigint)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://aws-bigdata-blog/artifacts/athena-ctas-insert-into-blog/'

Use CTAS to partition data and convert into parquet format with snappy compression

Now, convert the data to Parquet format with Snappy compression and partition the data on a yearly basis. All these actions are performed using the CTAS statement. For the purpose of this blog, the initial table only includes data from 2015 to 2019. You can add new data to this table using the INSERT INTO command.

The table created in Step 1 has a date field with the date formatted as YYYYMMDD (e.g. 20100104). The new table is partitioned on year. Extract the year value from the date field using the Presto function substr(“date”,1,4).

CREATE table new_parquet
WITH (format='PARQUET', 
parquet_compression='SNAPPY', 
partitioned_by=array['year'], 
external_location = 's3://your-bucket/optimized-data/') 
AS
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) >= 2015
        AND cast(substr("date",1,4) AS bigint) <= 2019

Once the query is successful, check the Amazon S3 location specified in the CTAS statement above. You should be able to see partitions and parquet files in each of these partitions, as shown in the following examples:

  1. Partitions:
    $ aws s3 ls s3://your-bucket/optimized-data/
                               PRE year=2015/
                               PRE year=2016/
                               PRE year=2017/
                               PRE year=2018/
                               PRE year=2019/

  2. Parquet files:
    $ aws s3 ls s3://your-bucket/optimized-data/ --recursive --human-readable | head -5
    
    2019-10-31 14:51:05    7.3 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_1be48df2-3154-438b-b61d-8fb23809679d
    2019-10-31 14:51:05    7.0 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_2a57f4e2-ffa0-4be3-9c3f-28b16d86ed5a
    2019-10-31 14:51:05    9.9 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_34381db1-00ca-4092-bd65-ab04e06dc799
    2019-10-31 14:51:05    7.5 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_354a2bc1-345f-4996-9073-096cb863308d
    2019-10-31 14:51:05    6.9 MiB optimized-data/year=2015/20191031_215021_00001_3f42d_42da4cfd-6e21-40a1-8152-0b902da385a1

Add more data into table using INSERT INTO statement

Now, add more data and partitions into the new table created above. The original dataset has data from 2010 to 2019. Since you added 2015 to 2019 using CTAS, add the rest of the data now using an INSERT INTO statement:

INSERT INTO new_parquet
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) < 2015

List the Amazon S3 location of the new table:

 $ aws s3 ls s3://your-bucket/optimized-data/
                           PRE year=2010/
                           PRE year=2011/
                           PRE year=2012/
                           PRE year=2013/
                           PRE year=2014/
                           PRE year=2015/
                           PRE year=2016/
                           PRE year=2017/
                           PRE year=2018/
                           PRE year=2019/ 

You can see that INSERT INTO is able to determine that “year” is a partition column and writes the data to Amazon S3 accordingly. There is also a significant reduction in the total size of the dataset thanks to compression and columnar storage in the Parquet format:

Size of dataset after parquet with snappy compression - 1.2 GB

You can also run INSERT INTO statements if more CSV data is added to original table. Assume you have new data for the year 2020 added to the original Amazon S3 dataset. In that case, you can run the following INSERT INTO statement to add this data and the relevant partition(s) to the new_parquet table:

INSERT INTO new_parquet
SELECT id,
         date,
         element,
         datavalue,
         mflag,
         qflag,
         sflag,
         obstime,
         substr("date",1,4) AS year
FROM original_csv
WHERE cast(substr("date",1,4) AS bigint) = 2020

Query the results

Now that you have transformed data, run some queries to see what you gained in terms of performance and cost optimization:

First, find the number of distinct IDs for every value of the year:

    1. Query on the original table:
      SELECT substr("date",1,4) as year, 
             COUNT(DISTINCT id) 
      FROM original_csv 
      GROUP BY 1 ORDER BY 1 DESC

    2. Query on the new table:
      SELECT year, 
        COUNT(DISTINCT id) 
      FROM new_parquet 
      GROUP BY  1 ORDER BY 1 DESC

      Original table

      New table

      Savings

      Run timeData scannedCost

      Run

      Time

      Data

      Scanned

      Cost
      16.88 seconds11.35 GB$0.05673.79 seconds428.05 MB$0.00214577.5% faster and 96.2% cheaper

       

Next, calculate the average maximum temperature (Celsius), average minimum temperature (Celsius), and average rainfall (mm) for the Earth in 2018:

      1. Query on the original table:
        SELECT element, round(avg(CAST(datavalue AS real)/10),2) AS value
        FROM original_csv
        WHERE element IN ('TMIN', 'TMAX', 'PRCP') AND substr("date",1,4) = '2018'
        GROUP BY  1

      1. Query on the new table:
        SELECT element, round(avg(CAST(datavalue AS real)/10),2) AS value
        FROM new_parquet 
        WHERE element IN ('TMIN', 'TMAX', 'PRCP') and year = '2018'
        GROUP BY  1

        Original tableNew tableSavings
        Run timeData scannedCost

        Run

        Time

        Data

        Scanned

        Cost
        18.65 seconds11.35 GB$0.05671.92 seconds68.08 MB$0.00034590% faster and 99.4% cheaper

         

Conclusion

This post showed you how to perform ETL operations using CTAS and INSERT INTO statements in Athena. You can perform the first set of transformations using a CTAS statement. When new data arrives, use an INSERT INTO statement to transform and load data to the table created by the CTAS statement. Using this approach, you converted data to the Parquet format with Snappy compression, converted a non-partitioned dataset to a partitioned dataset, reduced the overall size of the dataset and lowered the costs of running queries in Athena.

 


About the Author

 Pathik Shah is a big data architect for Amazon EMR at AWS.

 

 

 

 

 

Connect Amazon Athena to your Apache Hive Metastore and use user-defined functions

Post Syndicated from Janak Agarwal original https://aws.amazon.com/blogs/big-data/connect-amazon-athena-to-your-apache-hive-metastore-and-use-user-defined-functions/

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. This post details the two new preview features that you can start using today: connecting to Apache Hive Metastore and using user-defined functions.

Connecting Athena to your Apache Hive Metastore

Several customers use the Hive Metastore as a common metadata catalog for their big data environments. Such customers run Apache Spark, Presto, and Apache Hive on Amazon EC2 and Amazon EMR clusters with a self-hosted Hive Metastore as the common catalog. AWS also offers the AWS Glue Data Catalog, which is a fully managed catalog and drop-in replacement for the Hive Metastore. With the release as of this writing, you can now use the Hive Metastore in addition to the Data Catalog with Athena. Athena now allows you to connect to multiple Hive Metastores along with existing Data Catalog.

To connect to a self-hosted Hive Metastore, you need a metastore connector. You can download a reference implementation of this connector, which runs as a Lambda function in your account. The current version of the implementation supports only SELECT queries. DDL support is limited to basic metadata syntax. For more information, see please check Considerations and Limitations of this feature. You can also write a Hive Metastore connector using the previous reference implementation as an example. You can deploy your implementation as a Lambda function, and subsequently use it with Athena. For more information about the feature, see the Using Athena Data Connector for External Hive Metastore (Preview) documentation.

Using user-defined functions in Athena

Athena also offers preview support for scalar user-defined functions (UDFs). UDFs enable you to write functions and invoke them in SQL queries. A scalar UDF is applied one row at a time and returns a single column value. Athena invokes your scalar UDF with batches of rows to limit the performance impact associated with making a remote call for the UDF itself.

With the latest release as of this writing, you can use the Athena Query Federation SDK to define your functions and invoke them inline in SQL queries. You can now compress and decompress row values, scrub personally identifiable information (PII) from columns, transform dates to a different format, read image metadata, and execute proprietary custom code in your queries. You can also execute UDFs in both the SELECT and FILTER phase of the query and invoke multiple UDFs in the same query.

For more information about UDFs, see our documentation. For common UDF example implementations, see the GitHub repo. For more information about writing functions using the Athena Query Federation SDK, please visit this link.

Testing the preview features

All Athena queries originating from the workgroup AmazonAthenaPreviewFunctionality are considered Preview test queries.

Create a new workgroup AmazonAthenaPreviewFunctionality using Athena APIs or the Athena console. For more information, see Create a Workgroup.

The following considerations are important when using preview features.

Do not edit the workgroup name. You can edit other workgroup properties, such as enabling Amazon CloudWatch metrics and requester pays. You can use the Athena console, JDBC/ODBC drivers, or APIs to submit your test queries. Specify the workgroup AmazonAthenaPreviewFunctionality when you submit test queries.

Preview functionality is available only in the us-east-1 Region. If you use Athena in any other Region and submit queries using AmazonAthenaPreviewFunctionality, your query fails. Cross-Region calls are not supported in preview mode.

During the preview, you do not incur charges for the data scanned from federated data sources. However, you are charged standard Athena rates for data scanned from S3. Additionally, you are charged standard rates for the AWS services that you use with Athena, such as S3, AWS Lambda, AWS Glue, Amazon SageMaker, and AWS SAM. For example, you are charged S3 rates for storage, requests, and inter-Region data transfer. By default, query results are stored in an S3 bucket of your choice and are billed at standard S3 rates. If you use Lambda, you are charged based on the number of requests for your functions and the duration (the time it takes for your code to execute).

It is not recommended to onboard your production workload to AmazonAthenaPreviewFunctionality.

Query performance may vary between the preview workgroup and the other workgroups in your account. Additionally, new features and bug fixes may not be backwards compatible.

Summary

In summary, we introduced Athena’s two new features that released today in Preview.

Customers who use the Apache Hive Metastore for metadata management, and were previously unable to use Athena, can now connect their Hive Metastore to Athena to run queries. Also, customers can now use Athena’s Query Federation SDK to define and invoke their own functions in their SQL queries in Athena.

Both these features are available in Preview in the AWS us-east-1 region. Begin your Preview now by following these steps in the Athena FAQ. We welcome your feedback at [email protected]

 


About the Author

Janak Agarwal is a product manager for Athena at AWS.