Tag Archives: Analytics

How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime

Post Syndicated from Guillermo Menendez Corral original https://aws.amazon.com/blogs/big-data/how-to-migrate-from-ibm-netezza-to-amazon-redshift-with-no-downtime/

A large EMEA company recently decided to migrate their on-premises IBM Netezza data warehouse to Amazon Redshift. Given the volume of data (270TB uncompressed and more than 27K tables), the number of interconnected systems that made use of this data (more than 4.5K business processes), and the zero downtime requirements, we understood that the project would be a challenge. Since the company planned to decommission the data center where the data warehouse was deployed in less than a year’s time, there were also time constraints in place.

The data warehouse is a central piece for the company; it allows users across units to gather data and generate the daily reports required to run the business. In just a few years, business units accessing the cluster increased almost 3x, with 5x the initial number of users, executing 50x the number of daily queries for which the cluster had been designed. The legacy data warehouse was not able to scale to cover their business needs anymore, resulting in nightly ETL processes running outside of their time boundaries, and live queries taking too long.

The general dissatisfaction among the business users — along with the proximity of the data center decommissioning — moved the company to plan the migration, putting its IT department in charge of the definition of the new architecture, and the execution of the project.

Amazon Redshift, Amazon Web Services’ (AWS) fast, scalable OLAP data warehouse that makes it simple and cost-effective to analyze all your data across your data warehouse and data lake, was the perfect fit to solve their problems. Not only does Amazon Redshift provide full elasticity for future growth, and features such as concurrency scaling to cover high demand peaks, it also offers a whole ecosystem of analytics services to be easily integrated.

In this article, we explain how this customer performed a large-scale data warehouse migration from IBM Netezza to Amazon Redshift without downtime, by following a thoroughly planned migration process, and leveraging AWS Schema Conversion Tool (SCT) and Amazon Redshift best practices.

Preparing the migration

Large enterprise customers typically use data warehouse systems as a central repository for data analytics, aggregating operational data from heterogeneous transactional databases and running analytical workloads to serve analysts teams through business intelligence applications and reports. Using AWS, customers can benefit from the flexibility of having multiple compute resources processing different workloads, each workload scaling as the demand grows.

In this section, we describe the steps that we followed to prepare the migration of this data warehouse from IBM Netezza to Amazon Redshift.

Identifying workloads and dependencies

Customers typically have three different types of workloads running in their data warehouses:

  1. Batch processes: Long-running processes that require many resources and low concurrency, such as ETL jobs.
  2. Ad hoc queries: Short queries with high concurrency, such as analysts querying data.
  3. Business workloads: Typically mixed workloads, such as BI applications, reports, and dashboards.

In this customer’s case, they were building business data marts through complex ETL jobs, running statistical models, generating reports, and allowing analysts to run ad hoc queries. In essence, these applications are divided into two groups of workloads: batch and ad hoc queries. The on-premises platform was always saturated and struggling to deal with the level of concurrency demanded by these workloads while offering acceptable performance.

The following diagram shows the architecture that customer had before the migration:

By using Amazon Redshift, the customer is able to fulfill the requirements of every analytical workload. Within the old data warehouse, shown in the above diagram, two different managed service providers managed two sets of independent data and workloads. For the new architecture, the decision was to split the data warehouse into two different Amazon Redshift clusters to serve those different workloads, as described in the following section. Within each of these clusters, the customer is able to manage resources and concurrency for different applications under the same workload by configuring Amazon Redshift Workload Management (WLM). A typical WLM setup is to match every workload with a queue, so in this case, each cluster had two queues configured: batch and ad hoc, each with a different number of slots and assigned memory.

Sizing the target architecture

For heterogeneous migrations, like this one, a comprehensive analysis should be performed on the source database, collecting enough data to design the new architecture that supports both data and applications.

The AWS Schema Conversion Tool was the perfect companion for this project, as the customer was able to automate the reports and generate an assessment that helped estimate the migration complexity for different objects, e.g. data types, UDFs, and stored procedures.

In a typical database migration, customers categorize large tables by number of rows. However, when migrating to columnar databases, such as Amazon Redshift, it is essential to also assess table width (that is, number of columns) from the very beginning. While columnar databases are generally more efficient than row-based databases in storing data, wide tables with few rows may have a negative impact on columnar databases. To estimate the minimum table size required for each table in Amazon Redshift, use this formula from the AWS Knowledge Center.

For this customer, there was a clear separation between core applications, and a large isolated business application with minimum data dependencies and a different set of users. As a result, one of the main architectural decisions was to use two different Amazon Redshift clusters:

  • Primary cluster: Holds the core schemas and most of the data, and serves most of the business applications. Due to high storage requirements and long batch processes that run here, the recommended Amazon Redshift node type for this cluster is the dense storage family.
  • Secondary cluster: Purpose built cluster for a single application, demands high I/O. The recommended Amazon Redshift node type for this cluster is the dense compute family.

Planning the migration

There are several approaches when it comes to database migration. A must-have requirement for many migrations is to minimize downtime, which was the main driver for the migration pattern described in this post.

One of the challenges when migrating databases is to keep the data updated in both systems, capturing the changes on the source, and applying them to the destination database during the migration. By definition, data warehouses shouldn’t be used to run transactional (OLTP) workloads, but long-running ETL processes and analytical workloads (OLAP). Usually, those ETL processes update data in batches, typically on a daily basis. This simplifies the migration, because when those ETL processes that load data into the data warehouse are run in parallel against both target systems during the migration, CDC is not required.

The following image summarizes how we planned this migration with the customer, following a parallel approach to minimize downtime on their production data warehouse:

The main steps of this process are (1) data migration, (2) technical validation, (3) data sync, and (4) business validation, as described in the following section.

Running the migration

Full data migration

The initial data migration was the first milestone of the project. The main requirements for this phase were: (1) minimize the impact on the data source, and (2) transfer the data as fast as possible. To do this, AWS offers several options, depending on the size of the database, network performance (AWS Direct Connect or AWS Snowball), and whether the migration is heterogeneous or not (AWS Database Migration Service or AWS Schema Conversion Tool).

For this heterogeneous migration, the customer used the AWS Schema Conversion Tool (SCT). The SCT enabled them to run the data migration, provisioning multiple virtual machines in the same data center where IBM Netezza was installed, each running an AWS SCT Data Extractor agent. These data extractors are Java processes that connect directly to the source database and migrate data in chunks to the target database.

Sizing data extraction agents

To estimate the number of data extractor agents needed for the migration, consider this rule of thumb: One data extractor agent per 1 TB of compressed data on the source. Another recommendation is to install extraction agents on individual computers.

For each agent, consider the following hardware general requirements:

CPU4Lots of transformations and a large number of packets to process during data migration
RAM16Data chunks are kept in memory before dumping to disk.
Disk100 GB / ~500 IOPSIntermediate results are stored on Disk.
NetworkAt least 1 Gbit (10 Gbit recommended)While provisioning the resources, it is recommended to reduce the number of network hops from the source to AWS SCT data extraction agents.

Follow this documentation in order to go through the installation steps for the data extraction agents.

Depending on the size of the data to be migrated and the network speed, you may also run data extractor agents on top of EC2. For large data warehouses and in order to minimize downtime and optimize data transfer, it is recommended to deploy the data extractor agents as close as possible to the source. For example, in this migration project, 24 individual SCT extractor agents were installed in the on-premises data center for concurrent data extraction, and in order to speed up the process. Due to the stress that these operations generate on the data source, every extraction phase was run during weekends and off-hours.

The diagram below depicts the migration architecture deployed during the data migration phases:

Creating data extraction tasks

The source tables were migrated in parallel, on a table-by-table basis, using the deployed SCT extraction agents. These extraction agents authenticate using a valid user on the data source, allowing to adjust the resources available for that user during the extraction. Data was processed locally by SCT agents and uploaded to S3 through the network (via AWS Direct Connect). Note that other migration scenarios might require the use of AWS Snowball devices. Check the Snowball documentation to decide which transfer method is better for your scenario.

As part of the analysis performed while planning the migration, the customer identified large tables, e.g. tables with more than 20 million rows or 1TB size. In order to extract data from those tables, they used the virtual partitioning feature on AWS SCT, creating several sub-tasks and parallelizing the data extraction process for this table. We recommend creating two groups of tasks for each schema that migrates; one for small tables and another one for large tables using virtual partitions.

These tasks can be defined and created before running, so everything is ready for the migration window. Visit the following documentation to create, run, and monitor AWS SCT Data Extraction Tasks.

Technical validation

Once the initial extracted data was loaded to Amazon Redshift, data validation tests were performed in parallel, using validation scripts developed by the partner teams involved in the migration. The goal at this stage is to validate production workloads, comparing IBM Netezza and Amazon Redshift outputs from the same inputs.

Typical activities covered during this phase are the following:

  • Count number of objects and rows on each table.
  • Compare the same random subset of data in both IBM Netezza and Amazon Redshift for all migrated tables, validating that data is exactly the same row by row.
  • Check incorrect column encodings.
  • Identify skewed table data.
  • Annotate queries not benefiting from sort keys.
  • Identify inappropriate join cardinality.
  • Deal with tables with large varchar columns.
  • Confirm that processes do not crash when connected with target environment.
  • Validate daily batch job runs (job duration, number of rows processed).

You’ll find the right techniques to execute most of those activities in Top 10 Performance Tuning Techniques for Amazon Redshift.

Data synchronization

During this phase, the customer again migrated the tables and schemas that lost synchronization with the source during the Technical Validation phase. By using the same mechanism described on the First Full Data Migration section, and as ETL processes that generate the data marts are already running on the future system, data is kept updated after this synchronization phase.

Business validation

After the second data migration was successfully performed and the data movement was technically validated, the last remaining task was to involve the data warehouse  users in the final validation. These users from different business units across the company accessed the data warehouse using a variety of tools and methods: JDBC/ODBC clients, Python code, PL/SQL procedures, custom applications, etc.  It was central to the migration to make sure that every end user had verified and adapted his processes to work seamlessly with Amazon Redshift before the final cut-over was performed.

This phase took around three months and consisted of several tasks:

  • Adapt business users’ tools, applications, and scripts to connect to Amazon Redshift endpoints.
  • Modify user’s data load and dump procedures, replacing data movement to / from shared storage via ODBC / JDBC with COPY / UNLOAD operations from / to S3.
  • Modify any incompatible query, taking into account Amazon Redshift PostgreSQL implementation nuances.
  • Run business processes, both against IBM Netezza and Amazon Redshift, and compare results and execution times, being sure to notify any issue or unexpected result to the team in charge of the migration, so the case can be analyzed in detail.
  • Tune query performance, taking into account table sort keys and making extensive use of the EXPLAIN command in order to understand how Amazon Redshift plans and executes queries.

This business validation phase was key to have all end users aligned and ready for the final cut-over. Following Amazon Redshift best practices enabled end users to leverage the capabilities of their new data warehouse.

Soft cut-over

After all the migration and validation tasks had been performed, every ETL, business process, external system, and user tool was successfully connected and tested against Amazon Redshift.

This is when every process can be disconnected from the old data warehouse, which can be safely powered off and decommissioned.

Conclusion

In this blog post, we described the steps taken to perform a successful large-scale data warehouse migration from an on-premises IBM Netezza to Amazon Redshift. These same steps can be extrapolated to any other source data warehouse.

While this article describes a pure lift-and-shift migration, this is just the beginning of a transformation process towards a full-fledged corporate data lake. There are a series of next steps necessary in order to gain full advantage of the powerful analytics tools and services provided by AWS:

  • Activate Amazon Redshift’s Concurrency scaling feature on interactive query queues so clusters scale seamlessly on high usage periods without needing to provision the clusters for peak capacity.
  • Create a data lake in S3 and offload less accessed data, keeping warm and hot data on the Amazon Redshift clusters for higher performance.
  • Leverage Amazon Redshift Spectrum to be able to combine cold and hot data on analytic queries when required by the business needs.
  • Use Amazon Athena to be able to query cold data without affecting the data warehouse performance.

It is worth pointing out several takeaways that we feel are central to achieving a successful large-scale migration to Amazon Redshift:

  • Start with a PoC to make an accurate initial sizing of the Amazon Redshift cluster.
  • Create a detailed migration plan, which includes a clear procedure for every affected system.
  • Have end users fully aligned with the migration process, and make sure all their processes are validated before the final cutover is performed.
  • Follow Amazon Redshift best practices and techniques to leverage its full capabilities and performance.
  • Engage with the AWS account team from early stages and throughout the whole process. They are the point of contact with AWS specialists, Professional Services, and partners in order to bring the migration project to a successful conclusion.

We hope you found this post useful. Please feel free to leave a comment or question.

 


About the Authors

Guillermo Menendez Corral is a solutions architect at Amazon Web Services. He has over 12 years of experience designing and building SW applications and currently provides architectural guidance to AWS customers, with a focus on Analytics and Machine Learning.

 

 

 

 

Arturo Bayo is a big data consultant at Amazon Web Services. He promotes a data-driven culture in enterprise customers around EMEA, providing specialized guidance on business intelligence and data lake projects while working with AWS customers and partners to build innovative solutions around data and analytics.

 

 

 

 

Perform biomedical informatics without a database using MIMIC-III data and Amazon Athena

Post Syndicated from James Wiggins original https://aws.amazon.com/blogs/big-data/perform-biomedical-informatics-without-a-database-using-mimic-iii-data-and-amazon-athena/

Biomedical researchers require access to accurate, detailed data. The MIT MIMIC-III dataset is a popular resource. Using Amazon Athena, you can execute standard SQL queries against MIMIC-III without first loading the data into a database. Your analyses always reference the most recent version of the MIMIC-III dataset.

This post describes how to make the MIMIC-III dataset available in Athena and provide automated access to an analysis environment for MIMIC-III on AWS. We also compare a MIMIC-III reference bioinformatics study using a traditional database to that same study using Athena.

Overview

A dataset capturing a variety of measures longitudinally over time, across many patients, can drive analytics and machine learning toward research discovery and improved clinical decision-making. These features describe the MIT Laboratory of Computational Physiology (LCP) MIMIC-III dataset. In the words of LCP researchers:

“MIMIC-III is a large, publicly available database comprising de-identified health-related data associated with approximately 60K admissions of patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. …MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors: it is publicly and freely available, it encompasses a diverse and large population of ICU patients, and it contains high temporal resolution data including lab results, electronic documentation, and bedside monitor trends and waveforms.”

Recently, the Registry of Open Data on AWS (RODA) program made the MIMIC-III dataset available through the AWS Cloud. You can now use the MIMIC-III dataset without having to download, copy, or pay to store it. Instead, you can analyze the MIMIC-III dataset in the AWS Cloud using AWS services like Amazon EC2, Athena, AWS Lambda, or Amazon EMR. AWS Cloud availability enables quicker and cheaper research into the dataset.

Services like Athena also offer you new analytical approaches to the MIMIC-III dataset. Using Athena, you can execute standard SQL queries against MIMIC-III without first loading the data into a database. Because you can reference the MIMIC-III dataset hosted in the RODA program, your analyses always reference the most recent version of the MIMIC-III dataset. Live hosting reduces upfront time and effort, eliminates data synchronization issues, improves data analysis, and reduces overall study costs.

Transforming MIMIC-III data

Historically, the MIMIC team distributed the MIMIC-III dataset in compressed (gzipped) CSV format. The choice of CSV format reflects the loading of MIMIC-III data into a traditional relational database for analysis.

In contrast, the MIMIC team provides the MIMIC-III dataset to the RODA program in the following formats:

  • The traditional CSV format
  • An Apache Parquet format optimized for modern data processing technologies such as Athena

Apache Parquet stores data by column, so queries that fetch specific columns can run without reading the whole table. This optimization helps improve the performance and lower the cost of many query types common to biomedical informatics.

To convert the original MIMIC-III CSV dataset to Apache Parquet, we created a data transformation job using AWS Glue. The MIMIC team schedules the AWS Glue job to run as needed, updating the Parquet files in the RODA program with any changes to the CSV dataset. These scheduled updates keep the Parquet files current without interfering with the MIMIC team’s CSV dataset creation procedures. Find the code for this AWS Glue job in the mimic-code GitHub repo. Use the same code as a basis to develop other CSV-to-Parquet conversions.

The code that converts the MIMIC-III NOTEEVENTS table offers a valuable resource. This table stores long medical notes made up of multiple lines, with commas, double-quotes, and other characters that can be challenging to transform. For example, you can use the following Spark read statement to interpret that CSV file:

df = spark.read.csv('s3://'+mimiccsvinputbucket+'/NOTEEVENTS.csv.gz',\
	header=True,\
	schema=schema,\
	multiLine=True,\
	quote='"',\
	escape='"')

Executing the indwelling arterial catheter study using MIMIC-III

For example, the MIMIC team provides code for the MIMIC indwelling arterial catheter (IAC) aline study. The study uses the MIMIC-III dataset to reproduce findings from the published study, The Association Between Indwelling Arterial Catheters and Mortality in Hemodynamically Stable Patients With Respiratory Failure.

You can obtain the aline study code, as well as many other analytic examples, in the MIT Laboratory of Computational Physiology MIMIC Code Repository in GitHub. Learn more about the provided code in the JAMA paper, The MIMIC Code Repository: enabling reproducibility in critical care research.

The aline study code comprises about 1400 lines of SQL. It was developed for a PostgreSQL database, and is executed and further analyzed by Python and R code. We ran the aline study against the Parquet MIMIC-III dataset using Athena instead of PostgreSQL to answer the following questions:

  • What modifications would be required, if any, to run the study using Athena, instead of PostgreSQL?
  • How would performance differ?
  • How would cost differ?

Some SQL features used in the study work differently in Athena than they do in PostgreSQL. As a result, about 5% of the SQL statements needed modification. Specifically, the SQL statements require the following types of modification for use with Athena:

Instead of using materialized views, in Athena, create a new table. For instance, the CREATE MATERIALIZED VIEW statement can be written as CREATE TABLE.

  • Instead of using schema context statements like SET SEARCH_PATH, in Athena, explicitly declare the schema of referenced tables. For instance, instead of having a SET SEARCH_PATH statement before your query and referencing tables without a schema as follows:

SET SEARCH_PATH TO mimiciii;

left join chartevents ce

You declare the schema as a part of every table reference:

left join mimiciii.chartevents ce

  • Epoch time, as well as timestamp arithmetic, works differently in Athena than PostgreSQL. So, instead of arithmetic, use the date_diff() and specify seconds as the return value.

extract(epoch from endtime-starttime)/24.0/60.0/60.0 

The preceding statement can be written as:

date_diff('second',starttime, endtime)/24.0/60.0/60.0

  • Athena uses the “double” data type instead of the “numeric” data type.

ROUND( cast(f.height_first as numeric), 2) AS height_first,

The preceding statement can be written as:

ROUND( cast(f.height_first as double), 2) AS height_first,

  • If VARCHAR fields are empty, they are not detected as NULL. Instead, compare a VARCHAR to an empty string. For instance, consider the following field:

where ce.value IS NOT NULL

If ce.value is a VARCHAR, it can be written as the following:

where ce.value <> ''

After making these minor edits to the aline study SQL statements, the study executes successfully using Athena instead of PostgreSQL. The output produced from both methods is identical.

Next, consider the speed of Athena compared to PostgreSQL in analyzing the aline study. As shown in the following graph, the aline study using Athena queries of the Parquet-formatted MIMIC-III dataset run 10 times faster than queries of the same data in an Amazon RDS PostgreSQL database.

Compare the costs of running the aline study in Athena to running it in PostgreSQL. RDS PostgreSQL databases bill in one-second increments for the time the database runs, while Athena queries bill per gigabyte of data each query scans. Because of this fundamental pricing difference, you must make some assumptions to compare costs.

You could assume that running the aline study in RDS PostgreSQL involves the following steps, taking up an eight-hour working day:

  1. Deploy a new RDS PostgreSQL database.
  2. Load the MIMIC-III dataset.
  3. Connect a Jupyter notebook.
  4. Run the aline study.
  5. Terminate the RDS database.

You’d then compare the cost of running an RDS PostgreSQL database for eight hours (at $0.37 per hour, db.m5.xlarge with 100-GB EBS storage) to the cost of running the aline study against a dataset using Athena (9.93 GB of data scanned at $0.005/GB). As the following graph shows, this results in an almost 60x cost savings.

Working with MIMIC-III in AWS

MIMIC-III is a publicly available dataset containing detailed information about the clinical care of patients. For this reason, the MIMIC team requires that you complete a training course and submit a formal request before gaining access. For more information, see Requesting access.

After you obtain access to the MIMIC-III dataset, log in to the PhysioNet website and, under your profile settings, provide your AWS account ID. Then click the request access link, under Files, on the MIMIC-III Clinical Database page.  You then have access to the MIMIC-III dataset in AWS.

Choose Launch Stack below to begin working with the MIMIC-III dataset in your AWS account. This launches an AWS CloudFormation template, which deploys an index of the MIMIC-III Parquet-formatted dataset into your AWS Glue Data Catalog. It also deploys an Amazon SageMaker notebook instance pre-loaded with the aline study code and many other MIMIC-provided analytic examples.

After the AWS CloudFormation template deploys, choose Outputs, and follow the link to access Jupyter Notebooks. From there, you can open and execute the aline study code by browsing the filesystem to the path ./mimic-code/notebooks/aline-aws/aline-awsathena.ipynb.

After you’ve completed your analysis, you can stop your Jupyter Notebook instance to suspend charges for compute.  Any time that you want to continue working, just start the instance and continue where you left off.

Conclusion

The RODA program provides quick and inexpensive access to the global biomedical research resources of the MIMIC-III dataset. Cloud-native analytic tools like AWS Glue and Athena accelerate research, and groups like the MIT Laboratory of Computational Physiology are pioneering data availability in modern data formats like Apache Parquet.

The analytic approaches and code demonstrated in this post can be applied generally to help increase agility and decrease cost. Consider using them as you enhance existing or perform new biomedical informatics research.

 


About the Author


James Wiggins is a senior healthcare solutions architect at AWS. He is passionate about using technology to help organizations positively impact world health. He also loves spending time with his wife and three children.

 

 

 


Alistair Johnson is a research scientist and the Massachusetts Institute of Technology. Alistair has extensive experience and expertise in working with ICU data, having published the MIMIC-III and eICU-CRD datasets. His current research focuses on clinical epidemiology and machine learning for decision support.

 

 

Creating custom Pinpoint dashboards using Amazon QuickSight, part 2

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/messaging-and-targeting/creating-custom-pinpoint-dashboards-using-amazon-quicksight-part-2/

Note: This post was written by Manan Nayar and Aprajita Arora, Software Development Engineers on the AWS Digital User Engagement team.


In our previous post, we discussed the process of visualizing specific, pre-aggregated Amazon Pinpoint metrics—such as delivery rate or open rate—using the Amazon Pinpoint Metrics APIs. In that example, we showed how to create a Lambda function that retrieves your metrics, and then make those metrics available for creating visualizations in Amazon QuickSight.

This post discusses shows a different approach to exporting data from Amazon Pinpoint and using it to build visualizations. Rather than retrieve specific metrics, you can use the event stream feature in Amazon Pinpoint to export raw event data. You can use this data in Amazon QuickSight to create in-depth analyses of your data, as opposed to visualizing pre-calculated metrics. As an added benefit, when you use this solution, you don’t have to modify any code, and the underlying data is updated every few minutes.

Step 1: Configure the event stream in Amazon Pinpoint

The Amazon Pinpoint event stream includes information about campaign events (such as campaign.send) and application events (such as session.start). It also includes response information related to all of the emails and SMS messages that you send from your Amazon Pinpoint project, regardless of whether they were sent from campaigns or on a transactional basis. When you enable event streams, Amazon Pinpoint automatically sends this data to your S3 bucket (via Amazon Kinesis Data Firehose) every few minutes.

To set up the event stream

  1. Sign in to the Amazon Pinpoint console at http://console.aws.amazon.com/pinpoint. In the list of projects, choose the project that you want to enable event streaming for.
  2. Under Settings, choose Event stream.
  3. Choose Edit, and then configure the event stream to use Amazon Kinesis Data Firehose. If you don’t already have a Kinesis Data Firehose stream, follow the link to create one in the Kinesis console. Configure the stream to send data to an S3 bucket. For more information about creating streams, see Creating an Amazon Kinesis Data Firehose Delivery Stream.
  4. Under IAM role, choose Automatically create a role. Choose Save.

Step 2: Add a data set in Amazon QuickSight

Now that you’ve started streaming your Amazon Pinpoint data to S3, you can set Amazon QuickSight to look for data in the S3 bucket. You connect QuickSight to sources of data by creating data sets.

To create a data set

    1. In a text editor, create a new file. Paste the following code:
      {
          "fileLocations": [
              {
                  "URIPrefixes": [ 
                      "s3://<bucketName>/"          
                  ]
              }
          ],
          "globalUploadSettings": {
              "format": "JSON"
          }
      }

      In the preceding code, replace <bucketName> with the name of the S3 bucket that you’re using to store the event stream data. Save the file as manifest.json.

    2. Sign in to the QuickSight console at https://quicksight.aws.amazon.com.
    3. Create a new S3 data set. When prompted, choose the manifest file that you created in step 1. For more information about creating S3 data sets, see Creating a Data Set Using Amazon S3 Files in the Amazon QuickSight User Guide.
    4. Create a new analysis. From here, you can start creating visualizations of your data. To learn more, see Creating an Analysis in the Amazon QuickSight User Guide.

Step 3: Set the refresh rate for the data set

You can configure your data sets in Amazon QuickSight to automatically refresh on a certain schedule. In this section, you configure the data set to refresh every day, one minute before midnight.

To set the refresh schedule

  1. Go to the QuickSight start page at https://quicksight.aws.amazon.com/sn/start. Choose Manage data.
  2. Choose the data set that you created in the previous section.
  3. Choose Schedule refresh. Follow the prompts to set up a daily refresh schedule.

Step 4: Create your visualizations

From this point, you can start creating visualizations of your data. To learn more about creating visualizations, see Creating an Analysis in the Amazon QuickSight User Guide.

Build, secure, and manage data lakes with AWS Lake Formation

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/building-securing-and-managing-data-lakes-with-aws-lake-formation/

A data lake is a centralized store of a variety of data types for analysis by multiple analytics approaches and groups. Many organizations are moving their data into a data lake. In this post, I explore how you can use AWS Lake Formation to build, secure, and manage data lakes.

Traditionally, organizations have kept data in a rigid, single-purpose system, such as an on-premises data warehouse appliance. Similarly, they have analyzed data using a single method, such as predefined BI reports. Moving data between databases or for use with different approaches, like machine learning (ML) or improvised SQL querying, required “extract, transform, load” (ETL) processing before analysis. At best, these traditional methods have created inefficiencies and delays. At worst, they have complicated security.

By contrast, cloud-based data lakes open structured and unstructured data for more flexible analysis. Any amount of data can be aggregated, organized, prepared, and secured by IT staff in advance. Analysts and data scientists can then access it in place with the analytics tools of their choice, in compliance with appropriate usage policies.

Data lakes let you combine analytics methods, offering valuable insights unavailable through traditional data storage and analysis. In a retail scenario, ML methods discovered detailed customer profiles and cohorts on non-personally identifiable data gathered from web browsing behavior, purchase history, support records, and even social media. The exercise showed the deployment of ML models on real-time, streaming, interactive customer data.

Such models could analyze shopping baskets and serve up “next best offers” in the moment, or deliver instant promotional incentives. Marketing and support staff could explore customer profitability and satisfaction in real time and define new tactics to improve sales. Around a data lake, combined analytics techniques like these can unify diverse data streams, providing insights unobtainable from siloed data.

The challenges of building data lakes

Unfortunately, the complex and time-consuming process for building, securing, and starting to manage a data lake often takes months. Even building a data lake in the cloud requires many manual and time-consuming steps:

  • Setting up storage.
  • Moving, cleaning, preparing, and cataloging data.
  • Configuring and enforcing security policies for each service.
  • Manually granting access to users.

You want data lakes to centralize data for processing and analysis with multiple services. But organizing and securing the environment requires patience.

Currently, IT staff and architects spend too much time creating the data lake, configuring security, and responding to data requests. They could spend this time acting as curators of data resources, or as advisors to analysts and data scientists. Analysts and data scientists must wait for access to needed data throughout the setup.

The following diagram shows the data lake setup process:

Setting up storage

Data lakes hold massive amounts of data. Before doing anything else, you must set up storage to hold all that data. If you are using AWS, configure Amazon S3 buckets and partitions. If you are building the data lake on premises, acquire hardware and set up large disk arrays to store all the data.

Moving data

Connect to different data sources — on-premises and in the cloud — then collect data on IoT devices. Next, collect and organize the relevant datasets from those sources, crawl the data to extract the schemas, and add metadata tags to the catalog. You can use a collection of file transfer and ETL tools:

Cleaning and preparing data

Next, collected data must be carefully partitioned, indexed, and transformed to columnar formats to optimize for performance and cost. You must clean, de-duplicate, and match related records.

Today, organizations accomplish these tasks using rigid and complex SQL statements that perform unreliably and are difficult to maintain. This complex process of collecting, cleaning, and transforming the incoming data requires manual monitoring to avoid errors. Many customers use AWS Glue for this task.

Configuring and enforcing policies

Customers and regulators require that organizations secure sensitive data. Compliance involves creating and applying data access, protection, and compliance policies. For example, you restrict access to personally identifiable information (PII) at the table, column, or row level, encrypt all data, and keep audit logs of who is accessing the data.

Today, you can secure data using access control lists on S3 buckets or third-party encryption and access control software. You create and maintain data access, protection, and compliance policies for each analytics service requiring access to the data. For example, if you are running analysis against your data lake using Amazon Redshift and Amazon Athena, you must set up access control rules for each of these services.

Many customers use AWS Glue Data Catalog resource policies to configure and control metadata access to their data. Some choose to use Apache Ranger. But these approaches can be painful and limiting. S3 policies provide at best table-level access. And you must maintain data and metadata policies separately. With Apache Ranger, you can configure metadata access to only one cluster at a time. Also, policies can become wordy as the number of users and teams accessing the data lake grows within an organization.

Making it easy to find data

Users with different needs, like analysts and data scientists, may struggle to find and trust relevant datasets in the data lake. To make it easy for users to find relevant and trusted data, you must clearly label the data in a data lake catalog. Provide users with the ability to access and analyze this data without making requests to IT.

Today, each of these steps involves a lot of manual work. Customer labor includes building data access and transformation workflows, mapping security and policy settings, and configuring tools and services for data movement, storage, cataloging, security, analytics, and ML. With all these steps, a fully productive data lake can take months to implement.

The wide range of AWS services provides all the building blocks of a data lake, including many choices for storage, computing, analytics, and security. In the nearly 13 years that AWS has been operating Amazon S3 with exabytes of data, it’s also become the clear first choice for data lakes. AWS Glue adds a data catalog and server-less transformation capabilities. Amazon EMR brings managed big data processing frameworks like Apache Spark and Apache Hadoop. Amazon Redshift Spectrum offers data warehouse functions directly on data in Amazon S3. Athena brings server-less SQL querying.

With all these services available, customers have been building data lakes on AWS for years. AWS runs over 10,000 data lakes on top of S3, many using AWS Glue for the shared AWS Glue Data Catalog and data processing with Apache Spark.

AWS has learned from the thousands of customers running analytics on AWS that most customers who want to do analytics also want to build a data lake. But many of you want this process to be easier and faster than it is today.

AWS Lake Formation (now generally available)

At AWS re:Invent 2018, AWS introduced Lake Formation: a new managed service to help you build a secure data lake in days. If you missed it, watch Andy Jassy’s keynote announcement. Lake Formation has several advantages:

  • Identify, ingest, clean, and transform data: With Lake Formation, you can move, store, catalog, and clean your data faster.
  • Enforce security policies across multiple services: After your data sources are set up, you then define security, governance, and auditing policies in one place, and enforce those policies for all users and all applications.
  • Gain and manage new insights:With Lake Formation, you build a data catalog that describes available datasets and their appropriate business uses. This catalog makes your users more productive by helping them find the right dataset to analyze.

The following screenshot illustrates Lake Formation and its capabilities.

How to create a data lake

S3 forms the storage layer for Lake Formation. If you already use S3, you typically begin by registering existing S3 buckets that contain your data. Lake Formation creates new buckets for the data lake and import data into them. AWS always stores this data in your account, and only you have direct access to it.

There is no lock-in to Lake Formation for your data. Because AWS stores data in standard formats like CSV, ORC, or Parquet, it can be used with a wide variety of AWS or third-party analytics tools.

Lake Formation also optimizes the partitioning of data in S3 to improve performance and reduce costs. The raw data you load may reside in partitions that are too small (requiring extra reads) or too large (reading more data than needed). Lake Formation organizes your data by size, time, or relevant keys to allow fast scans and parallel, distributed reads for the most commonly used queries.

How to load data and catalog metadata

Lake Formation uses the concept of blueprints for loading and cataloging data. You can run blueprints one time for an initial load or set them up to be incremental, adding new data and making it available.

With Lake Formation, you can import data from MySQL, Postgres, SQL Server, MariaDB, and Oracle databases running in Amazon RDS or hosted in Amazon EC2. You can also import from on-premises databases by connecting with Java Database Connectivity (JDBC).

Point Lake Formation to the data source, identify the location to load it into the data lake, and specify how often to load it. Blueprints discovers the source table schema, automatically convert data to the target data format, partition the data based on the partitioning schema, and track data that was already processed. All these actions can be customized.

Blueprints rely on AWS Glue as a support service. AWS Glue crawlers connect and discover the raw data that to be ingested. AWS Glue code generation and jobs generate the ingest code to bring that data into the data lake. Lake Formation uses the same data catalog for organizing the metadata. AWS Glue stitches together crawlers and jobs and allows for monitoring for individual workflows. In these ways, Lake Formation is a natural extension of AWS Glue capabilities.

The following graphics show the Blueprint Workflow and Import screens:

How to transform and prepare data for analysis

In addition to supporting all the same ETL capabilities as AWS Glue, Lake Formation introduces new Amazon ML Transforms. This feature includes a fuzzy logic blocking algorithm that can de-duplicate 400M+ records in less than 2.5 hours, which is magnitudes better than earlier approaches.

To match and de-duplicate your data using Amazon ML Transforms: First, merge related datasets. Amazon ML Transforms divides these sets into training and testing samples, then scans for exact and fuzzy matches. You can provide more data and examples for greater accuracy, putting these into production to process new data as it arrives to your data lake. The partitioning algorithm requires minimal tuning. The confidence level reflects the quality of the grouping, improving on earlier, more improvised algorithms. The following diagram shows this matching and de-duplicating workflow.

Amazon.com is currently using and vetting Amazon ML Transforms internally, at scale, for retail workloads. Lake Formation now makes these algorithms available to customers, so you can avoid the frustration of creating complex and fragile SQL statements to handle record matching and de-duplication. Amazon ML Transforms help improve data quality before analysis. For more information, see Fuzzy Matching and Deduplicating Data with Amazon ML Transforms for AWS Lake Formation.

How to set access control permissions

Lake Formation lets you define policies and control data access with simple “grant and revoke permissions to data” sets at granular levels. You can assign permissions to IAM users, roles, groups, and Active Directory users using federation. You specify permissions on catalog objects (like tables and columns) rather than on buckets and objects.

You can easily view and audit all the data policies granted to a user—in one place. Search and view the permissions granted to a user, role, or group through the dashboard; verify permissions granted; and when necessary, easily revoke policies for a user. The following screenshots show the Grant permissions console:

How to make data available for analytics

Lake Formation offers unified, text-based, faceted search across all metadata, giving users self-serve access to the catalog of datasets available for analysis. This catalog includes discovered schemas (as discussed previously) and lets you add attributes like data owners, stewards, and other business-specific attributes as table properties.

At a more granular level, you can also add data sensitivity level, column definitions, and other attributes as column properties. You can explore data by any of these properties. But access is subject to user permissions. See the following screenshot of the AWS Glue tables tab:

How to monitor activity

With Lake Formation, you can also see detailed alerts in the dashboard, and then download audit logs for further analytics.

Amazon CloudWatch publishes all data ingestion events and catalog notifications. In this way, you can identify suspicious behavior or demonstrate compliance with rules.

To monitor and control access using Lake Formation, first define the access policies, as described previously. Users who want to conduct analysis access data directly through an AWS analytics service, such as Amazon EMR for Spark, Amazon Redshift, or Athena. Or, they access data indirectly with Amazon QuickSight or Amazon SageMaker.

A service forwards the user credentials to Lake Formation for the validation of access permissions. Then Lake Formation returns temporary credentials granting access to the data in S3, as shown in the following diagrams. After a user gains access, actual reads and writes of data operate directly between the analytics service and S3. This approach removes the need for an intermediary in the critical data-processing path.

The following screenshot and diagram show how to monitor and control access using Lake Formation.

Conclusion

With just a few steps, you can set up your data lake on S3 and start ingesting data that is readily queryable. To get started, go to the Lake Formation console and add your data sources. Lake Formation crawls those sources and moves the data into your new S3 data lake.

Lake Formation can automatically lay out the data in S3 partitions; change it into formats for faster analytics, like Apache Parquet and ORC; and increase data quality through machine-learned record matching and de-duplication.

From a single dashboard, you can set up all the permissions for your data lake. Those permissions are implemented for every service accessing this data – including analytics and ML services (Amazon Redshift, Athena, and Amazon EMR for Apache Spark workloads). Lake Formation saves you the hassle of redefining policies across multiple services and provides consistent enforcement of and compliance with those policies.

Learn how to start using AWS Lake Formation.


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at AWS. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 

 

Prajakta Damle is a Principle Product Manager at Amazon Web Services.

 

 

 

 

 

 

AWS Lake Formation – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-lake-formation-now-generally-available/

As soon as companies started to have data in digital format, it was possible for them to build a data warehouse, collecting data from their operational systems, such as Customer relationship management (CRM) and Enterprise resource planning (ERP) systems, and use this information to support their business decisions.

The reduction in costs of storage, together with an even greater reduction in complexity for managing large quantities of data, made possible by services such as Amazon S3, has allowed companies to retain more information, including raw data that is not structured, such as logs, images, video, and scanned documents.

This is the idea of a data lake: to store all your data in one, centralized repository, at any scale. We are seeing this approach with customers like Netflix, Zillow, NASDAQ, Yelp, iRobot, FINRA, and Lyft. They can run their analytics on this larger dataset, from simple aggregations to complex machine learning algorithms, to better discover patterns in their data and understand their business.

Last year at re:Invent we introduced in preview AWS Lake Formation, a service that makes it easy to ingest, clean, catalog, transform, and secure your data and make it available for analytics and machine learning. I am happy to share that Lake Formation is generally available today!

With Lake Formation you have a central console to manage your data lake, for example to configure the jobs that move data from multiple sources, such as databases and logs, to your data lake. Having such a large and diversified amount of data makes configuring the right access permission also critical. You can secure access to metadata in the Glue Data Catalog and data stored in S3 using a single set of granular data access policies defined in Lake Formation. These policies allow you to define table and column-level data access.

One thing I like the most of Lake Formation is that it works with your data already in S3! You can easily register your existing data with Lake Formation, and you don’t need to change existing processes loading your data to S3. Since data remains in your account, you have full control.

You can also use Glue ML Transforms to easily deduplicate your data. Deduplication is important to reduce the amount of storage you need, but also to make analyzing your data more efficient because you don’t have neither the overhead nor the possible confusion of looking at the same data twice. This problem is trivial if duplicate records can be identified by a unique key, but becomes very challenging when you have to do a “fuzzy match”. A similar approach can be used for record linkage, that is when you are looking for similar items in different tables, for example to do a “fuzzy join” of two databases that do not share a unique key.

In this way, implementing a data lake from scratch is much faster, and managing a data lake is much easier, making these technologies available to more customers.

Creating a Data Lake
Let’s build a data lake using the Lake Formation console. First I register the S3 buckets that are going to be part of my data lake. Then I create a database and grant permission to the IAM users and roles that I am going to use to manage my data lake. The database is registered in the Glue Data Catalog and holds the metadata required to analyze the raw data, such as the structure of the tables that are going to be automatically generated during data ingestion.

Managing permissions is one of the most complex tasks for a data lake. Consider for example the huge amount of data that can be part of it, the sensitive, mission-critical nature of some of the data, and the different structured, semi-structured, and unstructured formats in which data can reside. Lake Formation makes it easier with a central location where you can give IAM users, roles, groups, and Active Directory users (via federation) access to databases, tables, optionally allowing or denying access to specific columns within a table.

To simplify data ingestion, I can use blueprints that create the necessary workflows, crawlers and jobs on AWS Glue for common use cases. Workflows enable orchestration of your data loading workloads by building dependencies between Glue entities, such as triggers, crawlers and jobs, and allow you to track visually the status of the different nodes in the workflows on the console, making it easier to monitor progress and troubleshoot issues.

Database blueprints help load data from operational databases. For example, if you have an e-commerce website, you can ingest all your orders in your data lake. You can load a full snapshot from an existing database, or incrementally load new data. In case of an incremental load, you can select a table and one or more of its columns as bookmark keys (for example, a timestamp in your orders) to determine previously imported data.

Log file blueprints simplify ingesting logging formats used by Application Load Balancers, Elastic Load Balancers, and AWS CloudTrail. Let’s see how that works more in depth.

Security is always a top priority, and I want to be able to have a forensic log of all management operations across my account, so I choose the CloudTrail blueprint. As source, I select a trail collecting my CloudTrail logs from all regions into an S3 bucket. In this way, I’ll be able to query account activity across all my AWS infrastructure. This works similarly for a larger organization having multiple AWS accounts: they just need, when configuring the trail in the CloudTrial console, to apply the trail to their whole organization.

I then select the target database, and the S3 location for the data lake. As data format I use Parquet, a columnar storage format that will make querying the data faster and cheaper. The import frequency can be hourly to monthly, with the option to choose the day of the week and the time. For now, I want to run the workflow on demand. I can do that from the console or programmatically, for example using any AWS SDK or the AWS Command Line Interface (CLI).

Finally, I give the workflow a name, the IAM role to use during execution, and a prefix for the tables that will be automatically created by this workflow.

I start the workflow from the Lake Formation console and select to view the workflow graph. This opens the AWS Glue console, where I can visually see the steps of the workflow and monitor the progress of this run.

When the workflow is completed a new table is available in my data lake database. The source data remain as logs in the S3 bucket output of CloudTrail, but now I have them consolidated, in Parquet format and partitioned by date, in my data lake S3 location. To optimize costs, I can set up an S3 lifecycle policy that automatically expires data in the source S3 bucket after a safe amount of time has passed.

Securing Access to the Data Lake
Lake Formation provides secure and granular access to data stores in the data lake, via a new grant/revoke permissions model that augments IAM policies. It is simple to set up these permissions, for example using the console:

I simply select the IAM user or role I want to grant access to. Then I select the database and optionally the tables and the columns I want to provide access to. It is also possible to select which type of access to provide. For this demo, simple select permissions are sufficient.

Accessing the Data Lake
Now I can query the data using tools like Amazon Athena or Amazon Redshift. For example, I open the query editor in the Athena console. First, I want to use my new data lake to look into which source IP addresses are most common in my AWS Account activity:

SELECT sourceipaddress, count(*)
FROM my_trail_cloudtrail
GROUP BY  sourceipaddress
ORDER BY  2 DESC;

Looking at the result of the query, you can see which are the AWS API endpoints that I use the most. Then, I’d like to check which user identity types are used. That is an information stored in JSON format inside one of the columns. I can use some of the JSON functions available with Amazon Athena to get that information in my SQL statements:

SELECT json_extract_scalar(useridentity, '$.type'), count(*)
FROM "mylake"."my_trail_cloudtrail"
GROUP BY  json_extract_scalar(useridentity, '$.type')
ORDER BY  2 DESC;

Most of the times, AWS services are the ones creating activities in my trail. These queries are just an example, but give me quickly a deeper insight in what is happening in my AWS account.

Think of what could be a similar impact for your business! Using database and logs blueprints, you can quickly create workflows to ingest data from multiple sources within your organization, set the right permission at column level of who can have access to any information collected, clean and prepare your data using machine learning transforms, and correlate and visualize the information using tools like Amazon Athena, Amazon Redshift, and Amazon QuickSight.

Customizing Data Access with Column-Level Permissions
In order to follow data privacy guidelines and compliance, the mission-critical data stored in a data lake requires to create custom views for different stakeholders inside the company. Let’s compare the visibility of two IAM users in my AWS account, one that has full permissions on a table, and one that has only select access to a subset of the columns of the same table.

I already have a user with full access to the table containing my CloudTrail data, it’s called danilop. I create a new limitedview IAM user and I give it access to the Athena console. In the Lake Formation console, I only give this new user select permissions on three of the columns.

To verify the different access to the data in the table, I log in with one user at a time and go to the Athena console. On the left I can explore which tables and columns the logged-in user can see in the Glue Data Catalog. Here’s a comparison for the two users, side-by-side:

The limited user has access only to the three columns that I explicitly configured, and to the four columns used for partitioning the table, whose access is required to see any data. When I query the table in the Athena console with a select * SQL statement, logged in as the limitedview user, I only see data from those seven columns:

Available Now
There is no additional cost in using AWS Lake Formation, you pay for the use of the underlying services such as Amazon S3 and AWS Glue. One of the core benefits of Lake Formation are the security policies it is introducing. Previously you had to use separate policies to secure data and metadata access, and these policies only allowed table-level access. Now you can give access to each user, from a central location, only to the the columns they need to use.

AWS Lake Formation is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). Redshift integration with Lake Formation requires Redshift cluster version 1.0.8610 or higher, your clusters should have been automatically updated by the time you read this. Support for Apache Spark with Amazon EMR will follow over the next few months.

I only scratched the surface of what you can do with Lake Formation. Building and managing a data lake for your business is now much easier, let me know how you are using these new capabilities!

Danilo

Setting alerts in Amazon Elasticsearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/setting-alerts-in-amazon-elasticsearch-service/

Customers often use Amazon Elasticsearch Service for log analytics. Amazon ES lets you collect logs from your infrastructure, transform each log line into a JSON document, and send those documents to the bulk API.

A transformed log line contains many fields, each containing values. For instance, an Apache web log line includes a source IP address field, a request URL field, and a status code field (among others). Many users build dashboards—using Kibana—to monitor their infrastructure visually, surfacing application usage, bugs, or security problems evident from the data in those fields. For example, you can graph the count of HTTP 5xx status codes, then watch and react to changes. If you see a sudden jump in 5xx codes, you likely have a server issue. But with this system, you must monitor Kibana manually.

On April 8, Amazon ES launched support for event monitoring and alerting. To use this feature, you work with monitors—scheduled jobs—that have triggers, which are specific conditions that you set, telling the monitor when it should send an alert. An alert is a notification that the triggering condition occurred. When a trigger fires, the monitor takes action, sending a message to your destination.

This post uses a simulated IoT device farm to generate and send data to Amazon ES.

Simulator overview

This simulation consists of several important parts: sensors and devices.

Sensors

The core class for the simulator is the sensor. Devices have sensors that simulate different patterns of floating-point values. When called, each sensor’s report method updates and returns the value of its sensor. There are several subclasses for Sensor:

  • SineSensor: Produces a sin wave, based on the current timestamp.
  • ConstSensor: Produces a constant value. The class includes a random “fuzz” factor to drift around a particular value.
  • DriftingSensor: Allows continuous, random drift with a starting value.
  • MonotonicSensor: Increments its value by a constant delta, with random fuzz.

For this post, I used MonotonicSensor, whose value constantly increases, to force a breach in an alert that I set up.

You can identify a sensor by a universally unique identifier (UUID) and a label for the metric that it tracks. The report function for the sensor class returns a timestamp, the UUID of the sensor, the metric label, and the metric’s value at that instant.

Devices

Devices are collections of sensors. For this post, I created a collection of devices that simulate IoT devices in a field, measuring the temperature and humidity, and sending the CPU of the device. Each has a report method that recursively calls the report methods for all their sensors, returning a collection of the sensor reports. I made the code available in the Open Distro for Elasticsearch sample code repository on GitHub.

I set the CPU sensor of one device to drift constantly upward, simulating a problem in the device. You can see the intended “bad behavior” in the following line graph:

In the next sections, I set up an alert at 90% CPU so that I can catch and correct the situation.

Prerequisites

To follow along with this solution, you need an AWS account. Set up your own Amazon ES domain, to form the basis of your monitors and alerts.

Step 1: Set up your destination

When you create alerts in Amazon ES, you assign one destination or multiple. A destination is a delivery channel, where your domain sends notifications when your alerts trigger. You can use Amazon SNS, your Slack channel, or Amazon Chime as your destination. Or, you can set up a custom webhook (a URL) to receive messages. You set up headers and the message body, and Amazon ES Alerts posts the message to the destination URL.

I use SNS to receive alerts from my Amazon ES domain in this example, but Amazon ES provides many options for setting up your topic and subscription. I created the topic to receive notifications and subscribe to the topic for email delivery. Your SNS topic can have many subscriptions, supporting delivery via HTTP/S endpoint, email, Amazon SQS, AWS Lambda, and SMS.

To set up your destination, navigate to the AWS Management Console. Sign in and open the SNS console.

Choose Topics, Create Topic.

In the Create topic page, fill out values for Name and Display name. I chose sensor-alerting for both. Choose Create topic.

Now subscribe to your topic. You can do this from the topic page, as the console automatically returns you there when you complete topic creation. You can also subscribe from the Subscriptions tab in the left navigation pane. From the topic page, choose Create subscription.

On the Create Subscription page, for Protocol, choose Email. Fill in your email address in the Endpoint box and choose Create subscription. Make a note of the Topic ARN here, as you refer to it again later.

Finally, confirm your subscription by clicking the confirmation link in the email that SNS sends to you.

Step 2: Set up a role

To let Amazon ES publish alerts to your topic, create an IAM role with the proper permissions. Before you get started, copy the Topic ARN from the SNS topic page in Step 1.

Your role has two components: trusted entities and permissions for entities that assume the role. The console doesn’t support creating a role with Amazon ES as a trusted entity. Create a role with EC2 as the trusted entity and then edit the JSON trust document to change the entity.

In the AWS Management Console, open the IAM console and choose Roles, Create role.

On the Create role page, choose AWS Service and EC2. Choose Next: Permissions.

On the permissions page, choose Create policy. This brings you to a new window to create the policy. Don’t close the old tab, as you return to it in a moment.

The policy that you create in this step defines the permissions for entities that assume the role. Add a policy document that allows various entities (Amazon ES in this case) to publish to your SNS topic.

On the Create policy page, choose the JSON tab and copy-paste to replace the JSON text with the following code. Replace the sns-topic-arn in the code with the ARN for the topic that you created earlier. After you have done this, choose Review policy.

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "sns:Publish",
    "Resource": "sns-topic-arn"
  }]
}

On the Review policy page, give your policy a name. I chose SensorAlertingPolicy in this example. Choose Create policy.

Return to the Create role window or tab. Use the refresh button to reload the policies and type the name of your policy in the search box. Select the check box next to your policy. Choose Next: Tags, then choose Next: Review. You can also add tags to make your role easier to search.

On the Review page, give your role a name. I used SensorAlertingRole in this example. Choose Create role.

To change the trusted entity for the role to Amazon ES, in the IAM console, choose Roles. Type SensorAlertingRole in the search box, and choose the link (not the check box) to view that role. Choose Trust relationships, Edit trust relationship.

Edit the Policy Document code to replace ec2.amazonaws.com with es.amazonaws.com. Your completed policy document should look like the following code example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "es.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Choose Update Trust Policy. Make a note of your role ARN, as you refer to it again.

Step 3: Set up Amazon ES alerting

I pointed my IoT sensor simulator at my Amazon ES domain. This creates data that serves as the basis for the monitors and alerts. To do this yourself, navigate to your Kibana endpoint in your browser and choose Alerting in the left navigation pane. At the top of the window, choose Destinations, Add Destination.

In the Add Destination dialog, give your destination a name. For Type, choose SNS, and set the SNS topic ARN to the topic ARN that you created in Step 1. Set the IAM role ARN to the role ARN that you created in Step 2. Choose Create. You can set as many destinations as you like, allowing you to alert multiple people in the event of a problem.

Step 4: Set up a monitor

Monitors in Open Distro for Amazon ES allow you to specify a value to monitor. You can select the value either graphically or by specifying an Amazon ES query. You define a monitor first and then define triggers for the monitored value.

In Kibana, choose Monitors, Create Monitor.

Give your monitor a name. I named my monitor Device CPUs. You can set the frequency to one of the predefined intervals, or use a cron expression for more granular control. I chose Every 1 minute.

Scroll to the Define Monitor section of the page. Use this set of controls to specify the value to monitor. You can enter a value for Index or Indexes, Time field, and a target value. Choose Define using visual graph from the How do you want to define your monitor? list. You can also enter information for Define using extraction query, allowing you to provide a query that produces a value to monitor. For simple thresholds, the visual interface is fast and easy.

Select the Index value to monitor from the list. The list contains individual indexes. To use a wildcard, you can also type in the text box. For the value to register, you must press Enter after typing the index name (for example, “logs-*” <enter>).

Choose a value for Time field from the list. This reveals several selectors on top of a graph. Choose Count() and open the menu to see the aggregations for computing the value. Choose max(), then choose CPU for Select a field. Finally, set FOR THE LAST to 5 minute(s). Choose Create.

You can create your monitor visually or provide a query to produce the value to monitor.

I chose the logs-* index to monitor the max value of the CPU field, but this doesn’t create a trigger yet. Choose Create. This brings you to the Define Trigger page.

Step 5: Create a trigger

To create a trigger, specify the threshold value for the field that you’re monitoring. When the value of the field exceeds the threshold, the monitor enters an Active state. I created a trigger called CPU Too High, with a threshold value of 90 and a severity level of 1.

When you set the trigger conditions, set the action or actions that Amazon ES performs.

To add actions, scroll through the page. I added one action to send a message to my SNS topic—including the monitor name, trigger, severity, and the period over which the alarm has been active. You can use Mustache scripting to create a template for the message that you receive.

After you finish adding actions, choose Create at the bottom of the page.

Wrap up

When you return to the Alerting Dashboard, your alert appears in the Completed state. Alerts can exist in a variety of states. Completed signals that the monitor successfully queried your target, and that the trigger is not engaged.

To send the alert into the Active state, I sent simulated sensor data with a failing device whose CPU ramped up from 50% to 100%. When it hit 90%, I received the following email:

Conclusion

In this post, I demonstrated how Amazon ES alerting lets you monitor the critical data in your log files so that you can respond quickly when things start to go wrong. By identifying KPIs, setting thresholds, and distributing alerts to your first responders, you can improve your response time for critical issues.

If you have questions or feedback, leave them below, or reach out on Twitter!


About the Author

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

 

 

 

 

 

Get Cloudflare insights in your preferred analytics provider

Post Syndicated from Simon Steiner original https://blog.cloudflare.com/cloudflare-partners-with-analytics-providers/

Get Cloudflare insights in your preferred analytics provider

Today, we’re excited to announce our partnerships with Chronicle Security, Datadog, Elastic, Looker, Splunk, and Sumo Logic to make it easy for our customers to analyze Cloudflare logs and metrics using their analytics provider of choice. In a joint effort, we have developed pre-built dashboards that are available as a Cloudflare App in each partner’s platform. These dashboards help customers better understand events and trends from their websites and applications on our network.


Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Cloudflare insights in the tools you’re already using

Data analytics is a frequent theme in conversations with Cloudflare customers. Our customers want to understand how Cloudflare speeds up their websites and saves them bandwidth, ranks their fastest and slowest pages, and be alerted if they are under attack. While providing insights is a core tenet of Cloudflare’s offering, the data analytics market has matured and many of our customers have started using third-party providers to analyze data—including Cloudflare logs and metrics. By aggregating data from multiple applications, infrastructure, and cloud platforms in one dedicated analytics platform, customers can create a single pane of glass and benefit from better end-to-end visibility over their entire stack.

Get Cloudflare insights in your preferred analytics provider

While these analytics platforms provide great benefits in terms of functionality and flexibility, they can take significant time to configure: from ingesting logs, to specifying data models that make data searchable, all the way to building dashboards to get the right insights out of the raw data. We see this as an opportunity to partner with the companies our customers are already using to offer a better and more integrated solution.

Providing flexibility through easy-to-use integrations

To address these complexities of aggregating, managing, and displaying data, we have developed a number of product features and partnerships to make it easier to get insights out of Cloudflare logs and metrics. In February we announced Logpush, which allows customers to automatically push Cloudflare logs to Google Cloud Storage and Amazon S3. Both of these cloud storage solutions are supported by the major analytics providers as a source for collecting logs, making it possible to get Cloudflare logs into an analytics platform with just a few clicks. With today’s announcement of Cloudflare’s Analytics Partnerships, we’re releasing a Cloudflare App—a set of pre-built and fully customizable dashboards—in each partner’s app store or integrations catalogue to make the experience even more seamless.

By using these dashboards, customers can immediately analyze events and trends of their websites and applications without first needing to wade through individual log files and build custom searches. The dashboards feature all 55+ fields available in Cloudflare logs and include 90+ panels with information about the performance, security, and reliability of customers’ websites and applications.

Get Cloudflare insights in your preferred analytics provider

Ultimately, we want to provide flexibility to our customers and make it easier to use Cloudflare with the analytics tools they already use. Improving our customers’ ability to get better data and insights continues to be a focus for us, so we’d love to hear about what tools you’re using—tell us via this brief survey. To learn more about each of our partnerships and how to get access to the dashboards, please visit our developer documentation or contact your Customer Success Manager. Similarly, if you’re an analytics provider who is interested in partnering with us, use the contact form on our analytics partnerships page to get in touch.

Get Cloudflare insights in your preferred analytics provider

Post Syndicated from Simon Steiner original https://blog.cloudflare.com/cloudflare-partners-with-analytics-providers/

Get Cloudflare insights in your preferred analytics provider

Today, we’re excited to announce our partnerships with Chronicle Security, Datadog, Elastic, Looker, Splunk, and Sumo Logic to make it easy for our customers to analyze Cloudflare logs and metrics using their analytics provider of choice. In a joint effort, we have developed pre-built dashboards that are available as a Cloudflare App in each partner’s platform. These dashboards help customers better understand events and trends from their websites and applications on our network.


Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Cloudflare insights in the tools you’re already using

Data analytics is a frequent theme in conversations with Cloudflare customers. Our customers want to understand how Cloudflare speeds up their websites and saves them bandwidth, ranks their fastest and slowest pages, and be alerted if they are under attack. While providing insights is a core tenet of Cloudflare’s offering, the data analytics market has matured and many of our customers have started using third-party providers to analyze data—including Cloudflare logs and metrics. By aggregating data from multiple applications, infrastructure, and cloud platforms in one dedicated analytics platform, customers can create a single pane of glass and benefit from better end-to-end visibility over their entire stack.

Get Cloudflare insights in your preferred analytics provider

While these analytics platforms provide great benefits in terms of functionality and flexibility, they can take significant time to configure: from ingesting logs, to specifying data models that make data searchable, all the way to building dashboards to get the right insights out of the raw data. We see this as an opportunity to partner with the companies our customers are already using to offer a better and more integrated solution.

Providing flexibility through easy-to-use integrations

To address these complexities of aggregating, managing, and displaying data, we have developed a number of product features and partnerships to make it easier to get insights out of Cloudflare logs and metrics. In February we announced Logpush, which allows customers to automatically push Cloudflare logs to Google Cloud Storage and Amazon S3. Both of these cloud storage solutions are supported by the major analytics providers as a source for collecting logs, making it possible to get Cloudflare logs into an analytics platform with just a few clicks. With today’s announcement of Cloudflare’s Analytics Partnerships, we’re releasing a Cloudflare App—a set of pre-built and fully customizable dashboards—in each partner’s app store or integrations catalogue to make the experience even more seamless.

By using these dashboards, customers can immediately analyze events and trends of their websites and applications without first needing to wade through individual log files and build custom searches. The dashboards feature all 55+ fields available in Cloudflare logs and include 90+ panels with information about the performance, security, and reliability of customers’ websites and applications.

Get Cloudflare insights in your preferred analytics provider

Ultimately, we want to provide flexibility to our customers and make it easier to use Cloudflare with the analytics tools they already use. Improving our customers’ ability to get better data and insights continues to be a focus for us, so we’d love to hear about what tools you’re using—tell us via this brief survey. To learn more about each of our partnerships and how to get access to the dashboards, please visit our developer documentation or contact your Customer Success Manager. Similarly, if you’re an analytics provider who is interested in partnering with us, use the contact form on our analytics partnerships page to get in touch.

Amazon Managed Streaming for Apache Kafka (MSK) – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-managed-streaming-for-apache-kafka-msk-now-generally-available/

I am always amazed at how our customers are using streaming data. For example, Thomson Reuters, one of the world’s most trusted news organizations for businesses and professionals, built a solution to capture, analyze, and visualize analytics data to help product teams continuously improve the user experience. Supercell, the social game company providing games such as Hay Day, Clash of Clans, and Boom Beach, is delivering in-game data in real-time, handling 45 billion events per day.

Since we launched Amazon Kinesis at re:Invent 2013, we have continually expanded the ways in in which customers work with streaming data on AWS. Some of the available tools are:

  • Kinesis Data Streams, to capture, store, and process data streams with your own applications.
  • Kinesis Data Firehose, to transform and collect data into destinations such as Amazon S3, Amazon Elasticsearch Service, and Amazon Redshift.
  • Kinesis Data Analytics, to continuously analyze data using SQL or Java (via Apache Flink applications), for example to detect anomalies or for time series aggregation.
  • Kinesis Video Streams, to simplify processing of media streams.

At re:Invent 2018, we introduced in open preview Amazon Managed Streaming for Apache Kafka (MSK), a fully managed service that makes it easy to build and run applications that use Apache Kafka to process streaming data.

I am excited to announce that Amazon MSK is generally available today!

How it works

Apache Kafka (Kafka) is an open-source platform that enables customers to capture streaming data like click stream events, transactions, IoT events, application and machine logs, and have applications that perform real-time analytics, run continuous transformations, and distribute this data to data lakes and databases in real time. You can use Kafka as a streaming data store to decouple applications producing streaming data (producers) from those consuming streaming data (consumers).

While Kafka is a popular enterprise data streaming and messaging framework, it can be difficult to setup, scale, and manage in production. Amazon MSK takes care of these managing tasks and makes it easy to set up, configure, and run Kafka, along with Apache ZooKeeper, in an environment following best practices for high availability and security.

Your MSK clusters always run within an Amazon VPC managed by the MSK service. Your MSK resources are made available to your own VPC, subnet, and security group through elastic network interfaces (ENIs) which will appear in your account, as described in the following architectural diagram:

Customers can create a cluster in minutes, use AWS Identity and Access Management (IAM) to control cluster actions, authorize clients using TLS private certificate authorities fully managed by AWS Certificate Manager (ACM), encrypt data in-transit using TLS, and encrypt data at rest using AWS Key Management Service (KMS) encryption keys.

Amazon MSK continuously monitors server health and automatically replaces servers when they fail, automates server patching, and operates highly available ZooKeeper nodes as a part of the service at no additional cost. Key Kafka performance metrics are published in the console and in Amazon CloudWatch. Amazon MSK is fully compatible with Kafka versions 1.1.1 and 2.1.0, so that you can continue to run your applications, use Kafka’s admin tools, and and use Kafka compatible tools and frameworks without having to change your code.

Based on our customer feedback during the open preview, Amazon MSK added may features such as:

  • Encryption in-transit via TLS between clients and brokers, and between brokers
  • Mutual TLS authentication using ACM private certificate authorities
  • Support for Kafka version 2.1.0
  • 99.9% availability SLA
  • HIPAA eligible
  • Cluster-wide storage scale up
  • Integration with AWS CloudTrail for MSK API logging
  • Cluster tagging and tag-based IAM policy application
  • Defining custom, cluster-wide configurations for topics and brokers

AWS CloudFormation support is coming in the next few weeks.

Creating a cluster

Let’s create a cluster using the AWS management console. I give the cluster a name, select the VPC I want to use the cluster from, and the Kafka version.

I then choose the Availability Zones (AZs) and the corresponding subnets to use in the VPC. In the next step, I select how many Kafka brokers to deploy in each AZ. More brokers allow you to scale the throughtput of a cluster by allocating partitions to different brokers.

I can add tags to search and filter my resources, apply IAM policies to the Amazon MSK API, and track my costs. For storage, I leave the default storage volume size per broker.

I select to use encryption within the cluster and to allow both TLS and plaintext traffic between clients and brokers. For data at rest, I use the AWS-managed customer master key (CMK), but you can select a CMK in your account, using KMS, to have further control. You can use private TLS certificates to authenticate the identity of clients that connect to your cluster. This feature is using Private Certificate Authorities (CA) from ACM. For now, I leave this option unchecked.

In the advanced setting, I leave the default values. For example, I could have chosen here a different instance type for my brokers. Some of these settings can be updated using the AWS CLI.

I create the cluster and monitor the status from the cluster summary, including the Amazon Resource Name (ARN) that I can use when interacting via CLI or SDKs.

When the status is active, the client information section provides specific details to connect to the cluster, such as:

  • The bootstrap servers I can use with Kafka tools to connect to the cluster.
  • The Zookeper connect list of hosts and ports.

I can get similar information using the AWS CLI:

  • aws kafka list-clusters to see the ARNs of your clusters in a specific region
  • aws kafka get-bootstrap-brokers --cluster-arn <ClusterArn> to get the Kafka bootstrap servers
  • aws kafka describe-cluster --cluster-arn <ClusterArn> to see more details on the cluster, including the Zookeeper connect string

Quick demo of using Kafka

To start using Kafka, I create two EC2 instances in the same VPC, one will be a producer and one a consumer. To set them up as client machines, I download and extract the Kafka tools from the Apache website or any mirror. Kafka requires Java 8 to run, so I install Amazon Corretto 8.

On the producer instance, in the Kafka directory, I create a topic to send data from the producer to the consumer:

bin/kafka-topics.sh --create --zookeeper <ZookeeperConnectString> \
--replication-factor 3 --partitions 1 --topic MyTopic

Then I start a console-based producer:

bin/kafka-console-producer.sh --broker-list <BootstrapBrokerString> \
--topic MyTopic

On the consumer instance, in the Kafka directory, I start a console-based consumer:

bin/kafka-console-consumer.sh --bootstrap-server <BootstrapBrokerString> \
--topic MyTopic --from-beginning

Here’s a recording of a quick demo where I create the topic and then send messages from a producer (top terminal) to a consumer of that topic (bottom terminal):

Pricing and availability

Pricing is per Kafka broker-hour and per provisioned storage-hour. There is no cost for the Zookeeper nodes used by your clusters. AWS data transfer rates apply for data transfer in and out of MSK. You will not be charged for data transfer within the cluster in a region, including data transfer between brokers and data transfer between brokers and ZooKeeper nodes.

You can migrate your existing Kafka cluster to MSK using tools like MirrorMaker (that comes with open source Kafka) to replicate data from your clusters into a MSK cluster.

Upstream compatibility is a core tenet of Amazon MSK. Our code changes to the Kafka platform are released back to open source.

Amazon MSK is available in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), EU (Frankfurt), EU (Ireland), EU (Paris), and EU (London).

I look forward to see how are you going to use Amazon MSK to simplify building and migrating streaming applications to the cloud!

Tourists on GrabChat!

Post Syndicated from Grab Tech original https://engineering.grab.com/tourist-chat-data-story

Just over two years ago we introduced GrabChat, Southeast Asia’s first of its kind in-app messaging platform. Since then we’ve added all sorts of useful features to it. Auto-translated messages, the ability to send photos, and even voice messages! It’s been a great tool to facilitate smoother communications between our driver-partners and our passengers, and one group in particular has found it incredibly useful: tourists!

Now, we’ve analysed tourist data before, but we were curious about how GrabChat in particular has served this demographic. So we looked for interesting insights using sampled tourist chat data from Singapore, Malaysia, and Indonesia for the period of December 2018 to March 2019. That’s more than 3.7 million individual GrabChat messages sent by tourists! Here’s what we found.

Average chats per booking per country

Looking at the volume of the chats being transmitted per booking, we can see that the “chattiest” tourists are from East Timor, Nigeria, and Ukraine with averages of 6.0, 5.6, and 5.1 chats per booking respectively.

Then we wondered: if tourists from all over the world are talking this much to our driver-partners, how are they actually communicating if their mother-tongue is not the local language?

Need a Translator?

When we go to another country, we eat all the heavenly good food, fall in love with the culture, and admire the scenery. Language and communication barriers shouldn’t get in the way of all of that. That’s why Grab’s Chat feature has got it covered!

With Grab’s in-house translation solutions, any Grab passenger can send messages in their preferred language without fear of being misunderstood – or not understood at all! Their messages will be automatically translated into Bahasa Indonesia, Bahasa Melayu, Simplified Chinese, Thai, or Vietnamese depending on where they are. This applies not only apply to Grab’s transport services- GrabChat can be used when ordering GrabFood too!

Percentage of translated GrabChat messages
Indonesia saw the highest usage of translations on a by-booking basis!

 

Let’s look deeper into the tourist translation statistics for each country with the donut charts below. We can see that the most popular translation route for tourists in Indonesia was from English to Indonesian. The story is different for Singapore and Malaysia: we can see that there are translations to and from a more diverse set of languages, reflecting a more multicultural demographic.

Percentage of translated GrabChat messages
The most popular translation routes for tourist bookings in Indonesia, Malaysia, and Singapore.

 

Tap for Templates!

GrabChat also provides achat template feature. Templates are prewritten messages that you can send with just one tap! Did we mention that they are translated automatically too? Passengers and drivers can have a fast, simple, and translated conversation with each other without typing a single word- and sometimes, templates are really all you need.

Examples of chat templates, as they appear in GrabChat!
Examples of chat templates, as they appear in GrabChat!

 

As if all this wasn’t convenient enough, you can also make your own custom templates! Use them for those repetitive, identical messages you always seem to be sending out like telling your drivers where the hotel lobby is, or how to navigate right to your doorstep, or even to send a quick description of what you look like to make it easier for a driver to find you!

Template message usage

Taking a look at individual country data, tourists in Indonesia used templates the most with almost 60% of all of them using a template in their conversations at least once. Malaysia and Singapore saw lower but still sizeable utilisation rates of this feature, at 53% and 33% respectively.

Template message usage percentage
Indonesia saw the highest usage of templates on a by-booking basis.

 

In our analysis, we found an interesting insight! There was a positive correlation between template usage and the success rate of rides. Overall, bookings that used templates in their conversations saw 10% more completions over bookings that didn’t.

Template vs completed bookings

Picture this: a hassle-free experience

A picture says a thousand words, and for tourists using GrabChat’s image feature, those thousand words don’t even need to be translated. Instead of typing out a description of where they are standing for pickup, they can just click, snap, and send an image!

Our data revealed that GrabChat’s image functionality is most frequently used in areas where the tourist traffic is the highest. In fact, image function in GrabChat saw the most use in pickup areas such as airports, large shopping malls, public transport stations, and hotels, because it was harder for drivers to find their passengers in these crowded areas. Even with our super convenient Entrances feature, every little bit of information goes a long way to help your driver find you!

Pickup locations

If we take it a step further and look at the actual areas  within the cities where images were sent the most, we see that our initial hypothesis still holds fast.

Pickup locations
The top 5 pickup areas per country in which images were the most prevalent in GrabChat (for tourists).

 

In Singapore, we see the most images being sent out at the Downtown Core area- this area contains the majestic Marina Bay Sands, the Merlion statue, and the Esplanade, amongst other iconic attractions.

In Malaysia, the highest image usage occurs at none other than the Kuala Lumpur City Centre (KLCC) itself. This area includes the Twin Towers, a plethora of malls and hotels, Bukit Bintang (a bustling and lively night-life zone), and even an aquarium.

Indonesia’s top location for image chats is Kuta. A beach village in Bali, Kuta is a tourist hotspot with surfing, water parks, bars, budget-friendly yet delicious food, and numerous cultural attractions.

Speak up!

Allowing for two-way communication via GrabChat empowers both passengers and drivers to improve their journeys by divulging useful information, and asking clarifying questions: how many bags do you have? Does your car accommodate my pet dog? I’m standing by the lobby with my two kids- these are the sorts of things that are talked about in GrabChat messages.

During the analysis of our multitudes of wide-ranging GrabChat conversations, we picked up some pro-tips for you to get a Grab ride with even more convenience and ease, whether you’re a tourist or not:

Tip #1: Did some shopping on your trip? Swamped with bags? Send a message to your driver to let them know how many pieces of luggage you have with you.

As one might expect, chats that have keywords such as “luggage” or “baggage” (or any other related term) occur the most when riders are going to, or leaving, an airport. Most of the tourists on GrabChat asked the drivers if there was space for all of their things in the car. Interestingly, some of them also told the drivers how to recognise them for pickup based off of the descriptions of their bags!

Tip #2: Your children make good landmarks! If you’re in a crowded spot and you’re worried your driver can’t find you, drop them a message to let them know you’re that family with a baby and a little girl in pigtails.

When it comes to children, we found that passengers mainly use them to help identify themselves to the driver. Messages like “I’m with my two kids” or “We are a family with a baby” came up numerous times, and served as descriptions to facilitate fast pickup. These sorts of chats were the most prevalent in crowded areas like airports and shopping centres.

Tip #3: Don’t get caught off guard- be sure your furry friends have a seat!

Taking a look at pet related chats, we learned that our tourists have used GrabChat to ask clarifying questions to the driver. Passengers have likely considered that not every driver or vehicle is accommodating towards animals. The most common type of message was about whether pets are allowed in the vehicle. For example: “Is it okay if I bring a puppy?” or “I have a dog with me in a carrier, is that alright?”. Better safe than sorry! Alternatively, if you’re travelling with a pet, why not see if GrabPet is available in your country?

From the chat content analysis we have learned that tourists do indeed use GrabChat to talk to their drivers about specific details of their trip. We see that the chat feature is an invaluable tool that anyone can use to clear up any ambiguities and make their journeys more pleasant.

Bubble Tea Craze on GrabFood!

Post Syndicated from Grab Tech original https://engineering.grab.com/bubble-tea-craze-on-grabfood

Bigger and More Bubble Tea!

Bubble tea orders on GrabFood has been constantly and dramatically increasing with an impressive regional average growth rate of 3,000% in the year of 2018!  Just look at the percentage increase over the year of 2018, across all countries!

CountriesBubble tea growth by percentage in 2018*
Indonesia>8500% growth from Jan 2018 to Dec 2018
Philippines>3,500% growth from June 2018 to Dec 2018
Thailand>3,000% growth from Jan 21018 to Dec 2018
Vietnam>1,500% growth from May 2018 to Dec 2018
Singapore>700% growth from May 2018 to Dec 2018
Malaysia>250% growth from May 2018 to Dec 2018

*Time period: January 2018 to December 2018, or from the time GrabFood was launched.

What’s driving this growth is not just die-hard bubble tea fans who can’t go a week without drinking this sweet treat, but a growing bubble tea fan club in Southeast Asia. The number of bubble tea lovers on GrabFood grew over 12,000% in 2018 – and there’s no sign of stopping!

With increasing consumer demand, how is Southeast Asia’s bubble tea supply catching up?  As of December 2018, GrabFood has close to 4,000 bubble tea outlets from a network of over 1,500 brands – a 200% growth in bubble tea outlets in Southeast Asia!

Bubble-Tea-Lover growth on GrabFood

If this stat doesn’t stick, here is a map to show you how much bubble tea orders in different Southeast Asian cities have grown!

Maps of bubble tea merchants on GrabFood

And here is a little shoutout to our star merchants including Chatime, Coco Fresh Tea & Juice, Macao Imperial Tea, Ochaya, Koi Tea, Cafe Amazon, The Alley, iTEA, Gong Cha, and Serenitea.

Just how much do you drink?

On average, Southeast Asians drink  4 cups of bubble tea per person per month on GrabFood. Thai consumers top the regional average by 2 cups, consuming about six cups of bubble tea per person per month. This is closely followed by Filipino consumers who drink an average of 5 cups per person per month.

Average bubble tea consumption by cups per person per month

Favourite Flavours!

Have a look at the dazzling array of Bubble Tea flavours available on GrabFood today and you’ll find some uniquely Southeast Asian flavours like Chendol, Durian, and Gula Melaka, as well as rare flavours like salted cream and cheese! Can you spot your favourite flavours here?

Bubble tea flavour consumption per month

Let’s break it down by the country that GrabFood serves, and see who likes which flavours of Bubble Tea more!

Bubble tea flavour consumption per month by country

Top the Toppings!

Pearl seems to be the unbeatable best topping of most of the countries, except Vietnam whose No. 1 topping turned out to be Cheese Pudding! Top 3 toppings that topped your favorite bubble tea are:

Top list of toppings

Best Time for Bubble Tea!

Don’t we all need a cup of sweet Bubble Tea in the afternoon to get us through the day?  Across Southeast Asia, GrabFood’s data reveals that most people order bubble tea to accompany their meals at lunch, or as a  perfect midday energizer!

Times of the day when most people order bubble tea

Conclusion

So hazelnut or chocolate, pearl or (and) pudding (who says we can’t have the best of both worlds!)? The options are abundant and the choice is yours to enjoy!

If you have a sweet tooth, or simply want to reward yourself with Southeast Asia’s most popular drink, go ahead – you are only a couple of taps away from savouring this cup full of delight

Amazon Kinesis Data Firehose custom prefixes for Amazon S3 objects

Post Syndicated from Rajeev Chakrabarti original https://aws.amazon.com/blogs/big-data/amazon-kinesis-data-firehose-custom-prefixes-for-amazon-s3-objects/

In February 2019, Amazon Web Services (AWS) announced a new feature in Amazon Kinesis Data Firehose called Custom Prefixes for Amazon S3 Objects. It lets customers specify a custom expression for the Amazon S3 prefix where data records are delivered. Previously, Kinesis Data Firehose allowed only specifying a literal prefix. This prefix was then combined with a static date-formatted prefix to create the output folder in a fixed format. Customers asked for flexibility, so AWS listened and delivered.

Kinesis Data Firehose is most commonly used to consume event data from streaming sources, such as applications or IoT devices.  The data then is typically stored in a data lake, so it can be processed and eventually queried.  When storing data on Amazon S3, it is a best practice to partition or group related data and store it together in the same folder.  This provides the ability to filter the partitioned data and control the amount of data scanned by each query, thus improving performance and reducing cost.

A common way to group data is by date.  Kinesis Data Firehose automatically groups data and stores it into the appropriate folders on Amazon S3 based on the date.  However, the naming of folders in Amazon S3 is not compatible with Apache Hive naming conventions. This makes data more difficult to catalog using AWS Glue crawlers and analyze using big data tools.

This post discusses a new capability that lets us customize how Kinesis Data Firehose names the output folders in Amazon S3. It covers how custom prefixes work, the intended use cases, and includes step-by-step instructions to try the feature in your own account.

The need for custom prefixes for Amazon S3 objects

Previously, Kinesis Data Firehose created a static Universal Coordinated Time (UTC) based folder structure in the format YYYY/MM/DD/HH. It then appended it to the provided prefix before writing objects to Amazon S3. For example, if you provided a prefix “mydatalake/”, the generated folder hierarchy would be “mydatalake/2019/02/09/13”.  However, to be compatible with Hive naming conventions, the folder structure is expected to follow the format “/partitionkey=partitionvalue”.  Using this naming convention, data can be easily cataloged with AWS Glue crawlers, resulting in proper partition names.

Other methods for managing partitions also become possible such as running MSCK REPAIR TABLE in Amazon Athena or Apache Hive on Amazon EMR, which can add all partitions through a single statement. Furthermore, you can use other date-based partitioning patterns like “/dt=2019-02-09-13/” instead of expanding the date out into folders.  This is helpful in reducing the total number of partitions that need to be maintained as the table grows over time. It also simplifies range queries. Providing the ability to specify custom prefixes obviates the need for an additional ETL step to put the data in the right folder structure improving the time to insight.

How custom prefixes for Amazon S3 objects works

This new capability does not let you use any date or timestamp value from your event data, nor can you use any other arbitrary value in the event. Kinesis Data Firehose uses an internal timestamp field called ApproximateArrivalTimestamp. Each data record includes an ApproximateArrivalTimestamp (in UTC) that is set when a stream successfully receives and stores the record. This is commonly referred to as a server-side timestamp. Kinesis Data Firehose buffers incoming records according to the configured buffering hints and delivers them into Amazon S3 objects for the Amazon S3 destination. The resulting objects in Amazon S3 may contain multiple records, each with a different ApproximateArrivalTimestamp. When evaluating timestamps, Kinesis Data Firehose uses the ApproximateArrivalTimestamp of the oldest record that’s contained in the Amazon S3 object being written.

Kinesis Data Firehose also provides the ability to deliver records to a different error output location when there is a delivery, AWS Lambda transformation or format conversion failure. Previously, the error output location could not be configured and was determined by the type of delivery failure. With this release, the error output location (ErrorOutputPrefix) can also be configured. One benefit of this new capability is that you can separate failed records into date partitioned folders for easy reprocessing.

So how do you specify the custom Prefix and the ErrorOutputPrefix? You use an expression of the form: !{namespace:value}, where the namespace can be either firehose or timestamp. The value can be either “random-string” or “error-output-type” for the firehose namespace or a date pattern for the timestamp namespace in the Java DateTimeFormatter format. In a single expression, you can use a combination of the two namespaces although the !{firehose: error-output-type} can be used only in the ErrorOutputPrefix. For more information and examples, see Custom Prefixes for Amazon S3 Objects.

Writing streaming data into Amazon S3 with Kinesis Data Firehose

This walkthrough describes how streaming data can be written into Amazon S3 with Kinesis Data Firehose using a Hive compatible folder structure.  It then shows how AWS Glue crawlers can infer the schema and extract the proper partition names that we designated in Kinesis Data Firehose, and catalog them in AWS Glue Data Catalog.  Finally, we run sample queries to show that partitions are indeed being recognized.

To demonstrate this, we use python code to generate sample data.  We also use a Lambda transform on Kinesis Data Firehose to forcibly create failures. This demonstrates how data can be saved to the error output location. The code that you need for this walkthrough is included here in GitHub.

For this walkthrough, this is the architecture that we are building:

Step 1: Create an Amazon S3 bucket

Create an S3 bucket to be used by Kinesis Data Firehose to deliver event records. We use the AWS Command Line Interface (AWS CLI) to create the Amazon S3 bucket in the US East (N. Virginia) Region. Remember to substitute the bucket name in the example for your own.

aws s3 mb s3://kdfs3customprefixesexample --region us-east-1

Step 2: Lambda Transform (optional)

The incoming events have an ApproximateArrivalTimestamp field in the event payload.  This is sufficient to create a proper folder structure on Amazon S3.  However, when querying the data it may be beneficial to expose this timestamp value as a top level column for easy filtering and validation.  To accomplish this, we create a Lambda function that adds the ApproximateArrivalTimestamp as a top level field in the data payload. The data payload is what Kinesis Data Firehose writes as an object in Amazon S3. Additionally, the Lambda code also artificially generates some processing errors that are delivered to the “ErrorOutputPrefix” location specified for the delivery destination to illustrate the use of expressions in the “ErrorOutputPrefix.”

Create an IAM role for the Lambda transform function

First, create a role for the Lambda function called LambdaBasicRole. The TrustPolicyForLambda.json file is included in the GitHub repository.

$ aws iam create-role --role-name KDFLambdaBasicRole --assume-role-policy-document file://TrustPolicyForLambda.json

After the role is created, attach the managed Lambda basic execution policy to it.

$ aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole --role-name KDFLambdaBasicRole

Lambda function

To create the Lambda function, start with the Python Kinesis Data Firehose blueprint “General Firehose Processing” and then modify it. For more information about the structure of the records and what must be returned, see Amazon Kinesis Data Firehose Data Transformation.

Zip up the Python file, and then create the Lambda function using the AWS CLI. The CreateLambdaFunctionS3CustomPrefixes.json file is included in the GitHub repository.

aws lambda create-function --zip-file "fileb://lambda_function.zip" --cli-input-json file://CreateLambdaFunctionS3CustomPrefixes.json

Step3. Delivery Stream

Next, create the Kinesis Data Firehose delivery stream. The createdeliverystream.json file is included in the GitHub repository.

 aws firehose create-delivery-stream --cli-input-json file://createdeliverystream.json

In the previous configuration, we defined a Prefix and an ErrorOutputPrefix under the “ExtendedS3DestinationConfiguration” element. We defined the same for the “S3BackupConfiguration” element. Note that when the “ProcessingConfiguration” element is set to “Disabled”, the ErrorOutputPrefix parameter of the “ExtendedS3DestinationConfiguration” element exists only for consistency. It otherwise has no significance.

We’ve chosen a prefix that will result in a folder structure compatible with hive-style partitioning. This is the prefix we used:

“fhbase/year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/”

Kinesis Data Firehose first creates a base folder called “fhbase” directly under the Amazon S3 bucket. Second, it evaluates the expressions !{timestamp:YYYY}, !{timestamp:MM}, !{timestamp:dd}, and !{timestamp:HH} to year, month, day and hour using the Java DateTimeFormatter format. For example, an ApproximateArrivalTimestamp of 1549754078390 in UNIX epoch time, which is 2019-02-09T16:13:01.000000Z in UTC would evaluate to “year=2019”, “month=02”, “day=09” and “hour=16”.  Therefore, the location in Amazon S3 where data records that are delivered evaluate to “fhbase/year=2019/month=02/day=09/hour=16/”.

Similarly, the ErrorOutputPrefix “fherroroutputbase/!{firehose:random-string}/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/” results in a base folder called “fherroroutputbase” directly under the S3 bucket. The expression !{firehose:random-string} evaluates to an 11 character random string like “ztWxkdg3Thg”.  If you use this more than once in the same expression, every instance evaluates to a new random string. The expression !{firehose:error-output-type} evaluates to one of the following:

  1. “processing-failed” for Lambda transformation delivery failures
  2. “elasticsearch-failed” for an Amazon ES destination delivery failures
  3. “splunk-failed” for Splunk destination delivery failures
  4. “format-conversion-failed” for data format conversion failures

So, the location for an Amazon S3 object containing the delivery failed records for a Lambda transformation could evaluate to: fherroroutputbase/ztWxkdg3Thg/processing-failed/2019/02/09/.

You can run aws firehose describe-delivery-stream --delivery-stream-name KDFS3customPrefixesExample to describe the delivery stream created.

Next, enable encryption-at-rest for the delivery stream:

aws firehose start-delivery-stream-encryption --delivery-stream-name KDFS3customPrefixesExample

Or Create the delivery stream using the AWS Console

  1. Choose the source. For this example, I use Direct PUT.
  2. Choose if you would like to transform the incoming records with a Lambda transformation. I chose Enabled, and chose the name of the Lambda function that I had created earlier.

  1. Choose the destination. I chose the Amazon S3 destination.

  1. Choose the Amazon S3 bucket. I chose the Amazon S3 bucket that I had created earlier in this exercise.

  1. Specify the Amazon S3 Prefix and the Amazon S3 error prefix. This corresponds to the “Prefix” and “ErrorOutputPrefix” explained earlier in the context of the AWS CLI input JSON.

  1. Choose whether you would like to back up the raw (before transformation) records to another Amazon S3 location. I chose Enabled and specified the same bucket (you could choose a different bucket). I also specified a different prefix from the transformed records – the base folder is different but the folder structure below that is the same. This would make it more efficient to crawl this location using an AWS Glue crawler or create external tables in Athena or Redshift Spectrum pointing to this location.

  1. Specify the buffering hints for the Amazon S3 destination. I chose 1 MB and 240 seconds.
  2. Choose the S3 Compression and encryption settings. I chose no compression for the transformed records’ location. I chose to encrypt the Amazon S3 location at rest by using the service-managed AWS KMS customer master key (CMK).
  3. Choose whether you want to enable Error Logging in Cloudwatch. I chose Enabled.
  4. Specify the IAM role that you want Kinesis Data Firehose to assume to access resources on your behalf. Choose either Create new or Choose to display a new screen. Choose Create a new IAM role, name the role, and then choose Allow.
  5. Choose Create Delivery Stream.

The delivery stream is now created and active. You can send events to it.

 Test with sample data

I used Python code to generate sample data. The structure of the generated data is as follows:

{'sector': 'HEALTHCARE', 'price': 194.07, 'ticker_symbol': 'UFG', u'EventTime': '2019-02-12T07:10:52.649000Z', 'change': 20.56}
{'sector': 'HEALTHCARE', 'price': 124.01, 'ticker_symbol': 'QXZ', u'EventTime': '2019-02-12T07:10:53.745000Z', 'change': 3.32}
{'sector': 'MANUFACTURING', 'price': 26.95, 'ticker_symbol': 'QXZ', u'EventTime': '2019-02-12T07:10:54.864000Z', 'change': 24.53}

Sample code to generate data and push it into Kinesis Data Firehose is included in the GitHub repository.

After you start sending events to the Kinesis Data Firehose delivery stream, objects should start appearing under the specified prefixes in Amazon S3.

I wanted to illustrate Lambda invoke errors and the appearance of files in the ErrorOutputPrefix location for Lambda transform errors. Therefore, I did not give permissions to the “firehose_delivery_role” to invoke my Lambda function. The following file showed up in the location specified by the ErrorOutputPrefix.

aws s3 ls s3://kdfs3customprefixesexample/fherroroutputbase/FxvO2Tf9MQP/processing-failed/2019/02/12/

2019-02-12 16:57:24     260166 KDFS3customPrefixesExample-1-2019-02-12-16-53-20-5262db81-0f3a-48bf-8fc6-2249124923ff

Here is a snippet of the contents of the error file that I previously mentioned.

{"attemptsMade":4,"arrivalTimestamp":1549990400391,"errorCode":"Lambda.InvokeAccessDenied","errorMessage":"Access was denied. Ensure that the access policy allows access to the Lambda function.","attemptEndingTimestamp":1549990478018,"rawData":"eyJzZWN0b3IiOiAiSEVBTFRIQ0FSRSIsICJwcmljZSI6IDE4Ny45NCwgInRpY2tlcl9zeW1ib2wiOiAiVUZHIiwgIkV2ZW50VGltZSI6ICIyMDE5LTAyLTEyVDE2OjUzOjE5Ljk5MzAwMFoiLCAiY2hhbmdlIjogOS4yNn0=","lambdaArn":"arn:aws:lambda:us-east-1:<account-id>:function:KDFS3CustomPrefixesTransform:$LATEST"}

After I gave the “firehose_delivery_role” the appropriate permissions, the data objects showed up in the “Prefix” location specified for the Amazon S3 destination.

aws s3 ls s3://kdfs3customprefixesexample/fhbase/year=2019/month=02/day=12/hour=17/

2019-02-12 17:17:26    1392167 KDFS3customPrefixesExample-1-2019-02-12-17-14-51-fc63e8f6-7421-491d-8417-c5002fca1722

2019-02-12 17:18:39    1391946 KDFS3customPrefixesExample-1-2019-02-12-17-16-43-e080a18a-3e1e-45ad-8f1a-98c7887f5430

Also, because the Lambda code in my Lambda transform set the status failed for 10 percent of the records, those showed up in the ErrorOutputPrefix location for Lambda transform errors.

aws s3 ls s3://kdfs3customprefixesexample/fherroroutputbase/ztWxkdg3Thg/processing-failed/2019/02/12/

2019-02-12 17:25:54     180092 KDFS3customPrefixesExample-1-2019-02-12-17-21-53-3bbfe7c0-f505-47d0-b880-797ce9035f73

Here is a snippet of the content of the error file:

{"attemptsMade":1,"arrivalTimestamp":1549992113419,"errorCode":"Lambda.ProcessingFailedStatus","errorMessage":"ProcessingFailed status set for record","attemptEndingTimestamp":1549992138424,"rawData":"eyJ0aWNrZXJfc3ltYm9sIjogIlFYWiIsICJzZWN0b3IiOiAiSEVBTFRIQ0FSRSIsICJwcmljZSI6IDE3LjUyLCAiY2hhbmdlIjogMTcuNTUsICJFdmVudFRpbWUiOiAiMjAxOS0wMi0xMlQxNzoyMTo1My4zOTY2NDdaIn0=","lambdaArn":"arn:aws:lambda:us-east-1:<account-id>:function:KDFS3CustomPrefixesTransform:$LATEST"}

You’re now ready to create an AWS Glue crawler. For more information about using the AWS Glue Data Catalog, see Populating the AWS Glue Data Catalog.

  1. In the AWS Glue console, go to Crawlers, and choose Add Crawler.

  1. Add information about your crawler, then choose Next.
  2. In the Include Path, specify the Amazon S3 bucket name that you entered under the Amazon S3 destination. Also include the static prefix used when you created the Kinesis Data Firehose delivery stream. Do not include the custom prefix expression.
  3. Choose Next.

  1. Choose Next, No, Next.
  2. Specify the IAM role that AWS Glue would use. I chose to create a new IAM Role. Choose Next.
  3. Specify a schedule to run the crawler. I chose to Run it on Demand. Choose Next.
  4. Specify where the crawler adds the crawled and discovered tables. I chose the default database. Choose Next.

  1. Choose Finish.
  1. The crawler has been created and is ready to be run. Choose Run crawler.

  1. In the AWS Glue console, go to Tables. You can see that a table has been created with the name of the base folder. Choose fhbase.

The crawler has discovered and populated the table and its properties.

You can see the discovered schema. The crawler has identified and created the partitions based on the folder structure specified by the prefix expression.

Open the Amazon Athena console, and select the default database from the drop-down menu. Write the following query in the New query1 window, then choose Run query.

SELECT * FROM "default"."fhbase"

where year = '2019' and day = '12' and hour = '17'

order by approxarrtimestamputcfh desc

Notice that Amazon Athena recognizes the fhbase table as a partitioned table. The query can take advantage of the partitions in the query to filter the results.

Conclusion

As this post illustrates, Custom Prefixes for Amazon S3 objects provides much flexibility to customize the folder structure, where Kinesis Data Firehose delivers the data records and failure records in Amazon S3. Having control over the folder structure and naming in Amazon S3 simplifies data discovery, cataloging, and access. As a result, it helps get insight more expediently and helps you better manage the cost of your queries.

 


About the Author

Rajeev Chakrabarti is a Kinesis specialist solutions architect.

 

 

 

 

Build and run streaming applications with Apache Flink and Amazon Kinesis Data Analytics for Java Applications

Post Syndicated from Steffen Hausmann original https://aws.amazon.com/blogs/big-data/build-and-run-streaming-applications-with-apache-flink-and-amazon-kinesis-data-analytics-for-java-applications/

Stream processing facilitates the collection, processing, and analysis of real-time data and enables the continuous generation of insights and quick reactions to emerging situations. This capability is useful when the value of derived insights diminishes over time. Hence, the faster you can react to a detected situation, the more valuable the reaction is going to be. Consider, for instance, a streaming application that analyzes and blocks fraudulent credit card transactions while they occur. Compare that application to a traditional batch-oriented approach that identifies fraudulent transactions at the end of every business day and generates a nice report for you to read the next morning.

It is quite common for the value of insights to diminish over time. Therefore, using stream processing can substantially improve the value of your analytics application. However, building and operating a streaming application that continuously receives and processes data is much more challenging than operating a traditional batch-oriented analytics application.

In this post, we discuss how you can use Apache Flink and Amazon Kinesis Data Analytics for Java Applications to address these challenges. We explore how to build a reliable, scalable, and highly available streaming architecture based on managed services that substantially reduce the operational overhead compared to a self-managed environment. We particularly focus on how to prepare and run Flink applications with Kinesis Data Analytics for Java Applications. To this end, we use an exemplary scenario that includes source code and AWS CloudFormation templates. You can follow along with this example using your own AWS account or adapt the code according to your specific requirements.

Challenges of running streaming applications

When you build a streaming application, the downstream systems naturally rely on a continuous and timely generation of output. Accordingly, there are much higher requirements on the availability of the streaming application. There is also much less time to address operational issues compared to a traditional batch-based approach. In a batch-processing scenario, when a job that runs once at the end of a business day fails, you can often restart the failed job and still complete the computation by the next morning, when the results are needed. In contrast, when a streaming application fails, downstream systems that consume the output might be affected within minutes, or even sooner, when the expected output no longer arrives in time.

Moreover, in case of failure, you can’t simply delete all intermediate results and restart a failed processing job, as it is commonly done in the batch-processing case. The output of a streaming job is continuously consumed by downstream systems. Output that has already been consumed cannot easily be retracted. Therefore, the entire processing pipeline is much more sensitive to duplicates that are introduced by an application that is restarted on failure. Furthermore, the computations of a streaming application often rely on some kind of internal state that can be corrupted or even lost when the application fails.

Last but not least, streaming applications often deal with a varying amount of throughput. Therefore, scaling the application according to the current load is highly desirable. When the load increases, the infrastructure that supports the streaming application must scale to keep the application from becoming overloaded, falling behind, and producing results that are no longer relevant. On the other hand, when the load decreases, the infrastructure should scale in again to remain cost effective by not provisioning more resources than are needed.

A reliable and scalable streaming architecture based on Flink and Kinesis Data Analytics for Java Applications

Apache Flink is an open-source project that is tailored to stateful computations over unbounded and bounded datasets. Flink addresses many of the challenges that are common when analyzing streaming data by supporting different APIs (including Java and SQL), rich time semantics, and state management capabilities. It can also recover from failures while maintaining exactly-once processing semantics. Therefore, Flink is well suited for analyzing streaming data with low latency.

In this post, we illustrate how to deploy, operate, and scale a Flink application with Kinesis Data Analytics for Java Applications. We use a scenario to analyze the telemetry data of a taxi fleet in New York City in near-real time to optimize the fleet operation. In this scenario, every taxi in the fleet is capturing information about completed trips. The tracked information includes the pickup and drop-off locations, number of passengers, and generated revenue. This information is ingested into a Kinesis data stream as a simple JSON blob. From there, the data is processed by a Flink application, which is deployed to Kinesis Data Analytics for Java Applications. This application identifies areas that are currently requesting a high number of taxi rides. The derived insights are finally persisted into Amazon Elasticsearch Service, where they can be accessed and visualized using Kibana.

This scenario leads to the following architecture, which is separated into three stages for the ingestion, processing, and presentation of data.

Separating the different aspects of the infrastructure is a common approach in this domain and has several benefits over a more tightly coupled architecture.

First, the Kinesis data stream serves as a buffer that decouples the producers from the consumers. Taxis can persist the events that they generate into the data stream regardless of the condition of, for instance, the processing layer, which might be currently recovering from a node failure. Likewise, the derived data remains available through Kibana even if the ingestion or processing layer is currently unavailable due to some operational issues. Last but not least, all components can be scaled independently and can use infrastructure that is specifically tailored according to their individual requirements.

This architecture also allows you to experiment and adopt new technologies in the future. Multiple independent applications can concurrently consume the data stored in the Kinesis data stream. You can then test how a new version of an existing application performs with a copy of the production traffic. But you can also introduce a different tool and technology stack to analyze the data, again without affecting the existing production application. For example, it is common to persist the raw event data to Amazon S3 by adding a Kinesis Data Firehose delivery stream as a second consumer to the Kinesis data stream. This facilitates long-term archiving of the data, which you can then use to evaluate ad hoc queries or analyze historic trends.

All in all, separating the different aspects of the architecture into ingestion, processing, and presentation nicely decouples different components, making the architecture more robust. It furthermore allows you to choose different tools for different purposes and gives you a lot of flexibility to change or evolve the architecture over time.

For the rest of this post, we focus on using Apache Flink and Kinesis Data Analytics for Java Applications to identify areas that currently request a high number of taxi rides. We also derive the average trip duration to the New York City airports. But with this architecture, you also have the option to consume the incoming events using other tools, such as Apache Spark Structured Streaming and Kinesis Data Firehose, instead of, or in addition to, what is described here.

Let’s kick the tires!

To see the described architecture in action, execute the following AWS CloudFormation template in your own AWS account. The template first builds the Flink application that analyzes the incoming taxi trips, including the Flink Kinesis Connector that is required to read data from a Kinesis data stream. It then creates the infrastructure and submits the Flink application to Kinesis Data Analytics for Java Applications.

The entire process of building the application and creating the infrastructure takes about 20 minutes. After the AWS CloudFormation stack is created, the Flink application has been deployed as a Kinesis Data Analytics for Java application. It then waits for events in the data stream to arrive. Checkpointing is enabled so that the application can seamlessly recover from failures of the underlying infrastructure while Kinesis Data Analytics for Java Applications manages the checkpoints on your behalf. In addition, automatic scaling is configured so that Kinesis Data Analytics for Java Applications automatically allocates or removes resources and scales the application (that is, it adapts its parallelism) in response to changes in the incoming traffic.

To populate the Kinesis data stream, we use a Java application that replays a public dataset of historic taxi trips made in New York City into the data stream. The Java application has already been downloaded to an Amazon EC2 instance that was provisioned by AWS CloudFormation. You just need to connect to the instance and execute the JAR file to start ingesting events into the stream.

You can obtain all of the following commands, including their correct parameters, from the output section of the AWS CloudFormation template that you executed previously.

$ ssh [email protected]«Replay instance DNS name»

$ java -jar amazon-kinesis-replay-1.0.jar -stream «Kinesis data stream name» -region «AWS region» -speedup 3600

The speedup parameter determines how much faster the data is ingested into the Kinesis data stream relative to the actual occurrence of the historic events. With the given parameters, the Java application ingests an hour of historic data within one second. This results in a throughput of roughly 13k events and 6 MB of data per second, which completely saturates the Kinesis data stream (more on this later).

You can then go ahead and inspect the derived data through the Kibana dashboard that has been created. Or you can create your own visualizations to explore the data in Kibana.

https://«Elasticsearch endpoint»/_plugin/kibana/app/kibana#/dashboard/nyc-tlc-dashboard

The prepared Kibana dashboard contains a heatmap and a line graph. The heatmap visualizes locations where taxis are currently requested, and it shows that the highest demand for taxis is Manhattan. Moreover, the JFK and LaGuardia airports are also spots on the map where substantially more rides are requested compared to their direct neighborhoods. The line graph visualizes the average trip duration to these two airports. The following image shows how it steadily increases throughout the day until it abruptly drops in the evening.

For this post, the Elasticsearch cluster is configured to accept connections from the IP address range specified as a parameter of the AWS CloudFormation template. For production workloads, it’s much more desirable to further tighten the security of your Elasticsearch domain, for instance, by using Amazon Cognito for Kibana access control.

Scaling the architecture to increase its throughput

For this post, the Kinesis data stream was deliberately underprovisioned so that the Java application is completely saturating the data stream. When you closely inspect the output of the Java application, you’ll notice that the “replay lag” is continuously increasing. This means that the producer cannot ingest events as quickly as it is required according to the specified speedup parameter.

You can dive deeper into the metrics of the data stream by accessing it through an Amazon CloudWatch Dashboard. You can then see that the WriteProvisionedThroughputExceeded metric is slightly increased: Roughly 0.4 percent of the records are not accepted into the stream as the respective requests are throttled. In other terms, the data stream is underprovisioned, in particular as the producer pauses the ingestion of new events when too many events are in flight.

To increase the throughput of the data stream, you can simply update the number of shards from 6 to 12 with a couple of clicks on the console and through an API call, respectively. For production environments, you might even want to automate this procedure. For details on how to automatically scale a Kinesis data stream, see the blog post Scaling Amazon Kinesis Data Streams with AWS Application Auto Scaling.

When the scaling operation of the stream finishes, you can observe how the “replay lag” decreases and more events are ingested into the stream.

However, as a direct result, more events need to be processed. So now the Kinesis Data Analytics for Java application becomes overloaded and can no longer keep up with the increased number of incoming events. You can observe this through the millisBehindLatest metric, which is published to CloudWatch. The metric reports the time difference between the oldest record currently read by the Kinesis Data Analytics for Java application and the latest record in the stream according to the ingestion time in milliseconds. So it indicates how much behind the processing is from the tip of the stream.

As these metrics show, 10 minutes after the scaling operation finishes, processing is already more than 3 minutes behind the latest event in the stream. Even worse, it steadily keeps falling behind, continuously widening this gap.

However, in contrast to Kinesis Data Streams, Kinesis Data Analytics for Java Applications natively supports auto scaling. After a couple of minutes, you can see the effect of the scaling activities in the metrics. The millisBehindLatest metric starts to decrease until it reaches zero, when the processing has caught up with the tip of the Kinesis data stream.

However, notice how the millisBehindLatest metric spikes just before it starts to decline. This is caused by the way that scaling a Kinesis Data Analytics for Java application works today. To scale a running application, the internal state of the application is persisted into a so-called savepoint. This savepoint is exposed as a snapshot by Kinesis Data Analytics for Java Applications. Subsequently, the running instance of the application is terminated, and a new instance of the same application with more resources and a higher parallelism is created. The new instance of the application then populates its internal state from the snapshot and resumes the processing from where the now terminated instance left off.

Accordingly, the scaling operation causes a brief interruption of the processing, which explains the spike in metric. However, this operation is transparent to the producers and consumers. Producers can continue to write to the Kinesis data stream because they are nicely decoupled from the application. Likewise, consumers can still use Kibana to view their dashboards, although they might not see the latest data because it hasn’t yet been processed.

Let’s step back for a moment and review what you just did: You created a fully managed, highly available, scalable streaming architecture. You ingested and analyzed up to 25k events per second. You doubled the throughput of the architecture by scaling the Kinesis data stream and the Kinesis Data Analytics for Java application with a couple of clicks. You did all this while the architecture remained fully functional and kept receiving and processing events, not losing a single event. You also could have scaled the Elasticsearch cluster as seamlessly as the other components. But we’ll leave that as an exercise for the interested reader.

Try to imagine what it would have taken you to build something similar from scratch.

Prepare Flink applications for Kinesis Data Analytics for Java Applications

Now that you have seen the streaming application in action, let’s look at what is required to deploy and run a Flink application with Kinesis Data Analytics for Java Applications.

Similar to other deployment methods, the Flink application is first built and packaged into a fat JAR, which contains all the necessary dependencies for the application to run. The resulting fat JAR is then uploaded to Amazon S3. The location of the fat JAR on S3 and some additional configuration parameters are then used to create an application that can be executed by Kinesis Data Analytics for Java Applications. So instead of logging in to a cluster and directly submitting a job to the Flink runtime, you upload the respective fat JAR to S3. You then create a Kinesis Data Analytics for Java application that you can interact with using API calls, the console, and the AWS CLI, respectively.

Adapt the Flink configuration and runtime parameters

To obtain a valid Kinesis Data Analytics for Java application, the fat JAR of the Flink application must include certain dependencies. When you use Apache Maven to build your Flink application, you can simply add another dependency to the .pom file of your project.

<!—pom.xml ->
<project>
    ...
    <dependencies>
        ...
        <dependency>
            <groupId>com.amazonaws</groupId>
            <artifactId>aws-kinesisanalytics-runtime</artifactId>
            <version>1.0.1</version>
        </dependency>
    </dependencies>
    ...
</project>

You can then specify parameters that are passed to the resulting Kinesis Data Analytics for Java application when it is created or updated. These parameters are basically key-value pairs that are contained in a property map that is part of a property group.

"ApplicationConfiguration": {
    "EnvironmentProperties": {
        "PropertyGroups": [
            {
                "PropertyGroupId": "FlinkApplicationProperties",
                "PropertyMap": {
                    "InputStreamName": "...",
                    ...
                }
            }
        ]
    },
    ...
}

You can then obtain the values of these parameters in the application code from the Kinesis Data Analytics for Java Applications runtime. For example, the following code snippet gets the name of the Kinesis data stream that the application should connect to from the FlinkApplicationProperties property group.

Map<String, Properties> applicationProperties = KinesisAnalyticsRuntime.getApplicationProperties();

Properties flinkProperties = applicationProperties.get("FlinkApplicationProperties");

String kinesisStreamName = flinkProperties.getProperty("InputStreamName");

You use the same mechanism to configure other properties for the Kinesis Data Analytics for Java application (for example, checkpointing and the parallelism of the application) that are usually specified as a parameter or configuration option directly to the Flink runtime.

"ApplicationConfiguration": {
    "FlinkApplicationConfiguration": {
        "CheckpointConfiguration": {
            "ConfigurationType": "DEFAULT"
        },
        "MonitoringConfiguration": {
            "ConfigurationType": "CUSTOM",
            "MetricsLevel": "TASK",
            "LogLevel": "INFO"
        },
        "ParallelismConfiguration": {
            "ConfigurationType": "DEFAULT"
        }
    },
    ...
}

With this configuration, the checkpointing and parallelism settings are left at their default. This enables checkpointing and auto scaling and sets the initial parallelism of the Kinesis Data Analytics for Java application to one. Moreover, the log level is increased to INFO and CloudWatch metrics are collected for every subtask of the application.

Build the Flink Kinesis Connector

When you are building a Flink application that reads data from a Kinesis data stream, you might notice that the Flink Kinesis Connector is not available from Maven central. You actually need to build it yourself. The following steps build the connector for any recent Apache Flink release. However, because Kinesis Data Analytics for Java Applications is based on Flink 1.6.2, you can use this specific version for now.

$ wget -qO- https://github.com/apache/flink/archive/release-1.6.2.zip | bsdtar -xf-

$ cd flink-release-1.6.2

$ mvn clean package -B -DskipTests -Dfast -Pinclude-kinesis -pl flink-connectors/flink-connector-kinesis

Note that the connector has already been built and stored on S3 by the AWS CloudFormation template. You can simply download the JAR file of the connector from there and put it in your local Maven repository using the following Maven command:

$ mvn install:install-file -Dfile=flink-connector-kinesis_2.11-1.6.2.jar -DpomFile flink-connector-kinesis_2.11-1.6.2.pom.xml

Integrate the Flink Elasticsearch sink with Amazon Elasticsearch Service

Beginning with the 1.6 release, Apache Flink comes with an Elasticsearch connector that supports the Elasticsearch APIs over HTTP. Therefore, it can natively talk to the endpoints that are provided by Amazon Elasticsearch Service.

You just need to decide how to authenticate requests against the public endpoint of the Elasticsearch cluster. You can whitelist individual IPs for access to the cluster. However, the recommended way of authenticating against the Amazon Elasticsearch Service endpoint is to add authentication information to AWS requests using IAM credentials and the Signature Version 4 signing process.

To extend the Flink Elasticsearch connector, which is not aware of the AWS specific signing process, you can use the open-source aws-signing-request-interceptor, which is available from Maven central. You just need to add an interceptor to the Elasticsearch sink that is called just before the request is sent to the Amazon Elasticsearch Service endpoint. The interceptor can then sign the request using the permission of the role that has been configured for the Kinesis Data Analytics for Java application.

final List<HttpHost> httpHosts = Arrays.asList(HttpHost.create("https://...")));

ElasticsearchSink.Builder<T> esSinkBuilder = new ElasticsearchSink.Builder<>(
    httpHosts,
    new ElasticsearchSinkFunction<T>() {
      ...
    }
);

final Supplier<LocalDateTime> clock = () -> LocalDateTime.now(ZoneOffset.UTC);
final AWSCredentialsProvider credentialsProvider = new DefaultAWSCredentialsProviderChain();
final AWSSigner awsSigner = new AWSSigner(credentialsProvider, "eu-west-1", "es", clock);

esSinkBuilder.setRestClientFactory(
    restClientBuilder -> restClientBuilder.setHttpClientConfigCallback(
        callback -> callback.addInterceptorLast(new AWSSigningRequestInterceptor(awsSigner))
    )
);

esSinkBuilder.build();

Note that the actual code in the GitHub repository is a bit more sophisticated because you need to obtain a serializable request interceptor. But the basic approach to sign requests remains the same.

Monitor and debug the Flink application

When running a Kinesis Data Analytics for Java application, you don’t get direct access to the cluster that runs Flink. This is because the underlying infrastructure is completely managed by the service. You merely interact with the service through an API. However, you can still obtain metrics and logging information through CloudWatch and CloudWatch Logs, respectively.

The Kinesis Data Analytics for Java application exposes a lot of operational metrics, ranging from metrics for the entire application down to metrics for individual processes of operators of the application. You can control which level of detail is adequate or required for your purposes. In fact, the metrics used in the previous section were all obtained through CloudWatch.

In addition to operational metrics, you can configure the Kinesis Data Analytics for Java application to write messages to CloudWatch Logs. This capability seamlessly integrates with common logging frameworks, such as Apache Log4j and the Simple Logging Facade for Java (SLF4J). So it is useful for debugging and identifying the cause of operational issues.

To enable logging for your Kinesis Data Analytics for Java application, just specify an existing CloudWatch log stream as a logging option when you start the application, as follows:

final Logger LOG = LoggerFactory.getLogger(...);

LOG.info("Starting to consume events from stream {}", flinkProperties.getProperty("InputStreamName"));

After the log messages are persisted into CloudWatch Logs, you can easily query and analyze them through CloudWatch Logs Insights

Conclusion

In this post, you not only built a reliable, scalable, and highly available streaming application based on Apache Flink and Kinesis Data Analytics for Java Applications. You also scaled the different components while ingesting and analyzing up to 25k events per second in near-real time. In large parts, this scenario was enabled by using managed services, so you didn’t need to spend time on provisioning and configuring the underlying infrastructure.

The sources of the application and the AWS CloudFormation template used in this post are available from GitHub for your reference. You can dive into all the details of the Flink application and the configuration of the underlying services. I’m curious to see what you will build when you can focus on analyzing data in a streaming fashion rather than spending time on managing and operating infrastructure.

 


About the Author

Steffen Hausmann is a specialist solutions architect with AWS.

 

 

 

 

Improve clinical trial outcomes by using AWS technologies

Post Syndicated from Mayank Thakkar original https://aws.amazon.com/blogs/big-data/improve-clinical-trial-outcomes-by-using-aws-technologies/

We are living in a golden age of innovation, where personalized medicine is making it possible to cure diseases that we never thought curable. Digital medicine is helping people with diseases get healthier, and we are constantly discovering how to use the body’s immune system to target and eradicate cancer cells. According to a report published by ClinicalTrials.gov, the number of registered studies hit 293,000 in 2018, representing a 250x growth since 2000.

However, an internal rate of return (IRR) analysis by Endpoints News, using data from EvaluatePharma, highlights some interesting trends. A flourishing trend in pharma innovation is supported by strong growth in registered studies. However, the IRR shows a rapidly declining trend, from around 17 percent in 2000 to below the cost of capital in 2017 and projected to go to 0 percent by 2020.

This blog post is the first installment in a series that focuses on the end-to-end workflow of collecting, storing, processing, visualizing, and acting on clinical trials data in a compliant and secure manner. The series also discusses the application of artificial intelligence and machine learning technologies to the world of clinical trials. In this post, we highlight common architectural patterns that AWS customers use to modernize their clinical trials. These incorporate mobile technologies for better evidence generation, cost reduction, increasing quality, improving access, and making medicine more personalized for patients.

Improving the outcomes of clinical trials and reducing costs

Biotech and pharma organizations are feeling the pressure to use resources as efficiently as possible. This pressure forces them to search for any opportunity to streamline processes, get faster, and stay more secure, all while decreasing costs. More and more life sciences companies are targeting biologics, CAR-T, and precision medicine therapeutics, with focus shifting towards smaller, geographically distributed patient segments. This shift has resulted in an increasing mandate to capture data from previously unavailable, nontraditional sources. These sources include mobile devices, IoT devices, and in-home and clinical devices. Life sciences companies merge data from these sources with data from traditional clinical trials to build robust evidence around the safety and efficacy of a drug.

Early last year, the Clinical Trials Transformation Initiative (CTTI) provided recommendations about using mobile technologies for capturing holistic, high quality, attributable, real-world data from patients and for submission to the U.S. Food and Drug Administration (FDA). By using mobile technologies, life sciences companies can reduce barriers to trial participation and lower costs associated with conducting clinical trials. Global regulatory groups such as the FDA, Health Canada, and Medicines and Healthcare products Regulatory Agency (MHRA), among others, are also in favor of using mobile technologies. Mobile technologies can make patient recruitment more efficient, reach endpoints faster, and reduce the cost and time required to conduct clinical trials.

Improvised data ingestion using mobile technologies can speed up outcomes, reduce costs, and improve the accuracy of clinical trials. This is especially true when mobile data ingestion is supplemented with artificial intelligence and machine learning (AI/ML) technologies.

Together, they can usher in a new age of smart clinical trials.

At the same time, traditional clinical trial processes and technology designed for mass-marketed blockbuster drugs can’t effectively meet emerging industry needs. This leaves life sciences and pharmaceutical companies in need of assistance for evolving their clinical trial operations. These circumstances result in making clinical trials one of the largest areas of investment for bringing a new drug to market.

Using mobile technologies with traditional technologies in clinical trials can improve the outcomes of the trials and simultaneously reduce costs. Some of the use cases that the integration of various technologies enables include these:

  • Identifying and tracking participants in clinical trials
    • Identifying participants for clinical trials recruitment
    • Educating and informing patients participating in clinical trials
    • Implementing standardized protocols and sharing associated information to trial participants
    • Tracking adverse events and safety profiles
  • Integrating genomic and phenotypic data for identifying novel biomarkers
  • Integrating mobile data into clinical trials for better clinical trial management
  • Creation of a patient-control arm based on historical data
  • Stratifying cohorts based on treatment, claims, and registry datasets
  • Building a collaborative, interoperable network for data sharing and knowledge creation
  • Building compliance-ready infrastructure for clinical trial management

The AWS Cloud provides HIPAA eligible services and solutions. As an AWS customer, you can use these to build solutions for global implementation of mobile devices and sensors in trials, secure capture of streaming Internet of Things (IoT) data, and advanced analytics through visualization tools or AI/ML capabilities. Some of the use cases these services and solutions enable are finding and recruiting patients using smart analytics, facilitating global data management, and remote or in-patient site monitoring. Others include predicting lack of adherence, detecting adverse events, and accelerating trial outcomes along with optimizing trial costs.

Clinical Trials 2.0 (CT2.0) at AWS is geared toward facilitating wider adoption of cloud-native services to enable data ingestion from disparate sources, cost-optimized and reliable storage, and holistic analytics. At the same time, CT2.0 provides the granular access control, end-to-end security, and global scalability needed to conduct clinical trials more efficiently.

Reference architecture

One of the typical architectures for managing a clinical trial using mobile technologies is shown following. This architecture focuses on capturing real-time data from mobile sources and providing a way to process it.

* – Additional considerations such as data security, access control, and compliance need to be incorporated into the architecture and are discussed in the remainder of this post.

Managing a trial by using this architecture consists of the following five major steps.

Step 1: Collect data

Mobile devices, personal wearables, instruments, and smart-devices are extensively being used (or being considered) by global pharmaceutical companies in patient care and clinical trials to provide data for activity tracking, vital signs monitoring, and so on, in real-time. Devices like infusion pumps, personal use dialysis machines, and so on require tracking and alerting of device consumables and calibration status. Remote settings management is also a major use case for these kinds of devices. The end-user mobile devices used in the clinical trial emit a lot of telemetry data that requires real-time data capture, data cleansing, transformation, and analysis.

Typically, these devices are connected to an edge node or a smart phone. Such a connection provides sufficient computing resources to stream data to AWS IoT Core. AWS IoT Core can then be configured to write data to Amazon Kinesis Data Firehose in near real time. Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon S3. S3 provides online, flexible, cost efficient, pay-as-you-go storage that can replicate your data on three Availability Zones within an AWS Region. The edge node or smart phone can use the AWS IoT SDKs to publish and subscribe device data to AWS IoT Core with MQTT. This process uses the AWS method of authentication (called ‘SigV4’), X.509 certificate–based authentication, and customer-created token-based authentication (through custom authorizers). This authenticated approach enables you to map your choice of policies to each certificate and remotely control device or application access. You can also use the Kinesis Data Firehose encryption feature to enable server-side data encryption.

You can also capture additional data such as Case Report Forms (CRF), Electronic Medical Records (EMR), and medical images using Picture Archiving and Communication Systems (PACS). In addition, you can capture laboratory data (Labs) and other Patient Reported Outcomes data (ePRO). AWS provides multiple tools and services to effectively and securely connect to these data sources, enabling you to ingest data in various volumes, variety, and velocities. For more information about creating a HealthCare Data Hub and ingesting Digital Imaging and Communications in Medicine (DICOM) data, see the AWS Big Data Blog post Create a Healthcare Data Hub with AWS and Mirth Connect.

Step 2: Store data

After data is ingested from the devices and wearables used in the clinical trial, Kinesis Data Firehose is used to store the data on Amazon S3. This stored data serves as a raw copy and can later be used for historical analysis and pattern prediction. Using Amazon S3’s lifecycle policies, you can periodically move your data to reduced cost storage such as Amazon S3 Glacier for further optimizing their storage costs. Using Amazon S3 Intelligent Tiering can automatically optimize costs when data access patterns change, without performance impact or operational overhead by moving data between two access tiers—frequent access and infrequent access. You can also choose to encrypt data at rest and in motion using various encryption options available on S3.

Amazon S3 offers an extremely durable, highly available, and infinitely scalable data storage infrastructure, simplifying most data processing, backup, and replication tasks.

Step 3: Data processingfast lane

After collecting and storing a raw copy of the data, Amazon S3 is configured to publish events to AWS Lambda and invoke a Lambda function by passing the event data as a parameter. The Lambda function is used to extract the key performance indicators (KPIs) such as adverse event notifications, medication adherence, and treatment schedule management from the incoming data. You can use Lambda to process these KPIs and store them in Amazon DynamoDB, along with encryption at rest, which powers a near-real-time clinical trial status dashboard. This alerts clinical trial coordinators in real time so that appropriate interventions can take place.

In addition to this, using a data warehouse full of medical records, you can train and implement a machine learning model. This model can predict which patients are about to switch medications or might exhibit adherence challenges in the future. Such prediction can enable clinical trial coordinators to narrow in on those patients with mitigation strategies.

Step 4: Data processing—batch

For historical analysis and pattern prediction, the staged data (stored in S3) is processed in batches. A Lambda function is used to trigger the extract, transform, and load (ETL) process every time new data is added to the raw data S3 bucket. This Lambda function triggers an ETL process using AWS Glue, a fully managed ETL service that makes it easy for you to prepare and load your data for analytics. This approach helps in mining current and historical data to derive actionable insights, which is stored on Amazon S3.

From there, data is loaded on to Amazon Redshift, a cost-effective, petabyte-scale data warehouse offering from AWS. You can also use Amazon Redshift Spectrum to extend data warehousing out to exabytes without loading any data to Amazon Redshift, as detailed in the Big Data blog post Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required. This enables you to provide an all-encompassing picture of the entire clinical trial to your clinical trial coordinators, enabling you to react and respond faster.

In addition to this, you can train and implement a machine learning model to identify patients who might be at risk for adherence challenges. This enables clinical trial coordinators to reinforce patient education and support.

Step 5: Visualize and act on data

After the data is processed and ready to be consumed, you can use Amazon QuickSight, a cloud-native business intelligence service from AWS that offers native Amazon Redshift connectivity. Amazon QuickSight is serverless and so can be rolled out to your audiences in hours. You can also use a host of third-party reporting tools, which can use AWS-supplied JDBC or ODBC drivers or open-source PostgreSQL drivers to connect with Amazon Redshift. These tools include TIBCO Spotfire Analytics, Tableau Server, Qlik Sense Enterprise, Looker, and others. Real-time data processing (step 3 preceding) combines with historical-view batch processing (step 4). Together, they empower contract research organizations (CROs), study managers, trial coordinators, and other entities involved in the clinical trial journey to make effective and informed decisions at a speed and frequency that was previously unavailable. Using Amazon QuickSight’s unique Pay-per-Session pricing model, you can optimize costs for your bursty usage models by paying only when users access the dashboards.

Using Amazon Simple Notification Service (Amazon SNS), real-time feedback based on incoming data and telemetry is sent to patients by using text messages, mobile push, and emails. In addition, study managers and coordinators can send Amazon SNS notifications to patients. Amazon SNS provides a fully managed pub/sub messaging for micro services, distributed systems, and serverless applications. It’s designed for high-throughput, push-based, many-to-many messaging. Alerts and notifications can be based on current data or a combination of current and historical data.

To encrypt messages published to Amazon SNS, you can follow the steps listed in the post Encrypting messages published to Amazon SNS with AWS KMS, on the AWS Compute Blog.   

Data security, data privacy, data integrity, and compliance considerations

At AWS, customer trust is our top priority. We deliver services to millions of active customers, including enterprises, educational institutions, and government agencies in over 190 countries. Our customers include financial services providers, healthcare providers, and governmental agencies, who trust us with some of their most sensitive information.

To facilitate this, along with the services mentioned earlier, you should also use AWS Identity and Access Management (IAM) service. IAM enables you to maintain segregation of access, fine-grained access control, and securing end user mobile and web applications. You can also use AWS Security Token Service (AWS STS) to provide secure, self-expiring, time-boxed, temporary security credentials to third-party administrators and service providers, greatly strengthening your security posture. You can use AWS CloudTrail to log IAM and STS API calls. Additionally, AWS IoT Device Management makes it easy to securely onboard, organize, monitor, and remotely manage IoT devices at scale.

With AWS, you can add an additional layer of security to your data at rest in the cloud. AWS provides scalable and efficient encryption features for services like Amazon EBS, Amazon S3, Amazon Redshift, Amazon SNSAWS Glue, and many more. Flexible key management options, including AWS Key Management Service, enable you to choose whether to have AWS manage the encryption keys or to keep complete control over their keys. In addition, AWS provides APIs for you to integrate encryption and data protection with any of the services that you develop or deploy in an AWS environment.

As a customer, you maintain ownership of your data, and select which AWS services can process, store, and host the content. Generally speaking, AWS doesn’t access or use customers’ content for any purpose without their consent. AWS never uses customer data to derive information for marketing or advertising.

When evaluating the security of a cloud solution, it’s important that you understand and distinguish between the security of the cloud and security in the cloud. The AWS Shared Responsibility Model details this relationship.

To assist you with your compliance efforts, AWS continues to add more services to the various compliance regulations, attestations, certifications, and programs across the world. To decide which services are suitable for you, see the services in scope page.

You can also use various services like, but not limited to, AWS CloudTrail, AWS Config, Amazon GuardDuty, and AWS Key Management Service (AWS KMS) to enhance your compliance and auditing efforts. Find more details in the AWS Compliance Solutions Guide.

Final thoughts

With the ever-growing interconnectivity and technological advances in the field of medical devices, mobile devices and sensors can improve numerous aspects of clinical trials. They can help in recruitment, informed consent, patient counseling, and patient communication management. They can also improve protocol and medication adherence, clinical endpoints measurement, and the process of alerting participants on adverse events. Smart sensors, smart mobile devices, and robust interconnecting systems can be central in conducting clinical trials.

Every biopharma organization conducting or sponsoring a clinical trial activity faces the conundrum of advancing their approach to trials while maintaining overall trial performance and data consistency. The AWS Cloud enables a new dimension for how data is collected, stored, and used for clinical trials. It thus addresses that conundrum as we march towards a new reality of how drugs are brought to market. The AWS Cloud abstracts away technical challenges such as scaling, security, and establishing a cost-efficient IT infrastructure. In doing so, it allows biopharma organizations to focus on their core mission of improving patent lives through the development of effective, groundbreaking treatments.

 


About the Author

Mayank Thakkar – Global Solutions Architect, AWS HealthCare and Life Sciences

 

 

 

 

Deven Atnoor, Ph.D. – Industry Specialist, AWS HealthCare and Life Sciences

 

 

 

 

Best practices for successfully managing memory for Apache Spark applications on Amazon EMR

Post Syndicated from Karunanithi Shanmugam original https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

In the world of big data, a common use case is performing extract, transform (ET) and data analytics on huge amounts of data from a variety of data sources. Often, you then analyze the data to get insights. One of the most popular cloud-based solutions to process such vast amounts of data is Amazon EMR.

Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS. Amazon EMR enables organizations to spin up a cluster with multiple instances in a matter of few minutes. It also enables you to process various data engineering and business intelligence workloads through parallel processing. By doing this, to a great extent you can reduce the data processing times, effort, and costs involved in establishing and scaling a cluster.

Apache Spark is a cluster-computing software framework that is open-source, fast, and general-purpose. It is widely used in distributed processing of big data. Apache Spark relies heavily on cluster memory (RAM) as it performs parallel computing in memory across nodes to reduce the I/O and execution times of tasks.

Generally, you perform the following steps when running a Spark application on Amazon EMR:

  1. Upload the Spark application package to Amazon S3.
  2. Configure and launch the Amazon EMR cluster with configured Apache Spark.
  3. Install the application package from Amazon S3 onto the cluster and then run the application.
  4. Terminate the cluster after the application is completed.

It’s important to configure the Spark application appropriately based on data and processing requirements for it to be successful. With default settings, Spark might not use all the available resources of the cluster and might end up with physical or virtual memory issues, or both. There are thousands of questions raised in stackoverflow.com related to this specific topic.

This blog post is intended to assist you by detailing best practices to prevent memory-related issues with Apache Spark on Amazon EMR.

Common memory issues in Spark applications with default or improper configurations

Listed following are a few sample out-of-memory errors that can occur in a Spark application with default or improper configurations.

Out of Memory Error, Java Heap Space

WARN TaskSetManager: Loss was due to 
java.lang.OutOfMemoryError
java.lang.OutOfMemoryError: Java heap space

Out of Memory Error, Exceeding Physical Memory

Error: ExecutorLostFailure Reason: Container killed by YARN for exceeding limits.
12.4 GB of 12.3 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead.
Error: ExecutorLostFailure Reason: Container killed by YARN for exceeding limits.
4.5GB of 3GB physical memory used limits.
Consider boosting spark.yarn.executor.memoryOverhead.

Out of Memory Error, Exceeding Virtual Memory

Container killed by YARN for exceeding memory limits.
1.1gb of 1.0gb virtual memory used. Killing container.

Out of Memory Error, Exceeding Executor Memory

Required executor memory (1024+384 MB) is above 
the max threshold (896 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb'
and/or 'yarn.nodemanager.resource.memory-mb

These issues occur for various reasons, some of which are listed following:

  1. When the number of Spark executor instances, the amount of executor memory, the number of cores, or parallelism is not set appropriately to handle large volumes of data.
  2. When the Spark executor’s physical memory exceeds the memory allocated by YARN. In this case, the total of Spark executor instance memory plus memory overhead is not enough to handle memory-intensive operations. Memory-intensive operations include caching, shuffling, and aggregating (using reduceByKey, groupBy, and so on). Or, in some cases, the total of Spark executor instance memory plus memory overhead can be more than what is defined in yarn.scheduler.maximum-allocation-mb.
  3. The memory required to perform system operations such as garbage collection is not available in the Spark executor instance.

In the following sections, I discuss how to properly configure to prevent out-of-memory issues, including but not limited to those preceding.

Configuring for a successful Spark application on Amazon EMR

The following steps can help you configure a successful Spark application on Amazon EMR.

1. Determine the type and number of instances based on application needs

Amazon EMR has three types of nodes:

  1. Master: An EMR cluster has one master, which acts as the resource manager and manages the cluster and tasks.
  2. Core: The core nodes are managed by the master node. Core nodes run YARN NodeManager daemons, Hadoop MapReduce tasks, and Spark executors to manage storage, execute tasks, and send a heartbeat to the master.
  3. Task: The optional task-only nodes perform tasks and don’t store any data, in contrast to core nodes.

Best practice 1: Choose the right type of instance for each of the node types in an Amazon EMR cluster. Doing this is one key to success in running any Spark application on Amazon EMR.

There are numerous instance types offered by AWS with varying ranges of vCPUs, storage, and memory, as described in the Amazon EMR documentation. Based on whether an application is compute-intensive or memory-intensive, you can choose the right instance type with the right compute and memory configuration.

For memory-intensive applications, prefer R type instances over the other instance types. For compute-intensive applications, prefer C type instances. For applications balanced between memory and compute, prefer M type general-purpose instances.

To understand the possible use cases for each instance type offered by AWS, see Amazon EC2 Instance Types on the EC2 service website.

After deciding the instance type, determine the number of instances for each of the node types. You do this based on the size of the input datasets, application execution times, and frequency requirements.

2. Determine the Spark configuration parameters

Before we dive into the details on Spark configuration, let’s get an overview of how the executor container memory is organized using the diagram following.

As the preceding diagram shows, the executor container has multiple memory compartments. Of these, only one (execution memory) is actually used for executing the tasks. These compartments should be properly configured for running the tasks efficiently and without failure.

Calculate and set the following Spark configuration parameters carefully for the Spark application to run successfully:

  • spark.executor.memory – Size of memory to use for each executor that runs the task.
  • spark.executor.cores – Number of virtual cores.
  • spark.driver.memory – Size of memory to use for the driver.
  • spark.driver.cores – Number of virtual cores to use for the driver.
  • spark.executor.instances ­– Number of executors. Set this parameter unless spark.dynamicAllocation.enabled is set to true.
  • spark.default.parallelism – Default number of partitions in resilient distributed datasets (RDDs) returned by transformations like join, reduceByKey, and parallelize when no partition number is set by the user.

Amazon EMR provides high-level information on how it sets the default values for Spark parameters in the release guide. These values are automatically set in the spark-defaults settings based on the core and task instance types in the cluster.

To use all the resources available in a cluster, set the maximizeResourceAllocation parameter to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets these parameters in the spark-defaults settings. Even with this setting, generally the default numbers are low and the application doesn’t use the full strength of the cluster. For example, the default for spark.default.parallelism is only 2 x the number of virtual cores available, though parallelism can be higher for a large cluster.

Spark on YARN can dynamically scale the number of executors used for a Spark application based on the workloads. Using Amazon EMR release version 4.4.0 and later, dynamic allocation is enabled by default (as described in the Spark documentation).

The problem with the spark.dynamicAllocation.enabled property is that it requires you to set subproperties. Some example subproperties are spark.dynamicAllocation.initialExecutors, minExecutors, and maxExecutors. Subproperties are required for most cases to use the right number of executors in a cluster for an application, especially when you need multiple applications to run simultaneously. Setting subproperties requires a lot of trial and error to get the numbers right. If they’re not right, the capacity might be reserved but never actually used. This leads to wastage of resources or memory errors for other applications.

Best practice 2: Set spark.dynamicAllocation.enabled to true only if the numbers are properly determined for spark.dynamicAllocation.initialExecutors/minExecutors/maxExecutors parameters. Otherwise, set spark.dynamicAllocation.enabled to false and control the driver memory, executor memory, and CPU parameters yourself. To do this, calculate and set these properties manually for each application (see the example following).

Let’s assume that we are going to process 200 terabytes of data spread across thousands of file stores in Amazon S3. Further, let’s assume that we do this through an Amazon EMR cluster with 1 r5.12xlarge master node and 19 r5.12xlarge core nodes. Each r5.12xlarge instance has 48 virtual cores (vCPUs) and 384 GB RAM. All these calculations are for the --deploy-mode cluster, which we recommend for production use.

The following list describes how to set some important Spark properties, using the preceding case as an example.

spark.executor.cores

Assigning executors with a large number of virtual cores leads to a low number of executors and reduced parallelism. Assigning a low number of virtual cores leads to a high number of executors, causing a larger amount of I/O operations. Based on historical data, we suggest that you have five virtual cores for each executor to achieve optimal results in any sized cluster.

For the preceding cluster, the property spark.executor.cores should be assigned as follows: spark.executors.cores = 5 (vCPU)

spark.executor.memory

After you decide on the number of virtual cores per executor, calculating this property is much simpler. First, get the number of executors per instance using total number of virtual cores and executor virtual cores. Subtract one virtual core from the total number of virtual cores to reserve it for the Hadoop daemons.

Number of executors per instance = (total number of virtual cores per instance - 1)/ spark.executors.cores

Number of executors per instance = (48 - 1)/ 5 = 47 / 5 = 9 (rounded down)

Then, get the total executor memory by using the total RAM per instance and number of executors per instance. Leave 1 GB for the Hadoop daemons.

Total executor memory = total RAM per instance / number of executors per instance
Total executor memory = 383 / 9 = 42 (rounded down)

This total executor memory includes the executor memory and overhead (spark.yarn.executor.memoryOverhead). Assign 10 percent from this total executor memory to the memory overhead and the remaining 90 percent to executor memory.

spark.executors.memory = total executor memory * 0.90
spark.executors.memory = 42 * 0.9 = 37 (rounded down)

spark.yarn.executor.memoryOverhead = total executor memory * 0.10
spark.yarn.executor.memoryOverhead = 42 * 0.1 = 5 (rounded up)

spark.driver.memory

We recommend setting this to equal spark.executors.memory.

spark.driver.memory = spark.executors.memory

spark.driver.cores

We recommend setting this to equal spark.executors.cores.

spark.driver.cores= spark.executors.cores.

spark.executor.instances

Calculate this by multiplying the number of executors and total number of instances. Leave one executor for the driver.

spark.executor.instances = (number of executors per instance * number of core instances) minus 1 for the driver

spark.executor.instances = (9 * 19) - 1 = 170

spark.default.parallelism

Set this property using the following formula.

spark.default.parallelism = spark.executor.instances * spark.executors.cores * 2

spark.default.parallelism = 170 * 5 * 2 = 1,700

Warning: Although this calculation gives partitions of 1,700, we recommend that you estimate the size of each partition and adjust this number accordingly by using coalesce or repartition.

In case of dataframes, configure the parameter spark.sql.shuffle.partitions along with spark.default.parallelism.

Though the preceding parameters are critical for any Spark application, the following parameters also help in running the applications smoothly to avoid other timeout and memory-related errors. We advise that you set these in the spark-defaults configuration file.

  • spark.network.timeout – Timeout for all network transactions.
  • spark.executor.heartbeatInterval – Interval between each executor’s heartbeats to the driver. This value should be significantly less than spark.network.timeout.
  • spark.memory.fraction – Fraction of JVM heap space used for Spark execution and storage. The lower this is, the more frequently spills and cached data eviction occur.
  • spark.memory.storageFraction – Expressed as a fraction of the size of the region set aside by spark.memory.fraction. The higher this is, the less working memory might be available to execution. This means that tasks might spill to disk more often.
  • spark.yarn.scheduler.reporterThread.maxFailures – Maximum number executor failures allowed before YARN can fail the application.
  • spark.rdd.compress – When set to true, this property can save substantial space at the cost of some extra CPU time by compressing the RDDs.
  • spark.shuffle.compress – When set to true, this property compresses the map output to save space.
  • spark.shuffle.spill.compress – When set to true, this property compresses the data spilled during shuffles.
  • spark.sql.shuffle.partitions – Sets the number of partitions for joins and aggregations.
  • spark.serializer – Sets the serializer to serialize or deserialize data. As a serializer, I prefer Kyro (org.apache.spark.serializer.KryoSerializer), which is faster and more compact than the Java default serializer.

To understand more about each of the parameters mentioned preceding, see the Spark documentation.

We recommend you consider these additional programming techniques for efficient Spark processing:

  • coalesce – Reduces the number of partitions to allow for less data movement.
  • repartition – Reduces or increases the number of partitions and performs full shuffle of data as opposed to coalesce.
  • partitionBy – Distributes data horizontally across partitions.
  • bucketBy – Decomposes data into more manageable parts (buckets) based on hashed columns.
  • cache/persist – Pulls datasets into a clusterwide in-memory cache. Doing this is useful when data is accessed repeatedly, such as when querying a small lookup dataset or when running an iterative algorithm.

Best practice 3: Carefully calculate the preceding additional properties based on application requirements. Set these properties appropriately in spark-defaults, when submitting a Spark application (spark-submit), or within a SparkConf object.

3. Implement a proper garbage collector to clear memory effectively

Garbage collection can lead to out-of-memory errors in certain cases. These include cases when there are multiple large RDDs in the application. Other cases occur when there is an interference between the task execution memory and RDD cached memory.

You can use multiple garbage collectors to evict the old objects and place the new ones into the memory. However, the latest Garbage First Garbage Collector (G1GC) overcomes the latency and throughput limitations with the old garbage collectors.

Best practice 4: Always set up a garbage collector when handling large volume of data through Spark.

The parameter -XX:+UseG1GC specifies that the G1GC garbage collector should be used. (The default is -XX:+UseParallelGC.) To understand the frequency and execution time of the garbage collection, use the parameters -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps. To initiate garbage collection sooner, set InitiatingHeapOccupancyPercent to 35 (the default is 0.45). Doing this helps avoid potential garbage collection for the total memory, which can take a significant amount of time. An example follows.

"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",

4. Set the YARN configuration parameters

Even if all the Spark configuration properties are calculated and set correctly, virtual out-of-memory errors can still occur rarely as virtual memory is bumped up aggressively by the OS. To prevent these application failures, set the following flags in the YARN site settings.

Best practice 5: Always set the virtual and physical memory check flag to false.

"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"

5. Perform debugging and monitoring

To get details on where the spark configuration options are coming from, you can run spark-submit with the –verbose option. Also, you can use Ganglia and Spark UI to monitor the application progress, Cluster RAM usage, Network I/O, etc.

In the following example, we compare the outcomes between configured and non-configured Spark applications using Ganglia graphs.

When configured following the methods described, a Spark application can process 10 TB data successfully without any memory issues on an Amazon EMR cluster whose specs are as follows:

  • 1 r5.12xlarge master node
  • 19 r5.12xlarge core nodes
  • 8 TB total RAM
  • 960 total virtual CPUs
  • 170 executor instances
  • 5 virtual CPUs/executor
  • 37 GB memory/executor
  • Parallelism equals 1,700

Following, you can find Ganglia graphs for reference.

If you run the same Spark application with default configurations on the same cluster, it fails with an out-of-physical-memory error. This is because the default configurations (two executor instances, parallelism of 2, one vCPU/executor, 8-GB memory/executor) aren’t enough to process 10 TB data. Though the cluster had 7.8 TB memory, the default configurations limited the application to use only 16 GB memory, leading to the following out-of-memory error.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 8, executor 7): ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.5 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Also, for large datasets, the default garbage collectors don’t clear the memory efficiently enough for the tasks to run in parallel, causing frequent failures. The following charts help in comparing the RAM usage and garbage collection with the default and G1GC garbage collectors.With G1GC, the RAM used is maintained below 5 TB (see the blue area in the graph).

With the default garbage collector (CMS), the RAM used goes above 5 TB. This can lead to the failure of the Spark job when running many tasks continuously.

Example: EMR instance template with configuration

There are different ways to set the Spark and YARN configuration parameters. One of ways is to pass these when creating the EMR cluster.

To do this, in the Amazon EMR console’s Edit software settings section, you can enter the appropriately updated configuration template (Enter configuration). Or the configuration can be passed from S3 (Load JSON from S3).

Following is a configuration template with sample values. At a minimum, calculate and set the following parameters for a successful Spark application.

{
      "InstanceGroups":[
         {
            "Name":"AmazonEMRMaster",
            "Market":"ON_DEMAND",
            "InstanceRole":"MASTER",
            "InstanceType":"r5.12xlarge",
            "InstanceCount":1,
            "Configurations":[
               {
                 "Classification": "yarn-site",
                 "Properties": {
                   "yarn.nodemanager.vmem-check-enabled": "false",
                   "yarn.nodemanager.pmem-check-enabled": "false"
                 }
               },
               {
                 "Classification": "spark",
                 "Properties": {
                   "maximizeResourceAllocation": "false"
                 }
               },
               {
                 "Classification": "spark-defaults",
                 "Properties": {
                   "spark.network.timeout": "800s",
                   "spark.executor.heartbeatInterval": "60s",
                   "spark.dynamicAllocation.enabled": "false",
                   "spark.driver.memory": "21000M",
                   "spark.executor.memory": "21000M",
                   "spark.executor.cores": "5",
                   "spark.executor.instances": "171",
                   "spark.yarn.executor.memoryOverhead": "21000M",
                   "spark.yarn.driver.memoryOverhead": "21000M",
                   "spark.memory.fraction": "0.80",
                   "spark.memory.storageFraction": "0.30",
                   "spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
                   "spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
                   "spark.yarn.scheduler.reporterThread.maxFailures": "5",
                   "spark.storage.level": "MEMORY_AND_DISK_SER",
                   "spark.rdd.compress": "true",
                   "spark.shuffle.compress": "true",
                   "spark.shuffle.spill.compress": "true",
                   "spark.default.parallelism": "3400"
                 }
               },
               {
                 "Classification": "mapred-site",
                 "Properties": {
                   "mapreduce.map.output.compress": "true"
                 }
               },
               {
                 "Classification": "hadoop-env",
                 "Configurations": [{
                   "Classification": "export",
                   "Configurations": [],
                   "Properties": {
                     "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                   }
                 }],
                 "Properties": {}
               },
               {
                 "Classification": "spark-env",
                 "Configurations": [{
                   "Classification": "export",
                   "Properties": {
                     "JAVA_HOME": "/usr/lib/jvm/java-1.8.0"
                   }
                 }],
                 "Properties": {}
               }
            ]
        },
        {
            "Name":"AmazonEMRCore",
            "Market":"ON_DEMAND",
             "InstanceRole":"CORE",
             "InstanceType":"r5.12xlarge",
             "InstanceCount":19,
             "Configurations":[        
        ..............
        ..............
        ..............
        }
      ],
      "Ec2KeyName":"KEY_NAME"
  } 

Conclusion

In this blog post, I detailed the possible out-of-memory errors, their causes, and a list of best practices to prevent these errors when submitting a Spark application on Amazon EMR.

My colleagues and I formed these best practices after thorough research and understanding of various Spark configuration properties and testing multiple Spark applications. These best practices apply to most of out-of-memory scenarios, though there might be some rare scenarios where they don’t apply. However, we believe that this blog post provides all the details needed so you can tweak parameters and successfully run a Spark application.

 


About the Author

Karunanithi Shanmugam is a data engineer with AWS Tech and Finance.

 

 

 

 

Federate Amazon Redshift access with Okta as an identity provider

Post Syndicated from Rajiv Gupta original https://aws.amazon.com/blogs/big-data/federate-amazon-redshift-access-with-okta-as-an-identity-provider/

Managing database users and access can be a daunting and error-prone task. In the past, database administrators had to determine which groups a user belongs to and which objects a user/group is authorized to use. These lists were maintained within the database and could easily get disjointed from the corporate directory.

With federation, you can manage users and groups within the enterprise identity provider (IdP) and pass them to Amazon Redshift at login. In a previous post, Federate Database User Authentication Easily with IAM and Amazon Redshift, I discussed the internals of the federation workflow using Active Directory Federation Service (AD FS) as our identity provider.

In this post, I focus on Okta as the identity provider. I provide step-by-step guidance showing how you can set up a trial Okta.com account, build users and groups within your organization’s directory, and enable single sign-on (SSO) into Amazon Redshift. You can do all of this while also maintaining group-level access controls within your data warehouse.

The steps in this post are structured into the following sections:

  • Identity provider (Okta) configuration – You set up Okta, which contains your users organized into logical groups.
  • AWS configuration – You set up a role that establishes a trust relationship between your identity provider and AWS and a role that Okta uses to access Amazon Redshift.
  • Identity provider (Okta) advanced configuration – You finalize the Okta configuration by inputting the roles that you just created. You also inform Okta about which groups are allowed to be passed to Amazon Redshift.
  • Amazon Redshift server/client setup – You set up groups within the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables. Finally, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Identity provider (Okta) configuration

In this first step, you set up Okta, add users, and organize them into logical groups. You then add the Amazon Web Services Redshift Okta application.

Step 1: Create an Okta account

If you don’t already have access to an Okta account, you can start a 30-day free trial: https://www.okta.com/free-trial/.

Step 2: Set up your Okta directory

Sign in to Okta.com using the following URL, where <prefix> is specific to your account and was created at account setup:

https://<prefix>-admin.okta.com/admin/dashboard

Navigate to the Directory page to add people and groups into Okta that match your organization. Be sure to use lowercase group names because Amazon Redshift expects the group names to be lowercase.

In the following example, I added three users and two groups, where one of the users (Jorge) belongs to both the “sales” and “marketing” groups.

First, choose Admin in the upper-right corner.

To add users, choose Add Person. The following example shows the users that were created.

To add groups into Okta, choose Add Group. The following example shows three groups.

Step 3: Add the “Amazon Web Services Redshift” Okta application

Navigate to the Applications page. Choose Add Application, and search for the Amazon Web Services Redshift application. Proceed with the default settings.

Step 4: Download the Okta application metadata

Make sure that you have navigated to the Amazon Web Services Redshift application’s settings page, which appears as follows.

Choose Sign On, and then choose the Identity Provider metadata link to download the metadata file in xml format (for example, metadata.xml).

AWS configuration

Next, you set up a role that establishes a trust relationship between the identity provider and AWS. You also create a role that Okta uses to access Amazon Redshift.

Step 5: Create the SAML IAM identity provider

Switching to AWS Management Console, sign in using your AWS credentials. Then open the AWS Identity and Access Management (IAM) console.

On the IAM console, choose Identity providers, and then choose Create Provider, as shown following.

Provide a name for your IdP, and upload the metadata file that you downloaded in the previous step.

Step 6: Create the IAM SAML 2.0 federation role

On the IAM console, navigate to Roles and create a new SAML 2.0 federation role.  Reference the IdP that you created in the previous step, and choose Allow programmatic and AWS Management Console access.

Step 7: Add other permissions to query Amazon Redshift

Choose Next: Assign Permissions. Then choose Create policy.

Create the following custom policy, replacing the region, account, and cluster parameters. These permissions allow the role to use Amazon Redshift to query data, create users, and allow users to join groups.

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
           "Action": [
                "redshift:CreateClusterUser",
                "redshift:JoinGroup",
                "redshift:GetClusterCredentials",
                "redshift:ListSchemas",
                "redshift:ListTables",
                "redshift:ListDatabases",
                "redshift:ExecuteQuery",
                "redshift:FetchResults",
                "redshift:CancelQuery",
                "redshift:DescribeClusters",
                "redshift:DescribeQuery",
                "redshift:DescribeTable"],
           "Resource": "arn:aws:redshift:<region>:<account>:cluster:<cluster>"}]
}

There are a few important things to note:

  • The group membership lasts only for the duration of the user session.
  • There is no CreateGroup permission because groups need to be manually created and granted DB privileges.

The following image shows the summary page for the role.

Identity provider (Okta) advanced configuration

In this section, you finalize the Okta configuration by adding the roles that you just created. You also tell Okta which groups are allowed to be passed to Amazon Redshift.

Step 8: Configure the advanced sign-on settings

Switch back to Okta.com. Navigate to the settings page for the Amazon Web Services Redshift application. In the Sign-On section, scroll to Advanced Sign-On Settings.

Enter the previously created IdP and role ARNS, which are globally unique and ensure that Okta will be directed to your AWS account. Allowed DB Groups is a list of allowed groups that will be sent to Amazon Redshift in the DBGroup SAML assertion.

Don’t use the asterisk (*) wildcard. This will cause the Everyone group to be passed, and Amazon Redshift will complain because it expects the group names to be lowercase.  Note that the ${user.username} is sent in the DBUser SAML assertion.

Step 9: Authorize users

Authorize users to use the Amazon Web Services Redshift application by selecting their respective groups or individual user accounts. In this example, I authorized users by group.

Amazon Redshift server/client setup

Next, you set up groups in the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables. Finally, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Step 10: Set up groups

Log in to your Amazon Redshift cluster with an admin account. Create groups that match the IdP group names, and grant the appropriate permissions to tables and schemas.

CREATE GROUP sales;
CREATE GROUP marketing;
ALTER DEFAULT PRIVILEGES IN SCHEMA sales 
GRANT ALL on TABLES to GROUP sales; 
ALTER DEFAULT PRIVILEGES IN SCHEMA marketing 
GRANT ALL on TABLES to GROUP marketing;

Step 11: Configure the JDBC SQL client

Assuming that the Amazon Redshift JDBC driver is installed, set up a new connection to your cluster using your IdP credentials. In the following example, I am using SQLWorkbenchJ. For the URL, be sure to enter “iam” to instruct the driver to authenticate using IAM. For Username and Password, enter the values that you set in Okta.

Enter the extended properties as follows. For app_id and idp_host, refer to the URL for the application in your web browser:

https://<prefix>-admin.okta.com/admin/app/amazon_aws_redshift/instance/<app_id>

Step 12: Configure the ODBC SQL client

Assuming that the Amazon Redshift ODBC driver is installed, set up a new connection to your cluster using your IdP credentials. In the following example, I modified the ~/Library/ODBC/odbc.ini file.  See the previous instructions for determining the <app_id> and <prefix> values.

[ODBC Data Sources]
Redshift DSN=Installed

[Redshift DSN]
Driver=/opt/amazon/redshift/lib/libamazonredshiftodbc.dylib
Host=<endpoint>
Port=<port>
Database=<database>
locale=en-US
app_id=<app_id>
plugin_name=okta
idp_host=<prefix>.okta.com
iam=1

Step 13: Test user access

You should now be able to sign on with the users created. In our example, [email protected] has access to the tables in the “sales” schema only. The user [email protected] has access to tables in the “marketing” schema only. And [email protected] has access to tables in both schemas. Using the [email protected] user, you get following results when trying to query data from each of the schemas:

select storeid From sales.stores


storeid	
-------
1234
2345
3456
[…]


select * From marketing.campaign


An error occurred when executing the SQL command:
select * From marketing.campaign

[Amazon](500310) Invalid operation: permission denied for schema marketing;
1 statement failed.

Execution time: 0.16s

Summary

In this post, I provided a step-by-step guide for configuring and using Okta as your Identity Provider (IdP) to enable single sign-on to an Amazon Redshift cluster. I also showed how group membership within your IdP can be passed along, enabling you to manage user access to Amazon Redshift resources from within your IdP.

If you have questions or suggestions, please comment below.

 


About the Author

Rajiv Gupta is a data warehouse specialist solutions architect with Amazon Web Services.

 

 

 

 

Granting fine-grained access to the Amazon Redshift Management Console

Post Syndicated from Raj Jayaraman original https://aws.amazon.com/blogs/big-data/granting-fine-grained-access-to-the-amazon-redshift-management-console/

As a fully managed service, Amazon Redshift is designed to be easy to set up and use. In this blog post, we demonstrate how to grant access to users in an operations group to perform only specific actions in the Amazon Redshift Management Console. If you implement a custom IAM policy, you can set it up so these users can monitor and terminate running queries. At the same time, you can prevent these users from performing other more privileged operations such as modifying, restarting, or deleting an Amazon Redshift cluster.

An overview of Amazon Redshift access control

Since its release in February 2013, Amazon Redshift has quickly become a popular cloud-based data warehousing platform for thousands of customers worldwide.

Access to Amazon Redshift requires credentials that AWS can use to authenticate your requests. Those credentials must have permissions to access Amazon Redshift resources, such as an Amazon Redshift cluster or a snapshot. For more details on these credentials, see Authentication and Access Control for Amazon Redshift in the Amazon Redshift documentation.

Every AWS resource is owned by an AWS account, and permissions to create or access the resources are governed by AWS Identity and Access Management (IAM) policies. An AWS account administrator can attach permissions policies to IAM identities (users, groups, and roles). In particular, an AWS account administrator can attach an IAM permissions policy to a specific user. Such a policy grants permissions for that user to manage an Amazon Redshift resource, such as a snapshot or an event subscription.

When granting permissions, you can decide who gets the permissions and which Amazon Redshift resources they get permissions for. You can also decide on the specific actions that you want to allow on those resources. Policies attached to an IAM identity are referred to as identity-based IAM policies, and policies attached to a resource are referred to as resource-based policies. Amazon Redshift supports only identity-based IAM policies.

Use case: Setting limited access for a user

Consider the following use case. Suppose that an IAM user who is a member of a customer’s operations group needs to monitor and terminate queries running in an Amazon Redshift cluster. It’s best if they do so through the Amazon Redshift console. This user is not allowed to modify or delete any other Amazon Redshift resources.

To implement this use case, we need to implement a custom IAM policy that ensures this IAM user has read-only access to the Amazon Redshift console. Doing this means that the user can get descriptions of the available clusters and navigate to the Queries tab. Additionally, we want the IAM user to be able to cancel a running query through the Amazon Redshift console. To allow this, we use the redshift:CancelQuerySession IAM action. For descriptions of other allowed Amazon Redshift actions in an IAM policy and what each action means, see Actions Defined by Amazon Redshift in the Amazon Redshift documentation.

To create such a custom IAM policy, follow these instructions:

  1. Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  2. In the navigation pane on the left, choose Policies.
  3. Choose Create policy.
  4. Choose the JSON tab and input the following policy:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "redshift:Describe*",
                "redshift:CancelQuerySession",
                "redshift:ViewQueriesInConsole",
                "ec2:DescribeAccountAttributes",
                "ec2:DescribeAddresses",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcs",
                "ec2:DescribeInternetGateways",
                "sns:Get*",
                "sns:List*",
                "cloudwatch:Describe*",
                "cloudwatch:List*",
                "cloudwatch:Get*"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}

  1. On the Review policy page, type a value for Name and optionally for Description for the policy that you are creating. Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.
  2. Attach this policy to an existing or a new IAM user.

With this permission policy, an IAM user can select an Amazon Redshift cluster, list all running queries in the Queries tab, and terminate a query if needed. All the permissions are read-only. Thus, the user can’t create a new Amazon Redshift cluster or modify or delete an existing cluster. However, the user can view available clusters, cluster snapshots, parameter groups, and cluster subnet groups, and view other properties of existing clusters.

Validating the use case

With the above IAM policy in place, after the IAM user logs into the Amazon Redshift Management Console, the user can select and view details about the Amazon Redshift cluster or clusters in the account. After navigating to the Queries tab, the user can see both the running and completed queries.

To cancel or terminate a long running query, the user can select the query from the list and choose Terminate Query. However, this user can’t modify or delete anything else in the Amazon Redshift console. As an example, if the user tries to modify an Amazon Redshift cluster (to change its endpoint), that user encounters the following error.

Conclusion

In this post, we have walked through a detailed customer use case of providing fine-grained access to the Amazon Redshift console. Using a set of carefully tailored IAM policies, a customer’s operations personnel can have read-only access to the Amazon Redshift console. These personnel can cancel or terminate running queries without the ability to modify, add, or delete any other Amazon Redshift resources.

We want to acknowledge our fellow AWS co-workers Ryan Mich, Sulay Shah and Hunter Grider for their many useful comments and suggestions.

If you have any questions or suggestions, leave your feedback in the comment section. If you need any further assistance to optimize your Amazon Redshift implementation, contact your AWS account team or a trusted AWS partner.

 


About the authors

Raj Jayaraman is a cloud support engineer with AWS Support at Amazon Web Services.

 

 

 

 

Po Hong, Ph.D. is a senior data architect within the Global Data & Analytics Specialty Practice at AWS Professional Services.

 

 

 

 

Recipe for building a widget: How we helped to “peak-shift” demand by helping passengers understand travel trends

Post Syndicated from Grab Tech original https://engineering.grab.com/peak-shift-demand-travel-trends

Credits: Photo by rawpixel on Unsplash

 

Stuck in traffic in a Grab ride? Pass the time by opening your Grab app and checking out the Feed – just scroll down! You’ll find widgets for games, polls, videos, news and even food recommendations!

Beyond serving your everyday needs, we want to provide our users with information that is interesting, useful and relevant. That’s why we’re always coming up with new widgets.

Building each widget takes close collaboration across multiple different teams – from Product Management to Design, Engineering, Behavioral Science, and Data Science and Analytics. Sounds like a lot of people, doesn’t it? But you’ll be surprised to hear that this behind-the-scenes collaboration works rapidly, usually in the span of one month! Which means we’re often moving from ideation phase to product release in just a few weeks.

Travel Trends Widget

This fast-and-furious process is anchored on one word – “customer-centric”. And that’s how it all began with our  “Travel Trends Widget” – a widget that provides passengers with an overview of historical supply and demand trends for their current location and nearby time periods.

Because we had so much fun developing this widget, we wanted to write a blog post to share with you what we did and how we did it!

Inspiration: Where it all started

Transport demand can be rather lumpy. Owing to organic patterns (e.g. office hours), a lot of passengers tend to request for cars around the same time. In periods like this, the increase in demand could outpace the arrival of driver supply, increasing the waiting time for passengers.

Our goal at Grab is to make sure people get a ride when they want it and at the price they want, so we got to thinking about how we can ease this friction by leveraging our treasure trove – Big Data! – to help our passengers better plan their trips.

As we were looking at the data, we noticed that there is a seasonality to demand and supply: at certain times and days, imbalances appear, peak and disappear, and the process repeats itself. Studies say that humans in general, unless shown a compelling reason or benefit for change, are habitual beings subject to inertia. So we set out to achieve exactly that: To create a widget to surface information to our passengers that may help them alter their decisions on when they choose to book a ride, thereby redistributing some of the present peak demands to periods just before and after peak – also known as “peak shifting the demand”!

While this widget is the first-of-its-kind in the ride-hailing industry, “peak-shifting” was actually coined and introduced long ago!

London Transport Museum Trends

As you can see from this post from the London Transport Museum (Source: Transport for London), London tube tried peak-shifting long before anyone else: Original Ad from 1928 displayed on the left, and Ad from 2015 displayed on the right, comparing the trends to 1928.

Trends from a hotel in Beijing

You may also have seen something similar at the last hotel you stayed at. Notice here a poster in an elevator at a Beijing hotel, announcing the best times to eat breakfast in comfort and avoid the crowd. (Photo credits to Prashant, our Product Manager, who saw this on holiday.)

To apply “peak-shifting” and help our users better plan their trips, we decided to dig in and leverage our data. It was way more complex than we had initially thought, as market conditions could be different on different days. This meant that  generic statements like “5PM-8PM are peak hours and prices will be hight” would not hold true. Contrary to general perception, we observed that even during peak hours, there are buckets of time when there is no surge or low surge.

For instance, plot 1 and plot 2 below shows how a typical Monday and Tuesday surge looks like in a given month respectively. One of the key insights is that the surge trends during peak hour is different on Monday from Tuesday. It reinforces our initial hypothesis that every day is unique.

So we used machine learning techniques to build a forecasting widget which can help our users and give them the power to plan their trips beforehand. This widget is able to provide the pricing trends for the next 2 hours. So with a bit of flexibility, riders can ride the tide!

Grab trends

So how exactly does this widget work?!

Historical trends for Monday

It pulls together historically observed imbalances between supply and demand, for the consumer’s current location and nearby time periods. Aggregated data is displayed to consumers in easily interpreted visualisations, so that they can plan to leave at times when there are more supply, and with potentially more savings for fares.

How did we build the widget? Loop, agile working process, POC & workstream

Widget-building is an agile, collaborative, and simultaneous process. First, we started the process with analysis from Product Analytics team, pulling out data on traffic trends, surge patterns, and behavioral insights of both passengers and drivers in Singapore.

When we noticed the existence of seasonality for each day of the week, we came up with more precise analytical and business questions to dig deeper into the data. Upon verification of hypotheses, we decided that we will build a widget.

Then joined the Behavioural Science, UX (User Experience) Design and the Product Management teams, who started giving shape to the problem we are solving. Our Behavioural Scientists shared their expertise on how information, suggestions and choices should be presented to enable easy assimilation and beneficial action. Daily whiteboarding breakouts, endless back-and forth conversations, and a healthy amount of challenge-and-accept culture ensured that we distilled the idea down to its core. We then presented the relevant information with just the right level of detail, and with the right amount of messaging, to allow users to take the intended action i.e. shift his/her demand outside of peak periods if possible.

Our amazing regional Copywriting team then swung in to put our intent into words in 7 different languages for our users across South-East Asia. Simultaneously, our UX designers and Full-stack Engineers started exploring the best visual components to communicate data on time trends to users. More on this later, but suffice to say that plenty of ideas were explored and discarded in a collaborative process, which aimed to create something that’s intuitive and engaging while being robust and scalable to work across all types of devices.

While these designs made their way up to engineering, the Data Science team worked on finding the most rigorous method to deduce the historical trend of surge across all our cities and areas, and time periods within them. There were discussions on how to best store and update this data reliably so that the widget itself can access it with great performance.

Soon after, we went into the development process, and voila! We had the first iteration of the widget ready on our staging (internal testing) servers in just 2 weeks! This prototype was opened up to the core team for influx of feedback.

And just two weeks later, the widget made its way to our Singapore and Jakarta Feeds, accessible to the world at large! Feedback from our users started pouring in almost immediately (thanks to the rich feedback functionality that comes with each widget), ranging from great to sometimes not-so-great, and we listened to all of it with a keen ear! And thus began a new cycle of iterations and continuous improvement, more of which we will share in a subsequent post.

In the trenches with the creators: How multiple teams got together to make this come true

Various disciplines within our cross functional team came together to whip out this widget by quipping their expertise to the end product.

Using Behavioural Science to simplify choices and design good outcomes

Behavioural Science helped to explore many facets of consumer behaviour in order to plan and design the widget: understanding how consumers think and conceptualizing a widget that can be easily understood and used by the consumers.

While fares are governed entirely by market conditions, it’s important for us to explain the economics to customers. As a customer-centric company, we aim to make the consumers feel like they own their decisions, which they can take based on full information. And this is the role of Behavioral Scientists at Grab!

In guiding the customers through the information, Behavioural Science team had the following three objectives in mind while building this Travel Trends widget:

  1. Offer transparency on the fares: By exposing our historic surge levels for a 4 hour period, we wanted to ensure that the passenger is aware of the surge levels and does not treat the fare as a nasty shock.
  2. Give information that helps them plan: By showing them surge levels for the future 2 hours, we wanted to help customers who have the flexibility, plan for a better time, hence, giving them the power to decide based on transparent information.
  3. Provide helpful tips: Every bar gives users tips on the conditions at that time and the immediate future. For instance, a low surge bar, followed by a high surge bar gives the tip “Psst… Leave now, It might get busy later!”, helping people understand the graph better and nudging them to take an action. If you are interested in saving fares, may we suggest tapping around all the bars to reveal the secret pro-tips?

Designing interfaces that lead to consumer success by abstracting complexity

Design team is the one behind the colors and shapes that make up the widget that you see and interact with! The team took inspiration from Google’s Popular Times.

Source/Credits: Google Live Popular Times
Source/Credits: Google Live Popular Times

 

Right from the offset, our content and product teams were keen to surface additional information and actions with each bar to keep the widget interactive and useful. One of the early challenges was to arrive at the right gesture that invites the user to interact and intuitively navigate the bars on the widget but also does not conflict with other gestures (eg scrolling and scrubbing) that the user was pre-trained to perform on the feed. We found out that tapping was simultaneously an unused and yet intuitive gesturethat we could use for interaction with the bars.

We then went into rounds of iteration on the visual design of the widget. In this process, multiple stakeholders were involved ranging from Product to Content to Engineering. We had to overcome a number of constraints i.e. the limited canvas of a widget and the context of a user when she is exploring the feed. By re-using existing libraries and components, we managed to keep the development light and ship something fast.

GrabCar trends near you

Dozens of revisions and four iterations later, we landed with a design that we felt equipped the feature for its user-facing goal, and did so in a manner which was aesthetically appealing!

And finally we managed to deliver on the feature’s goal, by surfacing just the right detail of information in a manner that is intuitive yet effective to peak-shift demand.  

Bringing all of this to fruition through high performance engineering

Our Development Engineering team was in charge of developing the widget and making it available to our users in just a few weeks’ time – materialising the work of the other teams.

One of their challenges was to find the best way to process the vast amount of data (millions of database entries) so it can be visualized simply as bar charts. Grab’s engineers had to achieve this while making sure performance is as resilient as possible.

There were two options in doing this:

a) Fetch the data directly from the DB for each API call; or

b) Store the data in an in-memory data structure on a timely basis, so when a user calls the API will no longer have to hit the DB.

After considering that this feature will likely expect a lot of traffic thus high QPS, we decided that the former option would be too costly. Ultimately, we chose the latter option since it is more performant and more scalable.

At the frontend, the challenge was to cater to the intricate request from our designers. We use chart libraries to increase our development speed, and not all of the requirements were readily supported by these libraries.

For instance, let’s say this library makes visualising charts easy, but not so much for customising them. If designers wanted to have an average line in a dotted form, the library did not support this so easily. Also, the moving arrow pointers as you move between bar chart, changing colors of the bars changes when clicked – all required countless CSS tweaks.

CSS tweak on trends widget
CSS tweak on trends widget

Closing the product loop with user feedback and data driven insights

One of the most crucial parts of launching any product is to ensure that customers are engaging with the widget and finding it useful.

To understand what customers think about the widget, whether they find it useful and whether it is helping them to plan better,  we delved into the huge mine of clickstream data.

User feedback on the trends widget

We found that 1 in 3 users who make a booking everyday interact with the widget. And of these people, more than 70% users have given positive rating for the widget. This validates our initial hypothesis that if given an option, our customers will love the freedom to plan their trips and inculcate more transparent ecosystem.

These users also indicate the things they like most about the widget. 61% of users gave positive rating for usefulness, 20% were impressed by the design (Kudos to our fantastic designer Ajmal!!) and 13% for usability.

Tweet about the widget

Beyond internal data, our widget made some rounds on social media channels. For Example, here is screenshot of what our users have to say on Twitter.

We closely track these metrics on user engagement and feedback to ensure that we keep improving and coming up with new iterations which helps us to serve our customers in a better way.

Conclusion

We hope you enjoyed reading about how we went from ideation, through iterations to a finished widget in the hands of the user, all in 1 month! Many hands helped along the way. If you are interested in joining this hyper-proactive problem-solving team, please check out Grab’s career site!

And if you have feedback for us, we are here to listen! While we cannot be happier to see some positive reaction from the public, we are also thrilled to hear your suggestions and advice. Please leave us a memo using the Widget’s comment function!

Epilogue

We just released an upgrade to this widget which allows users to set reminders and be notified about availability of good fares in a time period of their choosing. We will keep a watch and come knocking! Go ahead, find the widget on your Grab feed, set a reminder and save on fares on your next ride!

Alerting, monitoring, and reporting for PCI-DSS awareness with Amazon Elasticsearch Service and AWS Lambda

Post Syndicated from Michael Coyne original https://aws.amazon.com/blogs/security/alerting-monitoring-and-reporting-for-pci-dss-awareness-with-amazon-elasticsearch-service-and-aws-lambda/

Logging account activity within your AWS infrastructure is paramount to your security posture and could even be required by compliance standards such as PCI-DSS (Payment Card Industry Security Standard). Organizations often analyze these logs to adapt to changes and respond quickly to security events. For example, if users are reporting that their resources are unable to communicate with the public internet, it would be beneficial to know if a network access list had been changed just prior to the incident. Many of our customers ship AWS CloudTrail event logs to an Amazon Elasticsearch Service cluster for this type of analysis. However, security best practices and compliance standards could require additional considerations. Common concerns include how to analyze log data without the data leaving the security constraints of your private VPC.

In this post, I’ll show you not only how to store your logs, but how to put them to work to help you meet your compliance goals. This implementation deploys an Amazon Elasticsearch Service domain with Amazon Virtual Private Cloud (Amazon VPC) support by utilizing VPC endpoints. A VPC endpoint enables you to privately connect your VPC to Amazon Elasticsearch without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. An AWS Lambda function is used to ship AWS CloudTrail event logs to the Elasticsearch cluster. A separate AWS Lambda function performs scheduled queries on log sets to look for patterns of concern. Amazon Simple Notification Service (SNS) generates automated reports based on a sample set of PCI guidelines discussed further in this post and notifies stakeholders when specific events occur. Kibana serves as the command center, providing visualizations of CloudTrail events that need to be logged based on the provided sample set of PCI-DSS compliance guidelines. The automated report and dashboard that are constructed around the sample PCI-DSS guidelines assist in event awareness regarding your security posture and should not be viewed as a de facto means of achieving certification. This solution serves as an additional tool to provide visibility in to the actions and events within your environment. Deployment is made simple with a provided AWS CloudFormation template.
 

Figure 1: Architectural diagram

Figure 1: Architectural diagram

The figure above depicts the architecture discussed in this post. An Elasticsearch cluster with VPC support is deployed within an AWS Region and Availability Zone. This creates a VPC endpoint in a private subnet within a VPC. Kibana is an Elasticsearch plugin that resides within the Elasticsearch cluster, it is accessed through a provided endpoint in the output section of the CloudFormation template. CloudTrail is enabled in the VPC and ships CloudTrail events to both an S3 bucket and CloudWatch Log Group. The CloudWatch Log Group triggers a custom Lambda function that ships the CloudTrail Event logs to the Elasticsearch domain through the VPC endpoint. An additional Lambda function is created that performs a periodic set of Elasticsearch queries and produces a report that is sent to an SNS Topic. A Windows-based EC2 instance is deployed in a public subnet so users will have the ability to view and interact with a Kibana dashboard. Access to the EC2 instance can be restricted to an allowed CIDR range through a parameter set in the CloudFormation deployment. Access to the Elasticsearch cluster and Kibana is restricted to a Security Group that is created and is associated with the EC2 instance and custom Lambda functions.

Sample PCI-DSS Guidelines

This solution provides a sample set of (10) PCI-DSS guidelines for events that need to be logged.

  • All Commands, API action taken by AWS root user
  • All failed logins at the AWS platform level
  • Action related to RDS (configuration changes)
  • Action related to enabling/disabling/changing of CloudTrail, CloudWatch logs
  • All access to S3 bucket that stores the AWS logs
  • Action related to VPCs (creation, deletion and changes)
  • Action related to changes to SGs/NACLs (creation, deletion and changes)
  • Action related to IAM users, roles, and groups (creation, deletion and changes)
  • Action related to route tables (creation, deletion and changes)
  • Action related to subnets (creation, deletion and changes)

Solution overview

In this walkthrough, you’ll create an Elasticsearch cluster within an Amazon VPC environment. You’ll ship AWS CloudTrail logs to both an Amazon S3 Bucket (to maintain an immutable copy of the logs) and to a custom AWS Lambda function that will stream the logs to the Elasticsearch cluster. You’ll also create an additional Lambda function that will run once a day and build a report of the number of CloudTrail events that occurred based on the example set of 10 PCI-DSS guidelines and then notify stakeholders via SNS. Here’s what you’ll need for this solution:

To make it easier to get started, I’ve included an AWS CloudFormation template that will automatically deploy the solution. The CloudFormation template along with additional files can be downloaded from this link. You’ll need the following resources to set it up:

  • An S3 bucket to upload and store the sample AWS Lambda code and sample Kibana dashboards. This bucket name will be requested during the CloudFormation template deployment.
  • An Amazon Virtual Private Cloud (Amazon VPC).

If you’re unfamiliar with how CloudFormation templates work, you can find more info in the CloudFormation Getting Started guide.

AWS CloudFormation deployment

The following parameters are available in this template.

ParameterDefaultDescription
Elasticsearch Domain NameName of the Amazon Elasticsearch Service domain.
Elasticsearch Version6.2Version of Elasticsearch to deploy.
Elasticsearch Instance Count3The number of data nodes to deploy in to the Elasticsearch cluster.
Elasticsearch Instance ClassThe instance class to deploy for the Elasticsearch data nodes.
Elasticsearch Instance Volume Size10The size of the volume for each Elasticsearch data node in GB.
VPC to launch intoThe VPC to launch the Amazon Elasticsearch Service cluster into.
Availability Zone to launch intoThe Availability Zone to launch the Amazon Elasticsearch Service cluster into.
Private Subnet IDThe subnet to launch the Amazon Elasticsearch Service cluster into.
Elasticsearch Security GroupA new Security Group is created that will be associated with the Amazon Elasticsearch Service cluster.
Security Group DescriptionA description for the above created Security Group.
Windows EC2 Instance Classm5.largeWindows instance for interaction with Kibana.
EC2 Key PairEC2 Key Pair to associate with the Windows EC2 instance.
Public SubnetPublic subnet to associate with the Windows EC2 instance for access.
Remote Access Allowed CIDR0.0.0.0/0The CIDR range to allow remote access (port 3389) to the EC2 instance.
S3 Bucket Name—Lambda FunctionsS3 Bucket that contains custom AWS Lambda functions.
Private SubnetPrivate subnet to associate with AWS Lambda functions that are deployed within a VPC.
CloudWatch Log Group NameThis will create a CloudWatch Log Group for the AWS CloudTrail event logs.
S3 Bucket Name—CloudTrail loggingThis will create a new Amazon S3 Bucket for logging CloudTrail events. Name must be a globally unique value.
Date range to perform queriesnow-1d(examples: now-1d, now-7d, now-90d)
Lambda Subnet CIDRCreate a Subnet CIDR to deploy AWS Lambda Elasticsearch query function in to
Availability Zone—LambdaThe availability zone to associate with the preceding AWS Lambda Subnet
Email Address[email protected]Email address for reporting to notify stakeholders via SNS. You must accept the subscription by selecting the link sent to this address before alerts will arrive.

It takes 30-45 minutes for this stack to be created. When it’s complete, the CloudFormation console will display the following resource values in the Outputs tab. These values can be referenced at any time and will be needed in the following sections.

oElasticsearchDomainEndpointElasticsearch Domain Endpoint Hostname
oKibanaEndpointKibana Endpoint Hostname
oEC2InstanceWindows EC2 Instance Name used for Kibana access
oSNSSubscriberSNS Subscriber Email Address
oElasticsearchDomainArnArn of the Elasticsearch Domain
oEC2InstancePublicIpPublic IP address of the Windows EC2 instance

Managing and testing the solution

Now that you’ve set up the environment, it’s time to configure the Kibana dashboard.

Kibana configuration

From the AWS CloudFormation output, gather information related to the Windows-based EC2 instance. Once you have retrieved that information, move on to the next steps.

Initial configuration and index pattern

  1. Log into the Windows EC2 instance via Remote Desktop Protocol (RDP) from a resource that is within the allowed CIDR range for remote access to the instance.
  2. Open a browser window and navigate to the Kibana endpoint hostname URL from the output of the AWS CloudFormation stack. Access to the Elasticsearch cluster and Kibana is restricted to the security group that is associated with the EC2 instance and custom Lambda functions during deployment.
  3. In the Kibana dashboard, select Management from the left panel and choose the link for Index Patterns.
  4. Add one index pattern containing the following: cwl-*
     
    Figure 2: Define the index pattern

    Figure 2: Define the index pattern

  5. Select Next Step.
  6. Select the Time Filter Field named @timestamp.
     
    Figure 3: Select "@timestamp"

    Figure 3: Select “@timestamp”

  7. Select Create index pattern.

At this point we’ve launched our environment and have accessed the Kibana console. Within the Kibana console, we’ve configured the index pattern for the CloudWatch logs that will contain the CloudTrail events. Next, we’ll configure visualizations and a dashboard.

Importing sample PCI DSS queries and Kibana dashboard

  1. Copy the export.json from the location you extracted the downloaded zip file to the EC2 Kibana bastion.
  2. Select Management on the left panel and choose the link for Saved Objects.
  3. Select Import in upper right corner and navigate to export.json.
  4. Select Yes, overwrite all saved objects, then select Index Pattern cwl-* and confirm all changes.
  5. Once the import completes, select PCI DSS Dashboard to see the sample dashboard and queries.

Note: You might encounter an error during the import that looks like this:
 

Figure 4: Error message

Figure 4: Error message

This simply means that your streamed logs do not have login-type events in the time period since your deployment. To correct this, you can add a field with a null event.

  1. From the left panel, select Dev Tools and copy the following JSON into the left panel of the console:
    
            POST /cwl-/default/
            {
                "userIdentity": {
                    "userName": "test"
                }
            }              
     

  2. Select the green Play triangle to execute the POST of a document with the missing field.
     
    Figure 5: Select the "Play" button

    Figure 5: Select the “Play” button

  3. Now reimport the dashboard using the steps in Importing Sample PCI DSS Queries and Kibana Dashboard. You should be able to complete the import with no errors.

At this point, you should have CloudTrail events that have been streamed to the Elasticsearch cluster, with a configured Kibana dashboard that looks similar to the following graphic:
 

Figure 6: A configured Kibana dashboard

Figure 6: A configured Kibana dashboard

Automated Reports

A custom AWS Lambda function was created during the deployment of the Amazon CloudFormation stack. This function uses the sample PCI-DSS guidelines from the Kibana dashboard to build a daily report. The Lambda function is triggered every 24 hours and performs a series of Elasticsearch time-based queries of now-1day (the last 24 hours) on the sample guidelines. The results are compiled into a message that is forwarded to Amazon Simple Notification Service (SNS), which sends a report to stakeholders based on the email address you provided in the CloudFormation deployment.

The Lambda function will be named <CloudFormation Stack Name>-ES-Query-LambdaFunction. The Lambda Function enables environment variables such as your query time window to be adjusted or additional functionality like additional Elasticsearch queries to be added to the code. The below sample report allows you to monitor any events against the sample PCI-DSS guidelines. These reports can then be further analyzed in the Kibana dashboard.


    Logging Compliance Report - Wednesday, 11. July 2018 01:06PM
    Violations for time period: 'now-1d'
    
    All Failed login attempts
    - No Alerts Found
    All Commands, API action taken by AWS root user
    - No Alerts Found
    Action related to RDS (configuration changes)
    - No Alerts Found
    Action related to enabling/disabling/changing of CloudTrail CloudWatch logs
    - 3 API calls indicating alteration of log sources detected
    All access to S3 bucket that stores the AWS logs
    - No Alerts Found
    Action related to VPCs (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to SGs/NACLs (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to IAM roles, users, and groups (creation, deletion and changes)
    - 2 API calls indicating creation, alteration or deletion of IAM roles, users, and groups
    Action related to changes to Route Tables (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to Subnets (creation, deletion and changes)
    - No Alerts Found         

Summary

At this point, you have now created a private Elasticsearch cluster with Kibana dashboards that monitors AWS CloudTrail events on a sample set of PCI-DSS guidelines and uses Amazon SNS to send a daily report providing awareness in to your environment—all isolated securely within a VPC. In addition to CloudTrail events streaming to the Elasticsearch cluster, events are also shipped to an Amazon S3 bucket to maintain an immutable source of your log files. The provided Lambda functions can be further modified to add additional or more complex search queries and to create more customized reports for your organization. With minimal effort, you could begin sending additional log data from your instances or containers to gain even more insight as to the security state of your environment. The more data you retain, the more visibility you have into your resources and the closer you are to achieving Compliance-on-Demand.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Michael Coyne

Michael is a consultant for AWS Professional Services. He enjoys the fast-paced environment of ever-changing technology and assisting customers in solving complex issues. Away from AWS, Michael can typically be found with a guitar and spending time with his wife and two young kiddos. He holds a BS in Computer Science from WGU.

New – Amazon Kinesis Data Analytics for Java

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-kinesis-data-analytics-for-java/

Customers are using Amazon Kinesis to collect, process, and analyze real-time streaming data. In this way, they can react quickly to new information from their business, their infrastructure, or their customers. For example, Epic Games ingests more than 1.5 million game events per second for its popular online game, Fornite.

With Amazon Kinesis Data Analytics you can process data in real-time using standard SQL. While SQL provides an easy way to quickly query large volumes of streaming data without learning new frameworks or languages, many customers also want to build more sophisticated data processing applications using general-purpose programming languages.

Using Java with Amazon Kinesis Data Analytics

Today, we are introducing support for Java in Amazon Kinesis Data Analytics. Now, developers can use their own Java code to create powerful real-time applications that process streaming data like continuously transforming and loading data into their data lakes, generating metrics to feed real-time gaming leaderboards, applying machine learning models to data streams from connected devices, and more.

To use this new functionality, developers build applications using open source libraries which include built-in operators for common data processing functions that allow applications to organize, transform, aggregate, and analyze data at any scale. These libraries are both open source and you can run them anywhere:

  • Apache Flink, an open source framework and engine for processing data streams.
  • AWS SDK for Java, providing Java APIs for many AWS services.

Developers can use these Java libraries within their Integrated Development Environment (IDE) of choice. Using these libraries, the following AWS services can be integrated with as little as one line of code:

  • Streaming Data Sources: Amazon Kinesis Data Streams
  • Streaming Destinations: Amazon S3, Amazon DynamoDB, Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose

In addition to the pre-built AWS integrations, the Java libraries include more connectors to tools like Cassandra, ElasticSearch, RabbitMQ, Redis, and more, and the ability to build custom integrations.

Building a Kinesis Data Streams Java Application

I prepared a simple Java application that implements the “mandatory” word count example for data processing. I send some paragraphs of text in input and I get, every five seconds, the number of times each word is being used as output.

First, I create two Kinesis Data Streams:

  • TextInputStream, where I am going to send my input records
  • WordCountOutputStream, where I am going to read the output of the Java application

 

Here is the code of the word-count Java application. To read and write from Kinesis Data Streams, I am using the Kinesis Connector from the Apache Flink project.

public class StreamingJob {

    private static final String region = "us-east-1";
    private static final String inputStreamName = "TextInputStream";
    private static final String outputStreamName = "WordCountOutputStream";

    private static DataStream<String> createSourceFromStaticConfig(
            StreamExecutionEnvironment env) {
        Properties inputProperties = new Properties();
        inputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, region);
        inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION,
            "LATEST");

        return env.addSource(new FlinkKinesisConsumer<>(inputStreamName,
            new SimpleStringSchema(), inputProperties));
    }

    private static FlinkKinesisProducer<String> createSinkFromStaticConfig() {
        Properties outputProperties = new Properties();
        outputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, region);

        FlinkKinesisProducer<String> sink = new FlinkKinesisProducer<>(new
            SimpleStringSchema(), outputProperties);
        sink.setDefaultStream(outputStreamName);
        sink.setDefaultPartition("0");
        return sink;
    }

    public static void main(String[] args) throws Exception {

        final StreamExecutionEnvironment env =
        StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<String> input = createSourceFromStaticConfig(env);

        input.flatMap(new Tokenizer())
             .keyBy(0)
             .timeWindow(Time.seconds(5))
             .sum(1)
             .map(new MapFunction<Tuple2<String, Integer>, String>() {
                 @Override
                 public String map(Tuple2<String, Integer> value) throws Exception {
                     return value.f0 + "," + value.f1.toString();
                }
             })
             .addSink(createSinkFromStaticConfig());

        env.execute("Word Count");
    }

    public static final class Tokenizer
        implements FlatMapFunction<String, Tuple2<String, Integer>> {

		@Override
		public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
			String[] tokens = value.toLowerCase().split("\\W+");
			for (String token : tokens) {
				if (token.length() > 0) {
					out.collect(new Tuple2<>(token, 1));
				}
			}
		}
    }
    
}

The most important part of the application is the manipulation of the input object, where I apply a few DataStream Transformations:

  1. I start with a DataFrame containing the String from the input stream.
  2. I use a Tokenizer in a FlatMap to split the sentence into “words”, each word followed by the number “1”.
  3. I apply the KeyBy operator to logically partition the stream in respect to the “word”.
  4. I use a 5 seconds tumbling window.
  5. I aggregate within the window, summing up for each word the number “1” to count them.
  6. I use a simple Map for each record to join the word and the number into a comma-separated values (CSV) String that I send to the output stream.

One of the most powerful operators shown here is the KeyBy operator. It enables you to re-organize a particular stream by a specified key in real-time. This type of re-keying enables further downstream operations like aggregations, counts, and much more. This enables you to set up streaming map-reduce on different keys within the same application.

I build the Java application using Maven and load the output JAR to an Amazon Simple Storage Service (S3) bucket in the region where I want to deploy the application. In the Kinesis Data Analytics console, I create a new application and select “Flink” as runtime:

I then configure the application to use the code on my S3 bucket. The console updates the IAM role for the application to have permissions to read the code.

You can optionally add key/value properties to the configuration of the application. You can read those properties from within the application, to provide customization at deployment time.

For monitoring, I leave the default metrics. I enable logging to Amazon CloudWatch, for errors only.

Don’t forget to add permissions to the IAM role created by the console to allow the Kinesis Analytics application to read and write from the streams used for input and output, TextInputStream and WordCountOutputStream in my case.

I can now start the application with the “Run” button, and when it is running, I use a script that I prepared to put some text (I am using a description of the Amazon Kinesis platform) in the input stream:

$ python put_records.py TextInputStream
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data...

The behavior of my application is summarized in the console in the Application Graph, a visual representation of the data flow consisting of operators and intermediate results (complex applications, using multiple streams, have a much more interesting graph):

To read the output stream, I am using a Lambda function written in Python. I am using the one provided with the Kinesis Record Aggregation & Deaggregation Modules for AWS Lambda, that provides automatic “de-aggregation” of records aggregated by the Amazon Kinesis Producer Library (KPL).

As expected, in the CloudWatch Logs console I get the list of the words and the number of times they were used, updated every 5 seconds by the Lambda function:

Pricing and Availability

With Amazon Kinesis Data Analytics for Java, you pay only for what you use. Pricing is similar to Amazon Kinesis Data Analytics for SQL, but there are a few differences.

For Java applications, you are charged a single additional Amazon Kinesis Processing Unit (KPU) per application, used for application orchestration. Java applications are also charged for running application storage and durable application backups. Running application storage is used for Amazon Kinesis Data Analytics’ stateful processing capabilities and is charged per GB-month. Durable application backups are optional and provide a point-in-time recovery point for applications, charged per GB-month.

For example, pricing is $0.11 per KPU hour in US East (N. Virginia), and you are charged for running application storage ($0.10 per GB-month) and durable application backups ($0.023 per GB-month).

Available Now

Amazon Kinesis Data Analytics for Java is available now in US East (N. Virginia), US East (Ohio), US West (Oregon), EU West (Ireland).

I only scratched the surface of the capabilities for stream processing enabled by the support of Java in Amazon Kinesis Data Analytics. I think this is a powerful tool that can enable new use cases. Let me know what you are going to build with it!