Tag Archives: Social Media

Building a Graph Database on AWS Using Amazon DynamoDB and Titan

Post Syndicated from Nick Corbett original https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan

Nick Corbett is a Big Data Consultant for AWS Professional Services

You might not know it, but a graph has changed your life. A bold claim perhaps, but companies such as Facebook, LinkedIn, and Twitter have revolutionized the way society interacts through their ability to manage a huge network of relationships. However, graphs aren’t just used in social media; they can represent many different systems, including financial transactions for fraud detection, customer purchases for recommendation engines, computer network topologies, or the logistics operations of Amazon.com.

In this post, I would like to introduce you to a technology that makes it easy to manipulate graphs in AWS at massive scale. To do this, let’s imagine that you have decided to build a mobile app to help you and your friends with the simple task of finding a good restaurant. You quickly decide to build a ‘server-less’ infrastructure, using Amazon Cognito to identity management and data synchronization, Amazon API Gateway for your REST API, and AWS Lambda to implement microservices that fulfil your business logic. Your final decision is where to store your data. Because your vision is to build a network of friends and restaurants, the natural choice is a graph database rather than an RDBMS.  Titan running on Amazon DynamoDB is a great fit for the job.

DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance together with seamless scalability. Recently, AWS announced a plug-in for Titan that allows it to use DynamoDB as a storage backend. This means you can now build a graph database using Titan and not worry about the performance, scalability, or operational management of storing your data.

Your vision for the network that will power your app is shown below and shows the three major parts of a graph: vertices (or nodes), edges, and properties.

A vertex (or node) represents an entity, such as a person or restaurant. In your graph, you have three types of vertex: customers, restaurants, and the type of cuisine served (called genre in the code examples).

An edge defines a relationship between two vertices. For example, a customer might visit a restaurant or a restaurant may serve food of a particular cuisine. An edge always has direction – it will be outgoing from one vertex and incoming to the other.

A property is a key-value pair that enriches a vertex or an edge. For example, a customer has a name or the customer might rate their experience when they visit a restaurant.

After a short time, your app is ready to be released, albeit as a minimum viable product. The initial functionality of your app is very simple: your customer supplies a cuisine, such as ‘Pizza’ or ‘Sushi’, and the app returns a list of restaurants they might like to visit.

To show how this works in Titan, you can follow these instructions in the AWS Big Data Blog’s GitHub’ repository to load some sample data into your own Titan database, using DynamoDB as the backend store. The data used in this example was based on a data set provided by the Machine Learning Repository at UCL1. By default, the example uses Amazon DynamoDB Local, a small client-side database and server that mimics the DynamoDB service. This component is intended to support local development and small scale testing, and lets you save on provisioned throughput, data storage, and transfer fees.

Interaction with Titan is through a graph traversal language called Gremlin, in much the same way as you would use SQL to interact with an RDBMS. However, whereas SQL is declarative, Gremlin is implemented as a functional pipeline; the results of each operation in the query are piped to the next stage. This provides a degree of control on not just what results your query generates but also how it is executed. Gremlin is part of the Open Source Apache TinkerPop stack, which has become the de facto standard framework for graph databases and is supported by products such as Titan, Neo4j, and OrientDB.

Titan is written in Java and you can see that this API is used to load the sample data by running Gremlin commands. The Java API would also be used by your microservices running in Lambda, calling through to DynamoDB to store the data. In fact, the data stored in DynamoDB is compressed and not humanly readable (for more information about the storage format, see Titan Graph Modeling in DynamoDB).

For the purposes of this post, however, it’s easier to user the Gremlin REPL, written in Groovy. The instructions on GitHub show you how to start your Gremlin session.

A simple Gremlin query that finds restaurants based on a type of cuisine is shown below:

gremlin> g.V.has(‘genreId’, ‘Pizzeria’).in.restaurant_name

==>La Fontana Pizza Restaurante and Cafe
==>Dominos Pizza
==>Little Cesarz
==>pizza clasica
==>Restaurante Tiberius

This introduces the concept of how graph queries work; you select one or more vertices then use the language to walk (or traverse) across the graph. You can also see the functional pipeline in action as the results of each element are passed to the next step in the query. The query can be read as shown below.

Network that will power your app

The query gives us five restaurants to recommend to our customer. This query would be just as easy to run if your data was based in an RDBMS, so at this point not much is gained by using a graph database. However, as more customers start using your app and the first feature requests come in, you start to feel the benefit of your decision.

Initial feedback from your customers is good. However, they tell you that although it’s great to get a recommendation based on a cuisine, it would be better if they could receive recommendations based on places their friends have visited. You quickly add a ‘friend’ feature to the the app and change the Gremlin query that you use to provide recommendations:

This query assumes that a particular user (‘U1064’) has asked us to find a ‘Cafeteria’ restaurant that their friends have visited. The Gremlin syntax can be read as shown below.

This query uses a pattern called ‘backtrack’. You make a selection of vertices and ‘remember’ them. You then traverse the graph, selecting more nodes. Finally, you ‘backtrack’ to your remembered selection and reduce it to those vertices that have a path through to your current position.

Again, this query could be executed in an RDBMS but it would be complex. Because you would keep all customers in a single table, finding friends would involve looping back to join a table to itself. While it’s perfectly possible to do this in SQL, the syntax can become long—especially if you want to loop multiple times; for example, how many of my friends’ friends’ have visited the same restaurant as me?  A more important problem would be the performance. Each SQL join would introduce extra latency to the query and you may find that, as your database grows, you can’t meet the strict latency requirements of a modern app. In my test system, Titan returned the answer to this query in 38ms, but the RDBMS where I staged the data took over 0.3 seconds to resolve, an order of magnitude difference!

Your new recommendations work well, but some customers are still not happy. Just because their friends visited a restaurant doesn’t mean that they enjoyed it; they only want recommendations to restaurants their friends actually liked. You update your app again and ask customers to rate their experience, using ‘0’ for poor, ‘1’ for good, and ‘2’ for excellent. You then modify the query to:

g.V.has(‘userId’,’U1101′).out(‘friend’).outE(‘visit’).has(‘visit_food’, T.gte, 1).as(‘x’).inV.as(‘y’).out(‘restaurant_genre’).has(‘genreId’, ‘Seafood’).back(‘x’).transform{e, m -> [food: m.x.visit_food, name:m.y.restaurant_name]}.groupCount{it.name}.cap

==>{Restaurante y Pescaderia Tampico=1, Restaurante Marisco Sam=1, Mariscos El Pescador=2}

This query is based on a user (‘U1101’) asking for a seafood restaurant. The stages of the query are shown below.

This query shows how you can filter for a property on an edge. When you traverse the ‘visit’ edge, you filter for those visits where the food rating was greater or equal than 1. The query also shows how you can transform results from a pipeline to a new object. You build a simple object, with two properties (food rating and name) for each ‘hit’ you have against your query criteria. Finally, the query also demonstrates the ‘groupCount’ function. This aggregation provides a count of each unique name.

The net result of this query is that the ‘best’ seafood restaurant to recommend is ‘Mariscos El Pescador’, as your customer’s friends have made two visits in which they rated the food as ‘good’ or better.

The reputation of your app grows and more and more customers sign up. It’s great to take advantage of DynamoDB scalability; there’s no need to re-architect your solution as you gain more users, as your storage backend can scale to deal with millions or even hundreds of millions of customers.

Soon, it becomes apparent that most of your customers are using your app when they are out and about. You need to enhance your app so that it can make recommendations that are close to the customer. Fortunately, Titan comes with built-in geo queries. The query below imagines that customer ‘U1064’ is asking for a ‘Cafeteria’ and that you’ve captured their location of their mobile as (22.165, -101.0):

g.V.has(‘userId’, ‘U1064’).out(‘friend’).outE(‘visit’).has(‘visit_rating’, T.gte, 2).has(‘visit_food’, T.gte, 2).inV.as(‘x’).out(‘restaurant_genre’).has(‘genreId’, ‘Cafeteria’).back(‘x’).has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).as(‘b’).transform{e, m -> m.b.restaurant_name + " distance " + m.b.restaurant_place.getPoint().distance(Geoshape.point(22.165, -101.00).getPoint())}

==>Luna Cafe distance 2.774053451453471
==>Cafeteria y Restaurant El Pacifico distance 3.064723519030348

This query is the same as before except that there’s an extra filter:

has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).

Each restaurant vertex has a property called ‘restaurant_place’, which is a geo-point (a longitude and latitude). The filter restricts selection to any restaurants whose ‘restaurant_place’ is within 5km of the customer’s current location. The part of the query that transforms the output from the pipeline is modified to include the distance to the customer. You can use this to order your recommendations so the nearest is shown first.

Your app hits the big time as more and more customers use it to find a good dining experience. You are approached by one of the restaurants, which wants to run a promotion to acquire new customers. Their request is simple – they will pay you to send an in-app advert to your customers who are friends of people who have visited their restaurant, but who haven’t visited the restaurant themselves. Relieved that your app can finally make some money, you set about writing the query. This type of query follows a ‘except’ pattern:

gremlin> x = []
gremlin> g.V.has(‘RestaurantId’,’135052′).in(‘visit’).aggregate(x).out(‘friend’).except(x).userId.order

The query assumes that RestaurantId 135052 has made the approach. The first line defines a variable ‘x’ as an array. The steps of the query are shown below.

The ‘except’ pattern used in this query makes it very easy to select elements that have not been selected in a previous step. This makes queries such as the above or “who are a customer’s friend’s friends that are not already their friends” easy resolve. Once again, you could write this query in SQL, but the syntax would be far more complex than the simple Gremlin query used above and the multiple joins needed to resolve the query would affect performance.

Summary

In this post, I’ve shown you how to build a simple graph database using Titan with DynamoDB for storage. Compared to a more traditional RDBMS approach, a graph database can offer many advantages when you need to model a complex network. Your queries will be easier to understand and you may well get better performance from using a storage engine geared towards graph traversal. Using DynamoDB for your storage gives the added benefit of a fully managed, scalable repository for storing your data. You can concentrate on producing an app that excites your customers rather than managing infrastructure.

If you have any questions or suggestions, please leave a comment below.

References

Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011

——————————————–

Related:

Scaling Writes on Amazon DynamoDB Tables with Global Secondary Indexes

Building a Graph Database on AWS Using Amazon DynamoDB and Titan

Post Syndicated from Nick Corbett original https://blogs.aws.amazon.com/bigdata/post/Tx12NN92B1F5K0C/Building-a-Graph-Database-on-AWS-Using-Amazon-DynamoDB-and-Titan

Nick Corbett is a Big Data Consultant for AWS Professional Services

You might not know it, but a graph has changed your life. A bold claim perhaps, but companies such as Facebook, LinkedIn, and Twitter have revolutionized the way society interacts through their ability to manage a huge network of relationships. However, graphs aren’t just used in social media; they can represent many different systems, including financial transactions for fraud detection, customer purchases for recommendation engines, computer network topologies, or the logistics operations of Amazon.com.

In this post, I would like to introduce you to a technology that makes it easy to manipulate graphs in AWS at massive scale. To do this, let’s imagine that you have decided to build a mobile app to help you and your friends with the simple task of finding a good restaurant. You quickly decide to build a ‘server-less’ infrastructure, using Amazon Cognito to identity management and data synchronization, Amazon API Gateway for your REST API, and AWS Lambda to implement microservices that fulfil your business logic. Your final decision is where to store your data. Because your vision is to build a network of friends and restaurants, the natural choice is a graph database rather than an RDBMS.  Titan running on Amazon DynamoDB is a great fit for the job.

DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance together with seamless scalability. Recently, AWS announced a plug-in for Titan that allows it to use DynamoDB as a storage backend. This means you can now build a graph database using Titan and not worry about the performance, scalability, or operational management of storing your data.

Your vision for the network that will power your app is shown below and shows the three major parts of a graph: vertices (or nodes), edges, and properties.

A vertex (or node) represents an entity, such as a person or restaurant. In your graph, you have three types of vertex: customers, restaurants, and the type of cuisine served (called genre in the code examples).

An edge defines a relationship between two vertices. For example, a customer might visit a restaurant or a restaurant may serve food of a particular cuisine. An edge always has direction – it will be outgoing from one vertex and incoming to the other.

A property is a key-value pair that enriches a vertex or an edge. For example, a customer has a name or the customer might rate their experience when they visit a restaurant.

After a short time, your app is ready to be released, albeit as a minimum viable product. The initial functionality of your app is very simple: your customer supplies a cuisine, such as ‘Pizza’ or ‘Sushi’, and the app returns a list of restaurants they might like to visit.

To show how this works in Titan, you can follow these instructions in the AWS Big Data Blog’s GitHub’ repository to load some sample data into your own Titan database, using DynamoDB as the backend store. The data used in this example was based on a data set provided by the Machine Learning Repository at UCL1. By default, the example uses Amazon DynamoDB Local, a small client-side database and server that mimics the DynamoDB service. This component is intended to support local development and small scale testing, and lets you save on provisioned throughput, data storage, and transfer fees.

Interaction with Titan is through a graph traversal language called Gremlin, in much the same way as you would use SQL to interact with an RDBMS. However, whereas SQL is declarative, Gremlin is implemented as a functional pipeline; the results of each operation in the query are piped to the next stage. This provides a degree of control on not just what results your query generates but also how it is executed. Gremlin is part of the Open Source Apache TinkerPop stack, which has become the de facto standard framework for graph databases and is supported by products such as Titan, Neo4j, and OrientDB.

Titan is written in Java and you can see that this API is used to load the sample data by running Gremlin commands. The Java API would also be used by your microservices running in Lambda, calling through to DynamoDB to store the data. In fact, the data stored in DynamoDB is compressed and not humanly readable (for more information about the storage format, see Titan Graph Modeling in DynamoDB).

For the purposes of this post, however, it’s easier to user the Gremlin REPL, written in Groovy. The instructions on GitHub show you how to start your Gremlin session.

A simple Gremlin query that finds restaurants based on a type of cuisine is shown below:

gremlin> g.V.has(‘genreId’, ‘Pizzeria’).in.restaurant_name

==>La Fontana Pizza Restaurante and Cafe
==>Dominos Pizza
==>Little Cesarz
==>pizza clasica
==>Restaurante Tiberius

This introduces the concept of how graph queries work; you select one or more vertices then use the language to walk (or traverse) across the graph. You can also see the functional pipeline in action as the results of each element are passed to the next step in the query. The query can be read as shown below.

Network that will power your app

The query gives us five restaurants to recommend to our customer. This query would be just as easy to run if your data was based in an RDBMS, so at this point not much is gained by using a graph database. However, as more customers start using your app and the first feature requests come in, you start to feel the benefit of your decision.

Initial feedback from your customers is good. However, they tell you that although it’s great to get a recommendation based on a cuisine, it would be better if they could receive recommendations based on places their friends have visited. You quickly add a ‘friend’ feature to the the app and change the Gremlin query that you use to provide recommendations:

This query assumes that a particular user (‘U1064’) has asked us to find a ‘Cafeteria’ restaurant that their friends have visited. The Gremlin syntax can be read as shown below.

This query uses a pattern called ‘backtrack’. You make a selection of vertices and ‘remember’ them. You then traverse the graph, selecting more nodes. Finally, you ‘backtrack’ to your remembered selection and reduce it to those vertices that have a path through to your current position.

Again, this query could be executed in an RDBMS but it would be complex. Because you would keep all customers in a single table, finding friends would involve looping back to join a table to itself. While it’s perfectly possible to do this in SQL, the syntax can become long—especially if you want to loop multiple times; for example, how many of my friends’ friends’ have visited the same restaurant as me?  A more important problem would be the performance. Each SQL join would introduce extra latency to the query and you may find that, as your database grows, you can’t meet the strict latency requirements of a modern app. In my test system, Titan returned the answer to this query in 38ms, but the RDBMS where I staged the data took over 0.3 seconds to resolve, an order of magnitude difference!

Your new recommendations work well, but some customers are still not happy. Just because their friends visited a restaurant doesn’t mean that they enjoyed it; they only want recommendations to restaurants their friends actually liked. You update your app again and ask customers to rate their experience, using ‘0’ for poor, ‘1’ for good, and ‘2’ for excellent. You then modify the query to:

g.V.has(‘userId’,’U1101′).out(‘friend’).outE(‘visit’).has(‘visit_food’, T.gte, 1).as(‘x’).inV.as(‘y’).out(‘restaurant_genre’).has(‘genreId’, ‘Seafood’).back(‘x’).transform{e, m -> [food: m.x.visit_food, name:m.y.restaurant_name]}.groupCount{it.name}.cap

==>{Restaurante y Pescaderia Tampico=1, Restaurante Marisco Sam=1, Mariscos El Pescador=2}

This query is based on a user (‘U1101’) asking for a seafood restaurant. The stages of the query are shown below.

This query shows how you can filter for a property on an edge. When you traverse the ‘visit’ edge, you filter for those visits where the food rating was greater or equal than 1. The query also shows how you can transform results from a pipeline to a new object. You build a simple object, with two properties (food rating and name) for each ‘hit’ you have against your query criteria. Finally, the query also demonstrates the ‘groupCount’ function. This aggregation provides a count of each unique name.

The net result of this query is that the ‘best’ seafood restaurant to recommend is ‘Mariscos El Pescador’, as your customer’s friends have made two visits in which they rated the food as ‘good’ or better.

The reputation of your app grows and more and more customers sign up. It’s great to take advantage of DynamoDB scalability; there’s no need to re-architect your solution as you gain more users, as your storage backend can scale to deal with millions or even hundreds of millions of customers.

Soon, it becomes apparent that most of your customers are using your app when they are out and about. You need to enhance your app so that it can make recommendations that are close to the customer. Fortunately, Titan comes with built-in geo queries. The query below imagines that customer ‘U1064’ is asking for a ‘Cafeteria’ and that you’ve captured their location of their mobile as (22.165, -101.0):

g.V.has(‘userId’, ‘U1064’).out(‘friend’).outE(‘visit’).has(‘visit_rating’, T.gte, 2).has(‘visit_food’, T.gte, 2).inV.as(‘x’).out(‘restaurant_genre’).has(‘genreId’, ‘Cafeteria’).back(‘x’).has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).as(‘b’).transform{e, m -> m.b.restaurant_name + " distance " + m.b.restaurant_place.getPoint().distance(Geoshape.point(22.165, -101.00).getPoint())}

==>Luna Cafe distance 2.774053451453471
==>Cafeteria y Restaurant El Pacifico distance 3.064723519030348

This query is the same as before except that there’s an extra filter:

has(‘restaurant_place’, WITHIN, Geoshape.circle(22.165, -101.00, 5)).

Each restaurant vertex has a property called ‘restaurant_place’, which is a geo-point (a longitude and latitude). The filter restricts selection to any restaurants whose ‘restaurant_place’ is within 5km of the customer’s current location. The part of the query that transforms the output from the pipeline is modified to include the distance to the customer. You can use this to order your recommendations so the nearest is shown first.

Your app hits the big time as more and more customers use it to find a good dining experience. You are approached by one of the restaurants, which wants to run a promotion to acquire new customers. Their request is simple – they will pay you to send an in-app advert to your customers who are friends of people who have visited their restaurant, but who haven’t visited the restaurant themselves. Relieved that your app can finally make some money, you set about writing the query. This type of query follows a ‘except’ pattern:

gremlin> x = []
gremlin> g.V.has(‘RestaurantId’,’135052′).in(‘visit’).aggregate(x).out(‘friend’).except(x).userId.order

The query assumes that RestaurantId 135052 has made the approach. The first line defines a variable ‘x’ as an array. The steps of the query are shown below.

The ‘except’ pattern used in this query makes it very easy to select elements that have not been selected in a previous step. This makes queries such as the above or “who are a customer’s friend’s friends that are not already their friends” easy resolve. Once again, you could write this query in SQL, but the syntax would be far more complex than the simple Gremlin query used above and the multiple joins needed to resolve the query would affect performance.

Summary

In this post, I’ve shown you how to build a simple graph database using Titan with DynamoDB for storage. Compared to a more traditional RDBMS approach, a graph database can offer many advantages when you need to model a complex network. Your queries will be easier to understand and you may well get better performance from using a storage engine geared towards graph traversal. Using DynamoDB for your storage gives the added benefit of a fully managed, scalable repository for storing your data. You can concentrate on producing an app that excites your customers rather than managing infrastructure.

If you have any questions or suggestions, please leave a comment below.

References

Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011

——————————————–

Related:

Scaling Writes on Amazon DynamoDB Tables with Global Secondary Indexes

Automating Analytic Workflows on AWS

Post Syndicated from Wangechi Doble original https://blogs.aws.amazon.com/bigdata/post/Tx1DPT262BQB7YF/Automating-Analytic-Workflows-on-AWS

Wangechi Doble is a Solutions Architect with AWS

Organizations are experiencing a proliferation of data. This data includes logs, sensor data, social media data, and transactional data, and resides in the cloud, on premises, or as high-volume, real-time data feeds. It is increasingly important to analyze this data: stakeholders want information that is timely, accurate, and reliable. This analysis ranges from simple batch processing to complex real-time event processing. Automating workflows can ensure that necessary activities take place when required to drive the analytic processes.

With Amazon Simple Workflow (Amazon SWF), AWS Data Pipeline, and, AWS Lambda, you can build analytic solutions that are automated, repeatable, scalable, and reliable. In this post, I show you how to use these services to migrate and scale an on-premises data analytics workload.

Workflow basics

A business process can be represented as a workflow. Applications often incorporate a workflow as steps that must take place in a predefined order, with opportunities to adjust the flow of information based on certain decisions or special cases.

The following is an example of an ETL workflow:

A workflow decouples steps within a complex application. In the workflow above, bubbles represent steps or activities, diamonds represent control decisions, and arrows show the control flow through the process. This post shows you how to use Amazon SWF, AWS Data Pipeline, and AWS Lambda to automate this workflow.

Overview

SWF, Data Pipeline, and Lambda are designed for highly reliable execution of tasks, which can be event-driven, on-demand, or scheduled. The following table highlights the key characteristics of each service.

Feature

Amazon SWF

AWS Data Pipeline

AWS Lambda

Runs in response to

Anything

Schedules

Events from AWS services/direct invocation

Execution order

Orders execution of application steps

Schedules data movement

Reacts to event triggers / Direct calls

Scheduling

On-demand

Periodic

Event-driven / on-demand / periodic

Hosting environment

Anywhere

AWS/on-premises

AWS

Execution design

Exactly once

Exactly once, configurable retry

At least once

Programming language

Any

JSON

Supported languages

Let’s dive deeper into each of the services.  If you are already familiar with the services, skip to the section below titled "Scenario: An ecommerce reporting ETL workflow."

Amazon SWF

SWF allows you to build distributed applications in any programming language with components that are accessible from anywhere. It reduces infrastructure and administration overhead because you don’t need to run orchestration infrastructure. SWF provides durable, distributed-state management that enables resilient, truly distributed applications. Think of SWF as a fully-managed state tracker and task coordinator in the cloud.

SWF key concepts:

Workflows are collections of actions.

Domains are collections of related workflows.

Actions are tasks or workflow steps.

Activity workers implement actions.

Deciders implement a workflow’s coordination logic.

SWF works as follows:

A workflow starter kickoffs your workflow execution. For example, this could be a web server frontend.

SWF receives the start workflow execution request and then schedules a decision task.

The decider receives the task from SWF, reviews the history, and applies the coordination logic to determine the activity that needs to be performed.

SWF receives the decision, schedules the activity task, and waits for the activity task to complete or time out.

SWF assigns the activity to a worker that performs the task, and returns the results to Amazon SWF.

SWF receives the results of the activity, adds them to the workflow history, and schedules a decision task.

This process repeats itself for each activity in your workflow.

The graphic below is an overview of how SWF operates.

Source: Amazon Simple Workflow – Cloud-Based Workflow Management

To facilitate a workflow, SWF uses a decider to co-ordinate the various tasks by assigning them to workers.  The tasks are the logical units of computing work, and workers are the functional components of the underlying application. Workers and deciders can be written in the programming language of your choice, and they can run in the cloud (such as on an Amazon EC2 instance), in your data center, or even on your desktop.

In addition, SWF supports Lambda functions as workers. This means that you can use SWF to manage the execution of Lambda functions in the context of a broader workflow. SWF provides the AWS Flow Framework, a programming framework that lets you build distributed SWF-based applications quickly and easily.

AWS Data Pipeline

Data Pipeline allows you to create automated, scheduled workflows to orchestrate data movement from multiple sources, both within AWS and on-premises. Data Pipeline can also run activities periodically:  The minimum pipeline is actually just an activity.  It is natively integrated with Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon EMR, Amazon Redshift, and Amazon EC2 and can be easily connected to third-party and on-premises data sources. Data Pipeline’s inputs and outputs are specified as data nodes within a workflow.

Data Pipeline key concepts:

A pipeline contains the definition of the dependent chain of data sources, destinations, and predefined or custom data processing activities required to execute your business logic.

Activities:

Arbitrary Linux applications – anything that you can run from the shell

Copies between different data source combinations

SQL queries

User-defined Amazon EMR jobs

A data node is a Data Pipeline–managed or user-managed resource.

Resources provide compute for activities, such as an Amazon EC2 instance or an Amazon EMR cluster, that perform the work defined by a pipeline activity.

Schedules drive orchestration execution.

A parameterized template lets you provide values for specially marked parameters within the template so that you can launch a customized pipeline.

Data Pipeline works as follows:

Define a task, business logic, and the schedule.

Data Pipeline checks for any preconditions. 

After preconditions are satisfied, the service executes your business logic. 

When a pipeline completes, a message is sent to the Amazon SNS topic of your choice. Data Pipeline also provides failure handling, SNS notifications in case of error, and built-in retry logic.

Below is a high-level diagram.

AWS Lambda

Lambda is an event-driven, zero-administration compute service. It runs your code in response to events from other AWS services or direct invocation from any web or mobile app and automatically manages compute resources for you. It allows you to build applications that respond quickly to new information, and automatically hosts and scales them for you.

Lambda key concepts:

Lambda function – You write a Lambda function, give it permission to access specific AWS resources, and then connect the function to your AWS or non-AWS resources.

Event sources publish events that cause your Lambda function to be invoked. Event sources can be:

AWS service, such as Amazon S3 and Amazon SNS.

Other Amazon services, such as Amazon Echo.

An event source you build, such as a mobile application.

Other Lambda functions that you invoke from within Lambda

Amazon API Gateway – over HTTPS

Lambda works as follows:

Write a Lambda function.

Upload your code to AWS Lambda.

Specify requirements for the information execution environment, including memory requirements, a timeout period, an IAM role, and the function you want to invoke within your code.

Associate your function with your event source.

Lambda executes any functions that are associated with it, either asynchronously or synchronously depending on your event source.

Scenario: An ecommerce reporting ETL workflow

Here’s an example to illustrate the concepts that I have discussed so far. Transactional data from your company website is stored in an on-premises master database and replicated to a slave for reporting, ad hoc querying, and manual targeting through email campaigns. Your organization has become very successful, is experiencing significant growth, and needs to scale. Reporting is currently done using a read-only copy of the transactional database. Under these circumstances, this design does not scale well to meet the needs of high-volume business analytics.

The following diagram illustrates the current on-premises architecture.

You decide to migrate the transactional reporting data to a data warehouse in the cloud to take advantage of Amazon Redshift, a scalable data warehouse optimized for query processing and analytics.

The reporting ETL workflow for this scenario is similar to the one I introduced earlier. In this example, I move data from the on-premises data store to Amazon Redshift and focus on the following activities:

Incremental data extraction from an on-premises database

Data validation and transformation

Data loading into Amazon Redshift

I am going to decompose this workflow using AWS Data Pipeline, Amazon SWF, and AWS Lambda and highlight key aspects of each approach.  I focus on three different approaches, with each approach focusing on an individual service. Please note that a solution using all three services together is possible, but not covered in this post.

AWS Data Pipeline reporting ETL workflow

Data Pipeline lets you define a dependent chain of data sources and destinations with an option to create data processing activities in a pipeline. You can schedule the tasks within the pipeline to perform various activities of data movement and processing. In addition to scheduling, you can also include failure and retry options in the data pipeline workflows.

The following diagram is an example of a Data Pipeline reporting ETL pipeline that moves data from the replicated slave database on-premises to Amazon Redshift:

The above pipeline performs the following activities:

ShellCommandActivity – Incrementally extracts data from an on-premises data store to Amazon S3 using a custom script hosted on an on-premises server with Task Runner installed.

EmrActivity – Launches a transient cluster that uses the extracted dataset as input, validates, and transforms it, and then outputs to an Amazon S3 bucket as illustrated in the blog post on ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce.

CopyActivity – Performs an Amazon Redshift COPY command on the transformed data and loads it into an Amazon Redshift table for analytics and reporting.

Data Pipeline checks the data for readiness. It also allows you to schedule and orchestrate the data movement while providing you with failure handling, SNS notifications in case of error, and built-in retry logic.  You can also specify preconditions as decision logic. For example, a precondition can check whether data is present in an S3 bucket before a pipeline copy activity attempts to load it into Amazon Redshift.

Data Pipeline is useful for creating and managing periodic, batch-processing, data-driven workflows. It optimizes the data processing experience, especially as it relates to data on AWS. For on-demand or real-time processing, Data Pipeline is not an ideal choice.

Amazon SWF on-demand reporting ETL workflow

You can use SWF as an alternative to Data Pipeline if you prioritize fine-grained, programmatic customization over the control flow and patterns of your workflow logic. SWF provides significant benefits, such as robust retry mechanisms upon failure, centralized application state tracking, and logical separation of application state and units of work.

The following diagram is an example SWF reporting ETL workflow.

The workflow’s decider controls the flow of execution from task to task. At a high level, the following activities take place in the above workflow:

An admin application sends a request to start the reporting ETL workflow.

The decider assigns the first task to on-premises data extraction workers to extract data from a transactional database.

Upon completion, the decider assigns the next task to the EMR Starter to launch an EMR ETL cluster to validate and transform the extracted dataset.

Upon completion, the decider assigns the last task to the Amazon Redshift Data Loader to load the transformed data into Amazon Redshift.

This workflow uses SWF for cron to automate failure handling and scaling in case you want to run your cron job on a pool of machines on-premises. In the latter case, this would eliminate any single point of failure, which is not possible with the traditional operating system cron.

Because the workers and deciders are both stateless, you can respond to increased traffic by simply adding more workers and deciders as needed. You can do this using the Auto Scaling service for applications that are running on EC2 instances in the AWS cloud.

To eliminate the need to manage infrastructure in your workflow, SWF now provides a Lambda task so that you can run Lambda functions in place of, or alongside, traditional SWF activities. SWF invokes Lambda functions directly, so you don’t need to implement a worker program to execute a Lambda function (as you must with traditional activities).

The following example reporting ETL workflow replaces traditional SWF activity workers with Lambda functions.

In the above workflow, SWF sequences Lambda functions to perform the same tasks described in the first example workflow. It uses the Lambda-based  database loader to load data into Amazon Redshift. Implementing activities as Lambda tasks using SWF Flow Framework simplifies the workflow’s execution model because there are no servers to maintain.

SWF makes it easy to build and manage on-demand, scalable, distributed workflows. For event-driven reporting ETL, turn to Lambda.

AWS Lambda event-driven reporting ETL workflow

With Lambda, you can convert the reporting ETL pipeline from a traditional batch processing or on-demand workflow to a real-time, event processing workflow with zero administration. The following diagram is an example Lambda event-driven reporting ETL workflow.

The above workflow uses Lambda to perform event-driven processing without managing any infrastructure. 

At a high-level, the following activities take place:

An on-premises application uploads data into an S3 bucket.

S3 invokes a Lambda function to verify the data upon detecting an object-created event in that bucket.

Verified data is staged in another S3 bucket. You can batch files in the staging bucket, then trigger a Lambda function to launch an EMR cluster to transform the batched input files.

The Amazon Redshift Database Loader loads transformed data into the database.

In this workflow, Lambda functions perform specific activities in response to event triggers associated with AWS services. With no centralized control logic, workflow execution depends on the completion of an activity, type of event source, and more fine-grained programmatic flow logic within a Lambda function itself.

Summary

In this post, I have shown you how to migrate and scale an on-premises data analytics workload using AWS Data Pipeline, Amazon SWF, or AWS Lambda.  Specifically, you’ve learned how Data Pipeline can drive a reporting ETL pipeline to incrementally refresh an Amazon Redshift database as a batch process; how to use SWF in a hybrid environment for on-demand distributed processing; and finally, how to use Lambda to provide event-driven processing with zero administration. 

You can learn how customers are leveraging Lambda in unique ways to perform event-driven processing in the blog posts Building Scalable and Responsive Big Data Interfaces with AWS Lambda and How Expedia Implemented Near Real-time Analysis of Interdependent Datasets.

To get started quickly with Data Pipeline, you can use the built-in templates discussed in the blog post Using AWS Data Pipeline’s Parameterized Templates to Build Your Own Library of ETL Use-case Definitions.

To get started with SWF, you can launch a sample workflow in the AWS console, or try a sample in one of the programming languages, or use the SWF Flow programming Framework.

As noted earlier, you can choose to build a reporting ETL solution that uses all three services together to automate your analytic workflow. Data Pipeline, SWF, and Lambda provide you with capability that scales to meet your processing needs. You can easily integrate these services to provide an end-to-end solution. You can build upon these concepts to automate not only your analytics workflows, but also your business processes and different types of applications that can exist anywhere in a scalable and reliable fashion.

If you have questions or suggestions, please leave a comment below.

————————————–

Related:

ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce

 

Automating Analytic Workflows on AWS

Post Syndicated from Wangechi Doble original https://blogs.aws.amazon.com/bigdata/post/Tx1DPT262BQB7YF/Automating-Analytic-Workflows-on-AWS

Wangechi Doble is a Solutions Architect with AWS

Organizations are experiencing a proliferation of data. This data includes logs, sensor data, social media data, and transactional data, and resides in the cloud, on premises, or as high-volume, real-time data feeds. It is increasingly important to analyze this data: stakeholders want information that is timely, accurate, and reliable. This analysis ranges from simple batch processing to complex real-time event processing. Automating workflows can ensure that necessary activities take place when required to drive the analytic processes.

With Amazon Simple Workflow (Amazon SWF), AWS Data Pipeline, and, AWS Lambda, you can build analytic solutions that are automated, repeatable, scalable, and reliable. In this post, I show you how to use these services to migrate and scale an on-premises data analytics workload.

Workflow basics

A business process can be represented as a workflow. Applications often incorporate a workflow as steps that must take place in a predefined order, with opportunities to adjust the flow of information based on certain decisions or special cases.

The following is an example of an ETL workflow:

A workflow decouples steps within a complex application. In the workflow above, bubbles represent steps or activities, diamonds represent control decisions, and arrows show the control flow through the process. This post shows you how to use Amazon SWF, AWS Data Pipeline, and AWS Lambda to automate this workflow.

Overview

SWF, Data Pipeline, and Lambda are designed for highly reliable execution of tasks, which can be event-driven, on-demand, or scheduled. The following table highlights the key characteristics of each service.

Feature

Amazon SWF

AWS Data Pipeline

AWS Lambda

Runs in response to

Anything

Schedules

Events from AWS services/direct invocation

Execution order

Orders execution of application steps

Schedules data movement

Reacts to event triggers / Direct calls

Scheduling

On-demand

Periodic

Event-driven / on-demand / periodic

Hosting environment

Anywhere

AWS/on-premises

AWS

Execution design

Exactly once

Exactly once, configurable retry

At least once

Programming language

Any

JSON

Supported languages

Let’s dive deeper into each of the services.  If you are already familiar with the services, skip to the section below titled "Scenario: An ecommerce reporting ETL workflow."

Amazon SWF

SWF allows you to build distributed applications in any programming language with components that are accessible from anywhere. It reduces infrastructure and administration overhead because you don’t need to run orchestration infrastructure. SWF provides durable, distributed-state management that enables resilient, truly distributed applications. Think of SWF as a fully-managed state tracker and task coordinator in the cloud.

SWF key concepts:

Workflows are collections of actions.

Domains are collections of related workflows.

Actions are tasks or workflow steps.

Activity workers implement actions.

Deciders implement a workflow’s coordination logic.

SWF works as follows:

A workflow starter kickoffs your workflow execution. For example, this could be a web server frontend.

SWF receives the start workflow execution request and then schedules a decision task.

The decider receives the task from SWF, reviews the history, and applies the coordination logic to determine the activity that needs to be performed.

SWF receives the decision, schedules the activity task, and waits for the activity task to complete or time out.

SWF assigns the activity to a worker that performs the task, and returns the results to Amazon SWF.

SWF receives the results of the activity, adds them to the workflow history, and schedules a decision task.

This process repeats itself for each activity in your workflow.

The graphic below is an overview of how SWF operates.

Source: Amazon Simple Workflow – Cloud-Based Workflow Management

To facilitate a workflow, SWF uses a decider to co-ordinate the various tasks by assigning them to workers.  The tasks are the logical units of computing work, and workers are the functional components of the underlying application. Workers and deciders can be written in the programming language of your choice, and they can run in the cloud (such as on an Amazon EC2 instance), in your data center, or even on your desktop.

In addition, SWF supports Lambda functions as workers. This means that you can use SWF to manage the execution of Lambda functions in the context of a broader workflow. SWF provides the AWS Flow Framework, a programming framework that lets you build distributed SWF-based applications quickly and easily.

AWS Data Pipeline

Data Pipeline allows you to create automated, scheduled workflows to orchestrate data movement from multiple sources, both within AWS and on-premises. Data Pipeline can also run activities periodically:  The minimum pipeline is actually just an activity.  It is natively integrated with Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon EMR, Amazon Redshift, and Amazon EC2 and can be easily connected to third-party and on-premises data sources. Data Pipeline’s inputs and outputs are specified as data nodes within a workflow.

Data Pipeline key concepts:

A pipeline contains the definition of the dependent chain of data sources, destinations, and predefined or custom data processing activities required to execute your business logic.

Activities:

Arbitrary Linux applications – anything that you can run from the shell

Copies between different data source combinations

SQL queries

User-defined Amazon EMR jobs

A data node is a Data Pipeline–managed or user-managed resource.

Resources provide compute for activities, such as an Amazon EC2 instance or an Amazon EMR cluster, that perform the work defined by a pipeline activity.

Schedules drive orchestration execution.

A parameterized template lets you provide values for specially marked parameters within the template so that you can launch a customized pipeline.

Data Pipeline works as follows:

Define a task, business logic, and the schedule.

Data Pipeline checks for any preconditions. 

After preconditions are satisfied, the service executes your business logic. 

When a pipeline completes, a message is sent to the Amazon SNS topic of your choice. Data Pipeline also provides failure handling, SNS notifications in case of error, and built-in retry logic.

Below is a high-level diagram.

AWS Lambda

Lambda is an event-driven, zero-administration compute service. It runs your code in response to events from other AWS services or direct invocation from any web or mobile app and automatically manages compute resources for you. It allows you to build applications that respond quickly to new information, and automatically hosts and scales them for you.

Lambda key concepts:

Lambda function – You write a Lambda function, give it permission to access specific AWS resources, and then connect the function to your AWS or non-AWS resources.

Event sources publish events that cause your Lambda function to be invoked. Event sources can be:

AWS service, such as Amazon S3 and Amazon SNS.

Other Amazon services, such as Amazon Echo.

An event source you build, such as a mobile application.

Other Lambda functions that you invoke from within Lambda

Amazon API Gateway – over HTTPS

Lambda works as follows:

Write a Lambda function.

Upload your code to AWS Lambda.

Specify requirements for the information execution environment, including memory requirements, a timeout period, an IAM role, and the function you want to invoke within your code.

Associate your function with your event source.

Lambda executes any functions that are associated with it, either asynchronously or synchronously depending on your event source.

Scenario: An ecommerce reporting ETL workflow

Here’s an example to illustrate the concepts that I have discussed so far. Transactional data from your company website is stored in an on-premises master database and replicated to a slave for reporting, ad hoc querying, and manual targeting through email campaigns. Your organization has become very successful, is experiencing significant growth, and needs to scale. Reporting is currently done using a read-only copy of the transactional database. Under these circumstances, this design does not scale well to meet the needs of high-volume business analytics.

The following diagram illustrates the current on-premises architecture.

You decide to migrate the transactional reporting data to a data warehouse in the cloud to take advantage of Amazon Redshift, a scalable data warehouse optimized for query processing and analytics.

The reporting ETL workflow for this scenario is similar to the one I introduced earlier. In this example, I move data from the on-premises data store to Amazon Redshift and focus on the following activities:

Incremental data extraction from an on-premises database

Data validation and transformation

Data loading into Amazon Redshift

I am going to decompose this workflow using AWS Data Pipeline, Amazon SWF, and AWS Lambda and highlight key aspects of each approach.  I focus on three different approaches, with each approach focusing on an individual service. Please note that a solution using all three services together is possible, but not covered in this post.

AWS Data Pipeline reporting ETL workflow

Data Pipeline lets you define a dependent chain of data sources and destinations with an option to create data processing activities in a pipeline. You can schedule the tasks within the pipeline to perform various activities of data movement and processing. In addition to scheduling, you can also include failure and retry options in the data pipeline workflows.

The following diagram is an example of a Data Pipeline reporting ETL pipeline that moves data from the replicated slave database on-premises to Amazon Redshift:

The above pipeline performs the following activities:

ShellCommandActivity – Incrementally extracts data from an on-premises data store to Amazon S3 using a custom script hosted on an on-premises server with Task Runner installed.

EmrActivity – Launches a transient cluster that uses the extracted dataset as input, validates, and transforms it, and then outputs to an Amazon S3 bucket as illustrated in the blog post on ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce.

CopyActivity – Performs an Amazon Redshift COPY command on the transformed data and loads it into an Amazon Redshift table for analytics and reporting.

Data Pipeline checks the data for readiness. It also allows you to schedule and orchestrate the data movement while providing you with failure handling, SNS notifications in case of error, and built-in retry logic.  You can also specify preconditions as decision logic. For example, a precondition can check whether data is present in an S3 bucket before a pipeline copy activity attempts to load it into Amazon Redshift.

Data Pipeline is useful for creating and managing periodic, batch-processing, data-driven workflows. It optimizes the data processing experience, especially as it relates to data on AWS. For on-demand or real-time processing, Data Pipeline is not an ideal choice.

Amazon SWF on-demand reporting ETL workflow

You can use SWF as an alternative to Data Pipeline if you prioritize fine-grained, programmatic customization over the control flow and patterns of your workflow logic. SWF provides significant benefits, such as robust retry mechanisms upon failure, centralized application state tracking, and logical separation of application state and units of work.

The following diagram is an example SWF reporting ETL workflow.

The workflow’s decider controls the flow of execution from task to task. At a high level, the following activities take place in the above workflow:

An admin application sends a request to start the reporting ETL workflow.

The decider assigns the first task to on-premises data extraction workers to extract data from a transactional database.

Upon completion, the decider assigns the next task to the EMR Starter to launch an EMR ETL cluster to validate and transform the extracted dataset.

Upon completion, the decider assigns the last task to the Amazon Redshift Data Loader to load the transformed data into Amazon Redshift.

This workflow uses SWF for cron to automate failure handling and scaling in case you want to run your cron job on a pool of machines on-premises. In the latter case, this would eliminate any single point of failure, which is not possible with the traditional operating system cron.

Because the workers and deciders are both stateless, you can respond to increased traffic by simply adding more workers and deciders as needed. You can do this using the Auto Scaling service for applications that are running on EC2 instances in the AWS cloud.

To eliminate the need to manage infrastructure in your workflow, SWF now provides a Lambda task so that you can run Lambda functions in place of, or alongside, traditional SWF activities. SWF invokes Lambda functions directly, so you don’t need to implement a worker program to execute a Lambda function (as you must with traditional activities).

The following example reporting ETL workflow replaces traditional SWF activity workers with Lambda functions.

In the above workflow, SWF sequences Lambda functions to perform the same tasks described in the first example workflow. It uses the Lambda-based  database loader to load data into Amazon Redshift. Implementing activities as Lambda tasks using SWF Flow Framework simplifies the workflow’s execution model because there are no servers to maintain.

SWF makes it easy to build and manage on-demand, scalable, distributed workflows. For event-driven reporting ETL, turn to Lambda.

AWS Lambda event-driven reporting ETL workflow

With Lambda, you can convert the reporting ETL pipeline from a traditional batch processing or on-demand workflow to a real-time, event processing workflow with zero administration. The following diagram is an example Lambda event-driven reporting ETL workflow.

The above workflow uses Lambda to perform event-driven processing without managing any infrastructure. 

At a high-level, the following activities take place:

An on-premises application uploads data into an S3 bucket.

S3 invokes a Lambda function to verify the data upon detecting an object-created event in that bucket.

Verified data is staged in another S3 bucket. You can batch files in the staging bucket, then trigger a Lambda function to launch an EMR cluster to transform the batched input files.

The Amazon Redshift Database Loader loads transformed data into the database.

In this workflow, Lambda functions perform specific activities in response to event triggers associated with AWS services. With no centralized control logic, workflow execution depends on the completion of an activity, type of event source, and more fine-grained programmatic flow logic within a Lambda function itself.

Summary

In this post, I have shown you how to migrate and scale an on-premises data analytics workload using AWS Data Pipeline, Amazon SWF, or AWS Lambda.  Specifically, you’ve learned how Data Pipeline can drive a reporting ETL pipeline to incrementally refresh an Amazon Redshift database as a batch process; how to use SWF in a hybrid environment for on-demand distributed processing; and finally, how to use Lambda to provide event-driven processing with zero administration. 

You can learn how customers are leveraging Lambda in unique ways to perform event-driven processing in the blog posts Building Scalable and Responsive Big Data Interfaces with AWS Lambda and How Expedia Implemented Near Real-time Analysis of Interdependent Datasets.

To get started quickly with Data Pipeline, you can use the built-in templates discussed in the blog post Using AWS Data Pipeline’s Parameterized Templates to Build Your Own Library of ETL Use-case Definitions.

To get started with SWF, you can launch a sample workflow in the AWS console, or try a sample in one of the programming languages, or use the SWF Flow programming Framework.

As noted earlier, you can choose to build a reporting ETL solution that uses all three services together to automate your analytic workflow. Data Pipeline, SWF, and Lambda provide you with capability that scales to meet your processing needs. You can easily integrate these services to provide an end-to-end solution. You can build upon these concepts to automate not only your analytics workflows, but also your business processes and different types of applications that can exist anywhere in a scalable and reliable fashion.

If you have questions or suggestions, please leave a comment below.

————————————–

Related:

ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce

 

Understanding the process of finding serious vulns

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2015/08/understanding-process-of-finding.html

Our industry tends to glamorize vulnerability research, with a growing number of bug reports accompanied by flashy conference presentations, media kits, and exclusive interviews. But for all that grandeur, the public understands relatively little about the effort that goes into identifying and troubleshooting the hundreds of serious vulnerabilities that crop up every year in the software we all depend on. It certainly does not help that many of the commercial security testing products are promoted with truly bombastic claims – and that some of the most vocal security researchers enjoy the image of savant hackers, seldom talking about the processes and toolkits they depend on to get stuff done.

I figured it may make sense to change this. Several weeks ago, I started trawling through the list of public CVE assignments, and then manually compiling a list of genuine, high-impact flaws in commonly used software. I tried to follow three basic principles:

For pragmatic reasons, I focused on problems where the nature of the vulnerability and the identity of the researcher is easy to ascertain. For this reason, I ended up rejecting entries such as CVE-2015-2132 or CVE-2015-3799.

I focused on widespread software – e.g., browsers, operating systems, network services – skipping many categories of niche enterprise products, WordPress add-ons, and so on. Good examples of rejected entries in this category include CVE-2015-5406 and CVE-2015-5681.

I skipped issues that appeared to be low impact, or where the credibility of the report seemed unclear. One example of a rejected submission is CVE-2015-4173.

To ensure that the data isn’t skewed toward more vulnerable software, I tried to focus on research efforts, rather than on individual bugs; where a single reporter was credited for multiple closely related vulnerabilities in the same product within a narrow timeframe, I would use only one sample from the entire series of bugs.

For the qualifying CVE entries, I started sending out anonymous surveys to the researchers who reported the underlying issues. The surveys open with a discussion of the basic method employed to find the bug:

How did you find this issue?

( ) Manual bug hunting
( ) Automated vulnerability discovery
( ) Lucky accident while doing unrelated work

If “manual bug hunting” is selected, several additional options appear:

( ) I was reviewing the source code to check for flaws.
( ) I studied the binary using a disassembler, decompiler, or a tracing tool.
( ) I was doing black-box experimentation to see how the program behaves.
( ) I simply noticed that this bug is being exploited in the wild.
( ) I did something else: ____________________

Selecting “automated discovery” results in a different set of choices:

( ) I used a fuzzer.
( ) I ran a simple vulnerability scanner (e.g., Nessus).
( ) I used a source code analyzer (static analysis).
( ) I relied on symbolic or concolic execution.
( ) I did something else: ____________________

Researchers who relied on automated tools are also asked about the origins of the tool and the computing resources used:

Name of tool used (optional): ____________________

Where does this tool come from?

( ) I created it just for this project.
( ) It’s an existing but non-public utility.
( ) It’s a publicly available framework.

At what scale did you perform the experiments?

( ) I used 16 CPU cores or less.
( ) I employed more than 16 cores.

Regardless of the underlying method, the survey also asks every participant about the use of memory diagnostic tools:

Did you use any additional, automatic error-catching tools – like ASAN
or Valgrind – to investigate this issue?

( ) Yes. ( ) Nope!

…and about the lengths to which the reporter went to demonstrate the bug:

How far did you go to demonstrate the impact of the issue?

( ) I just pointed out the problematic code or functionality.
( ) I submitted a basic proof-of-concept (say, a crashing test case).
( ) I created a fully-fledged, working exploit.

It also touches on the communications with the vendor:

Did you coordinate the disclosure with the vendor of the affected
software?

( ) Yes. ( ) No.

How long have you waited before having the issue disclosed to the
public?

( ) I disclosed right away. ( ) Less than a week. ( ) 1-4 weeks.
( ) 1-3 months. ( ) 4-6 months. ( ) More than 6 months.

In the end, did the vendor address the issue as quickly as you would
have hoped?

( ) Yes. ( ) Nope.

…and the channel used to disclose the bug – an area where we have seen some stark changes over the past five years:

How did you disclose it? Select all options that apply:

[ ] I made a blog post about the bug.
[ ] I posted to a security mailing list (e.g., BUGTRAQ).
[ ] I shared the finding on a web-based discussion forum.
[ ] I announced it at a security conference.
[ ] I shared it on Twitter or other social media.
[ ] We made a press kit or reached out to a journalist.
[ ] Vendor released an advisory.

The survey ends with a question about the motivation and the overall amount of effort that went into this work:

What motivated you to look for this bug?

( ) It’s just a hobby project.
( ) I received a scientific grant.
( ) I wanted to participate in a bounty program.
( ) I was doing contract work.
( ) It’s a part of my full-time job.

How much effort did you end up putting into this project?

( ) Just a couple of hours.
( ) Several days.
( ) Several weeks or more.

So far, the response rate for the survey is approximately 80%; because I only started in August, I currently don’t have enough answers to draw particularly detailed conclusions from the data set – this should change over the next couple of months. Still, I’m already seeing several well-defined if preliminary trends:

The use of fuzzers is ubiquitous (incidentally, of named projects, afl-fuzz leads the fray so far); the use of other automated tools, such as static analysis frameworks or concolic execution, appears to be unheard of – despite the undivided attention that such methods receive in academic settings.

Memory diagnostic tools, such as ASAN and Valgrind, are extremely popular – and are an untold success story of vulnerability research.

Most of public vulnerability research appears to be done by people who work on it full-time, employed by vendors; hobby work and bug bounties follow closely.

Only a small minority of serious vulnerabilities appear to be disclosed anywhere outside a vendor advisory, making it extremely dangerous to rely on press coverage (or any other casual source) for evaluating personal risk.

Of course, some security work happens out of public view; for example, some enterprises have well-established and meaningful security assurance programs that likely prevent hundreds of security bugs from ever shipping in the reviewed code. Since it is difficult to collect comprehensive and unbiased data about such programs, there is always some speculation involved when discussing the similarities and differences between this work and public security research.

Well, that’s it! Watch this space for updates – and let me know if there’s anything you’d change or add to the questionnaire.

Understanding the process of finding serious vulns

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2015/08/understanding-process-of-finding.html

Our industry tends to glamorize vulnerability research, with a growing number of bug reports accompanied by flashy conference presentations, media kits, and exclusive interviews. But for all that grandeur, the public understands relatively little about the effort that goes into identifying and troubleshooting the hundreds of serious vulnerabilities that crop up every year in the software we all depend on. It certainly does not help that many of the commercial security testing products are promoted with truly bombastic claims – and that some of the most vocal security researchers enjoy the image of savant hackers, seldom talking about the processes and toolkits they depend on to get stuff done.

I figured it may make sense to change this. Several weeks ago, I started trawling through the list of public CVE assignments, and then manually compiling a list of genuine, high-impact flaws in commonly used software. I tried to follow three basic principles:

For pragmatic reasons, I focused on problems where the nature of the vulnerability and the identity of the researcher is easy to ascertain. For this reason, I ended up rejecting entries such as CVE-2015-2132 or CVE-2015-3799.

I focused on widespread software – e.g., browsers, operating systems, network services – skipping many categories of niche enterprise products, WordPress add-ons, and so on. Good examples of rejected entries in this category include CVE-2015-5406 and CVE-2015-5681.

I skipped issues that appeared to be low impact, or where the credibility of the report seemed unclear. One example of a rejected submission is CVE-2015-4173.

To ensure that the data isn’t skewed toward more vulnerable software, I tried to focus on research efforts, rather than on individual bugs; where a single reporter was credited for multiple closely related vulnerabilities in the same product within a narrow timeframe, I would use only one sample from the entire series of bugs.

For the qualifying CVE entries, I started sending out anonymous surveys to the researchers who reported the underlying issues. The surveys open with a discussion of the basic method employed to find the bug:

How did you find this issue?

( ) Manual bug hunting
( ) Automated vulnerability discovery
( ) Lucky accident while doing unrelated work

If “manual bug hunting” is selected, several additional options appear:

( ) I was reviewing the source code to check for flaws.
( ) I studied the binary using a disassembler, decompiler, or a tracing tool.
( ) I was doing black-box experimentation to see how the program behaves.
( ) I simply noticed that this bug is being exploited in the wild.
( ) I did something else: ____________________

Selecting “automated discovery” results in a different set of choices:

( ) I used a fuzzer.
( ) I ran a simple vulnerability scanner (e.g., Nessus).
( ) I used a source code analyzer (static analysis).
( ) I relied on symbolic or concolic execution.
( ) I did something else: ____________________

Researchers who relied on automated tools are also asked about the origins of the tool and the computing resources used:

Name of tool used (optional): ____________________

Where does this tool come from?

( ) I created it just for this project.
( ) It’s an existing but non-public utility.
( ) It’s a publicly available framework.

At what scale did you perform the experiments?

( ) I used 16 CPU cores or less.
( ) I employed more than 16 cores.

Regardless of the underlying method, the survey also asks every participant about the use of memory diagnostic tools:

Did you use any additional, automatic error-catching tools – like ASAN
or Valgrind – to investigate this issue?

( ) Yes. ( ) Nope!

…and about the lengths to which the reporter went to demonstrate the bug:

How far did you go to demonstrate the impact of the issue?

( ) I just pointed out the problematic code or functionality.
( ) I submitted a basic proof-of-concept (say, a crashing test case).
( ) I created a fully-fledged, working exploit.

It also touches on the communications with the vendor:

Did you coordinate the disclosure with the vendor of the affected
software?

( ) Yes. ( ) No.

How long have you waited before having the issue disclosed to the
public?

( ) I disclosed right away. ( ) Less than a week. ( ) 1-4 weeks.
( ) 1-3 months. ( ) 4-6 months. ( ) More than 6 months.

In the end, did the vendor address the issue as quickly as you would
have hoped?

( ) Yes. ( ) Nope.

…and the channel used to disclose the bug – an area where we have seen some stark changes over the past five years:

How did you disclose it? Select all options that apply:

[ ] I made a blog post about the bug.
[ ] I posted to a security mailing list (e.g., BUGTRAQ).
[ ] I shared the finding on a web-based discussion forum.
[ ] I announced it at a security conference.
[ ] I shared it on Twitter or other social media.
[ ] We made a press kit or reached out to a journalist.
[ ] Vendor released an advisory.

The survey ends with a question about the motivation and the overall amount of effort that went into this work:

What motivated you to look for this bug?

( ) It’s just a hobby project.
( ) I received a scientific grant.
( ) I wanted to participate in a bounty program.
( ) I was doing contract work.
( ) It’s a part of my full-time job.

How much effort did you end up putting into this project?

( ) Just a couple of hours.
( ) Several days.
( ) Several weeks or more.

So far, the response rate for the survey is approximately 80%; because I only started in August, I currently don’t have enough answers to draw particularly detailed conclusions from the data set – this should change over the next couple of months. Still, I’m already seeing several well-defined if preliminary trends:

The use of fuzzers is ubiquitous (incidentally, of named projects, afl-fuzz leads the fray so far); the use of other automated tools, such as static analysis frameworks or concolic execution, appears to be unheard of – despite the undivided attention that such methods receive in academic settings.

Memory diagnostic tools, such as ASAN and Valgrind, are extremely popular – and are an untold success story of vulnerability research.

Most of public vulnerability research appears to be done by people who work on it full-time, employed by vendors; hobby work and bug bounties follow closely.

Only a small minority of serious vulnerabilities appear to be disclosed anywhere outside a vendor advisory, making it extremely dangerous to rely on press coverage (or any other casual source) for evaluating personal risk.

Of course, some security work happens out of public view; for example, some enterprises have well-established and meaningful security assurance programs that likely prevent hundreds of security bugs from ever shipping in the reviewed code. Since it is difficult to collect comprehensive and unbiased data about such programs, there is always some speculation involved when discussing the similarities and differences between this work and public security research.

Well, that’s it! Watch this space for updates – and let me know if there’s anything you’d change or add to the questionnaire.