Tag Archives: Netflix

Big Data AWS Training Course Gets Big Update

Post Syndicated from Michael Stroh original https://blogs.aws.amazon.com/bigdata/post/Tx3FR6JXY0HVTS3/Big-Data-AWS-Training-Course-Gets-Big-Update

Michael Stroh is Communications Manager for AWS Training & Certification

AWS offers a number of in-depth technical training courses, which we’re regularly updating in response to student feedback and changes to the AWS platform. Today I want to tell you about some exciting changes to Big Data on AWS, our most comprehensive training course on the AWS big data platform.

The 3-day class is primarily aimed at data scientists, analysts, solutions architects, and anybody else who wants to use AWS to handle their big data workloads. The course teaches you how to leverage Amazon Elastic MapReduce (EMR), Amazon Redshift, Amazon Kinesis, Amazon DynamoDB and the rest of the AWS big data platform (as well as several popular third-party tools) to get useful answers from your data at a speed and cost that suits your needs.

What’s new

So what’s different? For starters, the course was completely reorganized to talk about the AWS big data platform like a story—from data ingestion, to storage, to visualization—and make it easier to follow.

Customers also said they really wanted to hear more about Amazon Redshift and understand the differences between Amazon Redshift and Amazon EMR—where these services overlap, and where they’re different. So the new version of the course adds about 150% more Amazon Redshift-related content, including course modules on cluster architecture and optimization, and concepts critical to understanding Amazon Redshift, such as data warehousing and columnar data storage.

Also in response to customer feedback, the AWS Training team beefed up coverage of Hadoop programming frameworks, especially for Hive, Presto, Pig, and Spark. The Spark module, for example, now includes details on MLlib, Spark Streaming, and GraphX. There’s also a new course module on Hue, the popular Hadoop web interface, and a new hands-on lab on running Hue on Amazon EMR.

Course updates were also a response to the fast evolving big data platform. So we added coverage of AWS Import/Export Snowball, Amazon Kinesis Firehose, and AWS QuickSight—the data ingestion, streaming, and visualization services (respectively) announced at re:Invent 2015.

Other notable highlights of the revised course include:

More explanation of how Amazon Kinesis and Amazon Kinesis Streams work.

More focus on three different types of server-side encryption of data stored in Amazon S3 (SSE-C, SSE-S3, and SSE-KMS).

New hands-on lab featuring TIBCO Spotfire, the popular visualization and analytics tool.

Additional reference architectures and patterns for creating and hosting big data environments on AWS.

The revised course also includes new or improved case studies of The Weather Channel, Nasdaq, Netflix, AdRoll, and Kaiten Sushiro (shown below), a conveyor belt sushi chain that uses Amazon Kinesis and Amazon Redshift to help decide in real time what plates chefs should be making next:

Taking the class

That’s just a sampling of the changes. To learn more, check out the course description for Big Data on AWS. Here’s a global list of upcoming classes.

If you’re thinking about taking the course, you should already have a basic familiarity with Apache Hadoop, SQL, MapReduce, and other common big data technologies and concepts—plus a working knowledge of core AWS services. Still ramping up? We recommend taking Big Data Technology Fundamentals and AWS Technical Essentials first.

If you have questions or suggestions, please leave a comment below.

Building a Recommendation Engine with Spark ML on Amazon EMR using Zeppelin

Post Syndicated from Guy Ernest original https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin

Guy Ernest is a Solutions Architect with AWS

Many developers want to implement the famous Amazon model that was used to power the "People who bought this also bought these items" feature on Amazon.com. This model is based on a method called Collaborative Filtering. It takes items such as movies, books,  and that were rated highly by a set of users and recommending them to other users who also gave them high ratings. This method works well in domains where explicit ratings or implicit user actions can be gathered and analyzed.

There have been many advancements in these collaborative filtering algorithms. In 2006, Netflix introduced the Netflix Prize, shared a huge data set (more than 2GB of compressed data about millions of users and thousands of movies), and offered 1 million dollars to whoever could design an algorithm that could best improve their existing recommendation engine. The winning team was also asked to publish the algorithm and share it with the community. For several years, many teams across the world competed for the prestigious prize and many algorithms were developed and published. You can read the official paper from the winning team or the summary of it.

In the previous blog posts about Amazon ML, we built various ML models, such as numeric regression, binary classification, and multi-class classification. Such models can be used for features like recommendation engines. In this post, we are not going to implement the complete set of algorithms that were used in the Amazon solution. Instead, we show you how to use a simpler algorithm that is included out-of-the-box in Spark MLlib for collaborative filtering, called Alternating Least Squares (ALS).

Why Spark?

Apache Spark is an open source project that has gained attention from analytics experts. In contrast to Hadoop MapReduce’s two-stage, disk-based paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. Since it is possible to load data into a cluster’s memory and query it repeatedly, Spark is commonly used for iterative machine learning algorithms at scale. Furthermore, Spark includes a library with common machine learning algorithms, MLlib, which can be easily leveraged in a Spark application. For an example, see the "Large-Scale Machine Learning with Spark on Amazon EMR" post on the AWS Big Data Blog.

Spark succeeds where traditional MapReduce approach fails, making it easy to develop and execute iterative algorithms. Many ML algorithms are based on iterative optimization, which makes Spark a great platform for implementing them.

Other open-source alternatives for building ML models are either relatively slow, such as Mahout using Hadoop MapReduce, or limited in their scale, such as Weka or R. Commercial managed services alternatives, which are taking most of the complexity out of the process, are also available, such as Dato or Amazon Machine Learning, a service that makes it easy for developers of all skill levels to use machine learning technology. In this post, we explore Spark’s MLlib and show why it is a popular tool for data scientists using AWS who are looking for a DIY solution.

Why Apache Zeppelin?

Data scientists and analysts love to interact with their data, share their findings, and collaborate with others. Visualization tools are a popular way to build and share business intelligence insights, and notebooks can run in an interactive mode and allow others to see the steps which arrive at the final analysis. 

There are a few popular notebooks that can be used with Spark, mainly Apache Zeppelin and IPython. IPython is focused on Python, which is supported in Spark. In addition to Python, Zeppelin supports a few more languages (most importantly, Scala), and also integrates well with the Spark framework (Spark SQL, MLLib, and HiveQL using the HiveConext). Another alternative to explore is Jupyter, which extended IPython for multiple programming languages.

You can also easily add Zeppelin to your Amazon EMR cluster when you launch it by selecting Zeppelin-Sandbox from the Applications to install list.

Why Amazon EMR?

Amazon EMR makes it easy to launch a Hadoop cluster and install many of the Hadoop ecosystem applications, such as Pig, Hive, HBase, and Spark. EMR makes it easy for data scientists to create clusters quickly and process vast amounts of data in a few clicks. EMR also integrates with other AWS big data services such as Amazon S3 for low-cost durable storage, Amazon Kinesis for streaming data, and Amazon DynamoDB as a noSQL datastore.

EMR allows you the flexibility of choosing optimal instance types to fit different applications. For example, Spark caches data in memory for faster processing, so it is best to use instances with more memory (such as the R3 instance family). Also, EMR’s ability to use Amazon EC2 Spot capacity can dramatically reduce the cost of training and retraining ML models. Most of the time, the Spot market price for larger instances such as the r3.4xl is around 10%-20% of the On-Demand price.

The combination of EMR to create a Hadoop cluster and install Spark and Zeppelin on it, Spark to provide a rich language of data manipulation, Zeppelin to provide a notebook interface with data visualization, and SparkML to provide implementation of some popular ML algorithms is a powerful tool in the hands of data scientists that have a lot of data to process and are building predictive models for their business.

Launch Spark on EMR

To launch an EMR cluster with Spark and Zeppelin, use the AWS Management Console.

On the EMR console, choose Create cluster.

Choose Go to advanced options.

In the Software configuration section, in the Application to be installed table, add both Spark and Zeppelin-Sandbox. You can remove Pig and Hue, as you don’t need them on this cluster and it makes your cluster start faster.

When I showed this example to a team of AWS support engineers, they immediately started to check the utilization of the Spark cluster and reviewed the cluster logs and metrics. Based on their experience they advised me to configure Spark to use dynamic allocation of executors. When I used their advice, I got more than 5 times improvement in building the recommender model. To do that add the following JSON to the software settings:

[{"classification":"spark-defaults", "properties":{"spark.serializer":"org.apache.spark.serializer.KryoSerializer", "spark.dynamicAllocation.enabled":"true"}, "configurations":[]}]

In the Hardware configuration section, change the instance type to r3.xlarge to provide additional memory to your cluster over the general purpose m3.xlarge instance types. In future clusters, you can modify the cluster to use larger instances (r3.4xlarge, for example), but for the amount of data you are going to use in this example, the smaller instance type is sufficient.

Keep the default number of instances as 3. A quick look at the EMR pricing page shows that running this cluster has an hourly cost of less than $1.50/hr (EC2 + EMR price).

 

Give the cluster a name, such "SparkML". Keep the other default values of the wizard.

In the Security and Access section, choose an EC2 key pair that is already available in the region in which you are running the cluster at (see "oregon-key", in the example above), and for which you have access to its PEM file. If you don’t have any key in this region, follow the help link to the right of the EC2 key pair field.

After you select the EC2 key pair, complete the wizard and choose Create cluster.

Connect to your EMR cluster

After a few minutes, you can connect to the cluster.

In the cluster list, select the cluster to view the details.

Choose SSH beside the DNS address in the Master Public DNS field.

You may need to correct the location of the PEM file, and verify that you changed the permissions on the file:

chmod 400 ~/<YOUR-KEY-FILE>.pem

Connect to the cluster:

ssh -i ~/<YOUR-KEY-FILE>.pem" [email protected]<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com

On the SSH page are instructions for connecting to the cluster from a Windows machine. For detailed instructions for connecting to the cluster, see Connect to the Master Node Using SSH.

(Optional) Connect using Mosh

SSH is a secure way to connect to an EC2 instance of an EMR cluster, but it is also sensitive to connectivity issues. Often you may need to reconnect. To solve this issue, you can also install and use a tool called Mosh (Mobile Shell).

After you have connected to the cluster with the standard SSH connection, install Mosh on the server:

sudo yum install mosh

Proceed with the installation process on the master node. Mosh uses UDP connection to keep the stability of the secure connection; therefore, you need to open the security group to these ports.

In the cluster list, select the cluster name to view the details.

On the cluster details page, select the security group of the master node.

 On the Inbound tab, choose Edit and Add rule to allow UDP in ports 60000 – 60010, and choose Save.

You can restrict the access to your IP only, and only to the specific UDP ports that you set in your connection command. For the sake of simplicity for this public example, these settings should suffice.

Now install the Mosh client on your side of the connection from the Mosh website, and connect to your Spark cluster:

mosh –ssh="/usr/bin/ssh -i ~/<YOUR KEY FILE>.pem" [email protected]<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com –server="/usr/bin/mosh-server"

Connect to the Zeppelin notebook

There are several ways to connect to the UI on the master node. One method is to use a proxy extension to the browser, as it is explained in the pop up from the management console above or in the EMR documentation.

In this example, you can use the SSH tunnel between a local port (8157, for example) and the port that Zeppelin is listening to (default for 8890), as follows:

ssh -i <YOUR-KEY-FILE.pem> -N -L 8157:ec2-<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com:8890 [email protected]<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com

Or you can use the Mosh (Mobile Shell) connection, which I prefer:

mosh –ssh="/usr/bin/ssh -i <YOUR-KEY-FILE>.pem -N -L 8157:ec2-<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com:8890" [email protected] –server="/usr/bin/mosh-server"

Now, open a browser locally on your machine and point it to http://localhost:8157/, and you should see the home page of Zeppelin on your Spark cluster.

The green light on the top right side of the page indicates that the notebook is already connected to the Spark cluster. You can now open the preinstalled Tutorial notebook from the Notebook menu and choose Run all the notes. The tutorial demonstrates the interactivity of the notebook and the nice visualization and integration with SparkSQL.

Build the Recommender with SparkML

Now it is time to build the recommender model. Use the Spark MLlib tutorial for the dataset and the steps to build the model. The tutorial is based on an interesting and large dataset that was released by grouplens from MovieLens website. There are a few datasets for 100K, 1M, 10M, and 20M ratings. The datasets can be downloaded from the Spark MLLib training site (for example, wget https://raw.githubusercontent.com/databricks/spark-training/master/data/movielens/large/movies.dat) or from the grouplens site (for example, wget http://files.grouplens.org/datasets/movielens/ml-10m.zip).

After you have downloaded the dataset, upload them to S3 in the same region that you have your Spark cluster running.

Note: The code example is part of Databricks Spark training material (https://github.com/databricks/spark-training/) and AMPCamp (http://ampcamp.berkeley.edu/)

First, define the import libraries to be used in building the ML model:

import java.io.File
import scala.io.Source

import org.apache.log4j.Logger
import org.apache.log4j.Level

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import org.apache.spark.mllib.recommendation.{ALS, Rating, MatrixFactorizationModel}

Choose Run this paragraph and you should see the echo of your imports.

Next, load the dataset into the Spark cluster. You need to have two types of data:  movie information (movies.dat file) and ratings information (ratings.dat folder). Define the location of the files, and load them with the definition of the delimiter (::) and the format of each ((movieId, movieName) and (timestamp % 10, Rating(userId, movieId, rating)), respectively.

val movieLensHomeDir = "s3://emr.examples/movieLens/"

val movies = sc.textFile(movieLensHomeDir + "movies.dat").map { line =>
val fields = line.split("::")
// format: (movieId, movieName)
(fields(0).toInt, fields(1))
}.collect.toMap

val ratings = sc.textFile(movieLensHomeDir + "ratings.dat").map { line =>
val fields = line.split("::")
// format: (timestamp % 10, Rating(userId, movieId, rating))
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}

To verify that the files were loaded correctly, you can count the number of ratings using the following:

val numRatings = ratings.count
val numUsers = ratings.map(_._2.user).distinct.count
val numMovies = ratings.map(_._2.product).distinct.count

println("Got " + numRatings + " ratings from "
+ numUsers + " users on " + numMovies + " movies.")

Before building the ML model, it is important to split that dataset into a few parts, one for training (60%), one for validation (20%), and one for testing (20%), as follows:

val training = ratings.filter(x => x._1 < 6)
.values
.cache()
val validation = ratings.filter(x => x._1 >= 6 && x._1 < 8)
.values
.cache()
val test = ratings.filter(x => x._1 >= 8).values.cache()

val numTraining = training.count()
val numValidation = validation.count()
val numTest = test.count()

println("Training: " + numTraining + ", validation: " + numValidation + ", test: " + numTest)

You should get an output similar to the following:

training: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[2615] at repartition at :63
validation: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[2621] at repartition at :63
test: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[2623] at values at :59
numTraining: Long = 3810355
numValidation: Long = 1269312
numTest: Long = 1268409
Training: 3810355, validation: 1269312, test: 1268409

Next, define the function to be used to evaluate the performance of the model. The common function used is Root Mean Squared Error (RMSE), and this is a scala version of it:

/** Compute RMSE (Root Mean Squared Error). */
def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
.join(data.map(x => ((x.user, x.product), x.rating))).values
math.sqrt(predictionsAndRatings.map(x => (x._1 – x._2) * (x._1 – x._2)).reduce(_ + _) / n)
}

Now you can use the error function to choose the best parameters for the training algorithm. The ALS algorithm requires three parameters: matrix factors rank, number of iteration, and lambda. You can select different values for these parameters and measure the RMSE for each combination to select the best combination:

val ranks = List(8, 12)
val lambdas = List(0.1, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
println("RMSE (validation) = " + validationRmse + " for the model trained with rank = "
+ rank + ", lambda = " + lambda + ", and numIter = " + numIter + ".")
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}

This stage can take a few minutes to complete:

The best combination seems to be the larger rank (12) and the smaller lambda (0.1). Now, run it on the test data:

// evaluate the best model on the test set
val testRmse = computeRmse(bestModel.get, test, numTest)

println("The best model was trained with rank = " + bestRank + " and lambda = " + bestLambda
+ ", and numIter = " + bestNumIter + ", and its RMSE on the test set is
" + testRmse + ".")

The output should be similar to the following:

testRmse: Double = 0.8157748896668603
The best model was trained with rank = 12 and lambda = 0.1, and numIter = 20, and its RMSE on the test set is 0.8157748896668603.

How well does this model compare to a more naive prediction, based on the average rating?

// create a naive baseline and compare it with the best model
val meanRating = training.union(validation).map(_.rating).mean
val baselineRmse =
math.sqrt(test.map(x => (meanRating – x.rating) * (meanRating – x.rating)).mean)
val improvement = (baselineRmse – testRmse) / baselineRmse * 100
println("The best model improves the baseline by " + "%1.2f".format(improvement) + "%.")

You should see the following output:

meanRating: Double = 3.507898942981895
baselineRmse: Double = 1.059715654158681
improvement: Double = 23.038148392339163
The best model improves the baseline by 23.04%.

Use the model to make personal recommendations

After you have built the model, you can start using it to make recommendations for various users. To get the top 10 movie recommendations for one of your users (e.g. user ID=100), run the following:

val candidates = sc.parallelize(movies.keys.toSeq)
val recommendations = bestModel.get
.predict(candidates.map((100, _)))
.collect()
.sortBy(- _.rating)
.take(10)

var i = 1
println("Movies recommended for you:")
recommendations.foreach { r =>
println("%2d".format(i) + ": " + movies(r.product))
i += 1
}

To get the recommendations for "Comedy" movies, you can limit the candidates only to this genre. First, load the genre information that is part of the movie.dat file as the third column:

val moviesWithGenres = sc.textFile(movieLensHomeDir + "movies.dat").map { line =>
val fields = line.split("::")
// format: (movieId, movieName, genre information)
(fields(0).toInt, fields(2))
}.collect.toMap

Next filter the movies to include only the ones with "Comedy":

val comedyMovies = moviesWithGenres.filter(_._2.matches(".*Comedy.*")).keys
val candidates = sc.parallelize(comedyMovies.toSeq)
val recommendations = bestModel.get
.predict(candidates.map((100, _)))
.collect()
.sortBy(- _.rating)
.take(5)

var i = 1
println("Comedy Movies recommended for you:")
recommendations.foreach { r =>
println("%2d".format(i) + ": " + movies(r.product))
i += 1
}

You can save the model to S3, to allow you to load later again and to reuse it with your application.

// Save and load model
bestModel.get.save(sc, "s3://emr.examples/movieLens/model/recommendation")
val sameModel = MatrixFactorizationModel.load(sc, "s3://emr.examples/movieLens/model/recommendation")

After you are done with the Spark cluster, terminate your EMR cluster to stop paying for the resources.

Summary

In this example, you launched a Spark cluster with a Zeppelin notebook using Amazon EMR. You connected to the notebook from your local browser and built a recommendation engine using Spark MLlib and movie rating information from MovieLens. Finally, you used a machine learning model and made personal recommendations for some of the users in the dataset.

Using the concepts in this example, you can use your own data to build a recommender, evaluate its quality and performance, and apply the recommender to your use case. With a collaborative Zeppelin notebook, you can quickly visualize your data, explore the algorithm, and share the results with others in your organization.

If you have questions or suggestions, please leave a comment below.

——————

Related:

Large-Scale Machine Learning with Spark on Amazon EMR

Building a Recommendation Engine with Spark ML on Amazon EMR using Zeppelin

Post Syndicated from Guy Ernest original https://blogs.aws.amazon.com/bigdata/post/Tx6J5RM20WPG5V/Building-a-Recommendation-Engine-with-Spark-ML-on-Amazon-EMR-using-Zeppelin

Guy Ernest is a Solutions Architect with AWS

Many developers want to implement the famous Amazon model that was used to power the "People who bought this also bought these items" feature on Amazon.com. This model is based on a method called Collaborative Filtering. It takes items such as movies, books,  and that were rated highly by a set of users and recommending them to other users who also gave them high ratings. This method works well in domains where explicit ratings or implicit user actions can be gathered and analyzed.

There have been many advancements in these collaborative filtering algorithms. In 2006, Netflix introduced the Netflix Prize, shared a huge data set (more than 2GB of compressed data about millions of users and thousands of movies), and offered 1 million dollars to whoever could design an algorithm that could best improve their existing recommendation engine. The winning team was also asked to publish the algorithm and share it with the community. For several years, many teams across the world competed for the prestigious prize and many algorithms were developed and published. You can read the official paper from the winning team or the summary of it.

In the previous blog posts about Amazon ML, we built various ML models, such as numeric regression, binary classification, and multi-class classification. Such models can be used for features like recommendation engines. In this post, we are not going to implement the complete set of algorithms that were used in the Amazon solution. Instead, we show you how to use a simpler algorithm that is included out-of-the-box in Spark MLlib for collaborative filtering, called Alternating Least Squares (ALS).

Why Spark?

Apache Spark is an open source project that has gained attention from analytics experts. In contrast to Hadoop MapReduce’s two-stage, disk-based paradigm, Spark’s multi-stage in-memory primitives provides performance up to 100 times faster for certain applications. Since it is possible to load data into a cluster’s memory and query it repeatedly, Spark is commonly used for iterative machine learning algorithms at scale. Furthermore, Spark includes a library with common machine learning algorithms, MLlib, which can be easily leveraged in a Spark application. For an example, see the "Large-Scale Machine Learning with Spark on Amazon EMR" post on the AWS Big Data Blog.

Spark succeeds where traditional MapReduce approach fails, making it easy to develop and execute iterative algorithms. Many ML algorithms are based on iterative optimization, which makes Spark a great platform for implementing them.

Other open-source alternatives for building ML models are either relatively slow, such as Mahout using Hadoop MapReduce, or limited in their scale, such as Weka or R. Commercial managed services alternatives, which are taking most of the complexity out of the process, are also available, such as Dato or Amazon Machine Learning, a service that makes it easy for developers of all skill levels to use machine learning technology. In this post, we explore Spark’s MLlib and show why it is a popular tool for data scientists using AWS who are looking for a DIY solution.

Why Apache Zeppelin?

Data scientists and analysts love to interact with their data, share their findings, and collaborate with others. Visualization tools are a popular way to build and share business intelligence insights, and notebooks can run in an interactive mode and allow others to see the steps which arrive at the final analysis. 

There are a few popular notebooks that can be used with Spark, mainly Apache Zeppelin and IPython. IPython is focused on Python, which is supported in Spark. In addition to Python, Zeppelin supports a few more languages (most importantly, Scala), and also integrates well with the Spark framework (Spark SQL, MLLib, and HiveQL using the HiveConext). Another alternative to explore is Jupyter, which extended IPython for multiple programming languages.

You can also easily add Zeppelin to your Amazon EMR cluster when you launch it by selecting Zeppelin-Sandbox from the Applications to install list.

Why Amazon EMR?

Amazon EMR makes it easy to launch a Hadoop cluster and install many of the Hadoop ecosystem applications, such as Pig, Hive, HBase, and Spark. EMR makes it easy for data scientists to create clusters quickly and process vast amounts of data in a few clicks. EMR also integrates with other AWS big data services such as Amazon S3 for low-cost durable storage, Amazon Kinesis for streaming data, and Amazon DynamoDB as a noSQL datastore.

EMR allows you the flexibility of choosing optimal instance types to fit different applications. For example, Spark caches data in memory for faster processing, so it is best to use instances with more memory (such as the R3 instance family). Also, EMR’s ability to use Amazon EC2 Spot capacity can dramatically reduce the cost of training and retraining ML models. Most of the time, the Spot market price for larger instances such as the r3.4xl is around 10%-20% of the On-Demand price.

The combination of EMR to create a Hadoop cluster and install Spark and Zeppelin on it, Spark to provide a rich language of data manipulation, Zeppelin to provide a notebook interface with data visualization, and SparkML to provide implementation of some popular ML algorithms is a powerful tool in the hands of data scientists that have a lot of data to process and are building predictive models for their business.

Launch Spark on EMR

To launch an EMR cluster with Spark and Zeppelin, use the AWS Management Console.

On the EMR console, choose Create cluster.

Choose Go to advanced options.

In the Software configuration section, in the Application to be installed table, add both Spark and Zeppelin-Sandbox. You can remove Pig and Hue, as you don’t need them on this cluster and it makes your cluster start faster.

When I showed this example to a team of AWS support engineers, they immediately started to check the utilization of the Spark cluster and reviewed the cluster logs and metrics. Based on their experience they advised me to configure Spark to use dynamic allocation of executors. When I used their advice, I got more than 5 times improvement in building the recommender model. To do that add the following JSON to the software settings:

[{"classification":"spark-defaults", "properties":{"spark.serializer":"org.apache.spark.serializer.KryoSerializer", "spark.dynamicAllocation.enabled":"true"}, "configurations":[]}]

In the Hardware configuration section, change the instance type to r3.xlarge to provide additional memory to your cluster over the general purpose m3.xlarge instance types. In future clusters, you can modify the cluster to use larger instances (r3.4xlarge, for example), but for the amount of data you are going to use in this example, the smaller instance type is sufficient.

Keep the default number of instances as 3. A quick look at the EMR pricing page shows that running this cluster has an hourly cost of less than $1.50/hr (EC2 + EMR price).

 

Give the cluster a name, such "SparkML". Keep the other default values of the wizard.

In the Security and Access section, choose an EC2 key pair that is already available in the region in which you are running the cluster at (see "oregon-key", in the example above), and for which you have access to its PEM file. If you don’t have any key in this region, follow the help link to the right of the EC2 key pair field.

After you select the EC2 key pair, complete the wizard and choose Create cluster.

Connect to your EMR cluster

After a few minutes, you can connect to the cluster.

In the cluster list, select the cluster to view the details.

Choose SSH beside the DNS address in the Master Public DNS field.

You may need to correct the location of the PEM file, and verify that you changed the permissions on the file:

chmod 400 ~/<YOUR-KEY-FILE>.pem

Connect to the cluster:

ssh -i ~/<YOUR-KEY-FILE>.pem" [email protected]<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com

On the SSH page are instructions for connecting to the cluster from a Windows machine. For detailed instructions for connecting to the cluster, see Connect to the Master Node Using SSH.

(Optional) Connect using Mosh

SSH is a secure way to connect to an EC2 instance of an EMR cluster, but it is also sensitive to connectivity issues. Often you may need to reconnect. To solve this issue, you can also install and use a tool called Mosh (Mobile Shell).

After you have connected to the cluster with the standard SSH connection, install Mosh on the server:

sudo yum install mosh

Proceed with the installation process on the master node. Mosh uses UDP connection to keep the stability of the secure connection; therefore, you need to open the security group to these ports.

In the cluster list, select the cluster name to view the details.

On the cluster details page, select the security group of the master node.

 On the Inbound tab, choose Edit and Add rule to allow UDP in ports 60000 – 60010, and choose Save.

You can restrict the access to your IP only, and only to the specific UDP ports that you set in your connection command. For the sake of simplicity for this public example, these settings should suffice.

Now install the Mosh client on your side of the connection from the Mosh website, and connect to your Spark cluster:

mosh –ssh="/usr/bin/ssh -i ~/<YOUR KEY FILE>.pem" [email protected]<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com –server="/usr/bin/mosh-server"

Connect to the Zeppelin notebook

There are several ways to connect to the UI on the master node. One method is to use a proxy extension to the browser, as it is explained in the pop up from the management console above or in the EMR documentation.

In this example, you can use the SSH tunnel between a local port (8157, for example) and the port that Zeppelin is listening to (default for 8890), as follows:

ssh -i <YOUR-KEY-FILE.pem> -N -L 8157:ec2-<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com:8890 [email protected]<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com

Or you can use the Mosh (Mobile Shell) connection, which I prefer:

mosh –ssh="/usr/bin/ssh -i <YOUR-KEY-FILE>.pem -N -L 8157:ec2-<MASTER-PUBLIC-IP>.<REGION>.compute.amazonaws.com:8890" [email protected] –server="/usr/bin/mosh-server"

Now, open a browser locally on your machine and point it to http://localhost:8157/, and you should see the home page of Zeppelin on your Spark cluster.

The green light on the top right side of the page indicates that the notebook is already connected to the Spark cluster. You can now open the preinstalled Tutorial notebook from the Notebook menu and choose Run all the notes. The tutorial demonstrates the interactivity of the notebook and the nice visualization and integration with SparkSQL.

Build the Recommender with SparkML

Now it is time to build the recommender model. Use the Spark MLlib tutorial for the dataset and the steps to build the model. The tutorial is based on an interesting and large dataset that was released by grouplens from MovieLens website. There are a few datasets for 100K, 1M, 10M, and 20M ratings. The datasets can be downloaded from the Spark MLLib training site (for example, wget https://raw.githubusercontent.com/databricks/spark-training/master/data/movielens/large/movies.dat) or from the grouplens site (for example, wget http://files.grouplens.org/datasets/movielens/ml-10m.zip).

After you have downloaded the dataset, upload them to S3 in the same region that you have your Spark cluster running.

Note: The code example is part of Databricks Spark training material (https://github.com/databricks/spark-training/) and AMPCamp (http://ampcamp.berkeley.edu/)

First, define the import libraries to be used in building the ML model:

import java.io.File
import scala.io.Source

import org.apache.log4j.Logger
import org.apache.log4j.Level

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import org.apache.spark.mllib.recommendation.{ALS, Rating, MatrixFactorizationModel}

Choose Run this paragraph and you should see the echo of your imports.

Next, load the dataset into the Spark cluster. You need to have two types of data:  movie information (movies.dat file) and ratings information (ratings.dat folder). Define the location of the files, and load them with the definition of the delimiter (::) and the format of each ((movieId, movieName) and (timestamp % 10, Rating(userId, movieId, rating)), respectively.

val movieLensHomeDir = "s3://emr.examples/movieLens/"

val movies = sc.textFile(movieLensHomeDir + "movies.dat").map { line =>
val fields = line.split("::")
// format: (movieId, movieName)
(fields(0).toInt, fields(1))
}.collect.toMap

val ratings = sc.textFile(movieLensHomeDir + "ratings.dat").map { line =>
val fields = line.split("::")
// format: (timestamp % 10, Rating(userId, movieId, rating))
(fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble))
}

To verify that the files were loaded correctly, you can count the number of ratings using the following:

val numRatings = ratings.count
val numUsers = ratings.map(_._2.user).distinct.count
val numMovies = ratings.map(_._2.product).distinct.count

println("Got " + numRatings + " ratings from "
+ numUsers + " users on " + numMovies + " movies.")

Before building the ML model, it is important to split that dataset into a few parts, one for training (60%), one for validation (20%), and one for testing (20%), as follows:

val training = ratings.filter(x => x._1 < 6)
.values
.cache()
val validation = ratings.filter(x => x._1 >= 6 && x._1 < 8)
.values
.cache()
val test = ratings.filter(x => x._1 >= 8).values.cache()

val numTraining = training.count()
val numValidation = validation.count()
val numTest = test.count()

println("Training: " + numTraining + ", validation: " + numValidation + ", test: " + numTest)

You should get an output similar to the following:

training: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[2615] at repartition at :63
validation: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[2621] at repartition at :63
test: org.apache.spark.rdd.RDD[org.apache.spark.mllib.recommendation.Rating] = MapPartitionsRDD[2623] at values at :59
numTraining: Long = 3810355
numValidation: Long = 1269312
numTest: Long = 1268409
Training: 3810355, validation: 1269312, test: 1268409

Next, define the function to be used to evaluate the performance of the model. The common function used is Root Mean Squared Error (RMSE), and this is a scala version of it:

/** Compute RMSE (Root Mean Squared Error). */
def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
.join(data.map(x => ((x.user, x.product), x.rating))).values
math.sqrt(predictionsAndRatings.map(x => (x._1 – x._2) * (x._1 – x._2)).reduce(_ + _) / n)
}

Now you can use the error function to choose the best parameters for the training algorithm. The ALS algorithm requires three parameters: matrix factors rank, number of iteration, and lambda. You can select different values for these parameters and measure the RMSE for each combination to select the best combination:

val ranks = List(8, 12)
val lambdas = List(0.1, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
println("RMSE (validation) = " + validationRmse + " for the model trained with rank = "
+ rank + ", lambda = " + lambda + ", and numIter = " + numIter + ".")
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}

This stage can take a few minutes to complete:

The best combination seems to be the larger rank (12) and the smaller lambda (0.1). Now, run it on the test data:

// evaluate the best model on the test set
val testRmse = computeRmse(bestModel.get, test, numTest)

println("The best model was trained with rank = " + bestRank + " and lambda = " + bestLambda
+ ", and numIter = " + bestNumIter + ", and its RMSE on the test set is
" + testRmse + ".")

The output should be similar to the following:

testRmse: Double = 0.8157748896668603
The best model was trained with rank = 12 and lambda = 0.1, and numIter = 20, and its RMSE on the test set is 0.8157748896668603.

How well does this model compare to a more naive prediction, based on the average rating?

// create a naive baseline and compare it with the best model
val meanRating = training.union(validation).map(_.rating).mean
val baselineRmse =
math.sqrt(test.map(x => (meanRating – x.rating) * (meanRating – x.rating)).mean)
val improvement = (baselineRmse – testRmse) / baselineRmse * 100
println("The best model improves the baseline by " + "%1.2f".format(improvement) + "%.")

You should see the following output:

meanRating: Double = 3.507898942981895
baselineRmse: Double = 1.059715654158681
improvement: Double = 23.038148392339163
The best model improves the baseline by 23.04%.

Use the model to make personal recommendations

After you have built the model, you can start using it to make recommendations for various users. To get the top 10 movie recommendations for one of your users (e.g. user ID=100), run the following:

val candidates = sc.parallelize(movies.keys.toSeq)
val recommendations = bestModel.get
.predict(candidates.map((100, _)))
.collect()
.sortBy(- _.rating)
.take(10)

var i = 1
println("Movies recommended for you:")
recommendations.foreach { r =>
println("%2d".format(i) + ": " + movies(r.product))
i += 1
}

To get the recommendations for "Comedy" movies, you can limit the candidates only to this genre. First, load the genre information that is part of the movie.dat file as the third column:

val moviesWithGenres = sc.textFile(movieLensHomeDir + "movies.dat").map { line =>
val fields = line.split("::")
// format: (movieId, movieName, genre information)
(fields(0).toInt, fields(2))
}.collect.toMap

Next filter the movies to include only the ones with "Comedy":

val comedyMovies = moviesWithGenres.filter(_._2.matches(".*Comedy.*")).keys
val candidates = sc.parallelize(comedyMovies.toSeq)
val recommendations = bestModel.get
.predict(candidates.map((100, _)))
.collect()
.sortBy(- _.rating)
.take(5)

var i = 1
println("Comedy Movies recommended for you:")
recommendations.foreach { r =>
println("%2d".format(i) + ": " + movies(r.product))
i += 1
}

You can save the model to S3, to allow you to load later again and to reuse it with your application.

// Save and load model
bestModel.get.save(sc, "s3://emr.examples/movieLens/model/recommendation")
val sameModel = MatrixFactorizationModel.load(sc, "s3://emr.examples/movieLens/model/recommendation")

After you are done with the Spark cluster, terminate your EMR cluster to stop paying for the resources.

Summary

In this example, you launched a Spark cluster with a Zeppelin notebook using Amazon EMR. You connected to the notebook from your local browser and built a recommendation engine using Spark MLlib and movie rating information from MovieLens. Finally, you used a machine learning model and made personal recommendations for some of the users in the dataset.

Using the concepts in this example, you can use your own data to build a recommender, evaluate its quality and performance, and apply the recommender to your use case. With a collaborative Zeppelin notebook, you can quickly visualize your data, explore the algorithm, and share the results with others in your organization.

If you have questions or suggestions, please leave a comment below.

——————

Related:

Large-Scale Machine Learning with Spark on Amazon EMR

Videos now available for AWS re:Invent 2015 Big Data Analytics sessions

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx3D3UYOXB9XG6Z/Videos-now-available-for-AWS-re-Invent-2015-Big-Data-Analytics-sessions

For those of you who were able to attend AWS re:Invent 2015 last week or watched sessions through our live stream, thanks for participating in the conference. We hope you left feeling inspired to tackle your big data projects with tools in the AWS ecosystem and partner solutions. Also, we were excited for our customers to take the stage to discuss their data processing architectures and use cases.

If you missed a session in your schedule, don’t fret! We have added a large portion of re:Invent content to YouTube, and you can find videos of the big data sessions below.

Deep Dive Customers Use cases

BDT303 – Running Spark and Presto on the Netflix Big Data Platform
BDT306 – The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS
BDT307 – Zero Infrastructure, Real-Time Data Collection, and Analytics (with Zillow)
BDT312 – Application Monitoring in a Post-Server World: Why Data Context Is Critical (with New Relic)
BDT318 – Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second 
BDT404 – Building and Managing Large-Scale ETL Data Flows with AWS Data Pipeline and Dataduct (with Coursera)
DAT308 – How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
DAT311 – Large-Scale Genomic Analysis with Amazon Redshift (with Human Longevity Bioinformatics)
BDT323 – Amazon EBS and Cassandra: 1 Million Writes Per Second on 60 Nodes (with CrowdStrike)
BDT322 – How Redfin and Twitter Leverage Amazon S3 to Build Their Big Data Platforms
MBL314 – Building World-Class, Cloud-Connected Products: How Sonos Leverages Amazon Kinesis

Services Sessions (Amazon EMR, Amazon Kinesis, Amazon Redshift, AWS Data Pipeline and Amazon DynamoDB)

BDT208 – A Technical Introduction to Amazon Elastic MapReduce (with AOL)
BDT209 – Amazon Elasticsearch Service for Real-time Analytics
BDT305 – Amazon EMR Deep Dive and Best Practices (with FINRA)
BDT313 – Amazon DynamoDB for Big Data
BDT319 – Amazon QuickSight: Cloud-native Business Intelligence
BDT320 – Streaming Data Flows with Amazon Kinesis Firehose
BDT401 – Amazon Redshift Deep Dive: Tuning and Best Practices (with TripAdvisor)
BDT316 – Offloading ETL to Amazon Elastic MapReduce (with Amgen)
BDT403 – Best Practices for Building Realtime Streaming Applications with Amazon Kinesis (with AdRoll)
BDT206 – How to Accelerate Your Projects with AWS Marketplace  (with Boeing)
DAT201 – Introduction to Amazon Redshift (with RetailMeNot)

Architecture and Best Practice

BDT205 – Your First Big Data Application On AWS
BDT317 – Building a Data Lake on AWS 
BDT310 – Big Data Architectural Patterns and Best Practices on AWS
BDT402 – Delivering Business Agility Using AWS (with Wipro)
BDT309 – Data Science & Best Practices for Apache Spark on Amazon EMR
DAT204 – NoSQL? No Worries: Building Scalable Applications on AWS NoSQL Services (with Expedia and Mapbox)
ISM303 – Migrating Your Enterprise Data Warehouse to Amazon Redshift (with Boingo Wireless and Edmunds)

Machine Learning

BDT311 – Deep Learning: Going Beyond Machine Learning (with Day1 Solutions)
BDT302 – Real-World Smart Applications With Amazon Machine Learning
BDT207 – Real-Time Analytics In Service of Self-Healing Ecosystems (with Netflix)

Videos now available for AWS re:Invent 2015 Big Data Analytics sessions

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx3D3UYOXB9XG6Z/Videos-now-available-for-AWS-re-Invent-2015-Big-Data-Analytics-sessions

For those of you who were able to attend AWS re:Invent 2015 last week or watched sessions through our live stream, thanks for participating in the conference. We hope you left feeling inspired to tackle your big data projects with tools in the AWS ecosystem and partner solutions. Also, we were excited for our customers to take the stage to discuss their data processing architectures and use cases.

If you missed a session in your schedule, don’t fret! We have added a large portion of re:Invent content to YouTube, and you can find videos of the big data sessions below.

Deep Dive Customers Use cases

BDT303 – Running Spark and Presto on the Netflix Big Data Platform
BDT306 – The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS
BDT307 – Zero Infrastructure, Real-Time Data Collection, and Analytics (with Zillow)
BDT312 – Application Monitoring in a Post-Server World: Why Data Context Is Critical (with New Relic)
BDT318 – Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second 
BDT404 – Building and Managing Large-Scale ETL Data Flows with AWS Data Pipeline and Dataduct (with Coursera)
DAT308 – How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
DAT311 – Large-Scale Genomic Analysis with Amazon Redshift (with Human Longevity Bioinformatics)
BDT323 – Amazon EBS and Cassandra: 1 Million Writes Per Second on 60 Nodes (with CrowdStrike)
BDT322 – How Redfin and Twitter Leverage Amazon S3 to Build Their Big Data Platforms
MBL314 – Building World-Class, Cloud-Connected Products: How Sonos Leverages Amazon Kinesis

Services Sessions (Amazon EMR, Amazon Kinesis, Amazon Redshift, AWS Data Pipeline and Amazon DynamoDB)

BDT208 – A Technical Introduction to Amazon Elastic MapReduce (with AOL)
BDT209 – Amazon Elasticsearch Service for Real-time Analytics
BDT305 – Amazon EMR Deep Dive and Best Practices (with FINRA)
BDT313 – Amazon DynamoDB for Big Data
BDT319 – Amazon QuickSight: Cloud-native Business Intelligence
BDT320 – Streaming Data Flows with Amazon Kinesis Firehose
BDT401 – Amazon Redshift Deep Dive: Tuning and Best Practices (with TripAdvisor)
BDT316 – Offloading ETL to Amazon Elastic MapReduce (with Amgen)
BDT403 – Best Practices for Building Realtime Streaming Applications with Amazon Kinesis (with AdRoll)
BDT206 – How to Accelerate Your Projects with AWS Marketplace  (with Boeing)
DAT201 – Introduction to Amazon Redshift (with RetailMeNot)

Architecture and Best Practice

BDT205 – Your First Big Data Application On AWS
BDT317 – Building a Data Lake on AWS 
BDT310 – Big Data Architectural Patterns and Best Practices on AWS
BDT402 – Delivering Business Agility Using AWS (with Wipro)
BDT309 – Data Science & Best Practices for Apache Spark on Amazon EMR
DAT204 – NoSQL? No Worries: Building Scalable Applications on AWS NoSQL Services (with Expedia and Mapbox)
ISM303 – Migrating Your Enterprise Data Warehouse to Amazon Redshift (with Boingo Wireless and Edmunds)

Machine Learning

BDT311 – Deep Learning: Going Beyond Machine Learning (with Day1 Solutions)
BDT302 – Real-World Smart Applications With Amazon Machine Learning
BDT207 – Real-Time Analytics In Service of Self-Healing Ecosystems (with Netflix)

AWS Big Data Analytics Sessions at re:Invent 2015

Post Syndicated from Roy Ben-Alta original https://blogs.aws.amazon.com/bigdata/post/TxSMGKXY5S4VLX/AWS-Big-Data-Analytics-Sessions-at-re-Invent-2015

Roy Ben-Alta is a Business Development Manager – Big Data & Analytics

If you will be attending re:Invent 2015 in Las Vegas next week, you know that you’ll have many opportunities to learn more about Big Data & Analytics on AWS at the conference–and this year we have over 20 sessions! The following breakout sessions compose this year’s Big Data & Analytics track.

Didn’t register before the conference sold out? All sessions will be recorded and made available on YouTube after the conference. Also, all slide decks from the sessions will be made available on SlideShare.net after the conference.  

Click any of the following links to learn more about a breakout session.

Deep Dive Customers Use cases

BDT303 – Running Spark and Presto on the Netflix Big Data Platform
BDT306 – The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS
BDT307 – Zero Infrastructure, RealTime Data Collection, and Analytics
BDT312 – Application Monitoring in a Post-Server World: Why Data Context Is Critical
BDT314 – Running a Big Data and Analytics Application on Amazon EMR and Amazon Redshift with a Focus on Security
BDT318 – Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second 
BDT404 – Building and Managing LargeScale ETL Data Flows with AWS Data Pipeline and Dataduct
DAT308 – How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
DAT311 – Large-Scale Genomic Analysis with Amazon Redshift
BDT323 – Amazon EBS and Cassandra: 1 Million Writes Per Second on 60 Nodes
BDT322 – How Redfin and Twitter Leverage Amazon S3 to Build Their Big Data Platforms
MBL314 – Building World-Class, Cloud-Connected Products: How Sonos Leverages Amazon Kinesis

Services Sessions (Amazon EMR, Amazon Kinesis, Amazon Redshift, AWS Data Pipeline and Amazon DynamoDB)

BDT208 – A Technical Introduction to Amazon Elastic MapReduce
BDT305 – Amazon EMR Deep Dive and Best Practices
BDT313 – Amazon DynamoDB for Big Data
BDT401 – Amazon Redshift Deep Dive: Tuning and Best Practices
BDT316 – Offloading ETL to Amazon Elastic MapReduce
BDT403 – Best Practices for Building Realtime Streaming Applications with Amazon Kinesis
BDT206 – How to Accelerate Your Projects with AWS Marketplace 
DAT201 – Introduction to Amazon Redshift

Architecture and Best Practice

BDT317 – Building a Data Lake on AWS 
BDT310 – Big Data Architectural Patterns and Best Practices on AWS
BDT402 – Delivering Business Agility Using AWS
BDT309 – Data Science & Best Practices for Apache Spark on Amazon EMR
DAT204 – NoSQL? No Worries: Building Scalable Applications on AWS NoSQL Services
SM303 – Migrating Your Enterprise Data Warehouse to Amazon Redshift

Machine Learning

BDT311 – Deep Learning: Going Beyond Machine Learning
BDT302 – RealWorld Smart Applications With Amazon Machine Learning
BDT207 – RealTime Analytics In Service of SelfHealing Ecosystems

Workshops

BDT205 – Your First Big Data Application on AWS
WRK301 – Implementing Twitter Analytics using Spark Streaming, Scala, and AWS EMR
WRK303 – Realworld Data Warehousing with Redshift, Kinesis and AWS Marketplace
WRK304 – Recommendation Engine using Amazon Machine learning in Realtime

AWS Big Data Analytics Sessions at re:Invent 2015

Post Syndicated from Roy Ben-Alta original https://blogs.aws.amazon.com/bigdata/post/TxSMGKXY5S4VLX/AWS-Big-Data-Analytics-Sessions-at-re-Invent-2015

Roy Ben-Alta is a Business Development Manager – Big Data & Analytics

If you will be attending re:Invent 2015 in Las Vegas next week, you know that you’ll have many opportunities to learn more about Big Data & Analytics on AWS at the conference–and this year we have over 20 sessions! The following breakout sessions compose this year’s Big Data & Analytics track.

Didn’t register before the conference sold out? All sessions will be recorded and made available on YouTube after the conference. Also, all slide decks from the sessions will be made available on SlideShare.net after the conference.  

Click any of the following links to learn more about a breakout session.

Deep Dive Customers Use cases

BDT303 – Running Spark and Presto on the Netflix Big Data Platform
BDT306 – The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS
BDT307 – Zero Infrastructure, RealTime Data Collection, and Analytics
BDT312 – Application Monitoring in a Post-Server World: Why Data Context Is Critical
BDT314 – Running a Big Data and Analytics Application on Amazon EMR and Amazon Redshift with a Focus on Security
BDT318 – Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second 
BDT404 – Building and Managing LargeScale ETL Data Flows with AWS Data Pipeline and Dataduct
DAT308 – How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
DAT311 – Large-Scale Genomic Analysis with Amazon Redshift
BDT323 – Amazon EBS and Cassandra: 1 Million Writes Per Second on 60 Nodes
BDT322 – How Redfin and Twitter Leverage Amazon S3 to Build Their Big Data Platforms
MBL314 – Building World-Class, Cloud-Connected Products: How Sonos Leverages Amazon Kinesis

Services Sessions (Amazon EMR, Amazon Kinesis, Amazon Redshift, AWS Data Pipeline and Amazon DynamoDB)

BDT208 – A Technical Introduction to Amazon Elastic MapReduce
BDT305 – Amazon EMR Deep Dive and Best Practices
BDT313 – Amazon DynamoDB for Big Data
BDT401 – Amazon Redshift Deep Dive: Tuning and Best Practices
BDT316 – Offloading ETL to Amazon Elastic MapReduce
BDT403 – Best Practices for Building Realtime Streaming Applications with Amazon Kinesis
BDT206 – How to Accelerate Your Projects with AWS Marketplace 
DAT201 – Introduction to Amazon Redshift

Architecture and Best Practice

BDT317 – Building a Data Lake on AWS 
BDT310 – Big Data Architectural Patterns and Best Practices on AWS
BDT402 – Delivering Business Agility Using AWS
BDT309 – Data Science & Best Practices for Apache Spark on Amazon EMR
DAT204 – NoSQL? No Worries: Building Scalable Applications on AWS NoSQL Services
SM303 – Migrating Your Enterprise Data Warehouse to Amazon Redshift

Machine Learning

BDT311 – Deep Learning: Going Beyond Machine Learning
BDT302 – RealWorld Smart Applications With Amazon Machine Learning
BDT207 – RealTime Analytics In Service of SelfHealing Ecosystems

Workshops

BDT205 – Your First Big Data Application on AWS
WRK301 – Implementing Twitter Analytics using Spark Streaming, Scala, and AWS EMR
WRK303 – Realworld Data Warehousing with Redshift, Kinesis and AWS Marketplace
WRK304 – Recommendation Engine using Amazon Machine learning in Realtime

Be Sure to Comment on FCC’s NPRM 14-28

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2014/06/04/fcc-14-28.html

I remind everyone today, particularly USA Citizens, to be sure to comment
on
the FCC’s
Notice of Proposed Rulemaking (NPRM) 14-28
. They even did a sane thing
and provided
an email address you can write to rather than using their poorly designed
web forums
,
but PC
Magazine published relatively complete instructions for other ways
.
The deadline isn’t for a while yet, but it’s worth getting it done so you
don’t forget. Below is my letter in case anyone is interested.

Dear FCC Commissioners,

I am writing in response to NPRM 14-28 — your request for comments regarding
the “Open Internet”.

I am a trained computer scientist and I work in the technology industry.
(I’m a software developer and software freedom activist.) I have subscribed
to home network services since 1989, starting with the Prodigy service, and
switching to Internet service in 1991. Initially, I used a PSTN single-pair
modem and eventually upgraded to DSL in 1999. I still have a DSL line, but
it’s sadly not much faster than the one I had in 1999, and I explain below
why.

In fact, I’ve watched the situation get progressively worse, not better,
since the Telecommunications Act of 1996. While my download speeds are
little bit faster than they were in the late 1990s, I now pay
substantially more for only small increases of upload speeds, even in a
major urban markets. In short, it’s become increasingly more difficult
to actually purchase true Internet connectivity service anywhere in the
USA. But first, let me explain what I mean by “true Internet
connectivity”.

The Internet was created as a peer-to-peer medium where all nodes were
equal. In the original design of the Internet, every device has its own
IP address and, if the user wanted, that device could be addressed
directly and fully by any other device on the Internet. For its part,
the network in between the two nodes were intended to merely move the
packets between those nodes as quickly as possible — treating all those
packets the same way, and analyzing those packets only with publicly
available algorithms that everyone agreed were correct and fair.

Of course, the companies who typically appeal to (or even fight) the FCC
want the true Internet to simply die. They seek to turn the promise of
a truly peer-to-peer network of equality into a traditional broadcast
medium that they control. They frankly want to manipulate the Internet
into a mere television broadcast system (with the only improvement to
that being “more stations”).

Because of this, the three following features of the Internet —
inherent in its design — that are now extremely difficult for
individual home users to purchase at reasonable cost from so-called
“Internet providers” like Time Warner, Verizon, and Comcast:

A static IP address, which allows the user to be a true, equal node on
the Internet. (And, related: IPv6 addresses, which could end the claim
that static IP addresses are a precious resource.)

An unfiltered connection, that allows the user to run their own
webserver, email server and the like. (Most of these companies block TCP
ports 80 and 25 at the least, and usually many more ports, too).

Reasonable choices between the upload/download speed tradeoff.

For example, in New York, I currently pay nearly $150/month to an
independent ISP just to have a static, unfiltered IP address with 10
Mbps down and 2 Mbps up. I work from home and the 2 Mbps up is
incredibly slow for modern usage. However, I still live in the Slowness
because upload speeds greater than that are extremely price-restrictive
from any provider.

In other words, these carriers have designed their networks to
prioritize all downloading over all uploading, and to purposely place
the user behind many levels of Network Address Translation and network
filtering. In this environment, many Internet applications simply do
not work (or require complex work-arounds that disable key features).
As an example: true diversity in VoIP accessibility and service has
almost entirely been superseded by proprietary single-company services
(such as Skype) because SIP, designed by the IETF (in part) for VoIP
applications, did not fully anticipate that nearly every user would be
behind NAT and unable to use SIP without complex work-arounds.

I believe this disastrous situation centers around problems with the
Telecommunications Act of 1996. While
the
ILECs

are theoretically required to license network infrastructure fairly at bulk
rates to

CLEC
s,
I’ve frequently seen — both professional and personally — wars
waged against
CLECs by
ILECs. CLECs
simply can’t offer their own types of services that merely “use”
the ILECs’ connectivity. The technical restrictions placed by ILECs force
CLECs to offer the same style of service the ILEC offers, and at a higher
price (to cover their additional overhead in dealing with the CLECs)! It’s
no wonder there are hardly any CLECs left.

Indeed, in my 25 year career as a technologist, I’ve seen many nasty
tricks by Verizon here in NYC, such as purposeful work-slowdowns in
resolution of outages and Verizon technicians outright lying to me and
to CLEC technicians about the state of their network. For my part, I
stick with one of the last independent ISPs in NYC, but I suspect they
won’t be able to keep their business going for long. Verizon either (a)
buys up any CLEC that looks too powerful, or, (b) if Verizon can’t buy
them, Verizon slowly squeezes them out of business with dirty tricks.

The end result is that we don’t have real options for true Internet
connectivity for home nor on-site business use. I’m already priced
out of getting a 10 Mbps upload with a static IP and all ports usable.
I suspect within 5 years, I’ll be priced out of my current 2 Mbps upload
with a static IP and all ports usable.

I realize the problems that most users are concerned about on this issue
relate to their ability to download bytes from third-party companies
like Netflix. Therefore, it’s all too easy for Verizon to play out this
argument as if it’s big companies vs. big companies.

However, the real fallout from the current system is that the cost for
personal Internet connectivity that allows individuals equal existence
on the network is so high that few bother. The consequence, thus, is
that only those who are heavily involved in the technology industry even
know what types of applications would be available if everyone had a
static IP with all ports usable and equal upload and download speeds
of 10 Mbs or higher.

Yet, that’s the exact promise of network connectivity that I was taught
about as an undergraduate in Computer Science in the early 1990s. What
I see today is the dystopian version of the promise. My generation of
computer scientists have been forced to constrain their designs of
Internet-enabled applications to fit a model that the network carriers
dictate.

I realize you can’t possibly fix all these social ills in the network
connectivity industry with one rule-making, but I hope my comments have
perhaps given a slightly different perspective of what you’ll hear from
most of the other commenters on this issue. I thank you for reading my
comments and would be delighted to talk further with any of your staff
about these issues at your convenience.

Sincerely,

Bradley M. Kuhn,
a citizen of the USA since birth, currently living in New York, NY.