All posts by Jonathan Fritz

Meet the Amazon EMR Team this Friday at a Tech Talk & Networking Event in Mountain View

Post Syndicated from Jonathan Fritz original https://aws.amazon.com/blogs/big-data/meet-the-amazon-emr-team-this-friday-at-a-tech-talk-networking-event-in-mountain-view/

Want to change the world with Big Data and Analytics? Come join us on the Amazon EMR team in Amazon Web Services!

Meet the Amazon EMR team this Friday April 7th from 5:00 – 7:30 PM at Michael’s at Shoreline in Mountain View. We’ll feature short tech talks by EMR leadership who will talk about the past, present, and future of Apache Hadoop and Spark ecosystem and EMR. You’ll also meet EMR engineers who are eager to discuss the challenges and opportunities involved in building the EMR service and running the latest open-source big data frameworks like Spark and Presto at massive scale. We’ll give out several door prizes, including an Amazon Echo with an Amazon Dot, Kindle, and Fire TV Stick!

Amazon EMR is a web service which enables customers to run massive clusters with distributed big data frameworks like Apache Hadoop, Hive, Tez, Flink, Spark, Presto, HBase and more, with the ability to effortlessly scale up and down as needed. We run a large number of customer clusters, enabling processing on vast datasets.

We are developing innovative new features including our next-generation cluster management system, improvements for real-time processing of big data, and ways to enable customers to more easily interact with their data. We’re looking for top engineers to build them from the ground up.

Here are sample features that we have recently delivered:

Interested? We hope you can make it! Please RSVP on Eventbrite.

Respond to State Changes on Amazon EMR Clusters with Amazon CloudWatch Events

Post Syndicated from Jonathan Fritz original https://aws.amazon.com/blogs/big-data/respond-to-state-changes-on-amazon-emr-clusters-with-amazon-cloudwatch-events/

Jonathan Fritz is a Senior Product Manager for Amazon EMR

Customers can take advantage of the Amazon EMR API to create and terminate EMR clusters, scale clusters using Auto Scaling or manual resizing, and submit and run Apache Spark, Apache Hive, or Apache Pig workloads. These decisions are often triggered from cluster state-related information.

Previously, you could use the “describe” and “list” set of API operations to find the relevant information about your EMR clusters and associated instance groups, steps, and Auto Scaling policies. However, programmatic applications that check resource state changes and post notifications or take actions are forced to poll these API operations, which provides a slower end-to-end reaction time and additional management overhead than if you were able to use an event-driven architecture.

With new support for Amazon EMR in Amazon CloudWatch Events, you can be notified quickly and programmatically respond to state changes in your EMR clusters. Additionally, these events are also displayed in the Amazon EMR console, on the Cluster Details page in the Events section.

There are four new EMR event types:

  • Cluster State Change
  • Instance Group State Change
  • Step State Change
  • Auto Scaling State Change

CloudWatch Events allows you to create filters and rules to match these events and route them to Amazon SNS topics, AWS Lambda functions, Amazon SQS queues, streams in Amazon Kinesis Streams, or built-in targets. You then have the ability to programmatically act on these events, including sending emails and SMS messages, running retry logic in Lambda, or tracking the state of running steps. For more information about the sample events generated for each event type, see the CloudWatch Events documentation.

The following is an example using the CloudWatch Events console to route EMR step failure events to Lambda for automated retry logic and to SNS to push a notification to an email alias:

Cloudwatch_1

You can create rules for EMR event types in the CloudWatch Events console, AWS CLI, or the AWS SDKs using the CloudWatch Events API.

If you have any questions or would like to share an interesting use case about events and notifications with EMR, please leave a comment below.


Related

Dynamically Scale Applications on Amazon EMR with Auto Scaling

AutoScaling_SocialMedia

 

Dynamically Scale Applications on Amazon EMR with Auto Scaling

Post Syndicated from Jonathan Fritz original https://aws.amazon.com/blogs/big-data/dynamically-scale-applications-on-amazon-emr-with-auto-scaling/

Jonathan Fritz is a Senior Product Manager for Amazon EMR

Customers running Apache Spark, Presto, and the Apache Hadoop ecosystem take advantage of Amazon EMR’s elasticity to save costs by terminating clusters after workflows are complete and resizing clusters with low-cost Amazon EC2 Spot Instances. For instance, customers can create clusters for daily ETL or machine learning jobs and shut them down when they complete, or scale out a Presto cluster serving BI analysts during business hours for ad hoc, low-latency SQL on Amazon S3.

With new support for Auto Scaling in Amazon EMR releases 4.x and 5.x, customers can now add (scale out) and remove (scale in) nodes on a cluster more easily. Scaling actions are triggered automatically by Amazon CloudWatch metrics provided by EMR at 5 minute intervals, including several YARN metrics related to memory utilization, applications pending, and HDFS utilization.

In EMR release 5.1.0, we introduced two new metrics, YARNMemoryAvailablePercentage and ContainerPendingRatio, which serve as useful cluster utilization metrics for scalable, YARN-based frameworks like Apache Spark, Apache Tez, and Apache Hadoop MapReduce. Additionally, customers can use custom CloudWatch metrics in their Auto Scaling policies.

The following is an example Auto Scaling policy on an instance group that scales 1 instance at a time to a maximum of 40 instances or a minimum of 10 instances. The instance group scales out when the memory available in YARN is less than 15%, and scales in when this metric is greater than 75%: Also, the instance group scales out when the ratio of pending YARN containers over allocated YARN containers is 0.75.

AutoScaling_SocialMedia

Additionally, customers can now configure the scale-down behavior when nodes are terminated from their cluster on EMR 5.1.0. By default, EMR now terminates nodes only during a scale-in event at the instance hour boundary, regardless of when the request was submitted. Because EC2 charges per full hour regardless of when the instance is terminated, this behavior enables applications running on your cluster to use instances in a dynamically scaling environment more cost effectively.

Conversely, customers can set the previous default for EMR releases earlier than 5.1.0, which blacklists and drains tasks from nodes before terminating them, regardless of proximity to the instance hour boundary. With either behavior, EMR removes the least active nodes first and blocks termination if it could lead to HDFS corruption.

You can create or modify Auto Scaling polices using the EMR console, AWS CLI, or the AWS SDKs with the EMR API. To enable Auto Scaling, EMR also requires an additional IAM role to grant permission for Auto Scaling to add and terminate capacity. For more information, see Auto Scaling with Amazon EMR. If you have any questions or would like to share an interesting use case about encryption on EMR, please leave a comment below.


Related

Use Apache Flink on Amazon EMR

o_Flink_2


Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations

Post Syndicated from Jonathan Fritz original https://aws.amazon.com/blogs/big-data/encrypt-data-at-rest-and-in-flight-on-amazon-emr-with-security-configurations/

Customers running analytics, stream processing, machine learning, and ETL workloads on personally identifiable information, health information, and financial data have strict requirements for encryption of data at-rest and in-transit. The Apache Spark and Hadoop ecosystems lend themselves to these big data use cases, and customers have asked us to provide a quick and easy way to encrypt data at-rest and data in-transit between nodes in each execution framework.

With the release of security configurations for Amazon EMR release 5.0.0 and 4.8.0, customers can now easily enable encryption for data at-rest in Amazon S3, HDFS, and local disk, and enable encryption for data in-flight in the Apache Spark, Apache Tez, and Apache Hadoop MapReduce frameworks.

Security configurations make it easy to specify the encryption keys and certificates to use, ranging from AWS Key Management Service to supplying your own custom encryption materials provider (for an example of custom providers, see the Nasdaq about EMRFS and Amazon S3 client-side encryption post). Additionally, you can apply a security configuration to multiple clusters, making it easy to standardize your security settings. For instance, this makes it easy for customers to encrypt data across their HIPAA-compliant Amazon EMR workloads.

The following is an example security configuration specifying SSE-KMS for Amazon S3 encryption (using EMRFS), AWS KMS key for local disk encryption (which will also encrypt HDFS blocks), and a set of TLS certificates in Amazon S3 for applications that require them for encryption in-transit:

After you create a security configuration, you can specify it when creating a cluster and apply the settings. Security configurations can also be created using the AWS CLI or SDK. For more information, see Encrypting Data with Amazon EMR. If you have any questions or would like to share an interesting use case about encryption on Amazon EMR, please leave a comment below.

 

Encrypt Data At-Rest and In-Flight on Amazon EMR with Security Configurations

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx31P2UUJKR4ONF/Encrypt-Data-At-Rest-and-In-Flight-on-Amazon-EMR-with-Security-Configurations

Customers running analytics, stream processing, machine learning, and ETL workloads on personally identifiable information, health information, and financial data have strict requirements for encryption of data at-rest and in-transit. The Apache Spark and Hadoop ecosystems lend themselves to these big data use cases, and customers have asked us to provide a quick and easy way to encrypt data at-rest and data in-transit between nodes in each execution framework.

With the release of security configurations for Amazon EMR release 5.0.0 and 4.8.0, customers can now easily enable encryption for data at-rest in Amazon S3, HDFS, and local disk, and enable encryption for data in-flight in the Apache Spark, Apache Tez, and Apache Hadoop MapReduce frameworks.

Security configurations make it easy to specify the encryption keys and certificates to use, ranging from AWS Key Management Service to supplying your own custom encryption materials provider (for an example of custom providers, see the Nasdaq about EMRFS and Amazon S3 client-side encryption post). Additionally, you can apply a security configuration to multiple clusters, making it easy to standardize your security settings. For instance, this makes it easy for customers to encrypt data across their HIPAA-compliant Amazon EMR workloads.

The following is an example security configuration specifying SSE-KMS for Amazon S3 encryption (using EMRFS), AWS KMS key for local disk encryption (which will also encrypt HDFS blocks), and a set of TLS certificates in Amazon S3 for applications that require them for encryption in-transit:

After you create a security configuration, you can specify it when creating a cluster and apply the settings. Security configurations can also be created using the AWS CLI or SDK. For more information, see Encrypting Data with Amazon EMR. If you have any questions or would like to share an interesting use case about encryption on Amazon EMR, please leave a comment below.

 

Use Spark 2.0, Hive 2.1 on Tez, and the latest from the Hadoop ecosystem on Amazon EMR release 5.0

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx3KG7STXIZV5QZ/Use-Spark-2-0-Hive-2-1-on-Tez-and-the-latest-from-the-Hadoop-ecosystem-on-Amazon

Jonathan Fritz is a Senior Product Manager for Amazon EMR

We are excited to launch Amazon EMR release 5.0 today, giving customers the latest versions of 16 supported open-source applications in the big data ecosystem, including new major versions of Spark and Hive.

Almost exactly a year ago, we shipped release 4.0, which brought significant improvements to EMR. We based our build and packaging system on Apache Bigtop, moved to standard ports and paths, and streamlined application configuration with configuration objects. Our initial 4.0 release consolidated our set of supported Apache big data applications to Apache Hadoop, Apache Spark, Apache Hive, Apache Pig, and Apache Mahout.

Over the subsequent months, EMR added support for additional open-source projects, unlocking various use cases such as low-latency SQL over datasets in Amazon S3 with Presto, real-time data access and SQL analytics with Apache HBase and Phoenix, collaborative analysis for data science with notebooks in Apache Zeppelin, and designing complex processing workflows with Apache Oozie.

Also, we kept versions of most major projects up-to-date with each EMR release, such as offering the latest version of Spark just a few weeks after the open source release. Each new version of a project had many performance improvements, new features, and bug fixes, and customers demanded these improvements quickly to support their big data architectures.

EMR release 5.0 is a milestone in delivering the most up-to-date, complete selection of open-source applications in the Hadoop ecosystem to our customers:

  • Upgrade to Spark 2.0 a week after the Apache release, giving customers access to improved SQL support, significant performance increases, the new Structured Streaming API, and enhanced SparkR support. We have also compiled it with Scala 2.11.
  • Upgrade from Hive 1.x to Hive 2.1, which includes a variety of performance enhancements, better Parquet file format support, and bug fixes.
  • Trade Hadoop MapReduce for Tez as the default execution engine for Hive and Pig, signaling a greater move from traditional Hadoop MapReduce to newer frameworks like Tez and Spark.
  • Add the newest versions of Hue and Zeppelin, notebook and query UIs for Hadoop ecosystem applications, enable data scientists and business intelligence analysts to interact with data even more easily and efficiently.
  • Upgrade all sandbox applications are now release on EMR.
  • Use the latest versions of all supported applications: Hadoop 2.7.2, Spark 2.0, Presto 0.150, Hive 2.1, Tez 0.8.4, Pig 0.16, HBase 1.2.2, Phoenix 4.7.0, Zeppelin 0.6.1 (Snapshot), Hue 3.10, Oozie 4.2.0, Sqoop 1.4.6, Ganglia 3.7.2, HCatalog 2.1.0, Mahout 0.12.2, and ZooKeeper 3.4.8.

EMR 5

If you have any questions about release 5.0, feedback, or would like to share an interesting use case that leverages these applications, please leave a comment below.

You can also join our live webinar, Introducing Amazon EMR Release 5.0, at 9AM PDT on Tuesday, August 23.

 

Supercharge SQL on Your Data in Apache HBase with Apache Phoenix

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx2ZF1NDQYDJFGT/Supercharge-SQL-on-Your-Data-in-Apache-HBase-with-Apache-Phoenix

With today’s launch of Amazon EMR release 4.7, you can now create clusters with Apache Phoenix 4.7.0 for low-latency SQL and OLTP workloads. Phoenix uses Apache HBase as its backing store (HBase 1.2.1 is included on Amazon EMR release 4.7.0), using HBase scan operations and coprocessors for fast performance. Additionally, you can map Phoenix tables and views to existing HBase tables, giving you SQL access over data already stored in HBase.

Let’s run through a quick demo to explore how to connect to Phoenix using JDBC, create a view over an existing HBase table, and create a secondary index for increased read performance.

Create an Amazon EMR cluster and an HBase table

First, using the Amazon EMR console or AWS CLI, launch a new Amazon EMR cluster using release 4.7 and choose Phoenix as an application. Here’s an example AWS CLI command:

aws emr create-cluster --name PhoenixDemo --release-label emr-4.7.0 --instance-type m3.xlarge --instance-count 3 --applications Name=Phoenix --ec2-attributes KeyName=MyKeyName --use-default-roles

Selecting the Phoenix application also includes HBase and Hadoop (YARN, HDFS, and MapReduce), giving you all the components needed for a fully operational cluster.

Next, create a table in HBase to use with Phoenix. You will copy an HBase snapshot from Amazon S3 and restore it on your cluster. Go to this HBase post on the AWS Big Data Blog and follow the instructions under the  “HBase shell query walkthrough” section to restore a table named customer (3,448,682 rows).

Finally, run a get request example from that blog to certify your table has been restored correctly.

Connect to Phoenix using JDBC and create a table

Once your HBase table is ready, it’s time to map a table in Phoenix to your data in HBase. You use a JDBC connection to access Phoenix, and there are two drivers included on your cluster under /usr/lib/phoenix/bin. First, the Phoenix client connects directly to HBase processes to execute queries, which requires several ports to be open in your Amazon EC2 Security Group (for ZooKeeper, HBase Master, and RegionServers on your cluster) if your client is off-cluster.

Second, the Phoenix thin client connects to the Phoenix Query Server, which runs on port 8765 on the master node of your EMR cluster. This allows you to use a local client without adjusting your Amazon EC2 Security Groups by creating a SSH tunnel to the master node and using port forwarding for port 8765. The Phoenix Query Server is still a new component, and not all SQL clients can support the Phoenix thin client.

In this example, you will use the SQLLine client included with Phoenix on the master node to connect to the Phoenix Query Server. Return to the terminal on the master node of your cluster. If you closed your SSH tunnel after creating your HBase table, create another SSH tunnel. Connect to Phoenix using this command:

/usr/lib/phoenix/bin/sqlline-thin.py http://localhost:8765

Once the SQLLine client has connected, let’s create a SQL view over the customer table in HBase. We will create a view instead of a table, because dropping a view does not also delete the underlying data in HBase (the behavior for deleting underlying data in HBase for Phoenix tables is configurable, but is true by default). To map a pre-existing table in HBase, you use a ‘column_family’.’column_prefix’ format for each column you want to include in your Phoenix view (note that you must use quotation marks around column and table names that are lowercase). Also, identify the column that is the HBase primary key with PRIMARY KEY, and give the view the same name as the underlying HBase table. Now, create a view over the customer table:

CREATE VIEW "customer" (
pk VARCHAR PRIMARY KEY, 
"address"."state" VARCHAR,
"address"."street" VARCHAR,
"address"."city" VARCHAR,
"address"."zip" VARCHAR,
"cc"."number" VARCHAR,
"cc"."expire" VARCHAR,
"cc"."type" VARCHAR,
"contact"."phone" VARCHAR);

Use SQLLine’s !tables command to list available Phoenix tables and confirm your newly created view is in the list . Make sure your terminal window is wide enough to show the output before instantiating the SQLLine client. Otherwise, the complete output will not appear.

Speeding up queries with secondary indexes

First, run a SQL query counting the number of people with each credit card type in California:

SELECT "customer"."type" AS credit_card_type, count(*) AS num_customers FROM "customer" WHERE "customer"."state" = 'CA' GROUP BY "customer"."type";

However, because we aren’t including the Primary Key in the HBase table in the WHERE clause, Phoenix must scan all HBase rows to ensure that all rows with the state ‘CA’ are included. If we anticipate our read patterns will filter by state, we can create a secondary index on that column to give Phoenix the ability to scan along that axis. For a more in-depth view of secondary indexing feature set, see the Apache Phoenix documentation. Now create a covered secondary index on state and include the HBase primary key (the customer ID), city, expire date, and type:

CREATE INDEX my_index ON "customer" ("customer"."state") INCLUDE("PK", "customer"."city", "customer"."expire", "customer"."type");

Phoenix will use a Hadoop MapReduce job to create this index and parallelly load it into HBase as another table (this takes around 2 minutes). Now, rerun the SQL query from earlier and compare the performance. It should be at least 10x faster!

Conclusion

In this post, you learned how to connect to Phoenix using JDBC, create Phoenix views over data in HBase, create secondary indexes for faster performance, and query data. You can use Phoenix as a performant SQL interface over existing HBase tables or use Phoenix directly to populate and manage tables using HBase behind the scenes as an underlying data store. To learn more about Phoenix, see the Amazon EMR documentation or the Apache documentation.

If you have any questions about using Phoenix on Amazon EMR or would like to share interesting use cases that leverage Phoenix, please leave a comment below.

——————————-

Related

Combine NoSQL and Massively Parallel Analytics Using Apache HBase and Apache Hive on Amazon EMR

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

 

Import Zeppelin notes from GitHub or JSON in Zeppelin 0.5.6 on Amazon EMR

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx1Y66KB4QZTVJL/Import-Zeppelin-notes-from-GitHub-or-JSON-in-Zeppelin-0-5-6-on-Amazon-EMR

Jonathan Fritz is a Senior Product Manager for Amazon EMR

Many Amazon EMR customers use Zeppelin to create interactive notebooks to run workloads with Spark using Scala, Python, and SQL. These customers have found Amazon EMR to be a great platform for running Zeppelin because of strong integration with other AWS services and the ability to quickly create a fully configured Spark environment. Many customers have already discovered Amazon S3 to be a useful way to durably store and move their notebook files between EMR clusters. 

With the latest Zeppelin release (0.5.6) included on Amazon EMR release 4.4.0, you can now import notes using links to S3 JSON files, raw file URLs in GitHub, or local files. You can also download a note as a JSON file as well. This new functionality makes it easier to save and share Zeppelin notes, and it allows you to version your notes during development. The import feature is located on the Zeppelin home screen, and the export feature is located on the toolbar for each note. Additionally, you can still configure Zeppelin to store its entire notebook file in S3 by adding this configuration for zeppelin-env when creating your cluster (just make sure you have already created the bucket in S3 before creating your cluster):

[
{
"Classification": "zeppelin-env",
"Properties": {

},
"Configurations": [
{
"Classification": "export",
"Properties": {
"ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
"ZEPPELIN_NOTEBOOK_S3_BUCKET":"my-zeppelin-bucket-name",
"ZEPPELIN_NOTEBOOK_USER":"user"
},
"Configurations": [

]
}
]
}
]

Below is a screenshot of the import note functionality. You can specify the URL for a JSON in S3 or a raw file in GitHub here:

 

 

 

 

Import Zeppelin notes from GitHub or JSON in Zeppelin 0.5.6 on Amazon EMR

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx1Y66KB4QZTVJL/Import-Zeppelin-notes-from-GitHub-or-JSON-in-Zeppelin-0-5-6-on-Amazon-EMR

Jonathan Fritz is a Senior Product Manager for Amazon EMR

Many Amazon EMR customers use Zeppelin to create interactive notebooks to run workloads with Spark using Scala, Python, and SQL. These customers have found Amazon EMR to be a great platform for running Zeppelin because of strong integration with other AWS services and the ability to quickly create a fully configured Spark environment. Many customers have already discovered Amazon S3 to be a useful way to durably store and move their notebook files between EMR clusters. 

With the latest Zeppelin release (0.5.6) included on Amazon EMR release 4.4.0, you can now import notes using links to S3 JSON files, raw file URLs in GitHub, or local files. You can also download a note as a JSON file as well. This new functionality makes it easier to save and share Zeppelin notes, and it allows you to version your notes during development. The import feature is located on the Zeppelin home screen, and the export feature is located on the toolbar for each note. Additionally, you can still configure Zeppelin to store its entire notebook file in S3 by adding this configuration for zeppelin-env when creating your cluster (just make sure you have already created the bucket in S3 before creating your cluster):

[
{
"Classification": "zeppelin-env",
"Properties": {

},
"Configurations": [
{
"Classification": "export",
"Properties": {
"ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
"ZEPPELIN_NOTEBOOK_S3_BUCKET":"my-zeppelin-bucket-name",
"ZEPPELIN_NOTEBOOK_USER":"user"
},
"Configurations": [

]
}
]
}
]

Below is a screenshot of the import note functionality. You can specify the URL for a JSON in S3 or a raw file in GitHub here:

 

 

 

 

Videos now available for AWS re:Invent 2015 Big Data Analytics sessions

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx3D3UYOXB9XG6Z/Videos-now-available-for-AWS-re-Invent-2015-Big-Data-Analytics-sessions

For those of you who were able to attend AWS re:Invent 2015 last week or watched sessions through our live stream, thanks for participating in the conference. We hope you left feeling inspired to tackle your big data projects with tools in the AWS ecosystem and partner solutions. Also, we were excited for our customers to take the stage to discuss their data processing architectures and use cases.

If you missed a session in your schedule, don’t fret! We have added a large portion of re:Invent content to YouTube, and you can find videos of the big data sessions below.

Deep Dive Customers Use cases

BDT303 – Running Spark and Presto on the Netflix Big Data Platform
BDT306 – The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS
BDT307 – Zero Infrastructure, Real-Time Data Collection, and Analytics (with Zillow)
BDT312 – Application Monitoring in a Post-Server World: Why Data Context Is Critical (with New Relic)
BDT318 – Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second 
BDT404 – Building and Managing Large-Scale ETL Data Flows with AWS Data Pipeline and Dataduct (with Coursera)
DAT308 – How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
DAT311 – Large-Scale Genomic Analysis with Amazon Redshift (with Human Longevity Bioinformatics)
BDT323 – Amazon EBS and Cassandra: 1 Million Writes Per Second on 60 Nodes (with CrowdStrike)
BDT322 – How Redfin and Twitter Leverage Amazon S3 to Build Their Big Data Platforms
MBL314 – Building World-Class, Cloud-Connected Products: How Sonos Leverages Amazon Kinesis

Services Sessions (Amazon EMR, Amazon Kinesis, Amazon Redshift, AWS Data Pipeline and Amazon DynamoDB)

BDT208 – A Technical Introduction to Amazon Elastic MapReduce (with AOL)
BDT209 – Amazon Elasticsearch Service for Real-time Analytics
BDT305 – Amazon EMR Deep Dive and Best Practices (with FINRA)
BDT313 – Amazon DynamoDB for Big Data
BDT319 – Amazon QuickSight: Cloud-native Business Intelligence
BDT320 – Streaming Data Flows with Amazon Kinesis Firehose
BDT401 – Amazon Redshift Deep Dive: Tuning and Best Practices (with TripAdvisor)
BDT316 – Offloading ETL to Amazon Elastic MapReduce (with Amgen)
BDT403 – Best Practices for Building Realtime Streaming Applications with Amazon Kinesis (with AdRoll)
BDT206 – How to Accelerate Your Projects with AWS Marketplace  (with Boeing)
DAT201 – Introduction to Amazon Redshift (with RetailMeNot)

Architecture and Best Practice

BDT205 – Your First Big Data Application On AWS
BDT317 – Building a Data Lake on AWS 
BDT310 – Big Data Architectural Patterns and Best Practices on AWS
BDT402 – Delivering Business Agility Using AWS (with Wipro)
BDT309 – Data Science & Best Practices for Apache Spark on Amazon EMR
DAT204 – NoSQL? No Worries: Building Scalable Applications on AWS NoSQL Services (with Expedia and Mapbox)
ISM303 – Migrating Your Enterprise Data Warehouse to Amazon Redshift (with Boingo Wireless and Edmunds)

Machine Learning

BDT311 – Deep Learning: Going Beyond Machine Learning (with Day1 Solutions)
BDT302 – Real-World Smart Applications With Amazon Machine Learning
BDT207 – Real-Time Analytics In Service of Self-Healing Ecosystems (with Netflix)

Videos now available for AWS re:Invent 2015 Big Data Analytics sessions

Post Syndicated from Jonathan Fritz original https://blogs.aws.amazon.com/bigdata/post/Tx3D3UYOXB9XG6Z/Videos-now-available-for-AWS-re-Invent-2015-Big-Data-Analytics-sessions

For those of you who were able to attend AWS re:Invent 2015 last week or watched sessions through our live stream, thanks for participating in the conference. We hope you left feeling inspired to tackle your big data projects with tools in the AWS ecosystem and partner solutions. Also, we were excited for our customers to take the stage to discuss their data processing architectures and use cases.

If you missed a session in your schedule, don’t fret! We have added a large portion of re:Invent content to YouTube, and you can find videos of the big data sessions below.

Deep Dive Customers Use cases

BDT303 – Running Spark and Presto on the Netflix Big Data Platform
BDT306 – The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS
BDT307 – Zero Infrastructure, Real-Time Data Collection, and Analytics (with Zillow)
BDT312 – Application Monitoring in a Post-Server World: Why Data Context Is Critical (with New Relic)
BDT318 – Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second 
BDT404 – Building and Managing Large-Scale ETL Data Flows with AWS Data Pipeline and Dataduct (with Coursera)
DAT308 – How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift
DAT311 – Large-Scale Genomic Analysis with Amazon Redshift (with Human Longevity Bioinformatics)
BDT323 – Amazon EBS and Cassandra: 1 Million Writes Per Second on 60 Nodes (with CrowdStrike)
BDT322 – How Redfin and Twitter Leverage Amazon S3 to Build Their Big Data Platforms
MBL314 – Building World-Class, Cloud-Connected Products: How Sonos Leverages Amazon Kinesis

Services Sessions (Amazon EMR, Amazon Kinesis, Amazon Redshift, AWS Data Pipeline and Amazon DynamoDB)

BDT208 – A Technical Introduction to Amazon Elastic MapReduce (with AOL)
BDT209 – Amazon Elasticsearch Service for Real-time Analytics
BDT305 – Amazon EMR Deep Dive and Best Practices (with FINRA)
BDT313 – Amazon DynamoDB for Big Data
BDT319 – Amazon QuickSight: Cloud-native Business Intelligence
BDT320 – Streaming Data Flows with Amazon Kinesis Firehose
BDT401 – Amazon Redshift Deep Dive: Tuning and Best Practices (with TripAdvisor)
BDT316 – Offloading ETL to Amazon Elastic MapReduce (with Amgen)
BDT403 – Best Practices for Building Realtime Streaming Applications with Amazon Kinesis (with AdRoll)
BDT206 – How to Accelerate Your Projects with AWS Marketplace  (with Boeing)
DAT201 – Introduction to Amazon Redshift (with RetailMeNot)

Architecture and Best Practice

BDT205 – Your First Big Data Application On AWS
BDT317 – Building a Data Lake on AWS 
BDT310 – Big Data Architectural Patterns and Best Practices on AWS
BDT402 – Delivering Business Agility Using AWS (with Wipro)
BDT309 – Data Science & Best Practices for Apache Spark on Amazon EMR
DAT204 – NoSQL? No Worries: Building Scalable Applications on AWS NoSQL Services (with Expedia and Mapbox)
ISM303 – Migrating Your Enterprise Data Warehouse to Amazon Redshift (with Boingo Wireless and Edmunds)

Machine Learning

BDT311 – Deep Learning: Going Beyond Machine Learning (with Day1 Solutions)
BDT302 – Real-World Smart Applications With Amazon Machine Learning
BDT207 – Real-Time Analytics In Service of Self-Healing Ecosystems (with Netflix)