Tag Archives: artificial intelligence

Rock, paper, scissors, lizard, Spock, fire, water balloon!

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/rock-paper-scissors-lizard-spock-fire-water-balloon/

Use a Raspberry Pi and a Pi Camera Module to build your own machine learning–powered rock paper scissors game!

Rock-Paper-Scissors game using computer vision and machine learning on Raspberry Pi

A Rock-Paper-Scissors game using computer vision and machine learning on the Raspberry Pi. Project GitHub page: https://github.com/DrGFreeman/rps-cv PROJECT ORIGIN: This project results from a challenge my son gave me when I was teaching him the basics of computer programming making a simple text based Rock-Paper-Scissors game in Python.

Virtual rock paper scissors

Here’s why you should always leave comments on our blog: this project from Julien de la Bruère-Terreault instantly had our attention when he shared it on our recent Android Things post.

Julien and his son were building a text-based version of rock paper scissors in Python when his son asked him: “Could you make a rock paper scissors game that uses the camera to detect hand gestures?” Obviously, Julien really had no choice but to accept the challenge.

“The game uses a Raspberry Pi computer and Raspberry Pi Camera Module installed on a 3D-printed support with LED strips to achieve consistent images,” Julien explains in the tutorial for the build. “The pictures taken by the camera are processed and fed to an image classifier that determines whether the gesture corresponds to ‘Rock’, ‘Paper’, or ‘Scissors’ gestures.”

How does it work?

Physically, the build uses a Pi 3 Model B and a Camera Module V2 alongside 3D-printed parts. The parts are all green, since a consistent colour allows easy subtraction of background from the captured images. You can download the files for the setup from Thingiverse.

rock paper scissors raspberry pi

To illustrate how the software works, Julien has created a rather delightful pipeline demonstrating where computer vision and machine learning come in.

rock paper scissors using raspberry pi

The way the software works means the game doesn’t need to be limited to the standard three hand signs. If you wanted to, you could add other signs such as ‘lizard’ and ‘Spock’! Or ‘fire’ and ‘water balloon’. Or any other alterations made to the game in your pop culture favourites.

rock paper scissors lizard spock

Check out Julien’s full tutorial to build your own AI-powered rock paper scissors game here on Julien’s GitHub. Massive kudos to Julien for spending a year learning the skills required to make it happen. And a massive thank you to Julien’s son for inspiring him! This is why it’s great to do coding and digital making with kids — they have the best project ideas!

Sharing is caring

If you’ve built your own project using Raspberry Pi, please share it with us in the comments below, or via social media. As you can tell from today’s blog post, we love to see them and share them with the whole community!

The post Rock, paper, scissors, lizard, Spock, fire, water balloon! appeared first on Raspberry Pi.

Amazon SageMaker Adds Batch Transform Feature and Pipe Input Mode for TensorFlow Containers

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/sagemaker-nysummit2018/

At the New York Summit a few days ago we launched two new Amazon SageMaker features: a new batch inference feature called Batch Transform that allows customers to make predictions in non-real time scenarios across petabytes of data and Pipe Input Mode support for TensorFlow containers. SageMaker remains one of my favorite services and we’ve covered it extensively on this blog and the machine learning blog. In fact, the rapid pace of innovation from the SageMaker team is a bit hard to keep up with. Since our last post on SageMaker’s Automatic Model Tuning with Hyper Parameter Optimization, the team launched 4 new built-in algorithms and tons of new features. Let’s take a look at the new Batch Transform feature.

Batch Transform

The Batch Transform feature is a high-performance and high-throughput method for transforming data and generating inferences. It’s ideal for scenarios where you’re dealing with large batches of data, don’t need sub-second latency, or need to both preprocess and transform the training data. The best part? You don’t have to write a single additional line of code to make use of this feature. You can take all of your existing models and start batch transform jobs based on them. This feature is available at no additional charge and you pay only for the underlying resources.

Let’s take a look at how we would do this for the built-in Object Detection algorithm. I followed the example notebook to train my object detection model. Now I’ll go to the SageMaker console and open the Batch Transform sub-console.

From there I can start a new batch transform job.

Here I can name my transform job, select which of my models I want to use, and the number and type of instances to use. Additionally, I can configure the specifics around how many records to send to my inference concurrently and the size of the payload. If I don’t manually specify these then SageMaker will select some sensible defaults.

Next I need to specify my input location. I can either use a manifest file or just load all the files in an S3 location. Since I’m dealing with images here I’ve manually specified my input content-type.

Finally, I’ll configure my output location and start the job!

Once the job is running, I can open the job detail page and follow the links to the metrics and the logs in Amazon CloudWatch.

I can see the job is running and if I look at my results in S3 I can see the predicted labels for each image.

The transform generated one output JSON file per input file containing the detected objects.

From here it would be easy to create a table for the bucket in AWS Glue and either query the results with Amazon Athena or visualize them with Amazon QuickSight.

Of course it’s also possible to start these jobs programmatically from the SageMaker API.

You can find a lot more detail on how to use batch transforms in your own containers in the documentation.

Pipe Input Mode for Tensorflow

Pipe input mode allows customers to stream their training dataset directly from Amazon Simple Storage Service (S3) into Amazon SageMaker using a highly optimized multi-threaded background process. This mode offers significantly better read throughput than the File input mode that must first download the data to the local Amazon Elastic Block Store (EBS) volume. This means your training jobs start sooner, finish faster, and use less disk space, lowering the costs associated with training your models. It has the added benefit of letting you train on datasets beyond the 16 TB EBS volume size limit.

Earlier this year, we ran some experiments with Pipe Input Mode and found that startup times were reduced up to 87% on a 78 GB dataset, with throughput twice as fast in some benchmarks, ultimately resulting in up to a 35% reduction in total training time.

By adding support for Pipe Input Mode to TensorFlow we’re making it easier for customers to take advantage of the same increased speed available to the built-in algorithms. Let’s look at how this works in practice.

First, I need to make sure I have the sagemaker-tensorflow-extensions available for my training job. This gives us the new PipeModeDataset class which takes a channel and a record format as inputs and returns a TensorFlow dataset. We can use this in our input_fn for the TensorFlow estimator and read from the channel. The code sample below shows a simple example.

from sagemaker_tensorflow import PipeModeDataset

def input_fn(channel):
    # Simple example data - a labeled vector.
    features = {
        'data': tf.FixedLenFeature([], tf.string),
        'labels': tf.FixedLenFeature([], tf.int64),
    }
    
    # A function to parse record bytes to a labeled vector record
    def parse(record):
        parsed = tf.parse_single_example(record, features)
        return ({
            'data': tf.decode_raw(parsed['data'], tf.float64)
        }, parsed['labels'])

    # Construct a PipeModeDataset reading from a 'training' channel, using
    # the TF Record encoding.
    ds = PipeModeDataset(channel=channel, record_format='TFRecord')

    # The PipeModeDataset is a TensorFlow Dataset and provides standard Dataset methods
    ds = ds.repeat(20)
    ds = ds.prefetch(10)
    ds = ds.map(parse, num_parallel_calls=10)
    ds = ds.batch(64)
    
    return ds

Then you can define your model and the same way you would for a normal TensorFlow estimator. When it comes to estimator creation time you just need to pass in input_mode='Pipe' as one of the parameters.

Available Now

Both of these new features are available now at no additional charge, and I’m looking forward to seeing what customers can build with the batch transform feature. I can already tell you that it will help us with some of our internal ML workloads here in AWS Marketing.

As always, let us know what you think of this feature in the comments or on Twitter!

Randall

Amazon Translate Adds Support for Japanese, Russian, Italian, Traditional Chinese, Turkish, and Czech

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-translate-adds-support-for-new-languages/

Today I’m excited to announce that Amazon Translate has added support for Japanese, Russian, Italian, Traditional Chinese, Turkish, and Czech! Amazon Translate is a translation API that delivers fast, high-quality, and affordable language translation. Translate was originally released in preview at AWS re:Invent in 2017 and my colleague, Tara, wrote about the service in depth.

Since the initial preview, we’ve continued to iterate on customer feedback by adding features like automatic source language inference via Amazon Comprehend, Amazon CloudWatch metrics, and larger pieces of text in each TranslateText. In April, we made the service generally available and have continued to collect feature requests and feedback from customers.

Working with Amazon Translate

You can use the API explorer in the Amazon Translate Console to try out the new languages immediately.

You could also use any of the SDKs as well. I wrote a quick Python sample below.

import boto3
translate = boto3.client("translate")
lang_flag_pairs = [("ja", "🇯🇵"), ("ru", "🇷🇺"),
                   ("it", "🇮🇹"), ("zh-TW", "🇹🇼"),
                   ("tr", "🇹🇷"), ("cs", "🇨🇿")]
for lang, flag in language_flag_pairs:
    print(flag)
    print(translate.translate_text(
        Text="Hello, World!",
        SourceLanguageCode="en",
        TargetLanguageCode=lang
    )['TranslatedText'])

Which gave me a cool result:

🇯🇵
ハローワールド!
🇷🇺
Привет, Мир!
🇮🇹
Ciao, Mondo!
🇹🇼
你好,世界杯!
🇹🇷
Merhaba, Dünya!
🇨🇿
Ahoj, světe!

You can check out the examples page in the documentation for more interesting ways to use Amazon Translate.

I had the chance to chat with a few of our customers about how they’re using Amazon Translate. I found out that they’re using it to enable a lot of innovative use cases and customers are building real-time chat translation in games, capturing and translating tweets for analytics, and as you can see on this blog – translating blog posts and creating spoken versions of them with Polly.

Additional Info

Amazon Translate is available in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland) and AWS GovCloud (US). It has a useful monthly free tier of 2 million characters for the first 12 months, and $15 per million characters after that.

The team wanted me to let you know that they’re hard at work on some additional functionality and that they’re hoping, later this year, to add support for Danish, Dutch, Finnish, Hebrew, Polish, and Swedish.

As always, let us know if you have any feedback on this service in the comments or on Twitter!

Randall

Amazon Kinesis Video Streams Adds Support For HLS Output Streams

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-kinesis-video-streams-adds-support-for-hls-output-streams/

Today I’m excited to announce and demonstrate the new HTTP Live Streams (HLS) output feature for Amazon Kinesis Video Streams (KVS). If you’re not already familiar with KVS, Jeff covered the release for AWS re:Invent in 2017. In short, Amazon Kinesis Video Streams is a service for securely capturing, processing, and storing video for analytics and machine learning – from one device or millions. Customers are using Kinesis Video with machine learning algorithms to power everything from home automation and smart cities to industrial automation and security.

After iterating on customer feedback, we’ve launched a number of features in the past few months including a plugin for GStreamer, the popular open source multimedia framework, and docker containers which make it easy to start streaming video to Kinesis. We could talk about each of those features at length, but today is all about the new HLS output feature! Fair warning, there are a few pictures of my incredibly messy office in this post.

HLS output is a convenient new feature that allows customers to create HLS endpoints for their Kinesis Video Streams, convenient for building custom UIs and tools that can playback live and on-demand video. The HLS-based playback capability is fully managed, so you don’t have to build any infrastructure to transmux the incoming media. You simply create a new streaming session, up to 5 (for now), with the new GetHLSStreamingSessionURL API and you’re off to the races. The great thing about HLS is that it’s already an industry standard and really easy to leverage in existing web-players like JW Player, hls.js, VideoJS, Google’s Shaka Player, or even rendering natively in mobile apps with Android’s Exoplayer and iOS’s AV Foundation. Let’s take a quick look at the API, feel free to skip to the walk-through below as well.

Kinesis Video HLS Output API

The documentation covers this in more detail than what we can go over in the Blog but I’ll cover the broad components.

  1. Get an endpoint with the GetDataEndpoint API
  2. Use that endpoint to get an HLS streaming URL with the GetHLSStreamingSessionURL API
  3. Render the content in the HLS URL with whatever tools you want!

This is pretty easy in a Jupyter notebook with a quick bit of Python and boto3.

import boto3
STREAM_NAME = "RandallDeepLens"
kvs = boto3.client("kinesisvideo")
# Grab the endpoint from GetDataEndpoint
endpoint = kvs.get_data_endpoint(
    APIName="GET_HLS_STREAMING_SESSION_URL",
    StreamName=STREAM_NAME
)['DataEndpoint']
# Grab the HLS Stream URL from the endpoint
kvam = boto3.client("kinesis-video-archived-media", endpoint_url=endpoint)
url = kvam.get_hls_streaming_session_url(
    StreamName=STREAM_NAME,
    PlaybackMode="LIVE"
)['HLSStreamingSessionURL']

You can even visualize everything right away in Safari which can render HLS streams natively.

from IPython.display import HTML
HTML(data='<video src="{0}" autoplay="autoplay" controls="controls" width="300" height="400"></video>'.format(url)) 

We can also stream directly from a AWS DeepLens with just a bit of code:

import DeepLens_Kinesis_Video as dkv
import time
aws_access_key = "super_fake"
aws_secret_key = "even_more_fake"
region = "us-east-1"
stream_name ="RandallDeepLens"
retention = 1 #in minutes.
wait_time_sec = 60*300 #The number of seconds to stream the data
# will create the stream if it does not already exist
producer = dkv.createProducer(aws_access_key, aws_secret_key, "", region)
my_stream = producer.createStream(stream_name, retention)
my_stream.start()
time.sleep(wait_time_sec)
my_stream.stop()

How to use Kinesis Video Streams HLS Output Streams

We definitely need a Kinesis Video Stream, which we can create easily in the Kinesis Video Streams Console.

Now, we need to get some content into the stream. We have a few options here. Perhaps the easiest is the docker container. I decided to take the more adventurous route and compile the GStreamer plugin locally on my mac, following the scripts on github. Be warned, compiling this plugin takes a while and can cause your computer to transform into a space heater.

With our freshly compiled GStreamer binaries like gst-launch-1.0 and the kvssink plugin we can stream directly from my macbook’s webcam, or any other GStreamer source, into Kinesis Video Streams. I just use the kvssink output plugin and my data will wind up in the video stream. There are a few parameters to configure around this, so pay attention.

Here’s an example command that I ran to stream my macbook’s webcam to Kinesis Video Streams:

gst-launch-1.0 autovideosrc ! videoconvert \
! video/x-raw,format=I420,width=640,height=480,framerate=30/1 \
! vtenc_h264_hw allow-frame-reordering=FALSE realtime=TRUE max-keyframe-interval=45 bitrate=500 \
! h264parse \
! video/x-h264,stream-format=avc,alignment=au,width=640,height=480,framerate=30/1 \
! kvssink stream-name="BlogStream" storage-size=1024 aws-region=us-west-2 log-config=kvslog

Now that we’re streaming some data into Kinesis, I can use the getting started sample static website to test my HLS stream with a few different video players. I just fill in my AWS credentials and ask it to start playing. The GetHLSStreamingSessionURL API supports a number of parameters so you can play both on-demand segments and live streams from various timestamps.

Additional Info

Data Consumed from Kinesis Video Streams using HLS is charged $0.0119 per GB in US East (N. Virginia) and US West (Oregon) and pricing for other regions is available on the service pricing page. This feature is available now, in all regions where Kinesis Video Streams is available.

The Kinesis Video team told me they’re working hard on getting more integration with the AWS Media services, like MediaLive, which will make it easier to serve Kinesis Video Stream content to larger audiences.

As always, let us know what you you think on twitter or in the comments. I’ve had a ton of fun playing around with this feature over the past few days and I’m excited to see customers build some new tools with it!

Randall

Amazon SageMaker Automatic Model Tuning: Using Machine Learning for Machine Learning

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/sagemaker-automatic-model-tuning/

Today I’m excited to announce the general availability of Amazon SageMaker Automatic Model Tuning. Automatic Model Tuning eliminates the undifferentiated heavy lifting required to search the hyperparameter space for more accurate models. This feature allows developers and data scientists to save significant time and effort in training and tuning their machine learning models. A Hyperparameter Tuning job launches multiple training jobs, with different hyperparameter combinations, based on the results of completed training jobs. SageMaker trains a “meta” machine learning model, based on Bayesian Optimization, to infer hyperparameter combinations for our training jobs. Let’s dive a little deeper.

Model Tuning in the Machine Learning Process

A developer’s typical machine learning process comprises 4 steps: exploratory data analysis (EDA), model design, model training, and model evaluation. SageMaker already makes each of those steps easy with access to powerful Jupyter notebook instances, built-in algorithms, and model training within the service. Focusing on the training portion of the process, we typically work with data and feed it into a model where we evaluate the model’s prediction against our expected result. We keep a portion of our overall input data, the evaluation data, away from the training data used to train the model. We can use the evaluation data to examine the behavior of our model on data it has never seen. In many cases after we’ve chosen an algorithm or built a custom model, we will need to search the space of possible hyperparameter configurations of that algorithm for the best results for our input data.

Hyperparameters control how our underlying algorithms operate and influence the performance of the model. They can be things like: the number of epochs to train for, the number of layers in the network, the learning rate, the optimization algorithms, and more. Typically, you start with random values, or common values for other problems, and iterate through adjustments as you begin to see what effect the changes have. In the past this was a painstakingly manual process. However, thanks to the work of some very talented researchers we can use SageMaker to eliminate almost all of the manual overhead. A user only needs to select the hyperparameters to tune, a range for each parameter to explore, and the total number of training jobs to budget. Let’s see how this works in practice.

Hyperparameter Tuning

To demonstrate this feature we’ll work with the standard MNIST dataset, the Apache MXNet framework, and the SageMaker Python SDK. Everything you see below is available in the SageMaker example notebooks.

First, I’ll create a traditional MXNet estimator using the SageMaker Python SDK on a Notebook Instance:


import boto3
import sagemaker
from sagemaker.mxnet import MXNet
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
train_data_location = 's3://sagemaker-sample-data-{}/mxnet/mnist/train'.format(region)
test_data_location = 's3://sagemaker-sample-data-{}/mxnet/mnist/test'.format(region)
estimator = MXNet(entry_point='mnist.py',
                  role=role,
                  train_instance_count=1,
                  train_instance_type='ml.m4.xlarge',
                  sagemaker_session=sagemaker.Session(),
                  base_job_name='HPO-mxnet',
                  hyperparameters={'batch_size': 100})

This is probably quite similar to what you’ve seen in other SageMaker examples.

Now, we can import some tools for the Auto Model Tuning and create our hyperparameter ranges.


from sagemaker.tuner import HyperparameterTuner, IntegerParameter, CategoricalParameter, ContinuousParameter
hyperparameter_ranges = {'optimizer': CategoricalParameter(['sgd', 'Adam']),
                         'learning_rate': ContinuousParameter(0.01, 0.2),
                         'num_epoch': IntegerParameter(10, 50)}

The tuning job will select parameters from these ranges and use those to determine the best place to focus training efforts. There are few types of parameters:

  • Categorical parameters use one value from a discrete set.
  • Continuous parameters can use any real number value between the minimum and maximum value.
  • Integer parameters can use any integer within the bounds specified.

Now that we have our ranges defined we want to define our success metric and a regular expression for finding that metric in the training job logs.


objective_metric_name = 'Validation-accuracy'
metric_definitions = [{'Name': 'Validation-accuracy',
                       'Regex': 'Validation-accuracy=([0-9\\.]+)'}]

Now, with just these few things defined we can start our tuning job!


tuner = HyperparameterTuner(estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=9,
                            max_parallel_jobs=3)
tuner.fit({'train': train_data_location, 'test': test_data_location})

Now, we can open up the SageMaker console, select the Hyperparameter tuning jobs sub-console and check out all our tuning jobs.

We can click on the job we just created to get some more detail and explore the results of the tuning.

By default the console will show us the best job and the parameters used but we can also check out each of the other jobs.

Hopping back over to our notebook instance, we have a handy analytics object from tuner.analytics() that we can use to visualize the results of the training with some bokeh plots. Some examples of this are provided in the SageMaker example notebooks.

This feature works for built-in algorithms, jobs created with the SageMaker Python SDK, or even bring-your-own training jobs in docker.

We can even create tuning jobs right in the console by clicking Create hyperparameter tuning job.

First we select a name for our job, an IAM role and which VPC it should run in, if any.

Next, we configure the training job. We can use built-in algorithms or a custom docker image. If we’re using a custom image this would be where we defined the regex to to find the objective metric in the logs. For now we’ll just select XGBoost and click next.

Now we’ll configure our tuning job parameters just like in the notebook example. I’ll select the area under the curve (AUC) as the objective metric to optimize. Since this is a builtin algorithm the regex for that metric was already filled in by the previous step. I’ll set the minimum and maximum number of rounds and click next.

In the next screen we can configure the input channels that our algorithm is expecting as well as the location to output the models. We’d typically have more than just the “train” channel and would have an “eval” channel as well.

Finally, we can configure the resource limits for this tuning job.

Now we’re off to the races tuning!

Additional Resources

To take advantage of automatic model tuning there are really only a few things users have to define: the hyperparameter ranges, the success metric and a regex to find it, the number of jobs to run in parallel, and the maximum number of jobs to run. For the built-in algorithms we don’t even need to define the regex. There’s a small trade-off between the number of parallel jobs used and the accuracy of the final model. Increasing max_parallel_jobs will cause the tuning job to finish much faster but a lower parallelism will generally provide a slightly better final result.

Amazon SageMaker Automatic Model Tuning is provided at no additional charge, you pay only for the underlying resources used by the training jobs that the tuning job launches. This feature is available now in all regions where SageMaker is available. This feature is available in the API and training jobs launched by automatic model tuning are visible in the console. You can find our more by reading the documentation.

I really think this feature will save developers a lot of time and effort and I’m excited to see what customers do with it. As always, we welcome your feedback in the comments or on Twitter!

Randall

Some quick thoughts on the public discussion regarding facial recognition and Amazon Rekognition this past week

Post Syndicated from Dr. Matt Wood original https://aws.amazon.com/blogs/aws/some-quick-thoughts-on-the-public-discussion-regarding-facial-recognition-and-amazon-rekognition-this-past-week/

We have seen a lot of discussion this past week about the role of Amazon Rekognition in facial recognition, surveillance, and civil liberties, and we wanted to share some thoughts.

Amazon Rekognition is a service we announced in 2016. It makes use of new technologies – such as deep learning – and puts them in the hands of developers in an easy-to-use, low-cost way. Since then, we have seen customers use the image and video analysis capabilities of Amazon Rekognition in ways that materially benefit both society (e.g. preventing human trafficking, inhibiting child exploitation, reuniting missing children with their families, and building educational apps for children), and organizations (enhancing security through multi-factor authentication, finding images more easily, or preventing package theft). Amazon Web Services (AWS) is not the only provider of services like these, and we remain excited about how image and video analysis can be a driver for good in the world, including in the public sector and law enforcement.

There have always been and will always be risks with new technology capabilities. Each organization choosing to employ technology must act responsibly or risk legal penalties and public condemnation. AWS takes its responsibilities seriously. But we believe it is the wrong approach to impose a ban on promising new technologies because they might be used by bad actors for nefarious purposes in the future. The world would be a very different place if we had restricted people from buying computers because it was possible to use that computer to do harm. The same can be said of thousands of technologies upon which we all rely each day. Through responsible use, the benefits have far outweighed the risks.

Customers are off to a great start with Amazon Rekognition; the evidence of the positive impact this new technology can provide is strong (and growing by the week), and we’re excited to continue to support our customers in its responsible use.

-Dr. Matt Wood, general manager of artificial intelligence at AWS

Sending Inaudible Commands to Voice Assistants

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/05/sending_inaudib.html

Researchers have demonstrated the ability to send inaudible commands to voice assistants like Alexa, Siri, and Google Assistant.

Over the last two years, researchers in China and the United States have begun demonstrating that they can send hidden commands that are undetectable to the human ear to Apple’s Siri, Amazon’s Alexa and Google’s Assistant. Inside university labs, the researchers have been able to secretly activate the artificial intelligence systems on smartphones and smart speakers, making them dial phone numbers or open websites. In the wrong hands, the technology could be used to unlock doors, wire money or buy stuff online ­– simply with music playing over the radio.

A group of students from University of California, Berkeley, and Georgetown University showed in 2016 that they could hide commands in white noise played over loudspeakers and through YouTube videos to get smart devices to turn on airplane mode or open a website.

This month, some of those Berkeley researchers published a research paper that went further, saying they could embed commands directly into recordings of music or spoken text. So while a human listener hears someone talking or an orchestra playing, Amazon’s Echo speaker might hear an instruction to add something to your shopping list.

Introducing the AWS Machine Learning Competency for Consulting Partners

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/introducing-the-aws-machine-learning-competency-for-consulting-partners/

Today I’m excited to announce a new Machine Learning Competency for Consulting Partners in the Amazon Partner Network (APN). This AWS Competency program allows APN Consulting Partners to demonstrate a deep expertise in machine learning on AWS by providing solutions that enable machine learning and data science workflows for their customers. This new AWS Competency is in addition to the Machine Learning comptency for our APN Technology Partners, that we launched at the re:Invent 2017 partner summit.

These APN Consulting Partners help organizations solve their machine learning and data challenges through:

  • Providing data services that help data scientists and machine learning practitioners prepare their enterprise data for training.
  • Platform solutions that provide data scientists and machine learning practitioners with tools to take their data, train models, and make predictions on new data.
  • SaaS and API solutions to enable predictive capabilities within customer applications.

Why work with an AWS Machine Learning Competency Partner?

The AWS Competency Program helps customers find the most qualified partners with deep expertise. AWS Machine Learning Competency Partners undergo a strict validation of their capabilities to demonstrate technical proficiency and proven customer success with AWS machine learning tools.

If you’re an AWS customer interested in machine learning workloads on AWS, check out our AWS Machine Learning launch partners below:

 

Interested in becoming an AWS Machine Learning Competency Partner?

APN Partners with experience in Machine Learning can learn more about becoming an AWS Machine Learning Competency Partner here. To learn more about the benefits of joining the AWS Partner Network, see our APN Partner website.

Thanks to the AWS Partner Team for their help with this post!
Randall

Own your own working Pokémon Pokédex!

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/deep-learning-pokedex/

Squeal with delight as your inner Pokémon trainer witnesses the wonder of Adrian Rosebrock’s deep learning Pokédex.

Creating a real-life Pokedex with a Raspberry Pi, Python, and Deep Learning

This video demos a real-like Pokedex, complete with visual recognition, that I created using a Raspberry Pi, Python, and Deep Learning. You can find the entire blog post, including code, using this link: https://www.pyimagesearch.com/2018/04/30/a-fun-hands-on-deep-learning-project-for-beginners-students-and-hobbyists/ Music credit to YouTube user “No Copyright” for providing royalty free music: https://www.youtube.com/watch?v=PXpjqURczn8

The history of Pokémon in 30 seconds

The Pokémon franchise was created by video game designer Satoshi Tajiri in 1995. In the fictional world of Pokémon, Pokémon Trainers explore the vast landscape, catching and training small creatures called Pokémon. To date, there are 802 different types of Pokémon. They range from the ever recognisable Pikachu, a bright yellow electric Pokémon, to the highly sought-after Shiny Charizard, a metallic, playing-card-shaped Pokémon that your mate Alex claims she has in mint condition, but refuses to show you.

Pokemon GIF

In the world of Pokémon, children as young as ten-year-old protagonist and all-round annoyance Ash Ketchum are allowed to leave home and wander the wilderness. There, they hunt vicious, deadly creatures in the hope of becoming a Pokémon Master.

Adrian’s deep learning Pokédex

Adrian is a bit of a deep learning pro, as demonstrated by his Santa/Not Santa detector, which we wrote about last year. For that project, he also provided a great explanation of what deep learning actually is. In a nutshell:

…a subfield of machine learning, which is, in turn, a subfield of artificial intelligence (AI).While AI embodies a large, diverse set of techniques and algorithms related to automatic reasoning (inference, planning, heuristics, etc), the machine learning subfields are specifically interested in pattern recognition and learning from data.

As with his earlier Raspberry Pi project, Adrian uses the Keras deep learning model and the TensorFlow backend, plus a few other packages such as Adrian’s own imutils functions and OpenCV.

Adrian trained a Convolutional Neural Network using Keras on a dataset of 1191 Pokémon images, obtaining 96.84% accuracy. As Adrian explains, this model is able to identify Pokémon via still image and video. It’s perfect for creating a Pokédex – an interactive Pokémon catalogue that should, according to the franchise, be able to identify and read out information on any known Pokémon when captured by camera. More information on model training can be found on Adrian’s blog.

Adrian Rosebeck deep learning pokemon pokedex

For the physical build, a Raspberry Pi 3 with camera module is paired with the Raspberry Pi 7″ touch display to create a portable Pokédex. And while Adrian comments that the same result can be achieved using your home computer and a webcam, that’s not how Adrian rolls as a Raspberry Pi fan.

Adrian Rosebeck deep learning pokemon pokedex

Plus, the smaller size of the Pi is perfect for one of you to incorporate this deep learning model into a 3D-printed Pokédex for ultimate Pokémon glory, pretty please, thank you.

Adrian Rosebeck deep learning pokemon pokedex

Adrian has gone into impressive detail about how the project works and how you can create your own on his blog, pyimagesearch. So if you’re interested in learning more about deep learning, and making your own Pokédex, be sure to visit.

The post Own your own working Pokémon Pokédex! appeared first on Raspberry Pi.

Implement continuous integration and delivery of serverless AWS Glue ETL applications using AWS Developer Tools

Post Syndicated from Prasad Alle original https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-serverless-aws-glue-etl-applications-using-aws-developer-tools/

AWS Glue is an increasingly popular way to develop serverless ETL (extract, transform, and load) applications for big data and data lake workloads. Organizations that transform their ETL applications to cloud-based, serverless ETL architectures need a seamless, end-to-end continuous integration and continuous delivery (CI/CD) pipeline: from source code, to build, to deployment, to product delivery. Having a good CI/CD pipeline can help your organization discover bugs before they reach production and deliver updates more frequently. It can also help developers write quality code and automate the ETL job release management process, mitigate risk, and more.

AWS Glue is a fully managed data catalog and ETL service. It simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more.

When you are developing ETL applications using AWS Glue, you might come across some of the following CI/CD challenges:

  • Iterative development with unit tests
  • Continuous integration and build
  • Pushing the ETL pipeline to a test environment
  • Pushing the ETL pipeline to a production environment
  • Testing ETL applications using real data (live test)
  • Exploring and validating data

In this post, I walk you through a solution that implements a CI/CD pipeline for serverless AWS Glue ETL applications supported by AWS Developer Tools (including AWS CodePipeline, AWS CodeCommit, and AWS CodeBuild) and AWS CloudFormation.

Solution overview

The following diagram shows the pipeline workflow:

This solution uses AWS CodePipeline, which lets you orchestrate and automate the test and deploy stages for ETL application source code. The solution consists of a pipeline that contains the following stages:

1.) Source Control: In this stage, the AWS Glue ETL job source code and the AWS CloudFormation template file for deploying the ETL jobs are both committed to version control. I chose to use AWS CodeCommit for version control.

To get the ETL job source code and AWS CloudFormation template, download the gluedemoetl.zip file. This solution is developed based on a previous post, Build a Data Lake Foundation with AWS Glue and Amazon S3.

2.) LiveTest: In this stage, all resources—including AWS Glue crawlers, jobs, S3 buckets, roles, and other resources that are required for the solution—are provisioned, deployed, live tested, and cleaned up.

The LiveTest stage includes the following actions:

  • Deploy: In this action, all the resources that are required for this solution (crawlers, jobs, buckets, roles, and so on) are provisioned and deployed using an AWS CloudFormation template.
  • AutomatedLiveTest: In this action, all the AWS Glue crawlers and jobs are executed and data exploration and validation tests are performed. These validation tests include, but are not limited to, record counts in both raw tables and transformed tables in the data lake and any other business validations. I used AWS CodeBuild for this action.
  • LiveTestApproval: This action is included for the cases in which a pipeline administrator approval is required to deploy/promote the ETL applications to the next stage. The pipeline pauses in this action until an administrator manually approves the release.
  • LiveTestCleanup: In this action, all the LiveTest stage resources, including test crawlers, jobs, roles, and so on, are deleted using the AWS CloudFormation template. This action helps minimize cost by ensuring that the test resources exist only for the duration of the AutomatedLiveTest and LiveTestApproval

3.) DeployToProduction: In this stage, all the resources are deployed using the AWS CloudFormation template to the production environment.

Try it out

This code pipeline takes approximately 20 minutes to complete the LiveTest test stage (up to the LiveTest approval stage, in which manual approval is required).

To get started with this solution, choose Launch Stack:

This creates the CI/CD pipeline with all of its stages, as described earlier. It performs an initial commit of the sample AWS Glue ETL job source code to trigger the first release change.

In the AWS CloudFormation console, choose Create. After the template finishes creating resources, you see the pipeline name on the stack Outputs tab.

After that, open the CodePipeline console and select the newly created pipeline. Initially, your pipeline’s CodeCommit stage shows that the source action failed.

Allow a few minutes for your new pipeline to detect the initial commit applied by the CloudFormation stack creation. As soon as the commit is detected, your pipeline starts. You will see the successful stage completion status as soon as the CodeCommit source stage runs.

In the CodeCommit console, choose Code in the navigation pane to view the solution files.

Next, you can watch how the pipeline goes through the LiveTest stage of the deploy and AutomatedLiveTest actions, until it finally reaches the LiveTestApproval action.

At this point, if you check the AWS CloudFormation console, you can see that a new template has been deployed as part of the LiveTest deploy action.

At this point, make sure that the AWS Glue crawlers and the AWS Glue job ran successfully. Also check whether the corresponding databases and external tables have been created in the AWS Glue Data Catalog. Then verify that the data is validated using Amazon Athena, as shown following.

Open the AWS Glue console, and choose Databases in the navigation pane. You will see the following databases in the Data Catalog:

Open the Amazon Athena console, and run the following queries. Verify that the record counts are matching.

SELECT count(*) FROM "nycitytaxi_gluedemocicdtest"."data";
SELECT count(*) FROM "nytaxiparquet_gluedemocicdtest"."datalake";

The following shows the raw data:

The following shows the transformed data:

The pipeline pauses the action until the release is approved. After validating the data, manually approve the revision on the LiveTestApproval action on the CodePipeline console.

Add comments as needed, and choose Approve.

The LiveTestApproval stage now appears as Approved on the console.

After the revision is approved, the pipeline proceeds to use the AWS CloudFormation template to destroy the resources that were deployed in the LiveTest deploy action. This helps reduce cost and ensures a clean test environment on every deployment.

Production deployment is the final stage. In this stage, all the resources—AWS Glue crawlers, AWS Glue jobs, Amazon S3 buckets, roles, and so on—are provisioned and deployed to the production environment using the AWS CloudFormation template.

After successfully running the whole pipeline, feel free to experiment with it by changing the source code stored on AWS CodeCommit. For example, if you modify the AWS Glue ETL job to generate an error, it should make the AutomatedLiveTest action fail. Or if you change the AWS CloudFormation template to make its creation fail, it should affect the LiveTest deploy action. The objective of the pipeline is to guarantee that all changes that are deployed to production are guaranteed to work as expected.

Conclusion

In this post, you learned how easy it is to implement CI/CD for serverless AWS Glue ETL solutions with AWS developer tools like AWS CodePipeline and AWS CodeBuild at scale. Implementing such solutions can help you accelerate ETL development and testing at your organization.

If you have questions or suggestions, please comment below.

 


Additional Reading

If you found this post useful, be sure to check out Implement Continuous Integration and Delivery of Apache Spark Applications using AWS and Build a Data Lake Foundation with AWS Glue and Amazon S3.

 


About the Authors

Prasad Alle is a Senior Big Data Consultant with AWS Professional Services. He spends his time leading and building scalable, reliable Big data, Machine learning, Artificial Intelligence and IoT solutions for AWS Enterprise and Strategic customers. His interests extend to various technologies such as Advanced Edge Computing, Machine learning at Edge. In his spare time, he enjoys spending time with his family.

 
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.

 

 

 

Amazon Transcribe Now Generally Available

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-transcribe-now-generally-available/


At AWS re:Invent 2017 we launched Amazon Transcribe in private preview. Today we’re excited to make Amazon Transcribe generally available for all developers. Amazon Transcribe is an automatic speech recognition service (ASR) that makes it easy for developers to add speech to text capabilities to their applications. We’ve iterated on customer feedback in the preview to make a number of enhancements to Amazon Transcribe.

New Amazon Transcribe Features in GA

To start off we’ve made the SampleRate parameter optional which means you only need to know the file type of your media and the input language. We’ve added two new features – the ability to differentiate multiple speakers in the audio to provide more intelligible transcripts (“who spoke when”), and a custom vocabulary to improve the accuracy of speech recognition for product names, industry-specific terminology, or names of individuals. To refresh our memories on how Amazon Transcribe works lets look at a quick example. I’ll convert this audio in my S3 bucket.

import boto3
transcribe = boto3.client("transcribe")
transcribe.start_transcription_job(
    TranscriptionJobName="TranscribeDemo",
    LanguageCode="en-US",
    MediaFormat="mp3",
    Media={"MediaFileUri": "https://s3.amazonaws.com/randhunt-transcribe-demo-us-east-1/out.mp3"}
)

This will output JSON similar to this (I’ve stripped out most of the response) with indidivudal speakers identified:

{
  "jobName": "reinvent",
  "accountId": "1234",
  "results": {
    "transcripts": [
      {
        "transcript": "Hi, everybody, i'm randall ..."
      }
    ],
    "speaker_labels": {
      "speakers": 2,
      "segments": [
        {
          "start_time": "0.000000",
          "speaker_label": "spk_0",
          "end_time": "0.010",
          "items": []
        },
        {
          "start_time": "0.010000",
          "speaker_label": "spk_1",
          "end_time": "4.990",
          "items": [
            {
              "start_time": "1.000",
              "speaker_label": "spk_1",
              "end_time": "1.190"
            },
            {
              "start_time": "1.190",
              "speaker_label": "spk_1",
              "end_time": "1.700"
            }
          ]
        }
      ]
    },
    "items": [
      {
        "start_time": "1.000",
        "end_time": "1.190",
        "alternatives": [
          {
            "confidence": "0.9971",
            "content": "Hi"
          }
        ],
        "type": "pronunciation"
      },
      {
        "alternatives": [
          {
            "content": ","
          }
        ],
        "type": "punctuation"
      },
      {
        "start_time": "1.190",
        "end_time": "1.700",
        "alternatives": [
          {
            "confidence": "1.0000",
            "content": "everybody"
          }
        ],
        "type": "pronunciation"
      }
    ]
  },
  "status": "COMPLETED"
}

Custom Vocabulary

Now if I needed to have a more complex technical discussion with a colleague I could create a custom vocabulary. A custom vocabulary is specified as an array of strings passed to the CreateVocabulary API and you can include your custom vocabulary in a transcription job by passing in the name as part of the Settings in a StartTranscriptionJob API call. An individual vocabulary can be as large as 50KB and each phrase must be less than 256 characters. If I wanted to transcribe the recordings of my highschool AP Biology class I could create a custom vocabulary in Python like this:

import boto3
transcribe = boto3.client("transcribe")
transcribe.create_vocabulary(
LanguageCode="en-US",
VocabularyName="APBiology"
Phrases=[
    "endoplasmic-reticulum",
    "organelle",
    "cisternae",
    "eukaryotic",
    "ribosomes",
    "hepatocyes",
    "cell-membrane"
]
)

I can refer to this vocabulary later on by the name APBiology and update it programatically based on any errors I may find in the transcriptions.

Available Now

Amazon Transcribe is available now in US East (N. Virginia), US West (Oregon), US East (Ohio) and EU (Ireland). Transcribe’s free tier gives you 60 minutes of transcription for free per month for the first 12 months with a pay-as-you-go model of $0.0004 per second of transcribed audio after that, with a minimum charge of 15 seconds.

When combined with other tools and services I think transcribe opens up a entirely new opportunities for application development. I’m excited to see what technologies developers build with this new service.

Randall

Amazon Translate Now Generally Available

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-translate-now-generally-available/


Today we’re excited to make Amazon Translate generally available. Late last year at AWS re:Invent my colleague Tara Walker wrote about a preview of a new AI service, Amazon Translate. Starting today you can access Amazon Translate in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland) with a 2 million character monthly free tier for the first 12 months and $15 per million characters after that. There are a number of new features available in GA: automatic source language inference, Amazon CloudWatch support, and up to 5000 characters in a single TranslateText call. Let’s take a quick look at the service in general availability.

Amazon Translate New Features

Since Tara’s post already covered the basics of the service I want to point out some of the new features of the service released today. Let’s start with a code sample:

import boto3
translate = boto3.client("translate")
resp = translate.translate_text(
    Text="🇫🇷Je suis très excité pour Amazon Traduire🇫🇷",
    SourceLanguageCode="auto",
    TargetLanguageCode="en"
)
print(resp['TranslatedText'])

Since I have specified my source language as auto, Amazon Translate will call Amazon Comprehend on my behalf to determine the source language used in this text. If you couldn’t guess it, we’re writing some French and the output is 🇫🇷I'm very excited about Amazon Translate 🇫🇷. You’ll notice that our emojis are preserved in the output text which is definitely a bonus feature for Millennials like me.

The Translate console is a great way to get started and see some sample response.

Translate is extremely easy to use in AWS Lambda functions which allows you to use it with almost any AWS service. There are a number of examples in the Translate documentation showing how to do everything from translate a web page to a Amazon DynamoDB table. Paired with other ML services like Amazon Comprehend and [transcribe] you can build everything from closed captioning to real-time chat translation to a robust text analysis pipeline for call centers transcriptions and other textual data.

New Languages Coming Soon

Today, Amazon Translate allows you to translate text to or from English, to any of the following languages: Arabic, Chinese (Simplified), French, German, Portuguese, and Spanish. We’ve announced support for additional languages coming soon: Japanese (go JAWSUG), Russian, Italian, Chinese (Traditional), Turkish, and Czech.

Amazon Translate can also be used to increase professional translator efficiency, and reduce costs and turnaround times for their clients. We’ve already partnered with a number of Language Service Providers (LSPs) to offer their customers end-to-end translation services at a lower cost by allowing Amazon Translate to produce a high-quality draft translation that’s then edited by the LSP for a guaranteed human quality result.

I’m excited to see what applications our customers are able to build with high quality machine translation just one API call away.

Randall

Amazon SageMaker Now Supports Additional Instance Types, Local Mode, Open Sourced Containers, MXNet and Tensorflow Updates

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-sagemaker-roundup-sf/

Amazon SageMaker continues to iterate quickly and release new features on behalf of customers. Starting today, SageMaker adds support for many new instance types, local testing with the SDK, and Apache MXNet 1.1.0 and Tensorflow 1.6.0. Let’s take a quick look at each of these updates.

New Instance Types

Amazon SageMaker customers now have additional options for right-sizing their workloads for notebooks, training, and hosting. Notebook instances now support almost all T2, M4, P2, and P3 instance types with the exception of t2.micro, t2.small, and m4.large instances. Model training now supports nearly all M4, M5, C4, C5, P2, and P3 instances with the exception of m4.large, c4.large, and c5.large instances. Finally, model hosting now supports nearly all T2, M4, M5, C4, C5, P2, and P3 instances with the exception of m4.large instances. Many customers can take advantage of the newest P3, C5, and M5 instances to get the best price/performance for their workloads. Customers also take advantage of the burstable compute model on T2 instances for endpoints or notebooks that are used less frequently.

Open Sourced Containers, Local Mode, and TensorFlow 1.6.0 and MXNet 1.1.0

Today Amazon SageMaker has open sourced the MXNet and Tensorflow deep learning containers that power the MXNet and Tensorflow estimators in the SageMaker SDK. The ability to write Python scripts that conform to simple interface is still one of my favorite SageMaker features and now those containers can be additionally customized to include any additional libraries. You can download these containers locally to iterate and experiment which can accelerate your debugging cycle. When you’re ready go from local testing to production training and hosting you just change one line of code.

These containers launch with support for Tensorflow 1.6.0 and MXNet 1.1.0 as well. Tensorflow has a number of new 1.6.0 features including support for CUDA 9.0, cuDNN 7, and AVX instructions which allows for significant speedups in many training applications. MXNet 1.1.0 adds a number of new features including a Text API mxnet.text with support for text processing, indexing, glossaries, and more. Two of the really cool pre-trained embeddings included are GloVe and fastText.
<

Available Now
All of the features mentioned above are available today. As always please let us know on Twitter or in the comments below if you have any questions or if you’re building something interesting. Now, if you’ll excuse me I’m going to go experiment with some of those new MXNet APIs!

Randall

Facebook and Cambridge Analytica

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/03/facebook_and_ca.html

In the wake of the Cambridge Analytica scandal, news articles and commentators have focused on what Facebook knows about us. A lot, it turns out. It collects data from our posts, our likes, our photos, things we type and delete without posting, and things we do while not on Facebook and even when we’re offline. It buys data about us from others. And it can infer even more: our sexual orientation, political beliefs, relationship status, drug use, and other personality traits — even if we didn’t take the personality test that Cambridge Analytica developed.

But for every article about Facebook’s creepy stalker behavior, thousands of other companies are breathing a collective sigh of relief that it’s Facebook and not them in the spotlight. Because while Facebook is one of the biggest players in this space, there are thousands of other companies that spy on and manipulate us for profit.

Harvard Business School professor Shoshana Zuboff calls it “surveillance capitalism.” And as creepy as Facebook is turning out to be, the entire industry is far creepier. It has existed in secret far too long, and it’s up to lawmakers to force these companies into the public spotlight, where we can all decide if this is how we want society to operate and — if not — what to do about it.

There are 2,500 to 4,000 data brokers in the United States whose business is buying and selling our personal data. Last year, Equifax was in the news when hackers stole personal information on 150 million people, including Social Security numbers, birth dates, addresses, and driver’s license numbers.

You certainly didn’t give it permission to collect any of that information. Equifax is one of those thousands of data brokers, most of them you’ve never heard of, selling your personal information without your knowledge or consent to pretty much anyone who will pay for it.

Surveillance capitalism takes this one step further. Companies like Facebook and Google offer you free services in exchange for your data. Google’s surveillance isn’t in the news, but it’s startlingly intimate. We never lie to our search engines. Our interests and curiosities, hopes and fears, desires and sexual proclivities, are all collected and saved. Add to that the websites we visit that Google tracks through its advertising network, our Gmail accounts, our movements via Google Maps, and what it can collect from our smartphones.

That phone is probably the most intimate surveillance device ever invented. It tracks our location continuously, so it knows where we live, where we work, and where we spend our time. It’s the first and last thing we check in a day, so it knows when we wake up and when we go to sleep. We all have one, so it knows who we sleep with. Uber used just some of that information to detect one-night stands; your smartphone provider and any app you allow to collect location data knows a lot more.

Surveillance capitalism drives much of the internet. It’s behind most of the “free” services, and many of the paid ones as well. Its goal is psychological manipulation, in the form of personalized advertising to persuade you to buy something or do something, like vote for a candidate. And while the individualized profile-driven manipulation exposed by Cambridge Analytica feels abhorrent, it’s really no different from what every company wants in the end. This is why all your personal information is collected, and this is why it is so valuable. Companies that can understand it can use it against you.

None of this is new. The media has been reporting on surveillance capitalism for years. In 2015, I wrote a book about it. Back in 2010, the Wall Street Journal published an award-winning two-year series about how people are tracked both online and offline, titled “What They Know.”

Surveillance capitalism is deeply embedded in our increasingly computerized society, and if the extent of it came to light there would be broad demands for limits and regulation. But because this industry can largely operate in secret, only occasionally exposed after a data breach or investigative report, we remain mostly ignorant of its reach.

This might change soon. In 2016, the European Union passed the comprehensive General Data Protection Regulation, or GDPR. The details of the law are far too complex to explain here, but some of the things it mandates are that personal data of EU citizens can only be collected and saved for “specific, explicit, and legitimate purposes,” and only with explicit consent of the user. Consent can’t be buried in the terms and conditions, nor can it be assumed unless the user opts in. This law will take effect in May, and companies worldwide are bracing for its enforcement.

Because pretty much all surveillance capitalism companies collect data on Europeans, this will expose the industry like nothing else. Here’s just one example. In preparation for this law, PayPal quietly published a list of over 600 companies it might share your personal data with. What will it be like when every company has to publish this sort of information, and explicitly explain how it’s using your personal data? We’re about to find out.

In the wake of this scandal, even Mark Zuckerberg said that his industry probably should be regulated, although he’s certainly not wishing for the sorts of comprehensive regulation the GDPR is bringing to Europe.

He’s right. Surveillance capitalism has operated without constraints for far too long. And advances in both big data analysis and artificial intelligence will make tomorrow’s applications far creepier than today’s. Regulation is the only answer.

The first step to any regulation is transparency. Who has our data? Is it accurate? What are they doing with it? Who are they selling it to? How are they securing it? Can we delete it? I don’t see any hope of Congress passing a GDPR-like data protection law anytime soon, but it’s not too far-fetched to demand laws requiring these companies to be more transparent in what they’re doing.

One of the responses to the Cambridge Analytica scandal is that people are deleting their Facebook accounts. It’s hard to do right, and doesn’t do anything about the data that Facebook collects about people who don’t use Facebook. But it’s a start. The market can put pressure on these companies to reduce their spying on us, but it can only do that if we force the industry out of its secret shadows.

This essay previously appeared on CNN.com.

EDITED TO ADD (4/2): Slashdot thread.

SoFi, the underwater robotic fish

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/robotic-fish/

With the Greenland shark finally caught on video for the very first time, scientists and engineers are discussing the limitations of current marine monitoring technology. One significant advance comes from the CSAIL team at Massachusetts Institute of Technology (MIT): SoFi, the robotic fish.

A Robotic Fish Swims in the Ocean

More info: http://bit.ly/SoFiRobot Paper: http://robert.katzschmann.eu/wp-content/uploads/2018/03/katzschmann2018exploration.pdf

The untethered SoFi robot

Last week, the Computer Science and Artificial Intelligence Laboratory (CSAIL) team at MIT unveiled SoFi, “a soft robotic fish that can independently swim alongside real fish in the ocean.”

MIT CSAIL underwater fish SoFi using Raspberry Pi

Directed by a Super Nintendo controller and acoustic signals, SoFi can dive untethered to a maximum of 18 feet for a total of 40 minutes. A Raspberry Pi receives input from the controller and amplifies the ultrasound signals for SoFi via a HiFiBerry. The controller, Raspberry Pi, and HiFiBerry are sealed within a waterproof, cast-moulded silicone membrane filled with non-conductive mineral oil, allowing for underwater equalisation.

MIT CSAIL underwater fish SoFi using Raspberry Pi

The ultrasound signals, received by a modem within SoFi’s head, control everything from direction, tail oscillation, pitch, and depth to the onboard camera.

As explained on MIT’s news blog, “to make the robot swim, the motor pumps water into two balloon-like chambers in the fish’s tail that operate like a set of pistons in an engine. As one chamber expands, it bends and flexes to one side; when the actuators push water to the other channel, that one bends and flexes in the other direction.”

MIT CSAIL underwater fish SoFi using Raspberry Pi

Ocean exploration

While we’ve seen many autonomous underwater vehicles (AUVs) using onboard Raspberry Pis, SoFi’s ability to roam untethered with a wireless waterproof controller is an exciting achievement.

“To our knowledge, this is the first robotic fish that can swim untethered in three dimensions for extended periods of time. We are excited about the possibility of being able to use a system like this to get closer to marine life than humans can get on their own.” – CSAIL PhD candidate Robert Katzschmann

As the MIT news post notes, SoFi’s simple, lightweight setup of a single camera, a motor, and a smartphone lithium polymer battery set it apart it from existing bulky AUVs that require large motors or support from boats.

For more in-depth information on SoFi and the onboard tech that controls it, find the CSAIL team’s paper here.

The post SoFi, the underwater robotic fish appeared first on Raspberry Pi.

Real-Time Hotspot Detection in Amazon Kinesis Analytics

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/real-time-hotspot-detection-in-amazon-kinesis-analytics/

Today we’re releasing a new machine learning feature in Amazon Kinesis Data Analytics for detecting “hotspots” in your streaming data. We launched Kinesis Data Analytics in August of 2016 and we’ve continued to add features since. As you may already know, Kinesis Data Analytics is a fully managed real-time processing engine for streaming data that lets you write SQL queries to derive meaning from your data and output the results to Kinesis Data Firehose, Kinesis Data Streams, or even an AWS Lambda function. The new HOTSPOT function adds to the existing machine learning capabilities in Kinesis that allow customers to leverage unsupervised streaming based machine learning algorithms. Customers don’t need to be experts in data science or machine learning to take advantage of these capabilities.

Hotspots

The HOTSPOTS function is a new Kinesis Data Analytics SQL function you can use to idenitfy relatively dense regions in your data without having to explicity build and train complicated machine learning models. You can identify subsections of your data that need immediate attention and take action programatically by streaming the hotspots out to a Kinesis Data stream, to a Firehose delivery stream, or by invoking a AWS Lambda function.

There are a ton of really cool scenarios where this could make your operations easier. Imagine a ride-share program or autonomous vehicle fleet communicating spatiotemporal data about traffic jams and congestion, or a datacenter where a number of servers start to overheat indicating an HVAC issue. HOTSPOTS is not limited to spatiotemporal data and you could apply it across many problem domains.

The function follows some simple syntax and accepts the DOUBLE, INTEGER, FLOAT, TINYINT, SMALLINT, REAL, and BIGINT data types.

The HOTSPOT function takes a cursor as input and returns a JSON string describing the hotspot. This will be easier to understand with an example.

Using Kinesis Data Analytics to Detect Hotspots

Let’s take a simple data set from NY Taxi and Limousine Commission that tracks yellow cab pickup and dropoff locations. Most of this data is already on S3 and publicly accessible at s3://nyc-tlc/. We will create a small python script to load our Kinesis Data Stream with Taxi records which will feed our Kinesis Data Analytics. Finally we’ll output all of this to a Kinesis Data Firehose connected to an Amazon Elasticsearch Service cluster for visualization with Kibana. I know from living in New York for 5 years that we’ll probably find a hotspot or two in this data.

First, we’ll create an input Kinesis stream and start sending our NYC Taxi Ride data into it. I just wrote a quick python script to read from one of the CSV files and used boto3 to push the records into Kinesis. You can put the record in whatever way works for you.

 

import csv
import json
import boto3
def chunkit(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]

kinesis = boto3.client("kinesis")
with open("taxidata2.csv") as f:
    reader = csv.DictReader(f)
    records = chunkit([{"PartitionKey": "taxis", "Data": json.dumps(row)} for row in reader], 500)
    for chunk in records:
        kinesis.put_records(StreamName="TaxiData", Records=chunk)

Next, we’ll create the Kinesis Data Analytics application and add our input stream with our taxi data as the source.

Next we’ll automatically detect the schema.

Now we’ll create a quick SQL Script to detect our hotspots and add that to the Real Time Analytics section of our application.

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
    "pickup_longitude" DOUBLE,
    "pickup_latitude" DOUBLE,
    HOTSPOTS_RESULT VARCHAR(10000)
); 
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM" 
    SELECT "pickup_longitude", "pickup_latitude", "HOTSPOTS_RESULT" FROM
        TABLE(HOTSPOTS(
            CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001"),
            1000,
            0.013,
            20
        )
    );


Our HOTSPOTS function takes an input stream, a window size, scan radius, and a minimum number of points to count as a hotspot. The values for these are application dependent but you can tinker with them in the console easily until you get the results you want. There are more details about the parameters themselves in the documentation. The HOTSPOTS_RESULT returns some useful JSON that would let us plot bounding boxes around our hotspots:

{
  "hotspots": [
    {
      "density": "elided",
      "minValues": [40.7915039, -74.0077401],
      "maxValues": [40.7915041, -74.0078001]
    }
  ]
}

 

When we have our desired results we can save the script and connect our application to our Amazon Elastic Search Service Firehose Delivery Stream. We can run an intermediate lambda function in the firehose to transform our record into a format more suitable for geographic work. Then we can update our mapping in Elasticsearch to index the hotspot objects as Geo-Shapes.

Finally, we can connect to Kibana and visualize the results.

Looks like Manhattan is pretty busy!

Available Now
This feature is available now in all existing regions with Kinesis Data Analytics. I think this is a really interesting new feature of Kinesis Data Analytics that can bring immediate value to many applications. Let us know what you build with it on Twitter or in the comments!

Randall

Artificial Intelligence and the Attack/Defense Balance

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/03/artificial_inte.html

Artificial intelligence technologies have the potential to upend the longstanding advantage that attack has over defense on the Internet. This has to do with the relative strengths and weaknesses of people and computers, how those all interplay in Internet security, and where AI technologies might change things.

You can divide Internet security tasks into two sets: what humans do well and what computers do well. Traditionally, computers excel at speed, scale, and scope. They can launch attacks in milliseconds and infect millions of computers. They can scan computer code to look for particular kinds of vulnerabilities, and data packets to identify particular kinds of attacks.

Humans, conversely, excel at thinking and reasoning. They can look at the data and distinguish a real attack from a false alarm, understand the attack as it’s happening, and respond to it. They can find new sorts of vulnerabilities in systems. Humans are creative and adaptive, and can understand context.

Computers — so far, at least — are bad at what humans do well. They’re not creative or adaptive. They don’t understand context. They can behave irrationally because of those things.

Humans are slow, and get bored at repetitive tasks. They’re terrible at big data analysis. They use cognitive shortcuts, and can only keep a few data points in their head at a time. They can also behave irrationally because of those things.

AI will allow computers to take over Internet security tasks from humans, and then do them faster and at scale. Here are possible AI capabilities:

  • Discovering new vulnerabilities­ — and, more importantly, new types of vulnerabilities­ in systems, both by the offense to exploit and by the defense to patch, and then automatically exploiting or patching them.
  • Reacting and adapting to an adversary’s actions, again both on the offense and defense sides. This includes reasoning about those actions and what they mean in the context of the attack and the environment.
  • Abstracting lessons from individual incidents, generalizing them across systems and networks, and applying those lessons to increase attack and defense effectiveness elsewhere.
  • Identifying strategic and tactical trends from large datasets and using those trends to adapt attack and defense tactics.

That’s an incomplete list. I don’t think anyone can predict what AI technologies will be capable of. But it’s not unreasonable to look at what humans do today and imagine a future where AIs are doing the same things, only at computer speeds, scale, and scope.

Both attack and defense will benefit from AI technologies, but I believe that AI has the capability to tip the scales more toward defense. There will be better offensive and defensive AI techniques. But here’s the thing: defense is currently in a worse position than offense precisely because of the human components. Present-day attacks pit the relative advantages of computers and humans against the relative weaknesses of computers and humans. Computers moving into what are traditionally human areas will rebalance that equation.

Roy Amara famously said that we overestimate the short-term effects of new technologies, but underestimate their long-term effects. AI is notoriously hard to predict, so many of the details I speculate about are likely to be wrong­ — and AI is likely to introduce new asymmetries that we can’t foresee. But AI is the most promising technology I’ve seen for bringing defense up to par with offense. For Internet security, that will change everything.

This essay previously appeared in the March/April 2018 issue of IEEE Security & Privacy.

Auto Scaling is now available for Amazon SageMaker

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/auto-scaling-is-now-available-for-amazon-sagemaker/

Kumar Venkateswar, Product Manager on the AWS ML Platforms Team, shares details on the announcement of Auto Scaling with Amazon SageMaker.


With Amazon SageMaker, thousands of customers have been able to easily build, train and deploy their machine learning (ML) models. Today, we’re making it even easier to manage production ML models, with Auto Scaling for Amazon SageMaker. Instead of having to manually manage the number of instances to match the scale that you need for your inferences, you can now have SageMaker automatically scale the number of instances based on an AWS Auto Scaling Policy.

SageMaker has made managing the ML process easier for many customers. We’ve seen customers take advantage of managed Jupyter notebooks and managed distributed training. We’ve seen customers deploying their models to SageMaker hosting for inferences, as they integrate machine learning with their applications. SageMaker makes this easy –  you don’t have to think about patching the operating system (OS) or frameworks on your inference hosts, and you don’t have to configure inference hosts across Availability Zones. You just deploy your models to SageMaker, and it handles the rest.

Until now, you have needed to specify the number and type of instances per endpoint (or production variant) to provide the scale that you need for your inferences. If your inference volume changes, you can change the number and/or type of instances that back each endpoint to accommodate that change, without incurring any downtime. In addition to making it easy to change provisioning, customers have asked us how we can make managing capacity for SageMaker even easier.

With Auto Scaling for Amazon SageMaker, in the SageMaker console, the AWS Auto Scaling API, and the AWS SDK, this becomes much easier. Now, instead of having to closely monitor inference volume, and change the endpoint configuration in response, customers can configure a scaling policy to be used by AWS Auto Scaling. Auto Scaling adjusts the number of instances up or down in response to actual workloads, determined by using Amazon CloudWatch metrics and target values defined in the policy. In this way, customers can automatically adjust their inference capacity to maintain predictable performance at a low cost. You simply specify the target inference throughput per instance and provide upper and lower bounds for the number of instances for each production variant. SageMaker will then monitor throughput per instance using Amazon CloudWatch alarms, and then it will adjust provisioned capacity up or down as needed.

After you configure the endpoint with Auto Scaling, SageMaker will continue to monitor your deployed models to automatically adjust the instance count. SageMaker will keep throughput within desired levels, in response to changes in application traffic. This makes it easier to manage models in production, and it can help reduce the cost of deployed models, as you no longer have to provision sufficient capacity in order to manage your peak load. Instead, you configure the limits to accommodate your minimum expected traffic and the maximum peak, and Amazon SageMaker will work within those limits to minimize cost.

How do you get started? Open the SageMaker console. For existing endpoints, you first access the endpoint to modify the settings.


Then, scroll to the Endpoint runtime settings section, select the variant, and choose Configure auto scaling.


First, configure the minimum and maximum number of instances.

Next, choose the throughput per instance at which you want to add an additional instance, given previous load testing.

You can optionally set cool down periods for scaling in or out, to avoid oscillation during periods of wide fluctuation in workload. If not, SageMaker will assume default values.

And that’s it! You now have an endpoint that will automatically scale with increasing inferences.

You pay for the capacity used at regular SageMaker pay-as-you-go pricing, so you no longer have to pay for unused capacity during relative idle periods!

Auto Scaling in Amazon SageMaker is available today in the US East (N. Virginia & Ohio), EU (Ireland), and U.S. West (Oregon) AWS regions. To learn more, see the Amazon SageMaker Auto Scaling documentation.


Kumar Venkateswar is a Product Manager in the AWS ML Platforms team, which includes Amazon SageMaker, Amazon Machine Learning, and the AWS Deep Learning AMIs. When not working, Kumar plays the violin and Magic: The Gathering.

 

 

 

 


 

 

Best Practices for Running Apache Kafka on AWS

Post Syndicated from Prasad Alle original https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-kafka-on-aws/

This post was written in partnership with Intuit to share learnings, best practices, and recommendations for running an Apache Kafka cluster on AWS. Thanks to Vaishak Suresh and his colleagues at Intuit for their contribution and support.

Intuit, in their own words: Intuit, a leading enterprise customer for AWS, is a creator of business and financial management solutions. For more information on how Intuit partners with AWS, see our previous blog post, Real-time Stream Processing Using Apache Spark Streaming and Apache Kafka on AWS. Apache Kafka is an open-source, distributed streaming platform that enables you to build real-time streaming applications.

The best practices described in this post are based on our experience in running and operating large-scale Kafka clusters on AWS for more than two years. Our intent for this post is to help AWS customers who are currently running Kafka on AWS, and also customers who are considering migrating on-premises Kafka deployments to AWS.

AWS offers Amazon Kinesis Data Streams, a Kafka alternative that is fully managed.

Running your Kafka deployment on Amazon EC2 provides a high performance, scalable solution for ingesting streaming data. AWS offers many different instance types and storage option combinations for Kafka deployments. However, given the number of possible deployment topologies, it’s not always trivial to select the most appropriate strategy suitable for your use case.

In this blog post, we cover the following aspects of running Kafka clusters on AWS:

  • Deployment considerations and patterns
  • Storage options
  • Instance types
  • Networking
  • Upgrades
  • Performance tuning
  • Monitoring
  • Security
  • Backup and restore

Note: While implementing Kafka clusters in a production environment, make sure also to consider factors like your number of messages, message size, monitoring, failure handling, and any operational issues.

Deployment considerations and patterns

In this section, we discuss various deployment options available for Kafka on AWS, along with pros and cons of each option. A successful deployment starts with thoughtful consideration of these options. Considering availability, consistency, and operational overhead of the deployment helps when choosing the right option.

Single AWS Region, Three Availability Zones, All Active

One typical deployment pattern (all active) is in a single AWS Region with three Availability Zones (AZs). One Kafka cluster is deployed in each AZ along with Apache ZooKeeper and Kafka producer and consumer instances as shown in the illustration following.

In this pattern, this is the Kafka cluster deployment:

  • Kafka producers and Kafka cluster are deployed on each AZ.
  • Data is distributed evenly across three Kafka clusters by using Elastic Load Balancer.
  • Kafka consumers aggregate data from all three Kafka clusters.

Kafka cluster failover occurs this way:

  • Mark down all Kafka producers
  • Stop consumers
  • Debug and restack Kafka
  • Restart consumers
  • Restart Kafka producers

Following are the pros and cons of this pattern.

Pros Cons
  • Highly available
  • Can sustain the failure of two AZs
  • No message loss during failover
  • Simple deployment

 

  • Very high operational overhead:
    • All changes need to be deployed three times, one for each Kafka cluster
    • Maintaining and monitoring three Kafka clusters
    • Maintaining and monitoring three consumer clusters

A restart is required for patching and upgrading brokers in a Kafka cluster. In this approach, a rolling upgrade is done separately for each cluster.

Single Region, Three Availability Zones, Active-Standby

Another typical deployment pattern (active-standby) is in a single AWS Region with a single Kafka cluster and Kafka brokers and Zookeepers distributed across three AZs. Another similar Kafka cluster acts as a standby as shown in the illustration following. You can use Kafka mirroring with MirrorMaker to replicate messages between any two clusters.

In this pattern, this is the Kafka cluster deployment:

  • Kafka producers are deployed on all three AZs.
  • Only one Kafka cluster is deployed across three AZs (active).
  • ZooKeeper instances are deployed on each AZ.
  • Brokers are spread evenly across all three AZs.
  • Kafka consumers can be deployed across all three AZs.
  • Standby Kafka producers and a Multi-AZ Kafka cluster are part of the deployment.

Kafka cluster failover occurs this way:

  • Switch traffic to standby Kafka producers cluster and Kafka cluster.
  • Restart consumers to consume from standby Kafka cluster.

Following are the pros and cons of this pattern.

Pros Cons
  • Less operational overhead when compared to the first option
  • Only one Kafka cluster to manage and consume data from
  • Can handle single AZ failures without activating a standby Kafka cluster
  • Added latency due to cross-AZ data transfer among Kafka brokers
  • For Kafka versions before 0.10, replicas for topic partitions have to be assigned so they’re distributed to the brokers on different AZs (rack-awareness)
  • The cluster can become unavailable in case of a network glitch, where ZooKeeper does not see Kafka brokers
  • Possibility of in-transit message loss during failover

Intuit recommends using a single Kafka cluster in one AWS Region, with brokers distributing across three AZs (single region, three AZs). This approach offers stronger fault tolerance than otherwise, because a failed AZ won’t cause Kafka downtime.

Storage options

There are two storage options for file storage in Amazon EC2:

Ephemeral storage is local to the Amazon EC2 instance. It can provide high IOPS based on the instance type. On the other hand, Amazon EBS volumes offer higher resiliency and you can configure IOPS based on your storage needs. EBS volumes also offer some distinct advantages in terms of recovery time. Your choice of storage is closely related to the type of workload supported by your Kafka cluster.

Kafka provides built-in fault tolerance by replicating data partitions across a configurable number of instances. If a broker fails, you can recover it by fetching all the data from other brokers in the cluster that host the other replicas. Depending on the size of the data transfer, it can affect recovery process and network traffic. These in turn eventually affect the cluster’s performance.

The following table contrasts the benefits of using an instance store versus using EBS for storage.

Instance store EBS
  • Instance storage is recommended for large- and medium-sized Kafka clusters. For a large cluster, read/write traffic is distributed across a high number of brokers, so the loss of a broker has less of an impact. However, for smaller clusters, a quick recovery for the failed node is important, but a failed broker takes longer and requires more network traffic for a smaller Kafka cluster.
  • Storage-optimized instances like h1, i3, and d2 are an ideal choice for distributed applications like Kafka.

 

  • The primary advantage of using EBS in a Kafka deployment is that it significantly reduces data-transfer traffic when a broker fails or must be replaced. The replacement broker joins the cluster much faster.
  • Data stored on EBS is persisted in case of an instance failure or termination. The broker’s data stored on an EBS volume remains intact, and you can mount the EBS volume to a new EC2 instance. Most of the replicated data for the replacement broker is already available in the EBS volume and need not be copied over the network from another broker. Only the changes made after the original broker failure need to be transferred across the network. That makes this process much faster.

 

 

Intuit chose EBS because of their frequent instance restacking requirements and also other benefits provided by EBS.

Generally, Kafka deployments use a replication factor of three. EBS offers replication within their service, so Intuit chose a replication factor of two instead of three.

Instance types

The choice of instance types is generally driven by the type of storage required for your streaming applications on a Kafka cluster. If your application requires ephemeral storage, h1, i3, and d2 instances are your best option.

Intuit used r3.xlarge instances for their brokers and r3.large for ZooKeeper, with ST1 (throughput optimized HDD) EBS for their Kafka cluster.

Here are sample benchmark numbers from Intuit tests.

Configuration Broker bytes (MB/s)
  • r3.xlarge
  • ST1 EBS
  • 12 brokers
  • 12 partitions

 

Aggregate 346.9

If you need EBS storage, then AWS has a newer-generation r4 instance. The r4 instance is superior to R3 in many ways:

  • It has a faster processor (Broadwell).
  • EBS is optimized by default.
  • It features networking based on Elastic Network Adapter (ENA), with up to 10 Gbps on smaller sizes.
  • It costs 20 percent less than R3.

Note: It’s always best practice to check for the latest changes in instance types.

Networking

The network plays a very important role in a distributed system like Kafka. A fast and reliable network ensures that nodes can communicate with each other easily. The available network throughput controls the maximum amount of traffic that Kafka can handle. Network throughput, combined with disk storage, is often the governing factor for cluster sizing.

If you expect your cluster to receive high read/write traffic, select an instance type that offers 10-Gb/s performance.

In addition, choose an option that keeps interbroker network traffic on the private subnet, because this approach allows clients to connect to the brokers. Communication between brokers and clients uses the same network interface and port. For more details, see the documentation about IP addressing for EC2 instances.

If you are deploying in more than one AWS Region, you can connect the two VPCs in the two AWS Regions using cross-region VPC peering. However, be aware of the networking costs associated with cross-AZ deployments.

Upgrades

Kafka has a history of not being backward compatible, but its support of backward compatibility is getting better. During a Kafka upgrade, you should keep your producer and consumer clients on a version equal to or lower than the version you are upgrading from. After the upgrade is finished, you can start using a new protocol version and any new features it supports. There are three upgrade approaches available, discussed following.

Rolling or in-place upgrade

In a rolling or in-place upgrade scenario, upgrade one Kafka broker at a time. Take into consideration the recommendations for doing rolling restarts to avoid downtime for end users.

Downtime upgrade

If you can afford the downtime, you can take your entire cluster down, upgrade each Kafka broker, and then restart the cluster.

Blue/green upgrade

Intuit followed the blue/green deployment model for their workloads, as described following.

If you can afford to create a separate Kafka cluster and upgrade it, we highly recommend the blue/green upgrade scenario. In this scenario, we recommend that you keep your clusters up-to-date with the latest Kafka version. For additional details on Kafka version upgrades or more details, see the Kafka upgrade documentation.

The following illustration shows a blue/green upgrade.

In this scenario, the upgrade plan works like this:

  • Create a new Kafka cluster on AWS.
  • Create a new Kafka producers stack to point to the new Kafka cluster.
  • Create topics on the new Kafka cluster.
  • Test the green deployment end to end (sanity check).
  • Using Amazon Route 53, change the new Kafka producers stack on AWS to point to the new green Kafka environment that you have created.

The roll-back plan works like this:

  • Switch Amazon Route 53 to the old Kafka producers stack on AWS to point to the old Kafka environment.

For additional details on blue/green deployment architecture using Kafka, see the re:Invent presentation Leveraging the Cloud with a Blue-Green Deployment Architecture.

Performance tuning

You can tune Kafka performance in multiple dimensions. Following are some best practices for performance tuning.

 These are some general performance tuning techniques:

  • If throughput is less than network capacity, try the following:
    • Add more threads
    • Increase batch size
    • Add more producer instances
    • Add more partitions
  • To improve latency when acks =-1, increase your num.replica.fetches value.
  • For cross-AZ data transfer, tune your buffer settings for sockets and for OS TCP.
  • Make sure that num.io.threads is greater than the number of disks dedicated for Kafka.
  • Adjust num.network.threads based on the number of producers plus the number of consumers plus the replication factor.
  • Your message size affects your network bandwidth. To get higher performance from a Kafka cluster, select an instance type that offers 10 Gb/s performance.

For Java and JVM tuning, try the following:

  • Minimize GC pauses by using the Oracle JDK, which uses the new G1 garbage-first collector.
  • Try to keep the Kafka heap size below 4 GB.

Monitoring

Knowing whether a Kafka cluster is working correctly in a production environment is critical. Sometimes, just knowing that the cluster is up is enough, but Kafka applications have many moving parts to monitor. In fact, it can easily become confusing to understand what’s important to watch and what you can set aside. Items to monitor range from simple metrics about the overall rate of traffic, to producers, consumers, brokers, controller, ZooKeeper, topics, partitions, messages, and so on.

For monitoring, Intuit used several tools, including Newrelec, Wavefront, Amazon CloudWatch, and AWS CloudTrail. Our recommended monitoring approach follows.

For system metrics, we recommend that you monitor:

  • CPU load
  • Network metrics
  • File handle usage
  • Disk space
  • Disk I/O performance
  • Garbage collection
  • ZooKeeper

For producers, we recommend that you monitor:

  • Batch-size-avg
  • Compression-rate-avg
  • Waiting-threads
  • Buffer-available-bytes
  • Record-queue-time-max
  • Record-send-rate
  • Records-per-request-avg

For consumers, we recommend that you monitor:

  • Batch-size-avg
  • Compression-rate-avg
  • Waiting-threads
  • Buffer-available-bytes
  • Record-queue-time-max
  • Record-send-rate
  • Records-per-request-avg

Security

Like most distributed systems, Kafka provides the mechanisms to transfer data with relatively high security across the components involved. Depending on your setup, security might involve different services such as encryption, Kerberos, Transport Layer Security (TLS) certificates, and advanced access control list (ACL) setup in brokers and ZooKeeper. The following tells you more about the Intuit approach. For details on Kafka security not covered in this section, see the Kafka documentation.

Encryption at rest

For EBS-backed EC2 instances, you can enable encryption at rest by using Amazon EBS volumes with encryption enabled. Amazon EBS uses AWS Key Management Service (AWS KMS) for encryption. For more details, see Amazon EBS Encryption in the EBS documentation. For instance store–backed EC2 instances, you can enable encryption at rest by using Amazon EC2 instance store encryption.

Encryption in transit

Kafka uses TLS for client and internode communications.

Authentication

Authentication of connections to brokers from clients (producers and consumers) to other brokers and tools uses either Secure Sockets Layer (SSL) or Simple Authentication and Security Layer (SASL).

Kafka supports Kerberos authentication. If you already have a Kerberos server, you can add Kafka to your current configuration.

Authorization

In Kafka, authorization is pluggable and integration with external authorization services is supported.

Backup and restore

The type of storage used in your deployment dictates your backup and restore strategy.

The best way to back up a Kafka cluster based on instance storage is to set up a second cluster and replicate messages using MirrorMaker. Kafka’s mirroring feature makes it possible to maintain a replica of an existing Kafka cluster. Depending on your setup and requirements, your backup cluster might be in the same AWS Region as your main cluster or in a different one.

For EBS-based deployments, you can enable automatic snapshots of EBS volumes to back up volumes. You can easily create new EBS volumes from these snapshots to restore. We recommend storing backup files in Amazon S3.

For more information on how to back up in Kafka, see the Kafka documentation.

Conclusion

In this post, we discussed several patterns for running Kafka in the AWS Cloud. AWS also provides an alternative managed solution with Amazon Kinesis Data Streams, there are no servers to manage or scaling cliffs to worry about, you can scale the size of your streaming pipeline in seconds without downtime, data replication across availability zones is automatic, you benefit from security out of the box, Kinesis Data Streams is tightly integrated with a wide variety of AWS services like Lambda, Redshift, Elasticsearch and it supports open source frameworks like Storm, Spark, Flink, and more. You may refer to kafka-kinesis connector.

If you have questions or suggestions, please comment below.


Additional Reading

If you found this post useful, be sure to check out Implement Serverless Log Analytics Using Amazon Kinesis Analytics and Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics.


About the Author

Prasad Alle is a Senior Big Data Consultant with AWS Professional Services. He spends his time leading and building scalable, reliable Big data, Machine learning, Artificial Intelligence and IoT solutions for AWS Enterprise and Strategic customers. His interests extend to various technologies such as Advanced Edge Computing, Machine learning at Edge. In his spare time, he enjoys spending time with his family.