Tag Archives: computer vision

OpenVX API for Raspberry Pi

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/openvx-api-for-raspberry-pi/

Raspberry Pi is excited to bring the Khronos OpenVX 1.3 API to our line of single-board computers. Here’s Kiriti Nagesh Gowda, AMD‘s MTS Software Development Engineer, to tell you more.

OpenVX for computer vision

OpenVX™ is an open, royalty-free API standard for cross-platform acceleration of computer vision applications developed by The Khronos Group. The Khronos Group is an open industry consortium of more than 150 leading hardware and software companies creating advanced, royalty-free acceleration standards for 3D graphics, augmented and virtual reality, vision, and machine learning. Khronos standards include Vulkan®, OpenCL™, SYCL™, OpenVX™, NNEF™, and many others.

Now with added Raspberry Pi

The Khronos Group and Raspberry Pi have come together to work on an open-source implementation of OpenVX™ 1.3, which passes the conformance on Raspberry Pi. The open-source implementation passes the Vision, Enhanced Vision, & Neural Net conformance profiles specified in OpenVX 1.3 on Raspberry Pi.

Application developers may always freely use Khronos standards when they are available on the target system. To enable companies to test their products for conformance, Khronos has established an Adopters Program for each standard. This helps to ensure that Khronos standards are consistently implemented by multiple vendors to create a reliable platform for developers. Conformant products also enjoy protection from the Khronos IP Framework, ensuring that Khronos members will not assert their IP essential to the specification against the implementation.

OpenVX enables a performance and power-optimized computer vision processing, especially important in embedded and real-time use cases such as face, body, and gesture tracking, smart video surveillance, advanced driver assistance systems (ADAS), object and scene reconstruction, augmented reality, visual inspection, robotics, and more. The developers can take advantage of using this robust API in their application and know that the application is portable across all the conformant hardware.

Below, we will go over how to build and install the open-source OpenVX 1.3 library on Raspberry Pi 4 Model B. We will run the conformance for the Vision, Enhanced Vision, & Neural Net conformance profiles and create a simple computer vision application to get started with OpenVX on Raspberry Pi.

OpenVX 1.3 implementation for Raspberry Pi

The OpenVX 1.3 implementation is available on GitHub. To build and install the library, follow the instructions below.

Build OpenVX 1.3 on Raspberry Pi

Git clone the project with the recursive flag to get submodules:

git clone --recursive https://github.com/KhronosGroup/OpenVX-sample-impl.git

Note: The API Documents and Conformance Test Suite are set as submodules in the sample implementation project.

Use the Build.py script to build and install OpenVX 1.3:

cd OpenVX-sample-impl/
python Build.py --os=Linux --venum --conf=Debug --conf_vision --enh_vision --conf_nn

Build and run the conformance:

export OPENVX_DIR=$(pwd)/install/Linux/x32/Debug
export VX_TEST_DATA_PATH=$(pwd)/cts/test_data/
mkdir build-cts
cd build-cts
cmake --build .
LD_LIBRARY_PATH=./lib ./bin/vx_test_conformance

Sample application

Use the open-source samples on GitHub to test the installation.

The post OpenVX API for Raspberry Pi appeared first on Raspberry Pi.

Essential Suite — Artwork Producer Assistant

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/essential-suite-artwork-producer-assistant-8f2a760bc150

Essential Suite — Artwork Producer Assistant

By: Hamid Shahid & Syed Haq


Netflix continues to invest in content for a global audience with a diverse range of unique tastes and interests. Correspondingly, the member experience must also evolve to connect this global audience to the content that most appeals to each of them. Images that represent titles on Netflix (what we at Netflix call “artwork”) have proven to be one of the most effective ways to help our members discover the content they love to watch. We thus need to have a rich and diverse set of artwork that is tailored for different parts of the Netflix experience (what we call product canvases). We also need to source multiple images for each title representing different themes so we can present an image that is relevant to each member’s taste.

Manual curation and review of these high quality images from scratch for a growing catalog of titles can be particularly challenging for our Product Creative Strategy Producers (referred to as producers in the rest of the article). Below, we discuss how we’ve built upon our previous work of harvesting static images directly from video source files and our computer vision algorithms to produce a set of artwork candidates that covers the major product canvases for the entire content catalog. The artwork generated by this pipeline is used to augment the artwork typically sourced from design agencies. We call this suite of assisted artwork “The Essential Suite”.

Supplement, not replace

Producers from our Creative Production team are the ultimate decision makers when it comes to the selection of artwork that gets published for each title. Our usage of computer vision to generate artwork candidates from video sources thus is focussed on alleviating the workload for our Creative Production team. The team would rather spend its time on creative and strategic tasks rather than sifting through thousands of frames of a show looking for the most compelling ones. With the “Essential Suite”, we are providing an additional tool in the producers toolkit. Through testing we have learned that with proper checks and human curation in place, assisted artwork candidates can perform on par with agency designed artwork.

Design Agencies

Netflix uses best-in-class design agencies to provide artwork that can be used to promote titles on and off the Netflix service. Netflix producers work closely with design agencies to request, review and approve artwork. All artwork is delivered through a web application made available to the design agencies.

The computer generated artwork can be considered as artwork provided by an “Internal agency”. The idea is to generate artwork candidates using video source files and “bubble it up” to the producers on the same artwork portal where they review all other artwork, ideally without knowing if it is an agency produced or internally curated artwork, thereby selecting what goes on product purely based on creative quality of the image.

Assisted Artwork Generation Workflow

The artwork generation process involves several steps, starting with the arrival of the video source files and culminating in generated artwork being made available to producers. We use an open source workflow engine Netflix Conductor to run the orchestration. The whole process can be divided into two parts

  1. Generation
  2. Review

1. Generation

This article on AVA provides a good explanation on our technology to extract interesting images from video source files. The artwork generation workflow takes it a step further. For a given product canvas, it selects a handful of images from the hundreds of video stills most suitable for that particular product canvas. The workflow then crops and color-corrects the selected image, picks out the best spot to place the movie’s title based on negative space, selects and resizes the movie title and places it onto the image.

Here is an illustration of what it means if we had to do it manually

a. Image selection
b. Identify areas of interest
c. Cropped, color-corrected & title placed in the negative space

Image Selection / Analyze Image

Selection of the right still image is essential to generating good quality artwork. A lot of work has already been done in AVA to extract out a few hundreds of frames from hundreds of thousands of frames present in a typical video source. Broadly speaking, we use two methods to extract movie stills out of video source.

  1. AVA — Ava is primarily a character based algorithm. It picks up frames with a clear facial shot taking into account actors, facial expression and shot detection.
  2. Cinematics — Cinematics picks up aesthetically pleasing cinematic shots.

The combination of these two approaches produce a few hundred movie stills from a typical video source. For a season, this would be a few hundred shots for each episode. Our work here is to pick up the stills that best work for the desired canvas.

Both of the above algorithms use a few heatmaps which define what kind of images have proven to be working best in different canvases. The heatmaps are designed by internal artists who are experienced in designing promotional artwork/posters

Heatmap for a Billboard

We make use of meta-information such as the size of desired canvas, the “unsafe regions” and the “regions of interest” to identify what image would serve best. “Unsafe regions” are areas in the image where badges such as Netflix logo, new episodes, etc are placed. “Regions of interest” are areas that are always displayed in multi-purpose canvases. These details are stored as metadata for each canvas type and passed to the algorithm by the workflow. Some of our canvases are cropped dynamically for different user interfaces. For such images, the “Regions of interest” will be the area that is always displayed in each crop.

Unsafe regions

This data-driven approach allows for fast turnaround for additional canvases. While selecting images, the algorithms also returns back suggested coordinates within each image for cropping and title placement. Finally, it associates a “score” with the selected image. This score is the “confidence” that the algorithm has on the selection of candidate image on how well it could perform on service, based on previously collected stats.

Image Creation

The artwork generation workflow collates image selection results from each video source and picks up the top “n” images based on confidence score.

The selected image is then cropped and color-corrected based on coordinates passed by the algorithm. Some canvases also need the movie title to be placed on the image. The process makes use of the heatmap provided by our designers to perform cropping and title placement. As an example, the “Billboard” canvas shown on a movie’s landing page is right aligned, with the title and synopsis shown on the left.

Billboard Canvas

The workers to crop and color correct images are made available as separate titus jobs. The workflow invokes the jobs, storing each output in the artwork asset management system and passes it on for review.

2. Review

For each artwork candidate generated by the workflow, we want to get as much feedback as possible from the Creative Production team because they have the most context about the title. However, getting producers to provide feedback on hundreds of generated images is not scalable. For this reason, we have split the review process in two rounds.

Technical Quality Control (QC)

This round of review enables filtering out images that look obviously wrong to a human eye. Images with features such as human actors with an open mouth, inappropriate facial expressions or an incorrect body position, etc are filtered out in this round.

For the purpose of reviewing these images, we use a video/image annotation application that provides a simple interface to add tags for a given list of videos or images. For our purposes, for each image, we ask the very basic question “Should this image be used for artwork?”

The team reviewing these assets treat each image individually and only look for technical aspects of the image, regardless of the theme or genre of the title, or the quantity of images presented for a given title.

When an image is rejected, a few follow up questions are asked to ascertain why the image is not suitable to be used as artwork.

All this review data is fed back to the image selection, cropping and color corrections algorithms to train and improve them.

Editorial QC

Unlike technical QC, which is title agnostic, editorial QC is done by producers who are deeply familiar with the themes, storylines and characters in the title, to select artwork that will represent the title best on the Netflix service.

The application used to review generated artwork is the same application that producers use to place and review artwork requests fulfilled by design agencies. A screenshot of how generated artwork is presented to producers is shown below

Similar to technical QC, the option here for each artwork is whether to approve or reject the artwork. The producers are encouraged to provide reasons why they are rejecting an artwork.

Approved artwork makes its way to the artwork’s asset management system, where it resides alongside other agency-fulfilled artwork. From here, producers have the ability to publish it to the Netflix service.


We have learned a lot from our work on generating artwork. Artwork that looks good might not be the best depiction of the title’s story, a very clear character image might be a content spoiler. All of these decisions are best made by humans and we intend to keep it that way.

However, assisted artwork generation has a place in supporting our creative team by providing them with another avenue to pick up their assets from, and with careful supervision will help in their challenge of sourcing artwork at scale.

Essential Suite — Artwork Producer Assistant was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Netflix Hack Day — November 2019

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/netflix-hack-day-november-2019-c9b31d95d134?source=rss----2615bd06b42e---4

Netflix Hack Day — Fall 2019

By Tom Richards, Carenina Garcia Motion, and Leslie Posada

Hack Day at Netflix is an opportunity to build and show off a feature, tool, or quirky app. The goal is simple: experiment with new ideas/technologies, engage with colleagues across different disciplines, and have fun!

We know even the silliest idea can spur something more.

The most important value of our Hack Days is that they support a culture of innovation. We believe in this work, even if it never ships, and enjoy sharing the creativity and thought put into these ideas.

Below, you can find videos made by the hackers of some of our favorite hacks from this event.


Nostalgiflix is a chrome extension that transforms your Netflix web browser into an interactive TV time machine covering three decades (80’s, 90’s, and 00’s.) By dragging the UI slider around, you can view titles originally released within the selected year ( based on their historic box office and episode air dates.) More importantly you can also adjust the video filters in real-time to creatively downgrade the viewing experience, further enhancing the nostalgic effect. We think this feature could encourage our users to watch more of our older content while having fun reliving those moments of cinematic history.

By Joey Cato, Nazanin Delam, Sumana Mohan, Jeff Shi, Lily Dwyer, and Vishal Mishra

World of CS

This is a real time visualization of all contacts around the world. Each square on the map represent one of our global contact centers, spanning from Salt Lake City to Brazil, India, and Japan. The heatmap in the background is a historical trend of calls over the last hour, showing which countries are currently most active in contacting customer service. Every line you see is a live customer contact — starting at the customer’s country and ending at the contact center it was routed to. Four different types of contacts are represented in this visualization, white for regular phone calls, light blue for chats, green for calls that are initiated through our mobile apps on android and iOS, and red for contacts which are escalated from one representative to another.

By Sushruth Puttaswamy and Adam Krasny

Bird Box — Automatic AD

Audio Descriptive tracks provide descriptive narration in addition to dialog, helping visually impaired and blind members enjoy our shows. For the Hack Day project, we explored using recent research¹ to automatically generate descriptions, then used our own internal authoring tools to refine the output. We then used synthetic audio and automated mixing techniques to deliver a final audio description track.

By Adam Wang, Andy Swan, Raja Senapati, Shilpa Jois, Anjali Chablani, Deepa Krishnan, Vidya Sundaram, and Casey Wilms

You can also check out highlights from our past events: May 2019, November 2018, March 2018, August 2017, January 2017, May 2016, November 2015, March 2015, February 2014 & August 2014.

Thanks to all the teams who put together a great round of hacks in 24 hours


  1. Weakly Supervised Dense Event Captioning in Videos
    Duan, Xuguang and Huang, Wenbing and Gan, Chuang and Wang, Jingdong and Zhu, Wenwu and Huang, Junzhou
    Advances in Neural Information Processing Systems 31 Curran Associates, Inc.. p. 3062–3072. 2018

Netflix Hack Day — November 2019 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Netflix Studio Hack Day — May 2019

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/netflix-studio-hack-day-may-2019-b4a0ecc629eb?source=rss----2615bd06b42e---4

Netflix Studio Hack Day — May 2019

By Tom Richards, Carenina Garcia Motion, and Marlee Tart

Hack Days are a big deal at Netflix. They’re a chance to bring together employees from all our different disciplines to explore new ideas and experiment with emerging technologies.

For the most recent hack day, we channeled our creative energy towards our studio efforts. The goal remained the same: team up with new colleagues and have fun while learning, creating, and experimenting. We know even the silliest idea can spur something more.

The most important value of hack days is that they support a culture of innovation. We believe in this work, even if it never ships, and love to share the creativity and thought put into these ideas.

Below, you can find videos made by the hackers of some of our favorite hacks from this event.

Project Rumble Pak

You’re watching your favorite episode of Voltron when, after a suspenseful pause, there’s a huge explosion — and your phone starts to vibrate in your hands.

The Project Rumble Pak hack day project explores how haptics can enhance the content you’re watching. With every explosion, sword clank, and laser blast, you get force feedback to amp up the excitement.

For this project, we synchronized Netflix content with haptic effects using Immersion Corporation technology.

By Hans van de Bruggen and Ed Barker

The Voice of Netflix

Introducing The Voice of Netflix. We trained a neural net to spot words in Netflix content and reassemble them into new sentences on demand. For our stage demonstration, we hooked this up to a speech recognition engine to respond to our verbal questions in the voice of Netflix’s favorite characters. Try it out yourself at blogofsomeguy.com/v!

By Guy Cirino and Carenina Garcia Motion


TerraVision re-envisions the creative process and revolutionizes the way our filmmakers can search and discover filming locations. Filmmakers can drop a photo of a look they like into an interface and find the closest visual matches from our centralized library of locations photos. We are using a computer vision model trained to recognize places to build reverse image search functionality. The model converts each image into a small dimensional vector, and the matches are obtained by computing the nearest neighbors of the query.

By Noessa Higa, Ben Klein, Jonathan Huang, Tyler Childs, Tie Zhong, and Kenna Hasson

Get Out!

Have you ever found yourself needing to give the Evil Eye™ to colleagues who are hogging your conference room after their meeting has ended?

Our hack is a simple web application that allows employees to select a Netflix meeting room anywhere in the world, and press a button to kick people out of their meeting room if they have overstayed their meeting. First, the app looks up calendar events associated with the room and finds the latest meeting in the room that should have already ended. It then automatically calls in to that meeting and plays walk-off music similar to the Oscar’s to not-so-subtly encourage your colleagues to Get Out! We built this hack using Java (Springboot framework), the Google OAuth and Calendar APIs (for finding rooms) and Twilio API (for calling into the meeting), and deployed it on AWS.

By Abi Seshadri and Rachel Rivera

You can also check out highlights from our past events: November 2018, March 2018, August 2017, January 2017, May 2016, November 2015, March 2015, February 2014 & August 2014.

Thanks to all the teams who put together a great round of hacks in 24 hours.

Netflix Studio Hack Day — May 2019 was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Rock, paper, scissors, lizard, Spock, fire, water balloon!

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/rock-paper-scissors-lizard-spock-fire-water-balloon/

Use a Raspberry Pi and a Pi Camera Module to build your own machine learning–powered rock paper scissors game!

Rock-Paper-Scissors game using computer vision and machine learning on Raspberry Pi

A Rock-Paper-Scissors game using computer vision and machine learning on the Raspberry Pi. Project GitHub page: https://github.com/DrGFreeman/rps-cv PROJECT ORIGIN: This project results from a challenge my son gave me when I was teaching him the basics of computer programming making a simple text based Rock-Paper-Scissors game in Python.

Virtual rock paper scissors

Here’s why you should always leave comments on our blog: this project from Julien de la Bruère-Terreault instantly had our attention when he shared it on our recent Android Things post.

Julien and his son were building a text-based version of rock paper scissors in Python when his son asked him: “Could you make a rock paper scissors game that uses the camera to detect hand gestures?” Obviously, Julien really had no choice but to accept the challenge.

“The game uses a Raspberry Pi computer and Raspberry Pi Camera Module installed on a 3D-printed support with LED strips to achieve consistent images,” Julien explains in the tutorial for the build. “The pictures taken by the camera are processed and fed to an image classifier that determines whether the gesture corresponds to ‘Rock’, ‘Paper’, or ‘Scissors’ gestures.”

How does it work?

Physically, the build uses a Pi 3 Model B and a Camera Module V2 alongside 3D-printed parts. The parts are all green, since a consistent colour allows easy subtraction of background from the captured images. You can download the files for the setup from Thingiverse.

rock paper scissors raspberry pi

To illustrate how the software works, Julien has created a rather delightful pipeline demonstrating where computer vision and machine learning come in.

rock paper scissors using raspberry pi

The way the software works means the game doesn’t need to be limited to the standard three hand signs. If you wanted to, you could add other signs such as ‘lizard’ and ‘Spock’! Or ‘fire’ and ‘water balloon’. Or any other alterations made to the game in your pop culture favourites.

rock paper scissors lizard spock

Check out Julien’s full tutorial to build your own AI-powered rock paper scissors game here on Julien’s GitHub. Massive kudos to Julien for spending a year learning the skills required to make it happen. And a massive thank you to Julien’s son for inspiring him! This is why it’s great to do coding and digital making with kids — they have the best project ideas!

Sharing is caring

If you’ve built your own project using Raspberry Pi, please share it with us in the comments below, or via social media. As you can tell from today’s blog post, we love to see them and share them with the whole community!

The post Rock, paper, scissors, lizard, Spock, fire, water balloon! appeared first on Raspberry Pi.

Take a photo of yourself as an unreliable cartoon

Post Syndicated from Helen Lynn original https://www.raspberrypi.org/blog/take-a-photo-of-yourself-unreliable-cartoon/

Take a selfie, wait for the image to appear, and behold a cartoon version of yourself. Or, at least, behold a cartoon version of whatever the camera thought it saw. Welcome to Draw This by maker Dan Macnish.

Dan has made code, instructions, and wiring diagrams available to help you bring this beguiling weirdery into your own life.

raspberry pi cartoon polaroid camera

Neural networks, object recognition, and cartoons

One of the fun things about this re-imagined polaroid is that you never get to see the original image. You point, and shoot – and out pops a cartoon; the camera’s best interpretation of what it saw. The result is always a surprise. A food selfie of a healthy salad might turn into an enormous hot dog, or a photo with friends might be photobombed by a goat.

OK. Let’s take this one step at a time.

Pi + camera + button + LED

Draw This uses a Raspberry Pi 3 and a Camera Module, with a button and a useful status LED connected to the GPIO pins via a breadboard. You press the button, and the camera captures a still image while the LED comes on and stays lit for a couple of seconds while the Pi processes the image. So far, so standard Pi camera build.

Interpreting and re-interpreting the camera image

Dan uses Python to process the captured photograph, employing a pre-trained machine learning model from Google to recognise multiple objects in the image. Now he brings the strangeness. The Pi matches the things it sees in the photo with doodles from Google’s huge open-source Quick, Draw! dataset, and generates a new image that represents the objects in the original image as doodles. Then a thermal printer connected to the Pi’s GPIO pins prints the results.

A 28 x 14 grid of kangaroo doodles in dark grey on a white background

Kangaroos from the Quick, Draw! dataset (I got distracted)

Potential for peculiar

Reading about this build leaves me yearning to see its oddest interpretation of a scene, so if you make this and you find it really does turn you or your friend into a goat, please do share that with us.

And as you can see from my kangaroo digression above, there is a ton of potential for bizarro makes that use the Quick, Draw! dataset, object recognition models, or both; it’s not just the marsupials that are inexplicably compelling (I dare you to go and look and see how long it takes you to get back to whatever you were in the middle of). If you’re planning to make this, or something inspired by this, check out Dan’s cartoonify GitHub repo. And tell us all about it in the comments.

The post Take a photo of yourself as an unreliable cartoon appeared first on Raspberry Pi.

Journey into Deep Learning with AWS

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/journey-into-deep-learning-with-aws/

If you are anything like me, Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning are completely fascinating and exciting topics. As AI, ML, and Deep Learning become more widely used, for me it means that the science fiction written by Dr. Issac Asimov, the robotics and medical advancements in Star Wars, and the technologies that enabled Captain Kirk and his Star Trek crew “to boldly go where no man has gone before” can become achievable realities.


Most people interested in the aforementioned topics are familiar with the AI and ML solutions enabled by Deep Learning, such as Convolutional Neural Networks for Image and Video Classification, Speech Recognition, Natural Language interfaces, and Recommendation Engines. However, it is not always an easy task setting up the infrastructure, environment, and tools to enable data scientists, machine learning practitioners, research scientists, and deep learning hobbyists/advocates to dive into these technologies. Most developers desire to go quickly from getting started with deep learning to training models and developing solutions using deep learning technologies.

For these reasons, I would like to share some resources that will help to quickly build deep learning solutions whether you are an experienced data scientist or a curious developer wanting to get started.

Deep Learning Resources

The Apache MXNet is Amazon’s deep learning framework of choice. With the power of Apache MXNet framework and NVIDIA GPU computing, you can launch your scalable deep learning projects and solutions easily on the AWS Cloud. As you get started on your MxNet deep learning quest, there are a variety of self-service tutorials and datasets available to you:

  • Launch an AWS Deep Learning AMI: This guide walks you through the steps to launch the AWS Deep Learning AMI with Ubuntu
  • MXNet – Create a computer vision application: This hands-on tutorial uses a pre-built notebook to walk you through using neural networks to build a computer vision application to identify handwritten digits
  • AWS Machine Learning Datasets: AWS hosts datasets for Machine Learning on the AWS Marketplace that you can access for free. These large datasets are available for anyone to analyze the data without requiring the data to be downloaded or stored.
  • Predict and Extract – Learn to use pre-trained models for predictions: This hands-on tutorial will walk you through how to use pre-trained model for predicting and feature extraction using the full Imagenet dataset.


AWS Deep Learning AMIs

AWS offers Amazon Machine Images (AMIs) for use on Amazon EC2 for quick deployment of an infrastructure needed to start your deep learning journey. The AWS Deep Learning AMIs are pre-configured with popular deep learning frameworks built using Amazon EC2 instances on Amazon Linux, and Ubuntu that can be launched for AI targeted solutions and models. The deep learning frameworks supported and pre-configured on the deep learning AMI are:

  • Apache MXNet
  • TensorFlow
  • Microsoft Cognitive Toolkit (CNTK)
  • Caffe
  • Caffe2
  • Theano
  • Torch
  • Keras

Additionally, the AWS Deep Learning AMIs install preconfigured libraries for Jupyter notebooks with Python 2.7/3.4, AWS SDK for Python, and other data science related python packages and dependencies. The AMIs also come with NVIDIA CUDA and NVIDIA CUDA Deep Neural Network (cuDNN) libraries preinstalled with all the supported deep learning frameworks and the Intel Math Kernel Library is installed for Apache MXNet framework. You can launch any of the Deep Learning AMIs by visiting the AWS Marketplace using the Try the Deep Learning AMIs link.


It is a great time to dive into Deep Learning. You can accelerate your work in deep learning by using the AWS Deep Learning AMIs running on the AWS cloud to get your deep learning environment running quickly or get started learning more about Deep Learning on AWS with MXNet using the AWS self-service resources.  Of course, you can learn even more information about Deep Learning, Machine Learning, and Artificial Intelligence on AWS by reviewing the AWS Deep Learning page, the Amazon AI product page, and the AWS AI Blog.

May the Deep Learning Force be with you all.


Amazon Rekognition Update – Image Moderation

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-rekognition-update-image-moderation/

We launched Amazon Rekognition late last year and I told you about it in my post Amazon Rekognition – Image Detection and Recognition Powered by Deep Learning. As I explained at the time, this service was built by our Computer Vision team over the course of many years and analyzes billions of images daily.

Today we are adding image moderation to Rekognition. If your web site or application allows users to upload profile photos or other imagery, you will love this new Rekognition feature.

Rekognition can now identify images that contain suggestive or explicit content that may not be appropriate for your site. The moderation labels provide detailed sub-categories, allowing you to fine-tune the filters that you use to determine what kinds of images you deem acceptable or objectionable. You can use this feature to improve photo sharing sites, forums, dating apps, content platforms for children, e-commerce platforms and marketplaces, and more.

To access this feature, call the DetectModerationLabels function from your code. The response will include a set of moderation labels drawn from a built-in taxonomy:

"ModerationLabels": [ 
    "Confidence": 83.55088806152344, 
    "Name": "Suggestive",
    "ParentName": ""
    "Confidence": 83.55088806152344, 
    "Name": "Female Swimwear Or Underwear", 
    "ParentName": "Suggestive" 

You can use the Image Moderation Demo in the AWS Management Console to experiment with this feature:

Image moderation is available now and you can start using it today!



Raspberry Turk: a chess-playing robot

Post Syndicated from Lorna Lynch original https://www.raspberrypi.org/blog/raspberry-turk-chess-playing-robot/

Computers and chess have been a potent combination ever since the appearance of the first chess-playing computers in the 1970s. You might even be able to play a game of chess on the device you are using to read this blog post! For digital makers, though, adding a Raspberry Pi into the mix can be the first step to building something a little more exciting. Allow us to introduce you to Joey Meyer‘s chess-playing robot, the Raspberry Turk.

The Raspberry Turk chess-playing robot

Image credit: Joey Meyer

Being both an experienced software engineer with an interest in machine learning, and a skilled chess player, it’s not surprising that Joey was interested in tinkering with chess programs. What is really stunning, though, is the scale and complexity of the build he came up with. Fascinated by a famous historical hoax, Joey used his skills in programming and robotics to build an open-source Raspberry Pi-powered recreation of the celebrated Mechanical Turk automaton.

You can see the Raspberry Turk in action on Joey’s YouTube channel:

Chess Playing Robot Powered by Raspberry Pi – Raspberry Turk

The Raspberry Turk is a robot that can play chess-it’s entirely open source, based on Raspberry Pi, and inspired by the 18th century chess playing machine, the Mechanical Turk. Website: http://www.raspberryturk.com Source Code: https://github.com/joeymeyer/raspberryturk

A historical hoax

Joey explains that he first encountered the Mechanical Turk through a book by Tom Standage. A famous example of mechanical trickery, the original Turk was advertised as a chess-playing automaton, capable of defeating human opponents and solving complex puzzles.

Image of the Mechanical Turk Automaton

A modern reconstruction of the Mechanical Turk 
Image from Wikimedia Commons

Its inner workings a secret, the Turk toured Europe for the best part of a century, confounding everyone who encountered it. Unfortunately, it turned out not to be a fabulous example of early robotic engineering after all. Instead, it was just an elaborate illusion. The awesome chess moves were not being worked out by the clockwork brain of the automaton, but rather by a human chess master who was cunningly concealed inside the casing.

Building a modern Turk

A modern version of the Mechanical Turk was constructed in the 1980s. However, the build cost $120,000. At that price, it would have been impossible for most makers to create their own version. Impossible, that is, until now: Joey uses a Raspberry Pi 3 to drive the Raspberry Turk, while a Raspberry Pi Camera Module handles computer vision.

Image of chess board and Raspberry Turk robot

The Raspberry Turk in the middle of a game 
Image credit: Joey Meyer

Joey’s Raspberry Turk is built into a neat wooden table. All of the electronics are housed in a box on one side. The chessboard is painted directly onto the table’s surface. In order for the robot to play, a Camera Module located in a 3D-printed housing above the table takes an image of the chessboard. The image is then analysed to determine which pieces are in which positions at that point. By tracking changes in the positions of the pieces, the Raspberry Turk can determine which moves have been made, and which piece should move next. To train the system, Joey had to build a large dataset to validate a computer vision model. This involved painstakingly moving pieces by hand and collecting multiple images of each possible position.

Look, no hands!

A key feature of the Mechanical Turk was that the automaton appeared to move the chess pieces entirely by itself. Of course, its movements were actually being controlled by a person hidden inside the machine. The Raspberry Turk, by contrast, does move the chess pieces itself. To achieve this, Joey used a robotic arm attached to the table. The arm is made primarily out of Actobotics components. Joey explains:

The motion is controlled by the rotation of two servos which are attached to gears at the base of each link of the arm. At the end of the arm is another servo which moves a beam up and down. At the bottom of the beam is an electromagnet that can be dynamically activated to lift the chess pieces.

Joey individually fitted the chess pieces with tiny sections of metal dowel so that the magnet on the arm could pick them up.

Programming the Raspberry Turk

The Raspberry Turk is controlled by a daemon process that runs a perception/action sequence, and the status updates automatically as the pieces are moved. The code is written almost entirely in Python. It is all available on Joey’s GitHub repo for the project, together with his notebooks on the project.

Image of Raspberry Turk chessboard with Python script alongside

Image credit: Joey Meyer

The AI backend that gives the robot its chess-playing ability is currently Stockfish, a strong open-source chess engine. Joey says he would like to build his own engine when he has time. For the moment, though, he’s confident that this AI will prove a worthy opponent.

The project website goes into much more detail than we are able to give here. We’d definitely recommend checking it out. If you have been experimenting with any robotics or computer vision projects like this, please do let us know in the comments!

The post Raspberry Turk: a chess-playing robot appeared first on Raspberry Pi.


Post Syndicated from Liz Upton original https://www.raspberrypi.org/blog/robocod/

Fishbowl existence is tough. There you are, bobbing up and down in the same dull old environment, day in, day out; your view unchanging, your breakfast boringly identical every morning; that clam thing in the bottom of the tank opening and closing monotonously – goldfish can live for up to 20 years. That’s a hell of a long time to watch a clam thing for.

fishbowl on wheels

Two fish are in a tank. One says “How do you drive this thing?”

Indeed, fishbowl existence is so tough that several countries have banned the boring round bowls altogether. (There’s a reason that your childhood goldfish didn’t live for 20 years. You put it in an environment that bored it to death.) So this build comes with a caveat – we are worried that this particular fish is being driven from understimulus to overstimulus and back again, and that she might be prevented from making it to the full 20 years as a result. Please be kind to your fish.

What’s going on here? Over in Pittsburgh, at Carnegie Mellon University, Alex Kent and friends have widened the goldfish’s horizons, by giving it wheels. Meet the free-range fish.

Just Keep Swimming

Build18 @CMU . . . . . . . . . . . . * Jukin Media Verified * Find this video and others like it by visiting https://www.jukinmedia.com/licensing/view/949380 For licensing / permission to use, please email licensing(at)jukinmedia(dot)com.

Alex K, negligent fishparent, says that the speed and direction of the build is determined by the position of the fish relative to the centre of the tank. The battery lasts for five hours, and by all accounts the fish is still alive. Things are a bit jerky in this prototype build. Alex explains:

The jerking is actually caused by the Computer Vision algorithm losing track of the fish because of the reflection off of the lid, condensation on the lid, water ripples, etc.

Alex and co: before you look at more expensive solutions, try fixing a polarising filter to the camera you’re using.

All the code you’ll need to torture your own fish is available at GitHub.

Of course, Far Side fans will observe that there is nothing new under the sun.

Fishbowl on wheels by Gary Larson

Image from Gary Larson, The Far Side.

If you’ve got any good fish puns, let minnow in the comments.


The post Robocod appeared first on Raspberry Pi.

Capturing Pattern-Lock Authentication

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/01/capturing_patte.html

Interesting research — “Cracking Android Pattern Lock in Five Attempts“:

Abstract: Pattern lock is widely used as a mechanism for authentication and authorization on Android devices. In this paper, we demonstrate a novel video-based attack to reconstruct Android lock patterns from video footage filmed u sing a mobile phone camera. Unlike prior attacks on pattern lock, our approach does not require the video to capture any content displayed on the screen. Instead, we employ a computer vision algorithm to track the fingertip movements to infer the pattern. Using the geometry information extracted from the tracked fingertip motions, our approach is able to accurately identify a small number of (often one) candidate patterns to be tested by an adversary. We thoroughly evaluated our approach using 120 unique patterns collected from 215 independent users, by applying it to reconstruct patterns from video footage filmed using smartphone cameras. Experimental results show that our approach can break over 95% of the patterns in five attempts before the device is automatically locked by the Android system. We discovered that, in contrast to many people’s belief, complex patterns do not offer stronger protection under our attacking scenarios. This is demonstrated by the fact that we are able to break all but one complex patterns (with a 97.5% success rate) as opposed to 60% of the simple patterns in the first attempt. Since our threat model is common in day-to-day lives, our work calls for the community to revisit the risks of using Android pattern lock to protect sensitive information.

News article.

Amazon Rekognition – Image Detection and Recognition Powered by Deep Learning

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-rekognition-image-detection-and-recognition-powered-by-deep-learning/

What do you see when you look at this picture?

You might simply see an animal. Maybe you see a pet, a dog, or a Golden Retriever. The association between the image and these labels is not hard-wired in to your brain. Instead, you learned the labels after seeing hundreds or thousands of examples. Operating on a number of different levels, you learned to distinguish an animal from a plant, a dog from a cat, and a Golden Retriever from other dog breeds.

Deep Learning for Image Detection
Giving computers the same level of comprehension has proven to be a very difficult task. Over the course of decades, computer scientists have taken many different approaches to the problem. Today, a broad consensus has emerged that the best way to tackle this problem is via deep learning. Deep learning uses a combination of feature abstraction and neural networks to produce results that can be (as Arthur C. Clarke once said) indistinguishable from magic. However, it comes at a considerable cost. First, you need to put a lot of work into the training phase. In essence, you present the learning network with a broad spectrum of labeled examples (“this is a dog”, “this is a pet”, and so forth) so that it can correlate features in the image with the labels. This phase is computationally expensive due to the size and the multi-layered nature of the neural networks. After the training phase is complete, evaluating new images against the trained network is far easier. The results are traditionally expressed in confidence levels (0 to 100%) rather than as cold, hard facts. This allows you to decide just how much precision is appropriate for your applications.

Introducing Amazon Rekognition
Today I would like to tell you about Amazon Rekognition. Powered by deep learning and built by our Computer Vision team over the course of many years, this fully-managed service already analyzes billions of images daily. It has been trained on thousands of objects and scenes, and is now available for you to use in your own applications. You can use the Rekognition Demos to put the service through its paces before dive in and start writing code that uses the Rekognition API.

Rekognition was designed from the get-go to run at scale. It comprehends scenes, objects, and faces. Given an image, it will return a list of labels. Given an image with one or more faces, it will return bounding boxes for each face, along with attributes. Let’s see what it has to say about the picture of my dog (her name is Luna, by the way):

As you can see, Rekognition labeled Luna as an animal, a dog, a pet, and as a golden retriever with a high degree of confidence. It is important to note that these labels are independent, in the sense that the deep learning model does not explicitly understand the relationship between, for example, dogs and animals. It just so happens that both of these labels were simultaneously present on the dog-centric training material presented to Rekognition.

Let’s see how it does with a picture of my wife and I:

Amazon Rekognition found our faces, set up bounding boxes, and let me know that my wife was happy (the picture was taken on her birthday, so I certainly hope she was).

You can also use Rekognition to compare faces and to see if a given image contains any one of a number of faces that you have asked it to recognize.

All of this power is accessible from a set of API functions (the console is great for quick demos). For example, you can call DetectLabels to programmatically reproduce my first example, or DetectFaces to reproduce my second one. You can make multiple calls to IndexFaces to prepare Rekognition to recognize some faces. Each time you do this, Rekognition extracts some features (known as face vectors) from the image, stores the vectors, and discards the image. You can create one or more Rekognition collections and store related groups of face vectors in each one.

Rekognition can directly process images stored in Amazon Simple Storage Service (S3). In fact, you can use AWS Lambda functions to process newly uploaded photos at any desired scale. You can use AWS Identity and Access Management (IAM) to control access to the Rekognition APIs.

Applications for Rekognition
So, what can you use this for? I’ve got plenty of ideas to get you started!

If you have a large collection of photos, you can tag and index them using Amazon Rekognition. Because Rekognition is a service, you can process millions of photos per day without having to worry about setting up, running, or scaling any infrastructure. You can implement visual search, tag-based browsing, and all sorts of interactive discovery models.

You can use Rekognition in several different authentication and security contexts. You can compare a face on a webcam to a badge photo before allowing an employee to enter a secure zone. You can perform visual surveillance, inspecting photos for objects or people of interest or concern.

You can build “smart” marketing billboards that collect demographic data about viewers.

Now Available
Rekognition is now available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions and you can start using it today. As part of the AWS Free Tier tier, you can analyze up to 5,000 images per month and store up to 1,000 face vectors each month for an entire year. After that (and at higher volume), you will pay tiered pricing based on the number of images that you analyze and the number of face vectors that you store.



Bringing the Viewer In: The Video Opportunity in Virtual Reality

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/151940036881

By Satender Saroha, Video Engineering

Virtual reality (VR) 360° videos are the next frontier of how we engage with and consume content. Unlike a traditional scenario in which a person views a screen in front of them, VR places the user inside an immersive experience. A viewer is “in” the story, and not on the sidelines as an observer.

Ivan Sutherland, widely regarded as the father of computer graphics, laid out the vision for virtual reality in his famous speech, “Ultimate Display” in 1965 [1]. In that he said, “You shouldn’t think of a computer screen as a way to display information, but rather as a window into a virtual world that could eventually look real, sound real, move real, interact real, and feel real.”

Over the years, significant advancements have been made to bring reality closer to that vision. With the advent of headgear capable of rendering 3D spatial audio and video, realistic sound and visuals can be virtually reproduced, delivering immersive experiences to consumers.

When it comes to entertainment and sports, streaming in VR has become the new 4K HEVC/UHD of 2016. This has been accelerated by the release of new camera capture hardware like GoPro and streaming capabilities such as 360° video streaming from Facebook and YouTube. Yahoo streams lots of engaging sports, finance, news, and entertainment video content to tens of millions of users. The opportunity to produce and stream such content in 360° VR opens a unique opportunity to Yahoo to offer new types of engagement, and bring the users a sense of depth and visceral presence.

While this is not an experience that is live in product, it is an area we are actively exploring. In this blog post, we take a look at what’s involved in building an end-to-end VR streaming workflow for both Live and Video on Demand (VOD). Our experiments and research goes from camera rig setup, to video stitching, to encoding, to the eventual rendering of videos on video players on desktop and VR headsets. We also discuss challenges yet to be solved and the opportunities they present in streaming VR.

1. The Workflow

Yahoo’s video platform has a workflow that is used internally to enable streaming to an audience of tens of millions with the click of a few buttons. During experimentation, we enhanced this same proven platform and set of APIs to build a complete 360°/VR experience. The diagram below shows the end-to-end workflow for streaming 360°/VR that we built on Yahoo’s video platform.

Figure 1: VR Streaming Workflow at Yahoo

1.1. Capturing 360° video

In order to capture a virtual reality video, you need access to a 360°-capable video camera. Such a camera uses either fish-eye lenses or has an array of wide-angle lenses to collectively cover a 360 (θ) by 180 (ϕ) sphere as shown below.

Though it sounds simple, there is a real challenge in capturing a scene in 3D 360° as most of the 360° video cameras offer only 2D 360° video capture.

In initial experiments, we tried capturing 3D video using two cameras side-by-side, for left and right eyes and arranging them in a spherical shape. However this required too many cameras – instead we use view interpolation in the stitching step to create virtual cameras.

Another important consideration with 360° video is the number of axes the camera is capturing video with. In traditional 360° video that is captured using only a single-axis (what we refer as horizontal video), a user can turn their head from left to right. But this setup of cameras does not support a user tilting their head at 90°.

To achieve true 3D in our setup, we went with 6-12 GoPro cameras having 120° field of view (FOV) arranged in a ring, and an additional camera each on top and bottom, with each one outputting 2.7K at 30 FPS.

1.2. Stitching 360° video

Projection Layouts

Because a 360° view is a spherical video, the surface of this sphere needs to be projected onto a planar surface in 2D so that video encoders can process it. There are two popular layouts:

Equirectangular layout: This is the most widely-used format in computer graphics to represent spherical surfaces in a rectangular form with an aspect ratio of 2:1. This format has redundant information at the poles which means some pixels are over-represented, introducing distortions at the poles compared to the equator (as can be seen in the equirectangular mapping of the sphere below).

Figure 2: Equirectangular Layout [2]

CubeMap layout: CubeMap layout is a format that has also been used in computer graphics. It contains six individual 2D textures that map to six sides of a cube. The figure below is a typical cubemap representation. In a cubemap layout, the sphere is projected onto six faces and the images are folded out into a 2D image, so pieces of a video frame map to different parts of a cube, which leads to extremely efficient compact packing. Cubemap layouts require about 25% fewer pixels compared to equirectangular layouts.

Figure 3: CubeMap Layout [3]

Stitching Videos

In our setup, we experimented with a couple of stitching softwares. One was from Vahana VR [4], and the other was a modified version of the open-source Surround360 technology that works with a GoPro rig [5]. Both softwares output equirectangular panoramas for the left and the right eye. Here are the steps involved in stitching together a 360° image:

Raw frame image processing: Converts uncompressed raw video data to RGB, which involves several steps starting from black-level adjustment, to applying Demosaic algorithms in order to figure out RGB color parts for each pixel based on the surrounding pixels. This also involves gamma correction, color correction, and anti vignetting (undoing the reduction in brightness on the image periphery). Finally, this stage applies sharpening and noise-reduction algorithms to enhance the image and suppress the noise.

Calibration: During the calibration step, stitching software takes steps to avoid vertical parallax while stitching overlapping portions in adjacent cameras in the rig. The purpose is to align everything in the scene, so that both eyes see every point at the same vertical coordinate. This step essentially matches the key points in images among adjacent camera pairs. It uses computer vision algorithms for feature detection like Binary Robust Invariant Scalable Keypoints (BRISK) [6] and AKAZE [7].

Optical Flow: During stitching, to cover the gaps between adjacent real cameras and provide interpolated view, optical flow is used to create virtual cameras. The optical flow algorithm finds the pattern of apparent motion of image objects between two consecutive frames caused by the movement of the object or camera. It uses OpenCV algorithms to find the optical flow [8].

Below are the frames produced by the GoPro camera rig:

Figure 4: Individual frames from 12-camera rig

Figure 5: Stitched frame output with PtGui

Figure 6: Stitched frame with barrel distortion using Surround360

Figure 7: Stitched frame after removing barrel distortion using Surround360

To get the full depth in stereo, the rig is set-up so that i = r * sin(FOV/2 – 360/n). where:

  • i = IPD/2 where IPD is the inter-pupillary distance between eyes.\
  • r = Radius of the rig.
  • FOV = Field of view of GoPro cameras, 120 degrees.
  • n = Number of cameras which is 12 in our setup.

Given IPD is normally 6.4 cms, i should be greater than 3.2 cm. This implies that with a 12-camera setup, the radius of the the rig comes to 14 cm(s). Usually, if there are more cameras it is easier to avoid black stripes.

Reducing Bandwidth – FOV-based adaptive transcoding

For a truly immersive experience, users expect 4K (3840 x 2160) quality resolution at 60 frames per second (FPS) or higher. Given typical HMDs have a FOV of 120 degrees, a full 360° video needs a resolution of at least 12K (11520 x 6480). 4K streaming needs a bandwidth of 25 Mbps [9]. So for 12K resolution, this effectively translates to > 75 Mbps and even more for higher framerates. However, average wifi in US has bandwidth of 15 Mbps [10].

One way to address the bandwidth issue is by reducing the resolution of areas that are out of the field of view. Spatial sub-sampling is used during transcoding to produce multiple viewport-specific streams. Each viewport-specific stream has high resolution in a given viewport and low resolution in the rest of the sphere.

On the player side, we can modify traditional adaptive streaming logic to take into account field of view. Depending on the video, if the user moves his head around a lot, it could result in multiple buffer fetches and could result in rebuffering. Ideally, this will work best in videos where the excessive motion happens in one field of view at a time and does not span across multiple fields of view at the same time. This work is still in an experimental stage.

The default output format from stitching software of both Surround360 and Vahana VR is equirectangular format. In order to reduce the size further, we pass it through a cubemap filter transform integrated into ffmpeg to get an additional pixel reduction of ~25%  [11] [12].

At the end of above steps, the stitching pipeline produces high-resolution stereo 3D panoramas which are then ingested into the existing Yahoo Video transcoding pipeline to produce multiple bit-rates HLS streams.

1.3. Adding a stitching step to the encoding pipeline

Live – In order to prepare for multi-bitrate streaming over the Internet, a live 360° video-stitched stream in RTMP is ingested into Yahoo’s video platform. A live Elemental encoder was used to re-encode and package the live input into multiple bit-rates for adaptive streaming on any device (iOS, Android, Browser, Windows, Mac, etc.)

Video on Demand – The existing Yahoo video transcoding pipeline was used to package multiple bit-rates HLS streams from raw equirectangular mp4 source videos.

1.4. Rendering 360° video into the player

The spherical video stream is delivered to the Yahoo player in multiple bit rates. As a user changes their viewing angle, different portion of the frame are shown, presenting a 360° immersive experience. There are two types of VR players currently supported at Yahoo:

WebVR based Javascript Player – The Web community has been very active in enabling VR experiences natively without plugins from within browsers. The W3C has a Javascript proposal [13], which describes support for accessing virtual reality (VR) devices, including sensors and head-mounted displays on the Web. VR Display is the main starting point for all the device APIs supported. Some of the key interfaces and attributes exposed are:

  • VR Display Capabilities: It has attributes to indicate position support, orientation support, and has external display.
  • VR Layer: Contains the HTML5 canvas element which is presented by VR Display when its submit frame is called. It also contains attributes defining the left bound and right bound textures within source canvas for presenting to an eye.
  • VREye Parameters: Has information required to correctly render a scene for given eye. For each eye, it has offset the distance from middle of the user’s eyes to the center point of one eye which is half of the interpupillary distance (IPD). In addition, it maintains the current FOV of the eye, and the recommended renderWidth and render Height of each eye viewport.
  • Get VR Displays: Returns a list of VR Display(s) HMDs accessible to the browser.

We implemented a subset of webvr spec in the Yahoo player (not in production yet) that lets you watch monoscopic and stereoscopic 3D video on supported web browsers (Chrome, Firefox, Samsung), including Oculus Gear VR-enabled phones. The Yahoo player takes the equirectangular video and maps its individual frames on the Canvas javascript element. It uses the webGL and Three.JS libraries to do computations for detecting the orientation and extracting the corresponding frames to display.

For web devices which support only monoscopic rendering like desktop browsers without HMD, it creates a single Perspective Camera object specifying the FOV and aspect ratio. As the device’s requestAnimationFrame is called it renders the new frames. As part of rendering the frame, it first calculates the projection matrix for FOV and sets the X (user’s right), Y (Up), Z (behind the user) coordinates of the camera position.

For devices that support stereoscopic rendering like mobile phones from Samsung Gear, the webvr player creates two PerspectiveCamera objects, one for the left eye and one for the right eye. Each Perspective camera queries the VR device capabilities to get the eye parameters like FOV, renderWidth and render Height every time a frame needs to be rendered at the native refresh rate of HMD. The key difference between stereoscopic and monoscopic is the perceived sense of depth that the user experiences, as the video frames separated by an offset are rendered by separate canvas elements to each individual eye.

Cardboard VR – Google provides a VR sdk for both iOS and Android [14]. This simplifies common VR tasks like-lens distortion correction, spatial audio, head tracking, and stereoscopic side-by-side rendering. For iOS, we integrated Cardboard VR functionality into our Yahoo Video SDK, so that users can watch stereoscopic 3D videos on iOS using Google Cardboard.

2. Results

With all the pieces in place, and experimentation done, we were able to successfully do a 360° live streaming of an internal company-wide event.

Figure 8: 360° Live streaming of Yahoo internal event

In addition to demonstrating our live streaming capabilities, we are also experimenting with showing 360° VOD videos produced with a GoPro-based camera rig. Here is a screenshot of one of the 360° videos being played in the Yahoo player.

Figure 9: Yahoo Studios produced 360° VOD content in the Yahoo Player

3. Challenges and Opportunities

3.1. Enormous amounts of data

As we alluded to in the video processing section of this post, delivering 4K resolution videos for each eye for each FOV at a high frame-rate remains a challenge. While FOV-adaptive streaming does reduce the size by providing high resolution streams separately for each FOV, providing an impeccable 60 FPS or more viewing experience still requires a lot more data than the current internet pipes can handle. Some of the other possible options which we are closely paying attention to are:

Compression efficiency with HEVC and VP9 – New codecs like HEVC and VP9 have the potential to provide significant compression gains. HEVC open source codecs like x265 have shown a 40% compression performance gain compared to the currently ubiquitous H.264/AVC codec. LIkewise, a VP9 codec from Google has shown similar 40% compression performance gains. The key challenge is the hardware decoding support and the browser support. But with Apple and Microsoft very much behind HEVC and Firefox and Chrome already supporting VP9, we believe most browsers would support HEVC or VP9 within a year.

Using 10 bit color depth vs 8 bit color depth – Traditional monitors support 8 bpc (bits per channel) for displaying images. Given each pixel has 3 channels (RGB), 8 bpc maps to 256x256x256 color/luminosity combinations to represent 16 million colors. With 10 bit color depth, you have the potential to represent even more colors. But the biggest stated advantage of using 10 bit color depth is with respect to compression during encoding even if the source only uses 8 bits per channel. Both x264 and x265 codecs support 10 bit color depth, with ffmpeg already supporting encoding at 10 bit color depth.

3.2. Six degrees of freedom

With current camera rig workflows, users viewing the streams through HMD are able to achieve three degrees of Freedom (DoF) i.e., the ability to move up/down, clockwise/anti-clockwise, and swivel. But you still can’t get a different perspective when you move inside it i.e., move forward/backward. Until now, this true six DoF immersive VR experience has only been possible in CG VR games. In video streaming, LightField technology-based video cameras produced by Lytro are the first ones to capture light field volume data from all directions [15]. But Lightfield-based videos require an order of magnitude more data than traditional fixed FOV, fixed IPD, fixed lense camera rigs like GoPro. As bandwidth problems get resolved via better compressions and better networks, achieving true immersion should be possible.

4. Conclusion

VR streaming is an emerging medium and with the addition of 360° VR playback capability, Yahoo’s video platform provides us a great starting point to explore the opportunities in video with regard to virtual reality. As we continue to work to delight our users by showing immersive video content, we remain focused on optimizing the rendering of high-quality 4K content in our players. We’re looking at building FOV-based adaptive streaming capabilities and better compression during delivery. These capabilities, and the enhancement of our webvr player to play on more HMDs like HTC Vive and Oculus Rift, will set us on track to offer streaming capabilities across the entire spectrum. At the same time, we are keeping a close watch on advancements in supporting spatial audio experiences, as well as advancements in the ability to stream volumetric lightfield videos to achieve true six degrees of freedom, with the aim of realizing the true potential of VR.

Glossary – VR concepts:

VR – Virtual reality, commonly referred to as VR, is an immersive computer-simulated reality experience that places viewers inside an experience. It “transports” viewers from their physical reality into a closed virtual reality. VR usually requires a headset device that takes care of sights and sounds, while the most-involved experiences can include external motion tracking, and sensory inputs like touch and smell. For example, when you put on VR headgear you suddenly start feeling immersed in the sounds and sights of another universe, like the deck of the Star Trek Enterprise. Though you remain physically at your place, VR technology is designed to manipulate your senses in a manner that makes you truly feel as if you are on that ship, moving through the virtual environment and interacting with the crew.

360 degree video – A 360° video is created with a camera system that simultaneously records all 360 degrees of a scene. It is a flat equirectangular video projection that is morphed into a sphere for playback on a VR headset. A standard world map is an example of equirectangular projection, which maps the surface of the world (sphere) onto orthogonal coordinates.

Spatial Audio – Spatial audio gives the creator the ability to place sound around the user. Unlike traditional mono/stereo/surround audio, it responds to head rotation in sync with video. While listening to spatial audio content, the user receives a real-time binaural rendering of an audio stream [17].

FOV – A human can naturally see 170 degrees of viewable area (field of view). Most consumer grade head mounted displays HMD(s) like Oculus Rift and HTC Vive now display 90 degrees to 120 degrees.

Monoscopic video – A monoscopic video means that both eyes see a single flat image, or video file. A common camera setup involves six cameras filming six different fields of view. Stitching software is used to form a single equirectangular video. Max output resolution on 2D scopic videos on Gear VR is 3480×1920 at 30 frames per second.

Presence – Presence is a kind of immersion where the low-level systems of the brain are tricked to such an extent that they react just as they would to non-virtual stimuli.

Latency – It’s the time between when you move your head, and when you see physical updates on the screen. An acceptable latency is anywhere from 11 ms (for games) to 20 ms (for watching 360 vr videos).

Head Tracking – There are two forms:

  • Positional tracking – movements and related translations of your body, eg: sway side to side.
  • Traditional head tracking – left, right, up, down, roll like clock rotation.


[1] Ultimate Display Speech as reminisced by Fred Brooks: http://www.roadtovr.com/fred-brooks-ivan-sutherlands-1965-ultimate-display-speech/

[2] Equirectangular Layout Image: https://www.flickr.com/photos/[email protected]/10111691364/

[3] CubeMap Layout: http://learnopengl.com/img/advanced/cubemaps_skybox.png

[4] Vahana VR: http://www.video-stitch.com/

[5] Surround360 Stitching software: https://github.com/facebook/Surround360

[6] Computer Vision Algorithm BRISK: https://www.robots.ox.ac.uk/~vgg/rg/papers/brisk.pdf

[7] Computer Vision Algorithm AKAZE: http://docs.opencv.org/3.0-beta/doc/tutorials/features2d/akaze_matching/akaze_matching.html

[8] Optical Flow: http://docs.opencv.org/trunk/d7/d8b/tutorial_py_lucas_kanade.html

[9] 4K connection speeds: https://help.netflix.com/en/node/306

[10] Average connection speeds in US: https://www.akamai.com/us/en/about/news/press/2016-press/akamai-releases-fourth-quarter-2015-state-of-the-internet-report.jsp

[11] CubeMap transform filter for ffmpeg: https://github.com/facebook/transform

[12] FFMPEG software: https://ffmpeg.org/

[13] WebVR Spec: https://w3c.github.io/webvr/

[14] Google Daydream SDK: https://vr.google.com/cardboard/developers/

[15] Lytro LightField Volume for six DoF: https://www.lytro.com/press/releases/lytro-immerge-the-worlds-first-professional-light-field-solution-for-cinematic-vr

[16] 10 bit color depth: https://gist.github.com/l4n9th4n9/4459997

Open Sourcing a Deep Learning Solution for Detecting NSFW Images

Post Syndicated from davglass original https://yahooeng.tumblr.com/post/151148689421

By Jay Mahadeokar and Gerry Pesavento

Automatically identifying that an image is not suitable/safe for work (NSFW), including offensive and adult images, is an important problem which researchers have been trying to tackle for decades. Since images and user-generated content dominate the Internet today, filtering NSFW images becomes an essential component of Web and mobile applications. With the evolution of computer vision, improved training data, and deep learning algorithms, computers are now able to automatically classify NSFW image content with greater precision.

Defining NSFW material is subjective and the task of identifying these images is non-trivial. Moreover, what may be objectionable in one context can be suitable in another. For this reason, the model we describe below focuses only on one type of NSFW content: pornographic images. The identification of NSFW sketches, cartoons, text, images of graphic violence, or other types of unsuitable content is not addressed with this model.

To the best of our knowledge, there is no open source model or algorithm for identifying NSFW images. In the spirit of collaboration and with the hope of advancing this endeavor, we are releasing our deep learning model that will allow developers to experiment with a classifier for NSFW detection, and provide feedback to us on ways to improve the classifier.

Our general purpose Caffe deep neural network model (Github code) takes an image as input and outputs a probability (i.e a score between 0-1) which can be used to detect and filter NSFW images. Developers can use this score to filter images below a certain suitable threshold based on a ROC curve for specific use-cases, or use this signal to rank images in search results.

Convolutional Neural Network (CNN) architectures and tradeoffs

In recent years, CNNs have become very successful in image classification problems [1] [5] [6]. Since 2012, new CNN architectures have continuously improved the accuracy of the standard ImageNet classification challenge. Some of the major breakthroughs include AlexNet (2012) [6], GoogLeNet [5], VGG (2013) [2] and Residual Networks (2015) [1]. These networks have different tradeoffs in terms of runtime, memory requirements, and accuracy. The main indicators for runtime and memory requirements are:

  1. Flops or connections – The number of connections in a neural network determine the number of compute operations during a forward pass, which is proportional to the runtime of the network while classifying an image.
  2. Parameters -–The number of parameters in a neural network determine the amount of memory needed to load the network.

Ideally we want a network with minimum flops and minimum parameters, which would achieve maximum accuracy.

Training a deep neural network for NSFW classification

We train the models using a dataset of positive (i.e. NSFW) images and negative (i.e. SFW – suitable/safe for work) images. We are not releasing the training images or other details due to the nature of the data, but instead we open source the output model which can be used for classification by a developer.

We use the Caffe deep learning library and CaffeOnSpark; the latter is a powerful open source framework for distributed learning that brings Caffe deep learning to Hadoop and Spark clusters for training models (Big shout out to Yahoo’s CaffeOnSpark team!).

While training, the images were resized to 256×256 pixels, horizontally flipped for data augmentation, and randomly cropped to 224×224 pixels, and were then fed to the network. For training residual networks, we used scale augmentation as described in the ResNet paper [1], to avoid overfitting. We evaluated various architectures to experiment with tradeoffs of runtime vs accuracy.

  1. MS_CTC [4] – This architecture was proposed in Microsoft’s constrained time cost paper. It improves on top of AlexNet in terms of speed and accuracy maintaining a combination of convolutional and fully-connected layers.
  2. Squeezenet [3] – This architecture introduces the fire module which contain layers to squeeze and then expand the input data blob. This helps to save the number of parameters keeping the Imagenet accuracy as good as AlexNet, while the memory requirement is only 6MB.
  3. VGG [2] – This architecture has 13 conv layers and 3 FC layers.
  4. GoogLeNet [5] – GoogLeNet introduces inception modules and has 20 convolutional layer stages. It also uses hanging loss functions in intermediate layers to tackle the problem of diminishing gradients for deep networks.
  5. ResNet-50 [1] – ResNets use shortcut connections to solve the problem of diminishing gradients. We used the 50-layer residual network released by the authors.
  6. ResNet-50-thin – The model was generated using our pynetbuilder tool and replicates the Residual Network paper’s 50-layer network (with half number of filters in each layer). You can find more details on how the model was generated and trained here.

Tradeoffs of different architectures: accuracy vs number of flops vs number of params in network.

The deep models were first pre-trained on the ImageNet 1000 class dataset. For each network, we replace the last layer (FC1000) with a 2-node fully-connected layer. Then we fine-tune the weights on the NSFW dataset. Note that we keep the learning rate multiplier for the last FC layer 5 times the multiplier of other layers, which are being fine-tuned. We also tune the hyper parameters (step size, base learning rate) to optimize the performance.

We observe that the performance of the models on NSFW classification tasks is related to the performance of the pre-trained model on ImageNet classification tasks, so if we have a better pretrained model, it helps in fine-tuned classification tasks. The graph below shows the relative performance on our held-out NSFW evaluation set. Please note that the false positive rate (FPR) at a fixed false negative rate (FNR) shown in the graph is specific to our evaluation dataset, and is shown here for illustrative purposes. To use the models for NSFW filtering, we suggest that you plot the ROC curve using your dataset and pick a suitable threshold.

Comparison of performance of models on Imagenet and their counterparts fine-tuned on NSFW dataset.

We are releasing the thin ResNet 50 model, since it provides good tradeoff in terms of accuracy, and the model is lightweight in terms of runtime (takes < 0.5 sec on CPU) and memory (~23 MB). Please refer our git repository for instructions and usage of our model. We encourage developers to try the model for their NSFW filtering use cases. For any questions or feedback about performance of model, we encourage creating a issue and we will respond ASAP.

Results can be improved by fine-tuning the model for your dataset or use case. If you achieve improved performance or you have trained a NSFW model with different architecture, we encourage contributing to the model or sharing the link on our description page.

Disclaimer: The definition of NSFW is subjective and contextual. This model is a general purpose reference model, which can be used for the preliminary filtering of pornographic images. We do not provide guarantees of accuracy of output, rather we make this available for developers to explore and enhance as an open source project.

We would like to thank Sachin Farfade, Amar Ramesh Kamat, Armin Kappeler, and Shraddha Advani for their contributions in this work.


[1] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition” arXiv preprint arXiv:1512.03385 (2015).

[2] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.”; arXiv preprint arXiv:1409.1556(2014).

[3] Iandola, Forrest N., Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 1MB model size.”; arXiv preprint arXiv:1602.07360 (2016).

[4] He, Kaiming, and Jian Sun. “Convolutional neural networks at constrained time cost.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5353-5360. 2015.

[5] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9. 2015.

[6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks” In Advances in neural information processing systems, pp. 1097-1105. 2012.

Raspberry Pi with cloud vision at Google I/O

Post Syndicated from Liz Upton original https://www.raspberrypi.org/blog/raspberry-pi-cloud-vision-google-io/

Matt visited Google I/O yesterday, and sent back some pretty incredible pictures. This event looks more like a music festival than a tech conference.

Google I/O

He was sending pictures and excited snippets of text back to Pi Towers all through the event, and then, when he got home, shared this video. I’ve been so excited about it that I’ve had it playing on repeat, and we all thought you’d like to see it too.

This is a demo of a Raspberry Pi robot working with Google’s Cloud Vision API – and it’s got such potential for your projects.

What is Cloud Vision API?

Cloud Vision API provides powerful Image Analytics capabilities as easy to use APIs. It enables application developers to build the next generation of applications that can see and understand the content within the images. The service enables customers to detect a broad set of entities within an image from everyday objects to faces and product logos.

The robot is taking pictures and sending them to the cloud, where they’re analysed and sent back in real time. There’s facial detection – along with detection of what emotion is showing on those faces. And cloud vision offers you image recognition, so you should be able get your robot to distinguish limes from green apples. You can then get the robot to act on that data – so you could set it to gather apples and not limes, for example.

Cloud vision on a Pi robot

We’re pretty excited about the opportunities this API offers makers of all kinds of Raspberry Pi devices. You can learn more here – please let us know if you start integrating it into your own projects!


The post Raspberry Pi with cloud vision at Google I/O appeared first on Raspberry Pi.