Tag Archives: computer vision

Journey into Deep Learning with AWS

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/journey-into-deep-learning-with-aws/

If you are anything like me, Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning are completely fascinating and exciting topics. As AI, ML, and Deep Learning become more widely used, for me it means that the science fiction written by Dr. Issac Asimov, the robotics and medical advancements in Star Wars, and the technologies that enabled Captain Kirk and his Star Trek crew “to boldly go where no man has gone before” can become achievable realities.

 

Most people interested in the aforementioned topics are familiar with the AI and ML solutions enabled by Deep Learning, such as Convolutional Neural Networks for Image and Video Classification, Speech Recognition, Natural Language interfaces, and Recommendation Engines. However, it is not always an easy task setting up the infrastructure, environment, and tools to enable data scientists, machine learning practitioners, research scientists, and deep learning hobbyists/advocates to dive into these technologies. Most developers desire to go quickly from getting started with deep learning to training models and developing solutions using deep learning technologies.

For these reasons, I would like to share some resources that will help to quickly build deep learning solutions whether you are an experienced data scientist or a curious developer wanting to get started.

Deep Learning Resources

The Apache MXNet is Amazon’s deep learning framework of choice. With the power of Apache MXNet framework and NVIDIA GPU computing, you can launch your scalable deep learning projects and solutions easily on the AWS Cloud. As you get started on your MxNet deep learning quest, there are a variety of self-service tutorials and datasets available to you:

  • Launch an AWS Deep Learning AMI: This guide walks you through the steps to launch the AWS Deep Learning AMI with Ubuntu
  • MXNet – Create a computer vision application: This hands-on tutorial uses a pre-built notebook to walk you through using neural networks to build a computer vision application to identify handwritten digits
  • AWS Machine Learning Datasets: AWS hosts datasets for Machine Learning on the AWS Marketplace that you can access for free. These large datasets are available for anyone to analyze the data without requiring the data to be downloaded or stored.
  • Predict and Extract – Learn to use pre-trained models for predictions: This hands-on tutorial will walk you through how to use pre-trained model for predicting and feature extraction using the full Imagenet dataset.

 

AWS Deep Learning AMIs

AWS offers Amazon Machine Images (AMIs) for use on Amazon EC2 for quick deployment of an infrastructure needed to start your deep learning journey. The AWS Deep Learning AMIs are pre-configured with popular deep learning frameworks built using Amazon EC2 instances on Amazon Linux, and Ubuntu that can be launched for AI targeted solutions and models. The deep learning frameworks supported and pre-configured on the deep learning AMI are:

  • Apache MXNet
  • TensorFlow
  • Microsoft Cognitive Toolkit (CNTK)
  • Caffe
  • Caffe2
  • Theano
  • Torch
  • Keras

Additionally, the AWS Deep Learning AMIs install preconfigured libraries for Jupyter notebooks with Python 2.7/3.4, AWS SDK for Python, and other data science related python packages and dependencies. The AMIs also come with NVIDIA CUDA and NVIDIA CUDA Deep Neural Network (cuDNN) libraries preinstalled with all the supported deep learning frameworks and the Intel Math Kernel Library is installed for Apache MXNet framework. You can launch any of the Deep Learning AMIs by visiting the AWS Marketplace using the Try the Deep Learning AMIs link.

Summary

It is a great time to dive into Deep Learning. You can accelerate your work in deep learning by using the AWS Deep Learning AMIs running on the AWS cloud to get your deep learning environment running quickly or get started learning more about Deep Learning on AWS with MXNet using the AWS self-service resources.  Of course, you can learn even more information about Deep Learning, Machine Learning, and Artificial Intelligence on AWS by reviewing the AWS Deep Learning page, the Amazon AI product page, and the AWS AI Blog.

May the Deep Learning Force be with you all.

Tara

Amazon Rekognition Update – Image Moderation

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-rekognition-update-image-moderation/

We launched Amazon Rekognition late last year and I told you about it in my post Amazon Rekognition – Image Detection and Recognition Powered by Deep Learning. As I explained at the time, this service was built by our Computer Vision team over the course of many years and analyzes billions of images daily.

Today we are adding image moderation to Rekognition. If your web site or application allows users to upload profile photos or other imagery, you will love this new Rekognition feature.

Rekognition can now identify images that contain suggestive or explicit content that may not be appropriate for your site. The moderation labels provide detailed sub-categories, allowing you to fine-tune the filters that you use to determine what kinds of images you deem acceptable or objectionable. You can use this feature to improve photo sharing sites, forums, dating apps, content platforms for children, e-commerce platforms and marketplaces, and more.

To access this feature, call the DetectModerationLabels function from your code. The response will include a set of moderation labels drawn from a built-in taxonomy:

"ModerationLabels": [ 
  {
    "Confidence": 83.55088806152344, 
    "Name": "Suggestive",
    "ParentName": ""
   },
   {
    "Confidence": 83.55088806152344, 
    "Name": "Female Swimwear Or Underwear", 
    "ParentName": "Suggestive" 
   }
 ]

You can use the Image Moderation Demo in the AWS Management Console to experiment with this feature:

Image moderation is available now and you can start using it today!

Jeff;

 

Raspberry Turk: a chess-playing robot

Post Syndicated from Lorna Lynch original https://www.raspberrypi.org/blog/raspberry-turk-chess-playing-robot/

Computers and chess have been a potent combination ever since the appearance of the first chess-playing computers in the 1970s. You might even be able to play a game of chess on the device you are using to read this blog post! For digital makers, though, adding a Raspberry Pi into the mix can be the first step to building something a little more exciting. Allow us to introduce you to Joey Meyer‘s chess-playing robot, the Raspberry Turk.

The Raspberry Turk chess-playing robot

Image credit: Joey Meyer

Being both an experienced software engineer with an interest in machine learning, and a skilled chess player, it’s not surprising that Joey was interested in tinkering with chess programs. What is really stunning, though, is the scale and complexity of the build he came up with. Fascinated by a famous historical hoax, Joey used his skills in programming and robotics to build an open-source Raspberry Pi-powered recreation of the celebrated Mechanical Turk automaton.

You can see the Raspberry Turk in action on Joey’s YouTube channel:

Chess Playing Robot Powered by Raspberry Pi – Raspberry Turk

The Raspberry Turk is a robot that can play chess-it’s entirely open source, based on Raspberry Pi, and inspired by the 18th century chess playing machine, the Mechanical Turk. Website: http://www.raspberryturk.com Source Code: https://github.com/joeymeyer/raspberryturk

A historical hoax

Joey explains that he first encountered the Mechanical Turk through a book by Tom Standage. A famous example of mechanical trickery, the original Turk was advertised as a chess-playing automaton, capable of defeating human opponents and solving complex puzzles.

Image of the Mechanical Turk Automaton

A modern reconstruction of the Mechanical Turk 
Image from Wikimedia Commons

Its inner workings a secret, the Turk toured Europe for the best part of a century, confounding everyone who encountered it. Unfortunately, it turned out not to be a fabulous example of early robotic engineering after all. Instead, it was just an elaborate illusion. The awesome chess moves were not being worked out by the clockwork brain of the automaton, but rather by a human chess master who was cunningly concealed inside the casing.

Building a modern Turk

A modern version of the Mechanical Turk was constructed in the 1980s. However, the build cost $120,000. At that price, it would have been impossible for most makers to create their own version. Impossible, that is, until now: Joey uses a Raspberry Pi 3 to drive the Raspberry Turk, while a Raspberry Pi Camera Module handles computer vision.

Image of chess board and Raspberry Turk robot

The Raspberry Turk in the middle of a game 
Image credit: Joey Meyer

Joey’s Raspberry Turk is built into a neat wooden table. All of the electronics are housed in a box on one side. The chessboard is painted directly onto the table’s surface. In order for the robot to play, a Camera Module located in a 3D-printed housing above the table takes an image of the chessboard. The image is then analysed to determine which pieces are in which positions at that point. By tracking changes in the positions of the pieces, the Raspberry Turk can determine which moves have been made, and which piece should move next. To train the system, Joey had to build a large dataset to validate a computer vision model. This involved painstakingly moving pieces by hand and collecting multiple images of each possible position.

Look, no hands!

A key feature of the Mechanical Turk was that the automaton appeared to move the chess pieces entirely by itself. Of course, its movements were actually being controlled by a person hidden inside the machine. The Raspberry Turk, by contrast, does move the chess pieces itself. To achieve this, Joey used a robotic arm attached to the table. The arm is made primarily out of Actobotics components. Joey explains:

The motion is controlled by the rotation of two servos which are attached to gears at the base of each link of the arm. At the end of the arm is another servo which moves a beam up and down. At the bottom of the beam is an electromagnet that can be dynamically activated to lift the chess pieces.

Joey individually fitted the chess pieces with tiny sections of metal dowel so that the magnet on the arm could pick them up.

Programming the Raspberry Turk

The Raspberry Turk is controlled by a daemon process that runs a perception/action sequence, and the status updates automatically as the pieces are moved. The code is written almost entirely in Python. It is all available on Joey’s GitHub repo for the project, together with his notebooks on the project.

Image of Raspberry Turk chessboard with Python script alongside

Image credit: Joey Meyer

The AI backend that gives the robot its chess-playing ability is currently Stockfish, a strong open-source chess engine. Joey says he would like to build his own engine when he has time. For the moment, though, he’s confident that this AI will prove a worthy opponent.

The project website goes into much more detail than we are able to give here. We’d definitely recommend checking it out. If you have been experimenting with any robotics or computer vision projects like this, please do let us know in the comments!

The post Raspberry Turk: a chess-playing robot appeared first on Raspberry Pi.

Robocod

Post Syndicated from Liz Upton original https://www.raspberrypi.org/blog/robocod/

Fishbowl existence is tough. There you are, bobbing up and down in the same dull old environment, day in, day out; your view unchanging, your breakfast boringly identical every morning; that clam thing in the bottom of the tank opening and closing monotonously – goldfish can live for up to 20 years. That’s a hell of a long time to watch a clam thing for.

fishbowl on wheels

Two fish are in a tank. One says “How do you drive this thing?”

Indeed, fishbowl existence is so tough that several countries have banned the boring round bowls altogether. (There’s a reason that your childhood goldfish didn’t live for 20 years. You put it in an environment that bored it to death.) So this build comes with a caveat – we are worried that this particular fish is being driven from understimulus to overstimulus and back again, and that she might be prevented from making it to the full 20 years as a result. Please be kind to your fish.

What’s going on here? Over in Pittsburgh, at Carnegie Mellon University, Alex Kent and friends have widened the goldfish’s horizons, by giving it wheels. Meet the free-range fish.

Just Keep Swimming

Build18 @CMU . . . . . . . . . . . . * Jukin Media Verified * Find this video and others like it by visiting https://www.jukinmedia.com/licensing/view/949380 For licensing / permission to use, please email licensing(at)jukinmedia(dot)com.

Alex K, negligent fishparent, says that the speed and direction of the build is determined by the position of the fish relative to the centre of the tank. The battery lasts for five hours, and by all accounts the fish is still alive. Things are a bit jerky in this prototype build. Alex explains:

The jerking is actually caused by the Computer Vision algorithm losing track of the fish because of the reflection off of the lid, condensation on the lid, water ripples, etc.

Alex and co: before you look at more expensive solutions, try fixing a polarising filter to the camera you’re using.

All the code you’ll need to torture your own fish is available at GitHub.

Of course, Far Side fans will observe that there is nothing new under the sun.

Fishbowl on wheels by Gary Larson

Image from Gary Larson, The Far Side.

If you’ve got any good fish puns, let minnow in the comments.

 

The post Robocod appeared first on Raspberry Pi.

Capturing Pattern-Lock Authentication

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/01/capturing_patte.html

Interesting research — “Cracking Android Pattern Lock in Five Attempts“:

Abstract: Pattern lock is widely used as a mechanism for authentication and authorization on Android devices. In this paper, we demonstrate a novel video-based attack to reconstruct Android lock patterns from video footage filmed u sing a mobile phone camera. Unlike prior attacks on pattern lock, our approach does not require the video to capture any content displayed on the screen. Instead, we employ a computer vision algorithm to track the fingertip movements to infer the pattern. Using the geometry information extracted from the tracked fingertip motions, our approach is able to accurately identify a small number of (often one) candidate patterns to be tested by an adversary. We thoroughly evaluated our approach using 120 unique patterns collected from 215 independent users, by applying it to reconstruct patterns from video footage filmed using smartphone cameras. Experimental results show that our approach can break over 95% of the patterns in five attempts before the device is automatically locked by the Android system. We discovered that, in contrast to many people’s belief, complex patterns do not offer stronger protection under our attacking scenarios. This is demonstrated by the fact that we are able to break all but one complex patterns (with a 97.5% success rate) as opposed to 60% of the simple patterns in the first attempt. Since our threat model is common in day-to-day lives, our work calls for the community to revisit the risks of using Android pattern lock to protect sensitive information.

News article.

Amazon Rekognition – Image Detection and Recognition Powered by Deep Learning

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-rekognition-image-detection-and-recognition-powered-by-deep-learning/

What do you see when you look at this picture?

You might simply see an animal. Maybe you see a pet, a dog, or a Golden Retriever. The association between the image and these labels is not hard-wired in to your brain. Instead, you learned the labels after seeing hundreds or thousands of examples. Operating on a number of different levels, you learned to distinguish an animal from a plant, a dog from a cat, and a Golden Retriever from other dog breeds.

Deep Learning for Image Detection
Giving computers the same level of comprehension has proven to be a very difficult task. Over the course of decades, computer scientists have taken many different approaches to the problem. Today, a broad consensus has emerged that the best way to tackle this problem is via deep learning. Deep learning uses a combination of feature abstraction and neural networks to produce results that can be (as Arthur C. Clarke once said) indistinguishable from magic. However, it comes at a considerable cost. First, you need to put a lot of work into the training phase. In essence, you present the learning network with a broad spectrum of labeled examples (“this is a dog”, “this is a pet”, and so forth) so that it can correlate features in the image with the labels. This phase is computationally expensive due to the size and the multi-layered nature of the neural networks. After the training phase is complete, evaluating new images against the trained network is far easier. The results are traditionally expressed in confidence levels (0 to 100%) rather than as cold, hard facts. This allows you to decide just how much precision is appropriate for your applications.

Introducing Amazon Rekognition
Today I would like to tell you about Amazon Rekognition. Powered by deep learning and built by our Computer Vision team over the course of many years, this fully-managed service already analyzes billions of images daily. It has been trained on thousands of objects and scenes, and is now available for you to use in your own applications. You can use the Rekognition Demos to put the service through its paces before dive in and start writing code that uses the Rekognition API.

Rekognition was designed from the get-go to run at scale. It comprehends scenes, objects, and faces. Given an image, it will return a list of labels. Given an image with one or more faces, it will return bounding boxes for each face, along with attributes. Let’s see what it has to say about the picture of my dog (her name is Luna, by the way):

As you can see, Rekognition labeled Luna as an animal, a dog, a pet, and as a golden retriever with a high degree of confidence. It is important to note that these labels are independent, in the sense that the deep learning model does not explicitly understand the relationship between, for example, dogs and animals. It just so happens that both of these labels were simultaneously present on the dog-centric training material presented to Rekognition.

Let’s see how it does with a picture of my wife and I:

Amazon Rekognition found our faces, set up bounding boxes, and let me know that my wife was happy (the picture was taken on her birthday, so I certainly hope she was).

You can also use Rekognition to compare faces and to see if a given image contains any one of a number of faces that you have asked it to recognize.

All of this power is accessible from a set of API functions (the console is great for quick demos). For example, you can call DetectLabels to programmatically reproduce my first example, or DetectFaces to reproduce my second one. You can make multiple calls to IndexFaces to prepare Rekognition to recognize some faces. Each time you do this, Rekognition extracts some features (known as face vectors) from the image, stores the vectors, and discards the image. You can create one or more Rekognition collections and store related groups of face vectors in each one.

Rekognition can directly process images stored in Amazon Simple Storage Service (S3). In fact, you can use AWS Lambda functions to process newly uploaded photos at any desired scale. You can use AWS Identity and Access Management (IAM) to control access to the Rekognition APIs.

Applications for Rekognition
So, what can you use this for? I’ve got plenty of ideas to get you started!

If you have a large collection of photos, you can tag and index them using Amazon Rekognition. Because Rekognition is a service, you can process millions of photos per day without having to worry about setting up, running, or scaling any infrastructure. You can implement visual search, tag-based browsing, and all sorts of interactive discovery models.

You can use Rekognition in several different authentication and security contexts. You can compare a face on a webcam to a badge photo before allowing an employee to enter a secure zone. You can perform visual surveillance, inspecting photos for objects or people of interest or concern.

You can build “smart” marketing billboards that collect demographic data about viewers.

Now Available
Rekognition is now available in the US East (Northern Virginia), US West (Oregon), and EU (Ireland) Regions and you can start using it today. As part of the AWS Free Tier tier, you can analyze up to 5,000 images per month and store up to 1,000 face vectors each month for an entire year. After that (and at higher volume), you will pay tiered pricing based on the number of images that you analyze and the number of face vectors that you store.

Jeff;

 

Bringing the Viewer In: The Video Opportunity in Virtual Reality

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/151940036881

By Satender Saroha, Video Engineering

Virtual reality (VR) 360° videos are the next frontier of how we engage with and consume content. Unlike a traditional scenario in which a person views a screen in front of them, VR places the user inside an immersive experience. A viewer is “in” the story, and not on the sidelines as an observer.

Ivan Sutherland, widely regarded as the father of computer graphics, laid out the vision for virtual reality in his famous speech, “Ultimate Display” in 1965 [1]. In that he said, “You shouldn’t think of a computer screen as a way to display information, but rather as a window into a virtual world that could eventually look real, sound real, move real, interact real, and feel real.”

Over the years, significant advancements have been made to bring reality closer to that vision. With the advent of headgear capable of rendering 3D spatial audio and video, realistic sound and visuals can be virtually reproduced, delivering immersive experiences to consumers.

When it comes to entertainment and sports, streaming in VR has become the new 4K HEVC/UHD of 2016. This has been accelerated by the release of new camera capture hardware like GoPro and streaming capabilities such as 360° video streaming from Facebook and YouTube. Yahoo streams lots of engaging sports, finance, news, and entertainment video content to tens of millions of users. The opportunity to produce and stream such content in 360° VR opens a unique opportunity to Yahoo to offer new types of engagement, and bring the users a sense of depth and visceral presence.

While this is not an experience that is live in product, it is an area we are actively exploring. In this blog post, we take a look at what’s involved in building an end-to-end VR streaming workflow for both Live and Video on Demand (VOD). Our experiments and research goes from camera rig setup, to video stitching, to encoding, to the eventual rendering of videos on video players on desktop and VR headsets. We also discuss challenges yet to be solved and the opportunities they present in streaming VR.

1. The Workflow

Yahoo’s video platform has a workflow that is used internally to enable streaming to an audience of tens of millions with the click of a few buttons. During experimentation, we enhanced this same proven platform and set of APIs to build a complete 360°/VR experience. The diagram below shows the end-to-end workflow for streaming 360°/VR that we built on Yahoo’s video platform.

Figure 1: VR Streaming Workflow at Yahoo

1.1. Capturing 360° video

In order to capture a virtual reality video, you need access to a 360°-capable video camera. Such a camera uses either fish-eye lenses or has an array of wide-angle lenses to collectively cover a 360 (θ) by 180 (ϕ) sphere as shown below.

Though it sounds simple, there is a real challenge in capturing a scene in 3D 360° as most of the 360° video cameras offer only 2D 360° video capture.

In initial experiments, we tried capturing 3D video using two cameras side-by-side, for left and right eyes and arranging them in a spherical shape. However this required too many cameras – instead we use view interpolation in the stitching step to create virtual cameras.

Another important consideration with 360° video is the number of axes the camera is capturing video with. In traditional 360° video that is captured using only a single-axis (what we refer as horizontal video), a user can turn their head from left to right. But this setup of cameras does not support a user tilting their head at 90°.

To achieve true 3D in our setup, we went with 6-12 GoPro cameras having 120° field of view (FOV) arranged in a ring, and an additional camera each on top and bottom, with each one outputting 2.7K at 30 FPS.

1.2. Stitching 360° video

Projection Layouts

Because a 360° view is a spherical video, the surface of this sphere needs to be projected onto a planar surface in 2D so that video encoders can process it. There are two popular layouts:

Equirectangular layout: This is the most widely-used format in computer graphics to represent spherical surfaces in a rectangular form with an aspect ratio of 2:1. This format has redundant information at the poles which means some pixels are over-represented, introducing distortions at the poles compared to the equator (as can be seen in the equirectangular mapping of the sphere below).

Figure 2: Equirectangular Layout [2]

CubeMap layout: CubeMap layout is a format that has also been used in computer graphics. It contains six individual 2D textures that map to six sides of a cube. The figure below is a typical cubemap representation. In a cubemap layout, the sphere is projected onto six faces and the images are folded out into a 2D image, so pieces of a video frame map to different parts of a cube, which leads to extremely efficient compact packing. Cubemap layouts require about 25% fewer pixels compared to equirectangular layouts.

Figure 3: CubeMap Layout [3]

Stitching Videos

In our setup, we experimented with a couple of stitching softwares. One was from Vahana VR [4], and the other was a modified version of the open-source Surround360 technology that works with a GoPro rig [5]. Both softwares output equirectangular panoramas for the left and the right eye. Here are the steps involved in stitching together a 360° image:

Raw frame image processing: Converts uncompressed raw video data to RGB, which involves several steps starting from black-level adjustment, to applying Demosaic algorithms in order to figure out RGB color parts for each pixel based on the surrounding pixels. This also involves gamma correction, color correction, and anti vignetting (undoing the reduction in brightness on the image periphery). Finally, this stage applies sharpening and noise-reduction algorithms to enhance the image and suppress the noise.

Calibration: During the calibration step, stitching software takes steps to avoid vertical parallax while stitching overlapping portions in adjacent cameras in the rig. The purpose is to align everything in the scene, so that both eyes see every point at the same vertical coordinate. This step essentially matches the key points in images among adjacent camera pairs. It uses computer vision algorithms for feature detection like Binary Robust Invariant Scalable Keypoints (BRISK) [6] and AKAZE [7].

Optical Flow: During stitching, to cover the gaps between adjacent real cameras and provide interpolated view, optical flow is used to create virtual cameras. The optical flow algorithm finds the pattern of apparent motion of image objects between two consecutive frames caused by the movement of the object or camera. It uses OpenCV algorithms to find the optical flow [8].

Below are the frames produced by the GoPro camera rig:

Figure 4: Individual frames from 12-camera rig

Figure 5: Stitched frame output with PtGui

Figure 6: Stitched frame with barrel distortion using Surround360

Figure 7: Stitched frame after removing barrel distortion using Surround360

To get the full depth in stereo, the rig is set-up so that i = r * sin(FOV/2 – 360/n). where:

  • i = IPD/2 where IPD is the inter-pupillary distance between eyes.\
  • r = Radius of the rig.
  • FOV = Field of view of GoPro cameras, 120 degrees.
  • n = Number of cameras which is 12 in our setup.

Given IPD is normally 6.4 cms, i should be greater than 3.2 cm. This implies that with a 12-camera setup, the radius of the the rig comes to 14 cm(s). Usually, if there are more cameras it is easier to avoid black stripes.

Reducing Bandwidth – FOV-based adaptive transcoding

For a truly immersive experience, users expect 4K (3840 x 2160) quality resolution at 60 frames per second (FPS) or higher. Given typical HMDs have a FOV of 120 degrees, a full 360° video needs a resolution of at least 12K (11520 x 6480). 4K streaming needs a bandwidth of 25 Mbps [9]. So for 12K resolution, this effectively translates to > 75 Mbps and even more for higher framerates. However, average wifi in US has bandwidth of 15 Mbps [10].

One way to address the bandwidth issue is by reducing the resolution of areas that are out of the field of view. Spatial sub-sampling is used during transcoding to produce multiple viewport-specific streams. Each viewport-specific stream has high resolution in a given viewport and low resolution in the rest of the sphere.

On the player side, we can modify traditional adaptive streaming logic to take into account field of view. Depending on the video, if the user moves his head around a lot, it could result in multiple buffer fetches and could result in rebuffering. Ideally, this will work best in videos where the excessive motion happens in one field of view at a time and does not span across multiple fields of view at the same time. This work is still in an experimental stage.

The default output format from stitching software of both Surround360 and Vahana VR is equirectangular format. In order to reduce the size further, we pass it through a cubemap filter transform integrated into ffmpeg to get an additional pixel reduction of ~25%  [11] [12].

At the end of above steps, the stitching pipeline produces high-resolution stereo 3D panoramas which are then ingested into the existing Yahoo Video transcoding pipeline to produce multiple bit-rates HLS streams.

1.3. Adding a stitching step to the encoding pipeline

Live – In order to prepare for multi-bitrate streaming over the Internet, a live 360° video-stitched stream in RTMP is ingested into Yahoo’s video platform. A live Elemental encoder was used to re-encode and package the live input into multiple bit-rates for adaptive streaming on any device (iOS, Android, Browser, Windows, Mac, etc.)

Video on Demand – The existing Yahoo video transcoding pipeline was used to package multiple bit-rates HLS streams from raw equirectangular mp4 source videos.

1.4. Rendering 360° video into the player

The spherical video stream is delivered to the Yahoo player in multiple bit rates. As a user changes their viewing angle, different portion of the frame are shown, presenting a 360° immersive experience. There are two types of VR players currently supported at Yahoo:

WebVR based Javascript Player – The Web community has been very active in enabling VR experiences natively without plugins from within browsers. The W3C has a Javascript proposal [13], which describes support for accessing virtual reality (VR) devices, including sensors and head-mounted displays on the Web. VR Display is the main starting point for all the device APIs supported. Some of the key interfaces and attributes exposed are:

  • VR Display Capabilities: It has attributes to indicate position support, orientation support, and has external display.
  • VR Layer: Contains the HTML5 canvas element which is presented by VR Display when its submit frame is called. It also contains attributes defining the left bound and right bound textures within source canvas for presenting to an eye.
  • VREye Parameters: Has information required to correctly render a scene for given eye. For each eye, it has offset the distance from middle of the user’s eyes to the center point of one eye which is half of the interpupillary distance (IPD). In addition, it maintains the current FOV of the eye, and the recommended renderWidth and render Height of each eye viewport.
  • Get VR Displays: Returns a list of VR Display(s) HMDs accessible to the browser.

We implemented a subset of webvr spec in the Yahoo player (not in production yet) that lets you watch monoscopic and stereoscopic 3D video on supported web browsers (Chrome, Firefox, Samsung), including Oculus Gear VR-enabled phones. The Yahoo player takes the equirectangular video and maps its individual frames on the Canvas javascript element. It uses the webGL and Three.JS libraries to do computations for detecting the orientation and extracting the corresponding frames to display.

For web devices which support only monoscopic rendering like desktop browsers without HMD, it creates a single Perspective Camera object specifying the FOV and aspect ratio. As the device’s requestAnimationFrame is called it renders the new frames. As part of rendering the frame, it first calculates the projection matrix for FOV and sets the X (user’s right), Y (Up), Z (behind the user) coordinates of the camera position.

For devices that support stereoscopic rendering like mobile phones from Samsung Gear, the webvr player creates two PerspectiveCamera objects, one for the left eye and one for the right eye. Each Perspective camera queries the VR device capabilities to get the eye parameters like FOV, renderWidth and render Height every time a frame needs to be rendered at the native refresh rate of HMD. The key difference between stereoscopic and monoscopic is the perceived sense of depth that the user experiences, as the video frames separated by an offset are rendered by separate canvas elements to each individual eye.

Cardboard VR – Google provides a VR sdk for both iOS and Android [14]. This simplifies common VR tasks like-lens distortion correction, spatial audio, head tracking, and stereoscopic side-by-side rendering. For iOS, we integrated Cardboard VR functionality into our Yahoo Video SDK, so that users can watch stereoscopic 3D videos on iOS using Google Cardboard.

2. Results

With all the pieces in place, and experimentation done, we were able to successfully do a 360° live streaming of an internal company-wide event.

Figure 8: 360° Live streaming of Yahoo internal event

In addition to demonstrating our live streaming capabilities, we are also experimenting with showing 360° VOD videos produced with a GoPro-based camera rig. Here is a screenshot of one of the 360° videos being played in the Yahoo player.

Figure 9: Yahoo Studios produced 360° VOD content in the Yahoo Player

3. Challenges and Opportunities

3.1. Enormous amounts of data

As we alluded to in the video processing section of this post, delivering 4K resolution videos for each eye for each FOV at a high frame-rate remains a challenge. While FOV-adaptive streaming does reduce the size by providing high resolution streams separately for each FOV, providing an impeccable 60 FPS or more viewing experience still requires a lot more data than the current internet pipes can handle. Some of the other possible options which we are closely paying attention to are:

Compression efficiency with HEVC and VP9 – New codecs like HEVC and VP9 have the potential to provide significant compression gains. HEVC open source codecs like x265 have shown a 40% compression performance gain compared to the currently ubiquitous H.264/AVC codec. LIkewise, a VP9 codec from Google has shown similar 40% compression performance gains. The key challenge is the hardware decoding support and the browser support. But with Apple and Microsoft very much behind HEVC and Firefox and Chrome already supporting VP9, we believe most browsers would support HEVC or VP9 within a year.

Using 10 bit color depth vs 8 bit color depth – Traditional monitors support 8 bpc (bits per channel) for displaying images. Given each pixel has 3 channels (RGB), 8 bpc maps to 256x256x256 color/luminosity combinations to represent 16 million colors. With 10 bit color depth, you have the potential to represent even more colors. But the biggest stated advantage of using 10 bit color depth is with respect to compression during encoding even if the source only uses 8 bits per channel. Both x264 and x265 codecs support 10 bit color depth, with ffmpeg already supporting encoding at 10 bit color depth.

3.2. Six degrees of freedom

With current camera rig workflows, users viewing the streams through HMD are able to achieve three degrees of Freedom (DoF) i.e., the ability to move up/down, clockwise/anti-clockwise, and swivel. But you still can’t get a different perspective when you move inside it i.e., move forward/backward. Until now, this true six DoF immersive VR experience has only been possible in CG VR games. In video streaming, LightField technology-based video cameras produced by Lytro are the first ones to capture light field volume data from all directions [15]. But Lightfield-based videos require an order of magnitude more data than traditional fixed FOV, fixed IPD, fixed lense camera rigs like GoPro. As bandwidth problems get resolved via better compressions and better networks, achieving true immersion should be possible.

4. Conclusion

VR streaming is an emerging medium and with the addition of 360° VR playback capability, Yahoo’s video platform provides us a great starting point to explore the opportunities in video with regard to virtual reality. As we continue to work to delight our users by showing immersive video content, we remain focused on optimizing the rendering of high-quality 4K content in our players. We’re looking at building FOV-based adaptive streaming capabilities and better compression during delivery. These capabilities, and the enhancement of our webvr player to play on more HMDs like HTC Vive and Oculus Rift, will set us on track to offer streaming capabilities across the entire spectrum. At the same time, we are keeping a close watch on advancements in supporting spatial audio experiences, as well as advancements in the ability to stream volumetric lightfield videos to achieve true six degrees of freedom, with the aim of realizing the true potential of VR.

Glossary – VR concepts:

VR – Virtual reality, commonly referred to as VR, is an immersive computer-simulated reality experience that places viewers inside an experience. It “transports” viewers from their physical reality into a closed virtual reality. VR usually requires a headset device that takes care of sights and sounds, while the most-involved experiences can include external motion tracking, and sensory inputs like touch and smell. For example, when you put on VR headgear you suddenly start feeling immersed in the sounds and sights of another universe, like the deck of the Star Trek Enterprise. Though you remain physically at your place, VR technology is designed to manipulate your senses in a manner that makes you truly feel as if you are on that ship, moving through the virtual environment and interacting with the crew.

360 degree video – A 360° video is created with a camera system that simultaneously records all 360 degrees of a scene. It is a flat equirectangular video projection that is morphed into a sphere for playback on a VR headset. A standard world map is an example of equirectangular projection, which maps the surface of the world (sphere) onto orthogonal coordinates.

Spatial Audio – Spatial audio gives the creator the ability to place sound around the user. Unlike traditional mono/stereo/surround audio, it responds to head rotation in sync with video. While listening to spatial audio content, the user receives a real-time binaural rendering of an audio stream [17].

FOV – A human can naturally see 170 degrees of viewable area (field of view). Most consumer grade head mounted displays HMD(s) like Oculus Rift and HTC Vive now display 90 degrees to 120 degrees.

Monoscopic video – A monoscopic video means that both eyes see a single flat image, or video file. A common camera setup involves six cameras filming six different fields of view. Stitching software is used to form a single equirectangular video. Max output resolution on 2D scopic videos on Gear VR is 3480×1920 at 30 frames per second.

Presence – Presence is a kind of immersion where the low-level systems of the brain are tricked to such an extent that they react just as they would to non-virtual stimuli.

Latency – It’s the time between when you move your head, and when you see physical updates on the screen. An acceptable latency is anywhere from 11 ms (for games) to 20 ms (for watching 360 vr videos).

Head Tracking – There are two forms:

  • Positional tracking – movements and related translations of your body, eg: sway side to side.
  • Traditional head tracking – left, right, up, down, roll like clock rotation.

References:

[1] Ultimate Display Speech as reminisced by Fred Brooks: http://www.roadtovr.com/fred-brooks-ivan-sutherlands-1965-ultimate-display-speech/

[2] Equirectangular Layout Image: https://www.flickr.com/photos/[email protected]/10111691364/

[3] CubeMap Layout: http://learnopengl.com/img/advanced/cubemaps_skybox.png

[4] Vahana VR: http://www.video-stitch.com/

[5] Surround360 Stitching software: https://github.com/facebook/Surround360

[6] Computer Vision Algorithm BRISK: https://www.robots.ox.ac.uk/~vgg/rg/papers/brisk.pdf

[7] Computer Vision Algorithm AKAZE: http://docs.opencv.org/3.0-beta/doc/tutorials/features2d/akaze_matching/akaze_matching.html

[8] Optical Flow: http://docs.opencv.org/trunk/d7/d8b/tutorial_py_lucas_kanade.html

[9] 4K connection speeds: https://help.netflix.com/en/node/306

[10] Average connection speeds in US: https://www.akamai.com/us/en/about/news/press/2016-press/akamai-releases-fourth-quarter-2015-state-of-the-internet-report.jsp

[11] CubeMap transform filter for ffmpeg: https://github.com/facebook/transform

[12] FFMPEG software: https://ffmpeg.org/

[13] WebVR Spec: https://w3c.github.io/webvr/

[14] Google Daydream SDK: https://vr.google.com/cardboard/developers/

[15] Lytro LightField Volume for six DoF: https://www.lytro.com/press/releases/lytro-immerge-the-worlds-first-professional-light-field-solution-for-cinematic-vr

[16] 10 bit color depth: https://gist.github.com/l4n9th4n9/4459997

Open Sourcing a Deep Learning Solution for Detecting NSFW Images

Post Syndicated from davglass original https://yahooeng.tumblr.com/post/151148689421

By Jay Mahadeokar and Gerry Pesavento

Automatically identifying that an image is not suitable/safe for work (NSFW), including offensive and adult images, is an important problem which researchers have been trying to tackle for decades. Since images and user-generated content dominate the Internet today, filtering NSFW images becomes an essential component of Web and mobile applications. With the evolution of computer vision, improved training data, and deep learning algorithms, computers are now able to automatically classify NSFW image content with greater precision.

Defining NSFW material is subjective and the task of identifying these images is non-trivial. Moreover, what may be objectionable in one context can be suitable in another. For this reason, the model we describe below focuses only on one type of NSFW content: pornographic images. The identification of NSFW sketches, cartoons, text, images of graphic violence, or other types of unsuitable content is not addressed with this model.

To the best of our knowledge, there is no open source model or algorithm for identifying NSFW images. In the spirit of collaboration and with the hope of advancing this endeavor, we are releasing our deep learning model that will allow developers to experiment with a classifier for NSFW detection, and provide feedback to us on ways to improve the classifier.

Our general purpose Caffe deep neural network model (Github code) takes an image as input and outputs a probability (i.e a score between 0-1) which can be used to detect and filter NSFW images. Developers can use this score to filter images below a certain suitable threshold based on a ROC curve for specific use-cases, or use this signal to rank images in search results.

Convolutional Neural Network (CNN) architectures and tradeoffs

In recent years, CNNs have become very successful in image classification problems [1] [5] [6]. Since 2012, new CNN architectures have continuously improved the accuracy of the standard ImageNet classification challenge. Some of the major breakthroughs include AlexNet (2012) [6], GoogLeNet [5], VGG (2013) [2] and Residual Networks (2015) [1]. These networks have different tradeoffs in terms of runtime, memory requirements, and accuracy. The main indicators for runtime and memory requirements are:

  1. Flops or connections – The number of connections in a neural network determine the number of compute operations during a forward pass, which is proportional to the runtime of the network while classifying an image.
  2. Parameters -–The number of parameters in a neural network determine the amount of memory needed to load the network.

Ideally we want a network with minimum flops and minimum parameters, which would achieve maximum accuracy.

Training a deep neural network for NSFW classification

We train the models using a dataset of positive (i.e. NSFW) images and negative (i.e. SFW – suitable/safe for work) images. We are not releasing the training images or other details due to the nature of the data, but instead we open source the output model which can be used for classification by a developer.

We use the Caffe deep learning library and CaffeOnSpark; the latter is a powerful open source framework for distributed learning that brings Caffe deep learning to Hadoop and Spark clusters for training models (Big shout out to Yahoo’s CaffeOnSpark team!).

While training, the images were resized to 256×256 pixels, horizontally flipped for data augmentation, and randomly cropped to 224×224 pixels, and were then fed to the network. For training residual networks, we used scale augmentation as described in the ResNet paper [1], to avoid overfitting. We evaluated various architectures to experiment with tradeoffs of runtime vs accuracy.

  1. MS_CTC [4] – This architecture was proposed in Microsoft’s constrained time cost paper. It improves on top of AlexNet in terms of speed and accuracy maintaining a combination of convolutional and fully-connected layers.
  2. Squeezenet [3] – This architecture introduces the fire module which contain layers to squeeze and then expand the input data blob. This helps to save the number of parameters keeping the Imagenet accuracy as good as AlexNet, while the memory requirement is only 6MB.
  3. VGG [2] – This architecture has 13 conv layers and 3 FC layers.
  4. GoogLeNet [5] – GoogLeNet introduces inception modules and has 20 convolutional layer stages. It also uses hanging loss functions in intermediate layers to tackle the problem of diminishing gradients for deep networks.
  5. ResNet-50 [1] – ResNets use shortcut connections to solve the problem of diminishing gradients. We used the 50-layer residual network released by the authors.
  6. ResNet-50-thin – The model was generated using our pynetbuilder tool and replicates the Residual Network paper’s 50-layer network (with half number of filters in each layer). You can find more details on how the model was generated and trained here.

Tradeoffs of different architectures: accuracy vs number of flops vs number of params in network.

The deep models were first pre-trained on the ImageNet 1000 class dataset. For each network, we replace the last layer (FC1000) with a 2-node fully-connected layer. Then we fine-tune the weights on the NSFW dataset. Note that we keep the learning rate multiplier for the last FC layer 5 times the multiplier of other layers, which are being fine-tuned. We also tune the hyper parameters (step size, base learning rate) to optimize the performance.

We observe that the performance of the models on NSFW classification tasks is related to the performance of the pre-trained model on ImageNet classification tasks, so if we have a better pretrained model, it helps in fine-tuned classification tasks. The graph below shows the relative performance on our held-out NSFW evaluation set. Please note that the false positive rate (FPR) at a fixed false negative rate (FNR) shown in the graph is specific to our evaluation dataset, and is shown here for illustrative purposes. To use the models for NSFW filtering, we suggest that you plot the ROC curve using your dataset and pick a suitable threshold.

Comparison of performance of models on Imagenet and their counterparts fine-tuned on NSFW dataset.

We are releasing the thin ResNet 50 model, since it provides good tradeoff in terms of accuracy, and the model is lightweight in terms of runtime (takes < 0.5 sec on CPU) and memory (~23 MB). Please refer our git repository for instructions and usage of our model. We encourage developers to try the model for their NSFW filtering use cases. For any questions or feedback about performance of model, we encourage creating a issue and we will respond ASAP.

Results can be improved by fine-tuning the model for your dataset or use case. If you achieve improved performance or you have trained a NSFW model with different architecture, we encourage contributing to the model or sharing the link on our description page.

Disclaimer: The definition of NSFW is subjective and contextual. This model is a general purpose reference model, which can be used for the preliminary filtering of pornographic images. We do not provide guarantees of accuracy of output, rather we make this available for developers to explore and enhance as an open source project.

We would like to thank Sachin Farfade, Amar Ramesh Kamat, Armin Kappeler, and Shraddha Advani for their contributions in this work.

References:

[1] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition” arXiv preprint arXiv:1512.03385 (2015).

[2] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.”; arXiv preprint arXiv:1409.1556(2014).

[3] Iandola, Forrest N., Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 1MB model size.”; arXiv preprint arXiv:1602.07360 (2016).

[4] He, Kaiming, and Jian Sun. “Convolutional neural networks at constrained time cost.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5353-5360. 2015.

[5] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet,Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9. 2015.

[6] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks” In Advances in neural information processing systems, pp. 1097-1105. 2012.

Raspberry Pi with cloud vision at Google I/O

Post Syndicated from Liz Upton original https://www.raspberrypi.org/blog/raspberry-pi-cloud-vision-google-io/

Matt visited Google I/O yesterday, and sent back some pretty incredible pictures. This event looks more like a music festival than a tech conference.

Google I/O

He was sending pictures and excited snippets of text back to Pi Towers all through the event, and then, when he got home, shared this video. I’ve been so excited about it that I’ve had it playing on repeat, and we all thought you’d like to see it too.

This is a demo of a Raspberry Pi robot working with Google’s Cloud Vision API – and it’s got such potential for your projects.

What is Cloud Vision API?

Cloud Vision API provides powerful Image Analytics capabilities as easy to use APIs. It enables application developers to build the next generation of applications that can see and understand the content within the images. The service enables customers to detect a broad set of entities within an image from everyday objects to faces and product logos.

The robot is taking pictures and sending them to the cloud, where they’re analysed and sent back in real time. There’s facial detection – along with detection of what emotion is showing on those faces. And cloud vision offers you image recognition, so you should be able get your robot to distinguish limes from green apples. You can then get the robot to act on that data – so you could set it to gather apples and not limes, for example.

Cloud vision on a Pi robot

We’re pretty excited about the opportunities this API offers makers of all kinds of Raspberry Pi devices. You can learn more here – please let us know if you start integrating it into your own projects!

 

The post Raspberry Pi with cloud vision at Google I/O appeared first on Raspberry Pi.