Tag Archives: machine learning

Architecture Monthly Magazine for July: Machine Learning

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-for-july-machine-learning/

Every month, AWS publishes the AWS Architecture Monthly Magazine (available for free on Kindle and Flipboard) that curates some of the best technical and video content from around AWS.

In the June edition, we offered several pieces of content related to Internet of Things (IoT). This month we’re talking about artificial intelligence (AI), namely machine learning.

Machine Learning: Let’s Get it Started

Alan Turing, the British mathematician whose life and work was documented in the movie The Imitation Game, was a pioneer of theoretical computer science and AI. He was the first to put forth the idea that machines can think.

Jump ahead 80 years to this month when researchers asked four-time World Poker Tour title holder Darren Elias to play Texas Hold’em with Pluribus, a poker-playing bot (actually, five of these bots were at the table). Pluribus learns by playing against itself over and over and remembering which strategies worked best. The bot became world-class-level poker player in a matter of days. Read about it in the journal Science.

If AI is making a machine more human, AI’s subset, machine learning, involves the techniques that allow these machines to make sense of the data we feed them. Machine learning is mimicking how humans learn, and Pluribus is actually learning from itself.

From self-driving cars, medical diagnostics, and facial recognition to our helpful (and sometimes nosy) pals Siri, Alexa, and Cortana, all these smart machines are constantly improving from the moment we unbox them. We humans are teaching the machines to think like us.

For July’s magazine, we assembled architectural best practices about machine learning from all over AWS, and we’ve made sure that a broad audience can appreciate it.

  • Interview: Mahendra Bairagi, Solutions Architect, Artificial Intelligence
  • Training: Getting in the Voice Mindset
  • Quick Start: Predictive Data Science with Amazon SageMaker and a Data Lake on AWS
  • Blog post: Amazon SageMaker Neo Helps Detect Objects and Classify Images on Edge Devices
  • Solution: Fraud Detection Using Machine Learning
  • Video: Viz.ai Uses Deep Learning to Analyze CT Scans and Save Lives
  • Whitepaper: Power Machine Learning at Scale

We hope you find this edition of Architecture Monthly useful, and we’d like your feedback. Please give us a star rating and your comments on Amazon. You can also reach out to [email protected] anytime. Check back in a month to discover what the August magazine will offer.

Catwalk: Serving Machine Learning Models at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/catwalk-serving-machine-learning-models-at-scale

Introduction

Grab’s unwavering ambition is to be the best Super App in Southeast Asia that adds value to the everyday for our customers. In order to achieve that, the customer experience must be flawless for each and every Grab service. Let’s take our frequently used ride-hailing service as an example. We want fair pricing for both drivers and passengers, accurate estimation of ETAs, effective detection of fraudulent activities, and ensured ride safety for our customers. The key to perfecting these customer journeys is artificial intelligence (AI).

Grab has a tremendous amount of data that we can leverage to solve complex problems such as fraudulent user activity, and to provide our customers personalized experiences on our products. One of the tools we are using to make sense of this data is machine learning (ML).

As Grab made giant strides towards increasingly using machine learning across the organization, more and more teams were organically building model serving solutions for their own use cases. Unfortunately, these model serving solutions required data scientists to understand the infrastructure underlying them. Moreover, there was a lot of overlap in the effort it took to build these model serving solutions.

That’s why we came up with Catwalk: an easy-to-use, self-serve, machine learning model serving platform for everyone at Grab.

Goals

To determine what we wanted Catwalk to do, we first looked at the typical workflow of our target audience – data scientists at Grab:

  • Build a trained model to solve a problem.
  • Deploy the model to their project’s particular serving solution. If this involves writing to a database, then the data scientists need to programmatically obtain the outputs, and write them to the database. If this involves running the model on a server, the data scientists require a deep understanding of how the server scales and works internally to ensure that the model behaves as expected.
  • Use the deployed model to serve users, and obtain feedback such as user interaction data. Retrain the model using this data to make it more accurate.
  • Deploy the retrained model as a new version.
  • Use monitoring and logging to check the performance of the new version. If the new version is misbehaving, revert back to the old version so that production traffic is not affected. Otherwise run an AB test between the new version and the previous one.

We discovered an obvious pain point – the process of deploying models requires additional effort and attention, which results in data scientists being distracted from their problem at hand. Apart from that, having many data scientists build and maintain their own serving solutions meant there was a lot of duplicated effort. With Grab increasingly adopting machine learning, this was a state of affairs that could not be allowed to continue.

To address the problems, we came up with Catwalk with goals to:

  1. Abstract away the complexities and expose a minimal interface for data scientists
  2. Prevent duplication of effort by creating an ML model serving platform for everyone in Grab
  3. Create a highly performant, highly available, model versioning supported ML model serving platform and integrate it with existing monitoring systems at Grab
  4. Shorten time to market by making model deployment self-service

What is Catwalk?

In a nutshell, Catwalk is a platform where we run Tensorflow Serving containers on a Kubernetes cluster integrated with the observability stack used at Grab.

In the next sections, we are going to explain the two main components in Catwalk – Tensorflow Serving and Kubernetes, and how they help us obtain our outlined goals.

What is Tensorflow Serving?

Tensorflow Serving is an open-source ML model serving project by Google. In Google’s own words, “Tensorflow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. Tensorflow Serving provides out-of-the-box integration with Tensorflow models, but can be easily extended to serve other types of models and data.”

Why Tensorflow Serving?

There are a number of ML model serving platforms in the market right now. We chose Tensorflow Serving because of these three reasons, ordered by priority:

  1. Highly performant. It has proven performance handling tens of millions of inferences per second at Google according to their website.
  2. Highly available. It has a model versioning system to make sure there is always a healthy version being served while loading a new version into its memory
  3. Actively maintained by the developer community and backed by Google

Even though, by default, Tensorflow Serving only supports models built with Tensorflow, this is not a constraint, though, because Grab is actively moving toward using Tensorflow.

How are we using Tensorflow Serving?

In this section, we will explain how we are using Tensorflow Serving and how it helps abstract away complexities for data scientists.

Here are the steps showing how we are using Tensorflow Serving to serve a trained model:

  1. Data scientists export the model using tf.saved_model API and drop it to an S3 models bucket. The exported model is a folder containing model files that can be loaded to Tensorflow Serving.
  2. Data scientists are granted permission to manage their folder.
  3. We run Tensorflow Serving and point it to load the model files directly from the S3 models bucket. Tensorflow Serving supports loading models directly from S3 out of the box. The model is served!
  4. Data scientists come up with a retrained model. They export and upload it to their model folder.
  5. As Tensorflow Serving keeps watching the S3 models bucket for new models, it automatically loads the retrained model and serves. Depending on the model configuration, it can either gracefully replace the running model version with a newer version or serve multiple versions at the same time.
Tensorflow Serving Diagram

The only interface to data scientists is a path to their model folder in the S3 models bucket. To update their model, they upload exported models to their folder and the models will automatically be served. The complexities are gone. We’ve achieved one of the goals!

Well, not really…

Imagine youare going to run Tensorflow Serving to serve one model in a cloud provider, which means you  need a compute resource from a cloud provider to run it. Running it on one box doesn’t provide high availability, so you need another box running the same model. Auto scaling is also needed in order to scale out based on the traffic. On top of these many boxes lies a load balancer. The load balancer evenly spreads incoming traffic to all the boxes, thus ensuring that there is a single point of entry for any clients, which can be abstracted away from the horizontal scaling. The load balancer also exposes an HTTP endpoint to external users. As a result, we form a Tensorflow Serving cluster that is ready to serve.

Next, imagine you have more models to deploy. You have three options

  1. Load the models into the existing cluster – having one cluster serve all models.
  2. Spin up a new cluster to serve each model – having multiple clusters, one cluster serves one model.
  3. Combination of 1 and 2 – having multiple clusters, one cluster serves a few models.

The first option would not scale, because it’s just not possible to load all models into one cluster as the cluster has limited resources.

The second option will definitely work but it doesn’t sound like an effective process, as you need to create a set of resources every time you have a new model to deploy. Additionally, how do you optimize the usage of resources, e.g., there might be unutilized resources in your clusters that could potentially be shared by the rest.

The third option looks promising, you can manually choose the cluster to deploy each of your new models into so that all the clusters’ resource utilization is optimal. The problem is you have to manuallymanage it. Managing 100 models using 25 clusters can be a challenging task. Furthermore, running multiple models in a cluster can also cause a problem as different models usually have different resource utilization patterns and can interfere with each other. For example, one model might use up all the CPU and the other model won’t be able to serve anymore.

Wouldn’t it be better if we had a system that automatically orchestrates model deployments based on resource utilization patterns and prevents them from interfering with each other? Fortunately, that  is exactly what Kubernetes is meant to do!

So what is Kubernetes?

Kubernetes abstracts a cluster of physical/virtual hosts (such as EC2) into a cluster of logical hosts (pods in Kubernetes terms). It provides a container-centric management environment. It orchestrates computing, networking, and storage infrastructure on behalf of user workloads.

Let’s look at some of the definitions of Kubernetes resources

Tensorflow Serving Diagram
  • Cluster – a cluster of nodes running Kubernetes.
  • Node – a node inside a cluster.
  • Deployment – a configuration to instruct Kubernetes the desired state of an application. It also takes care of rolling out an update (canary, percentage rollout, etc), rolling back and horizontal scaling.
  • Pod – a single processing unit. In our case, Tensorflow Serving will be running as a container in a pod. Pod can have CPU/memory limits defined.
  • Service – an abstraction layer that abstracts out a group of pods and exposes the application to clients.
  • Ingress – a collection of routing rules that govern how external users access services running in a cluster.
  • Ingress Controller – a controller responsible for reading the ingress information and processing that data accordingly such as creating a cloud-provider load balancer or spinning up a new pod as a load balancer using the rules defined in the ingress resource.

Essentially, we deploy resources to instruct Kubernetes the desired state of our application and Kubernetes will make sure that it is always the case.

How are we using Kubernetes?

In this section, we will walk you through how we deploy Tensorflow Serving in Kubernetes cluster and how it makes managing model deployments very convenient.

We used a managed Kubernetes service, to create a Kubernetes cluster and manually provisioned compute resources as nodes. As a result, we have a Kubernetes cluster with nodes that are ready to run applications.

An application to serve one model consists of

  1. Two or more Tensorflow Serving pods that serves a model with an autoscaler to scale pods based on resource consumption
  2. A load balancer to evenly spread incoming traffic to pods
  3. An exposed HTTP endpoint to external users

In order to deploy the application, we need to

  1. Deploy a deployment resource specifying
  2. Number of pods of Tensorflow Serving
  3. An S3 url for Tensorflow Serving to load model files
  4. Deploy a service resource to expose it
  5. Deploy an ingress resource to define an HTTP endpoint url

Kubernetes then allocates Tensorflow Serving pods to the cluster with the number of pods according to the value defined in deployment resource. Pods can be allocated to any node inside the cluster, Kubernetes makes sure that the node it allocates a pod into has sufficient resources that the pod needs. In case there is no node that has sufficient resources, we can easily scale out the cluster by adding new nodes into it.

In order for the rules defined inthe ingressresource to work, the cluster must have an ingress controller running, which is what guided our choice of the load balancer. What an ingress controller does is simple: it keeps checking the ingressresource, creates a load balancer and defines rules based on rules in the ingressresource. Once the load balancer is configured, it will be able to redirect incoming requests to the Tensorflow Serving pods.

That’s it! We have a scalable Tensorflow Serving application that serves a model through a load balancer! In order to serve another model, all we need to do is to deploy the same set of resources but with the model’s S3 url and HTTP endpoint.

To illustrate what is running inside the cluster, let’s see how it looks like when we deploy two applications: one for serving pricing model another one for serving fraud-check model. Each application is configured to have two Tensorflow Serving pods and exposed at /v1/models/model

Tensorflow Serving Diagram

There are two Tensorflow Serving pods that serve fraud-check model and exposed through a load balancer. Same for the pricing model, the only differences are the model it is serving and the exposed HTTP endpoint url. The load balancer rules for pricing and fraud-check model look like this

IfThen forward to
Path is /v1/models/pricingpricing pod ip-1
pricing pod ip-2
Path is /v1/models/fraud-checkfraud-check pod ip-1
fraud-check pod ip-2

Stats and Logs

The last piece is how stats and logs work. Before getting to that, we need to introduce DaemonSet. According to the document, DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage collected. Deleting a DaemonSet will clean up the pods it created.

We deployed datadog-agent and filebeat as a DaemonSet. As a result, we always have one datadog-agent pod and one filebeat pod in all nodes and they are accessible from Tensorflow Serving pods in the same node. Tensorflow Serving pods emit a stats event for every request to datadog-agent pod in the node it is running in.

Here is a sample of DataDog stats:

DataDog stats

And logs that we put in place:

Logs

Benefits Gained from Catwalk

Catwalk has become the go-to, centralized system to serve machine learning models. Data scientists are not required to take care of the serving infrastructure hence they can focus on what matters the most: come up with models to solve customer problems. They are only required to provide exported model files and estimation of expected traffic in order to prepare sufficient resources to run their model. In return, they are presented with an endpoint to make inference calls to their model, along with all necessary tools for monitoring and debugging. Updating the model version is self-service, and the model improvement cycle is much shorter than before. We used to count in days, we now count in minutes.

Future Plans

Improvement on Automation

Currently, the first deployment of any model will still need some manual task from the platform team. We aim to automate this processentirely. We’ll work with our awesome CI/CD team who is making the best use of Spinnaker.

Model serving on mobile devices

As a platform, we are looking at setting standards for model serving across Grab. This includes model serving on mobile devices as well. Tensorflow Serving also provides a Lite version to be used on mobile devices. It is a whole new paradigm with vastly different tradeoffs for machine learning practitioners. We are quite excited to set some best practices in this area.

gRPC support

Catwalk currently supports HTTP/1.1. We’ll hook Grab’s service discovery mechanism to open gRPC traffic, which TFS already supports.

If you are interested in building pipelines for machine learning related topics, and you share our vision of driving South East Asia forward, come join us!

Predictive CPU isolation of containers at Netflix

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/predictive-cpu-isolation-of-containers-at-netflix-91f014d856c7?source=rss----2615bd06b42e---4

By Benoit Rostykus, Gabriel Hartmann

Noisy Neighbors

We’ve all had noisy neighbors at one point in our life. Whether it’s at a cafe or through a wall of an apartment, it is always disruptive. The need for good manners in shared spaces turns out to be important not just for people, but for your Docker containers too.

When you’re running in the cloud your containers are in a shared space; in particular they share the CPU’s memory hierarchy of the host instance.

Because microprocessors are so fast, computer architecture design has evolved towards adding various levels of caching between compute units and the main memory, in order to hide the latency of bringing the bits to the brains. However, the key insight here is that these caches are partially shared among the CPUs, which means that perfect performance isolation of co-hosted containers is not possible. If the container running on the core next to your container suddenly decides to fetch a lot of data from the RAM, it will inevitably result in more cache misses for you (and hence a potential performance degradation).

Linux to the rescue?

Traditionally it has been the responsibility of the operating system’s task scheduler to mitigate this performance isolation problem. In Linux, the current mainstream solution is CFS (Completely Fair Scheduler). Its goal is to assign running processes to time slices of the CPU in a “fair” way.

CFS is widely used and therefore well tested and Linux machines around the world run with reasonable performance. So why mess with it? As it turns out, for the large majority of Netflix use cases, its performance is far from optimal. Titus is Netflix’s container platform. Every month, we run millions of containers on thousands of machines on Titus, serving hundreds of internal applications and customers. These applications range from critical low-latency services powering our customer-facing video streaming service, to batch jobs for encoding or machine learning. Maintaining performance isolation between these different applications is critical to ensuring a good experience for internal and external customers.

We were able to meaningfully improve both the predictability and performance of these containers by taking some of the CPU isolation responsibility away from the operating system and moving towards a data driven solution involving combinatorial optimization and machine learning.

The idea

CFS operates by very frequently (every few microseconds) applying a set of heuristics which encapsulate a general concept of best practices around CPU hardware use.

Instead, what if we reduced the frequency of interventions (to every few seconds) but made better data-driven decisions regarding the allocation of processes to compute resources in order to minimize collocation noise?

One traditional way of mitigating CFS performance issues is for application owners to manually cooperate through the use of core pinning or nice values. However, we can automatically make better global decisions by detecting collocation opportunities based on actual usage information. For example if we predict that container A is going to become very CPU intensive soon, then maybe we should run it on a different NUMA socket than container B which is very latency-sensitive. This avoids thrashing caches too much for B and evens out the pressure on the L3 caches of the machine.

Optimizing placements through combinatorial optimization

What the OS task scheduler is doing is essentially solving a resource allocation problem: I have X threads to run but only Y CPUs available, how do I allocate the threads to the CPUs to give the illusion of concurrency?

As an illustrative example, let’s consider a toy instance of 16 hyperthreads. It has 8 physical hyperthreaded cores, split on 2 NUMA sockets. Each hyperthread shares its L1 and L2 caches with its neighbor, and shares its L3 cache with the 7 other hyperthreads on the socket:

If we want to run container A on 4 threads and container B on 2 threads on this instance, we can look at what “bad” and “good” placement decisions look like:

The first placement is intuitively bad because we potentially create collocation noise between A and B on the first 2 cores through their L1/L2 caches, and on the socket through the L3 cache while leaving a whole socket empty. The second placement looks better as each CPU is given its own L1/L2 caches, and we make better use of the two L3 caches available.

Resource allocation problems can be efficiently solved through a branch of mathematics called combinatorial optimization, used for example for airline scheduling or logistics problems.

We formulate the problem as a Mixed Integer Program (MIP). Given a set of K containers each requesting a specific number of CPUs on an instance possessing d threads, the goal is to find a binary assignment matrix M of size (d, K) such that each container gets the number of CPUs it requested. The loss function and constraints contain various terms expressing a priori good placement decisions such as:

  • avoid spreading a container across multiple NUMA sockets (to avoid potentially slow cross-sockets memory accesses or page migrations)
  • don’t use hyper-threads unless you need to (to reduce L1/L2 thrashing)
  • try to even out pressure on the L3 caches (based on potential measurements of the container’s hardware usage)
  • don’t shuffle things too much between placement decisions

Given the low-latency and low-compute requirements of the system (we certainly don’t want to spend too many CPU cycles figuring out how containers should use CPU cycles!), can we actually make this work in practice?

Implementation

We decided to implement the strategy through Linux cgroups since they are fully supported by CFS, by modifying each container’s cpuset cgroup based on the desired mapping of containers to hyper-threads. In this way a user-space process defines a “fence” within which CFS operates for each container. In effect we remove the impact of CFS heuristics on performance isolation while retaining its core scheduling capabilities.

This user-space process is a Titus subsystem called titus-isolate which works as follows. On each instance, we define three events that trigger a placement optimization:

  • add: A new container was allocated by the Titus scheduler to this instance and needs to be run
  • remove: A running container just finished
  • rebalance: CPU usage may have changed in the containers so we should reevaluate our placement decisions

We periodically enqueue rebalance events when no other event has recently triggered a placement decision.

Every time a placement event is triggered, titus-isolate queries a remote optimization service (running as a Titus service, hence also isolating itself… turtles all the way down) which solves the container-to-threads placement problem.

This service then queries a local GBRT model (retrained every couple of hours on weeks of data collected from the whole Titus platform) predicting the P95 CPU usage of each container in the coming 10 minutes (conditional quantile regression). The model contains both contextual features (metadata associated with the container: who launched it, image, memory and network configuration, app name…) as well as time-series features extracted from the last hour of historical CPU usage of the container collected regularly by the host from the kernel CPU accounting controller.

The predictions are then fed into a MIP which is solved on the fly. We’re using cvxpy as a nice generic symbolic front-end to represent the problem which can then be fed into various open-source or proprietary MIP solver backends. Since MIPs are NP-hard, some care needs to be taken. We impose a hard time budget to the solver to drive the branch-and-cut strategy into a low-latency regime, with guardrails around the MIP gap to control overall quality of the solution found.

The service then returns the placement decision to the host, which executes it by modifying the cpusets of the containers.

For example, at any moment in time, an r4.16xlarge with 64 logical CPUs might look like this (the color scale represents CPU usage):

Results

The first version of the system led to surprisingly good results. We reduced overall runtime of batch jobs by multiple percent on average while most importantly reducing job runtime variance (a reasonable proxy for isolation), as illustrated below. Here we see a real-world batch job runtime distribution with and without improved isolation:

Notice how we mostly made the problem of long-running outliers disappear. The right-tail of unlucky noisy-neighbors runs is now gone.

For services, the gains were even more impressive. One specific Titus middleware service serving the Netflix streaming service saw a capacity reduction of 13% (a decrease of more than 1000 containers) needed at peak traffic to serve the same load with the required P99 latency SLA! We also noticed a sharp reduction of the CPU usage on the machines, since far less time was spent by the kernel in cache invalidation logic. Our containers are now more predictable, faster and the machine is less used! It’s not often that you can have your cake and eat it too.

Next Steps

We are excited with the strides made so far in this area. We are working on multiple fronts to extend the solution presented here.

We want to extend the system to support CPU oversubscription. Most of our users have challenges knowing how to properly size the numbers of CPUs their app needs. And in fact, this number varies during the lifetime of their containers. Since we already predict future CPU usage of the containers, we want to automatically detect and reclaim unused resources. For example, one could decide to auto-assign a specific container to a shared cgroup of underutilized CPUs, to better improve overall isolation and machine utilization, if we can detect the sensitivity threshold of our users along the various axes of the following graph.

We also want to leverage kernel PMC events to more directly optimize for minimal cache noise. One possible avenue is to use the Intel based bare metal instances recently introduced by Amazon that allow deep access to performance analysis tools. We could then feed this information directly into the optimization engine to move towards a more supervised learning approach. This would require a proper continuous randomization of the placements to collect unbiased counterfactuals, so we could build some sort of interference model (“what would be the performance of container A in the next minute, if I were to colocate one of its threads on the same core as container B, knowing that there’s also C running on the same socket right now?”).

Conclusion

If any of this piques your interest, reach out to us! We’re looking for ML engineers to help us push the boundary of containers performance and “machine learning for systems” and systems engineers for our core infrastructure and compute platform.


Predictive CPU isolation of containers at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lerner — using RL agents for test case scheduling

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/lerner-using-rl-agents-for-test-case-scheduling-3e0686211198?source=rss----2615bd06b42e---4

Lerner — using RL agents for test case scheduling

By: Stanislav Kirdey, Kevin Cureton, Scott Rick, Sankar Ramanathan

Introduction

Netflix brings delightful customer experiences to homes on a variety of devices that continues to grow each day. The device ecosystem is rich with partners ranging from Silicon-on-Chip (SoC) manufacturers, Original Design Manufacturer (ODM) and Original Equipment Manufacturer (OEM) vendors.

Partners across the globe leverage Netflix device certification process on a continual basis to ensure that quality products and experiences are delivered to their customers. The certification process involves the verification of partner’s implementation of features provided by the Netflix SDK.

The Partner Device Ecosystem organization in Netflix is responsible for ensuring successful integration and testing of the Netflix application on all partner devices. Netflix engineers run a series of tests and benchmarks to validate the device across multiple dimensions including compatibility of the device with the Netflix SDK, device performance, audio-video playback quality, license handling, encryption and security. All this leads to a plethora of test cases, most of them automated, that need to be executed to validate the functionality of a device running Netflix.

Problem

With a collection of tests that, by nature, are time consuming to run and sometimes require manual intervention, we need to prioritize and schedule test executions in a way that will expedite detection of test failures. There are several problems efficient test scheduling could help us solve:

  1. Quickly detect a regression in the integration of the Netflix SDK on a consumer electronic or MVPD (multichannel video programming distributor) device.
  2. Detect a regression in a test case. Using the Netflix Reference Application and known good devices, ensure the test case continues to function and tests what is expected.
  3. When code many test cases are dependent on has changed, choose the right test cases among thousands of affected tests to quickly validate the change before committing it and running extensive, and expensive, tests.
  4. Choose the most promising subset of tests out of thousands of test cases available when running continuous integration against a device.
  5. Recommend a set of test cases to execute against the device that would increase the probability of failing the device in real-time.

Solving the above problems could help Netflix and our Partners save time and money during the entire lifecycle of device design, build, test, and certification.

These problems could be solved in several different ways. In our quest to be objective, scientific, and inline with the Netflix philosophy of using data to drive solutions for intriguing problems, we proceeded by leveraging machine learning.

Our inspiration was the findings in a research paper “Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration” by Helge Spieker, et. al. We thought that reinforcement learning would be a promising approach that could provide great flexibility in the training process. Likewise it has very low requirements on the initial amount of training data.

In the case of continuously testing a Netflix SDK integration on a new device, we usually lack relevant data for model training in the early phases of integration. In this situation training an agent is a great fit as it allows us to start with very little input data and let the agent explore and exploit the patterns it learns in the process of SDK integration and regression testing. The agent in reinforcement learning is an entity that performs a decision on what action to take considering the current state of the environment, and gets a reward based on the quality of the action.

Solution

We built a system called Lerner that consists of a set of microservices and a python library that allows scalable agent training and inference for test case scheduling. We also provide an API client in Python.

Lerner works in tandem with our continuous integration framework that executes on-device tests using the Netflix Test Studio platform. Tests are run on Netflix Reference Applications (running as containers on Titus), as well as on physical devices.

There were several motivations that led to building a custom solution:

  1. We wanted to keep the APIs and integrations as simple as possible.
  2. We needed a way to run agents and tie the runs to the internal infrastructure for analytics, reporting, and visualizations.
  3. We wanted the to tool be available as a standalone library as well as scalable API service.

Lerner provides ability to setup any number of agents making it the first component in our re-usable reinforcement learning framework for device certification.

Lerner, as a web-service, relies on Amazon Web Services (AWS) and Netflix’s Open Source Software (OSS) tools. We use Spinnaker to deploy instances and host the API containers on Titus — which allows fast deployment times and rapid scalability. Lerner uses AWS services to store binary versions of the agents, agent configurations, and training data. To maintain the quality of Lerner APIs, we are using the server-less paradigm for Lerner’s own integration testing by utilizing AWS Lambda.

The agent training library is written in Python and supports versions 2.7, 3.5, 3.6, and 3.7. The library is available in the artifactory repository for easy installation. It can be used in Python notebooks — allowing for rapid experimentation in isolated environments without a need to perform API calls. The agent training library exposes different types of learning agents that utilize neural networks to approximate action.

The neural network (NN)-based agent uses a deep net with fully connected layers. The NN gets the state of a particular test case (the input) and outputs a continuous value, where a higher number means an earlier position in a test execution schedule. The inputs to the neural network include: general historical features such as the last N executions and several domain specific features that provide meta-information about a test case.

The Lerner APIs are split into three areas:

  1. Storing execution results.
  2. Getting recommendations based on the current state of the environment.
  3. Assign reward to the agent based on the execution result and predicted recommendations.

A process of getting recommendations and rewarding the agent using APIs consists of 4 steps:

  1. Out of all available test cases for a particular job — form a request that can be interpreted by Lerner. This involves aggregation of historical results and additional features.
  2. Lerner returns a recommendation identified with a unique episode id.
  3. A CI system can execute the recommendation and submit the execution results to Lerner based on the episode id.
  4. Call an API to assign a reward based on the agent id and episode id.

Below is a diagram of the services and persistence layers that support the functionality of the Lerner API.

The self-service nature of the tool makes it easy for service owners to integrate with Lerner, create agents, ask agents for recommendations and reward them after execution results are available.

The metrics relevant to the training and recommendation process are reported to Atlas and visualized using Netflix’s Lumen. Users of the service can track the statistics specific to the agents they setup and deploy, which allows them to build their own dashboards.

We have identified some interesting patterns while doing online reinforcement learning.

  • The recommendation/execution reward cycle can happen without any prior training data.
  • We can bootstrap several CI jobs that would use agents with different reward functions, and gain additional insight based on agents performance. It could help us design and implement more targeted reward functions.
  • We can keep a small amount of historical data to train agents. The data can be truncated after each execution and offloaded to a long-term storage for further analysis.

Some of the downsides:

  • It might take time for an agent to stop exploring and start exploiting the accumulated experience.
  • As agents stored in a binary format in the database, an update of an agent from multiple jobs could cause a race condition in its state. Handling concurrency in the training process is cumbersome and requires trade offs. We achieved the desired state by relying on the locking mechanisms of the underlying persistence layer that stores and serves agent binaries.

Thus, we have the luxury of training as many agents as we want that could prioritize and recommend test cases based on their unique learning experiences.

Outcome

We are currently piloting the system and have live agents serving predictions for various CI runs. At the moment we run Lerner-based CIs in parallel with CIs that either execute test cases in random order or use simple heuristics as sorting test cases by time and execute everything that previously failed.

The system was built with simplicity and performance in mind, so the set of APIs are minimal. We developed client libraries that allow seamless, but opinionated, integration with Lerner.

We collect several metrics to evaluate the performance of a recommendation, with main metrics being time taken to first failure and time taken to complete a whole scheduled run.

Lerner-based recommendations are proving to be different and more insightful than random runs, as they allow us to fit a particular time budget and detect patterns such as cases that tend to fail together in a cluster, cases that haven’t been run in a long time, and so on.

The below graphs shows more or less an artificial case when a schedule of 100+ test cases would contain several flaky tests. The Y-axis represents how many minutes it took to complete the schedule or reach a first failed test case. In blue, we have random recommendations with no time budget constraints. In green you can see executions based on Lerner recommendations under a time constraint of 60 minutes. The green spikes represent Lerner exploring the environment, where the wiggly lines around 0 are the executions that failed quickly as Lerner was exploiting its policy.

Execution of schedules that were randomly generated. Y-axis represents time to finish execution or reach first failure.
Execution of Lerner based schedules. You can see moments when Lerner was exploring the environment, and the wiggly lines represent when the schedule was generated based on exploiting existing knowledge.

Next Steps

The next phases of the project will focus on:

  • Reward functions that are aware of a comprehensive domain context, such as assigning appropriate rewards to states where infrastructure is fragile and test case could not be run appropriately.
  • Administrative user-interface to manage agents.
  • More generic, simple, and user-friendly framework for reinforcement learning and agent deployment.
  • Using Lerner on all available CIs jobs against all SDK versions.
  • Experiment with different neural network architectures.

If you would like to be a part of our team, come join us.


Lerner — using RL agents for test case scheduling was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scheduling GPUs for deep learning tasks on Amazon ECS

Post Syndicated from Anuneet Kumar original https://aws.amazon.com/blogs/compute/scheduling-gpus-for-deep-learning-tasks-on-amazon-ecs/

This post is contributed by Brent Langston – Sr. Developer Advocate, Amazon Container Services

Last week, AWS announced enhanced Amazon Elastic Container Service (Amazon ECS) support for GPU-enabled EC2 instances. This means that now GPUs are first class resources that can be requested in your task definition, and scheduled on your cluster by ECS.

Previously, to schedule a GPU workload, you had to maintain your own custom configured AMI, with a custom configured Docker runtime. You also had to use custom vCPU logic as a stand-in for assigning your GPU workloads to GPU instances. Even when all that was in place, there was still no pinning of a GPU to a task. One task might consume more GPU resources than it should. This could cause other tasks to not have a GPU available.

Now, AWS maintains an ECS-optimized AMI that includes the correct NVIDIA drivers and Docker customizations. You can use this AMI to provision your GPU workloads. With this enhancement, GPUs can also be requested directly in the task definition. Like allocating CPU or RAM to a task, now you can explicitly request a number of GPUs to be allocated to your task. The scheduler looks for matching resources on the cluster to place those tasks. The GPUs are pinned to the task for as long as the task is running, and can’t be allocated to any other tasks.

I thought I’d see how easy it is to deploy GPU workloads to my ECS cluster. I’m working in the US-EAST-2 (Ohio) region, from my AWS Cloud9 IDE, so these commands work for Amazon Linux. Feel free to adapt to your environment as necessary.

If you’d like to run this example yourself, you can find all the code in this GitHub repo. If you run this example in your own account, be aware of the instance pricing, and clean up your resources when your experiment is complete.

Clone the repo using the following command:

git clone https://github.com/brentley/tensorflow-container.git

Setup

You need the latest version of the AWS CLI (for this post, I used 1.16.98):

echo “export PATH=$HOME/.local/bin:$HOME/bin:$PATH” >> ~/.bash_profile
source ~/.bash_profile
pip install --user -U awscli

Provision an ECS cluster, with two C5 instances, and two P3 instances:

aws cloudformation deploy --stack-name tensorflow-test --template-file cluster-cpu-gpu.yml --capabilities CAPABILITY_IAM                            

While AWS CloudFormation is provisioning resources, examine the template used to build your infrastructure. Open `cluster-cpu-gpu.yml`, and you see that you are provisioning a test VPC with two c5.2xlarge instances, and two p3.2xlarge instances. This gives you one NVIDIA Tesla V100 GPU per instance, for a total of two GPUs to run training tasks.

I adapted the TensorFlow benchmark Docker container to create a training workload. I use this container to compare the GPU scheduling and runtime.

When the CloudFormation stack is deployed, register a task definition with the ECS service:

aws ecs register-task-definition --cli-input-json file://gpu-1-taskdef.json

To request GPU resources in the task definition, the only change needed is to include a GPU resource requirement in the container definition:

            "resourceRequirements": [
                {
                    "type": "GPU",
                    "value": "1"
                }
            ],

Including this resource requirement ensures that the ECS scheduler allocates the task to an instance with a free GPU resource.

Launch a single-GPU training workload

Now you’re ready to launch the first GPU workload.

export cluster=$(aws cloudformation describe-stacks --stack-name tensorflow-test --query 
'Stacks[0].Outputs[?OutputKey==`ClusterName`].OutputValue' --output text) 
echo $cluster
aws ecs run-task --cluster $cluster --task-definition tensorflow-1-gpu

When you launch the task, the output shows the `guIds` values that are assigned to the task. This GPU is pinned to this task, and can’t be shared with any other tasks. If all GPUs are allocated, you can’t schedule additional GPU tasks until a running task with a GPU completes. That frees the GPU to be scheduled again.

When you look at the log output in Amazon CloudWatch Logs, you see that the container discovered one GPU: `/gpu0` and the training benchmark trained at a rate of 321.16 images/sec.

With your two p3.2xlarge nodes in the cluster, you are limited to two concurrent single GPU based workloads. To scale horizontally, you could add additional p3.2xlarge nodes. This would limit your workloads to a single GPU each.  To scale vertically, you could bump up the instance type,  which would allow you to assign multiple GPUs to a single task.  Now, let’s see how fast your TensorFlow container can train when assigned multiple GPUs.

Launch a multiple-GPU training workload

To begin, replace the p3.2xlarge instances with p3.16xlarge instances. This gives your cluster two instances that each have eight GPUs, for a total of 16 GPUs that can be allocated.

aws cloudformation deploy --stack-name tensorflow-test --template-file cluster-cpu-gpu.yml --parameter-overrides GPUInstanceType=p3.16xlarge --capabilities CAPABILITY_IAM

When the CloudFormation deploy is complete, register two more task definitions to launch your benchmark container requesting more GPUs:

aws ecs register-task-definition --cli-input-json file://gpu-4-taskdef.json  
aws ecs register-task-definition --cli-input-json file://gpu-8-taskdef.json 

Next, launch two TensorFlow benchmark containers, one requesting four GPUs, and one requesting eight GPUs:

aws ecs run-task --cluster $cluster --task-definition tensorflow-4-gpu
aws ecs run-task --cluster $cluster --task-definition tensorflow-8-gpu

With each task request, GPUs are allocated: four in the first request, and eight in the second request. Again, these GPUs are pinned to the task, and not usable by any other task until these tasks are complete.

Check the log output in CloudWatch Logs:

On the “devices” lines, you can see that the container discovered and used four (or eight) GPUs. Also, the total images/sec improved to 1297.41 with four GPUs, and 1707.23 with eight GPUs.

Because you can pin single or multiple GPUs to a task, running advanced GPU based training tasks on Amazon ECS is easier than ever!

Cleanup

To clean up your running resources, delete the CloudFormation stack:

aws cloudformation delete-stack --stack-name tensorflow-test

Conclusion

For more information, see Working with GPUs on Amazon ECS.

If you want to keep up on the latest container info from AWS, please follow me on Twitter and tweet any questions! @brentContained

Algorithmic and Technological Transparency

Post Syndicated from Bozho original https://techblog.bozho.net/algorithmic-and-technological-transparency/

Today I had a talk on OpenFest about algorithmic and technological transparency. It was a somewhat abstract and high-level talk but I hoped to make it more meaningful than if some random guy just spoke about techno-dystopias.

I don’t really like these kinds of talks where someone who has no idea what a “training set” is, how dirty the data is and how to write five lines in Python talks about an AI revolution. But I think being a technical person let’s me put some details into the high-level stuff and make it less useless.

The problem I’m trying to address is opaque systems – we don’t know how recommendation systems work, why are seeing certain videos or ads, how decisions are taken, what happens with our data and how secure the systems we use are, including our cars and future self-driving cars.

Algorithms have real impact on society – recommendation engines make conspiracy theories mainstream, motivate fascists, create echo chambers. Algorithms decide whether we get loans, welfare, a conviction, or even if we get hit by a car (as in the classical trolley problem).

How do we make the systems more transparent, though? There are many approaches, some overlapping. Textual description of how things operate, tracking the actions of back-office admins and making them provable to third parties, implementing “Why am I seeing this?”, tracking each step in machine learning algorithms, publishing datasets, publishing source code (as I’ve previously discussed open-sourcing some aspects of self-driving cars).

We don’t have to reveal trade secrets and lose competitive advantage in order to be transparent. Transparency doesn’t have to come at the expense of profit. In some cases it can even improve the perception of a company.

I think we should start with defining best practices then move to industry codes and standards, and if that doesn’t work, carefully think of focused regulation (but have in mind politicians are mostly clueless about technology and may do more damage)

The slides from the talk are below (the talk itself was not in English)

It can be seen as our responsibility as engineers to steer our products in that direction. Not all of them require or would benefit from such transparency (rarely anyone cares about how an accounting system works), and it would normally be the product people that can decide what should become more transparent, but we also have a stake and should be morally accountable for what we build.

I hope we start thinking and working towards more transparent systems. And of course, each system has its specifics and ways to become more transparent, but I think it should be a general mode of thinking – how do we make our systems not only better and more secure, but also more understandable so that we don’t become the “masterminds” behind black-boxes that decide lives.

The post Algorithmic and Technological Transparency appeared first on Bozho's tech blog.

Running GPU-Accelerated Kubernetes Workloads on P3 and P2 EC2 Instances with Amazon EKS

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/

This post contributed by Scott Malkie, AWS Solutions Architect

Amazon EC2 P3 and P2 instances, featuring NVIDIA GPUs, power some of the most computationally advanced workloads today, including machine learning (ML), high performance computing (HPC), financial analytics, and video transcoding. Now Amazon Elastic Container Service for Kubernetes (Amazon EKS) supports P3 and P2 instances, making it easy to deploy, manage, and scale GPU-based containerized applications.

This blog post walks through how to start up GPU-powered worker nodes and connect them to an existing Amazon EKS cluster. Then it demonstrates an example application to show how containers can take advantage of all that GPU power!

Prerequisites

You need an existing Amazon EKS cluster, kubectl, and the aws-iam-authenticator set up according to Getting Started with Amazon EKS.

Two steps are required to enable GPU workloads. First, join Amazon EC2 P3 or P2 GPU compute instances as worker nodes to the Kubernetes cluster. Second, configure pods to enable container-level access to the node’s GPUs.

Spinning up Amazon EC2 GPU instances and joining them to an existing Amazon EKS Cluster

To start the worker nodes, use the standard AWS CloudFormation template for Amazon EKS worker nodes, specifying the AMI ID of the new Amazon EKS-optimized AMI for GPU workloads. This AMI is available on AWS Marketplace.

Subscribe to the AMI and then launch it using the AWS CloudFormation template. The template takes care of networking, configuring kubelets, and placing your worker nodes into an Auto Scaling group, as shown in the following image.

This template creates an Auto Scaling group with up to two p3.8xlarge Amazon EC2 GPU instances. Powered by up to eight NVIDIA Tesla V100 GPUs, these instances deliver up to 1 petaflop of mixed-precision performance per instance to significantly accelerate ML and HPC applications. Amazon EC2 P3 instances have been proven to reduce ML training times from days to hours and to reduce time-to-results for HPC.

After the AWS CloudFormation template completes, the Outputs view contains the NodeInstanceRole parameter, as shown in the following image.

NodeInstanceRole needs to be passed in to the AWS Authenticator ConfigMap, as documented in the AWS EKS Getting Started Guide. To do so, edit the ConfigMap template and run the command kubectl apply -f aws-auth-cm.yaml in your terminal to apply the ConfigMap. You can then run kubectl get nodes —watch to watch the two Amazon EC2 GPU instances join the cluster, as shown in the following image.

Configuring Kubernetes pods to access GPU resources

First, use the following command to apply the NVIDIA Kubernetes device plugin as a daemon set on the cluster.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

This command produces the following output:

Once the daemon set is running on the GPU-powered worker nodes, use the following command to verify that each node has allocatable GPUs.

kubectl get nodes \
"-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

The following output shows that each node has four GPUs available:

Next, modify any Kubernetes pod manifests, such as the following one, to take advantage of these GPUs. In general, adding the resources configuration (resources: limits:) to pod manifests gives containers access to one GPU. A pod can have access to all of the GPUs available to the node that it’s running on.

apiVersion: v1
kind: Pod
metadata:
  name: pod-name
spec:
  containers:
  - name: container-name
    ...
    resources:
      limits:
        nvidia.com/gpu: 4

As a more specific example, the following sample manifest displays the results of the nvidia-smi binary, which shows diagnostic information about all GPUs visible to the container.

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:latest
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 4

Download this manifest as nvidia-smi-pod.yaml and launch it with kubectl apply -f nvidia-smi-pod.yaml.

To confirm successful nvidia-smi execution, use the following command to examine the log.

kubectl logs nvidia-smi

The above commands produce the following output:

Existing limitations

  • GPUs cannot be overprovisioned – containers and pods cannot share GPUs
  • The maximum number of GPUs that you can schedule to a pod is capped by the number of GPUs available to that pod’s node
  • Depending on your account, you might have Amazon EC2 service limits on how many and which type of Amazon EC2 GPU compute instances you can launch simultaneously

For more information about GPU support in Kubernetes, see the Kubernetes documentation. For more information about using Amazon EKS, see the Amazon EKS documentation. Guidance setting up and running Amazon EKS can be found in the AWS Workshop for Kubernetes on GitHub.

Please leave any comments about this post and share what you’re working on. I can’t wait to see what you build with GPU-powered workloads on Amazon EKS!

Rock, paper, scissors, lizard, Spock, fire, water balloon!

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/rock-paper-scissors-lizard-spock-fire-water-balloon/

Use a Raspberry Pi and a Pi Camera Module to build your own machine learning–powered rock paper scissors game!

Rock-Paper-Scissors game using computer vision and machine learning on Raspberry Pi

A Rock-Paper-Scissors game using computer vision and machine learning on the Raspberry Pi. Project GitHub page: https://github.com/DrGFreeman/rps-cv PROJECT ORIGIN: This project results from a challenge my son gave me when I was teaching him the basics of computer programming making a simple text based Rock-Paper-Scissors game in Python.

Virtual rock paper scissors

Here’s why you should always leave comments on our blog: this project from Julien de la Bruère-Terreault instantly had our attention when he shared it on our recent Android Things post.

Julien and his son were building a text-based version of rock paper scissors in Python when his son asked him: “Could you make a rock paper scissors game that uses the camera to detect hand gestures?” Obviously, Julien really had no choice but to accept the challenge.

“The game uses a Raspberry Pi computer and Raspberry Pi Camera Module installed on a 3D-printed support with LED strips to achieve consistent images,” Julien explains in the tutorial for the build. “The pictures taken by the camera are processed and fed to an image classifier that determines whether the gesture corresponds to ‘Rock’, ‘Paper’, or ‘Scissors’ gestures.”

How does it work?

Physically, the build uses a Pi 3 Model B and a Camera Module V2 alongside 3D-printed parts. The parts are all green, since a consistent colour allows easy subtraction of background from the captured images. You can download the files for the setup from Thingiverse.

rock paper scissors raspberry pi

To illustrate how the software works, Julien has created a rather delightful pipeline demonstrating where computer vision and machine learning come in.

rock paper scissors using raspberry pi

The way the software works means the game doesn’t need to be limited to the standard three hand signs. If you wanted to, you could add other signs such as ‘lizard’ and ‘Spock’! Or ‘fire’ and ‘water balloon’. Or any other alterations made to the game in your pop culture favourites.

rock paper scissors lizard spock

Check out Julien’s full tutorial to build your own AI-powered rock paper scissors game here on Julien’s GitHub. Massive kudos to Julien for spending a year learning the skills required to make it happen. And a massive thank you to Julien’s son for inspiring him! This is why it’s great to do coding and digital making with kids — they have the best project ideas!

Sharing is caring

If you’ve built your own project using Raspberry Pi, please share it with us in the comments below, or via social media. As you can tell from today’s blog post, we love to see them and share them with the whole community!

The post Rock, paper, scissors, lizard, Spock, fire, water balloon! appeared first on Raspberry Pi.

Astronaut-made virtual co-pilot

Post Syndicated from Liz Upton original https://www.raspberrypi.org/blog/virtual-co-pilot-solar-pilot-guard/

This project features several of our favourite things. Astronauts! Machine learning! High-altitude danger! Graphs! (It could only get slightly better with the addition of tap-dancing centaurs.) Read on to have your nerdliest pleasure centres tickled.

Solar Pilot Guard - wing of a plane in flight

Your interest should be focussed on the strange fin with the red tip. Although we agree the mountains look nice too.

Solar Pilot Guard, a Foale family project

Michael Foale is a former astronaut with dual British/American citizenship; and thanks to that dual citizenship was revered by British kids like me as some kind of Superman when he spent time on the Russian Mir space station back in the 1990s. It’s always great to see one of our heroes using the Raspberry Pi, but it’s doubly great when the use it’s being put to is so very, very cool.

Foale’s daughter Jenna is a PhD candidate in computational fluid dynamics, and together they have engineered a machine-learning system called Solar Pilot Guard to help prevent aircraft crashes, using the Wolfram Language on a Raspberry Pi. A solar-powered probe (that fin in the image above) detects changes in acceleration and air pressure to spot potential loss-of-control (LOC) events in flight, calculating the probability of each pressure/acceleration event representing a possible LOC event.

Solar Pilot Guard schematic cross-section

Click to embiggen

If it detects a possible LOC event, the system issues a voice command to the pilot over Bluetooth speakers, using machine learning to tell the pilot what corrective measures they should take.

Here it is in action:

Solar Pilot Guard use in-flight

An example of in-flight operation of the Solar Pilot Guard (SPG), issuing commands for correction of flight behavior that could lead to loss of control (LOC). Demonstrated commands: Push, Power – Left, Left – Right, Right Submitted to EAA AirVenture, Oshkosh 2017.

Losing control to generate training data

In order to train the network, Michael Foale had to feed the machine data about what LOCs and normal flight look like — which meant flying the kit in ways which would make the plane lose control, not just once, but over and over, until the neural net had the data it needed to differentiate different sorts of LOC events. Told you he was a superhero.

A stack of different machine learning functions at different levels of abstraction are working together here. This is a training set from one of the (presumably terrifying) training flights:

Solar Pilot Guard training set

The Pi processes and learns from this data; if you’re interested in a very deep dive into the way this all works, and how you can build your own neural networks using the Wolfram Language, there’s a very comprehensive treatment over at the Wolfram blog.

We love seeing projects like this that recognise just how robust and powerful a little Raspberry Pi can be. Jenna and Michael: thank you for sharing what you’ve been working on here. It’s one of the coolest and most audacious projects we’ve seen in a long time.

The post Astronaut-made virtual co-pilot appeared first on Raspberry Pi.

The Android Things flower that smiles with you

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/android-things-expression-flower/

Smile, and the world smiles with you — or, in this case, a laser-cut flower running Android Things on a Raspberry Pi does.

Android Things flower Raspberry Pi Smile recognition Expression Flower

Expression Flower

The aim of the Expression Flower is to “challenge the perception of what robotics can be while exploring the possibility for a whimsical experience that is engaging, natural, and fun.”

Tl;dr: cute interactive flower. No Skynet.

Android Things

The flower is powered by Google’s IoT platform Android Things, running on a Raspberry Pi, and it has a camera mounted in the centre. It identifies facial expressions using the ML Kit machine learning package, also from Google. The software categorises expressions, and responds with a specific action: smile at the flower, and it will open up its petals with a colourful light show; wink at it, and its petals will close up bashfully.

Android Things flower Raspberry Pi Smile recognition Expression Flower

The build is made of laser-cut and 3D-printed parts, alongside off-the-shelf components. The entire build protocol, including video, parts, and code, is available on hackster.io, so all makers can give Expression Flower a go.

Android Things flower Raspberry Pi Smile recognition Expression Flower

Seriously, this may be the easiest-to-follow tutorial we’ve ever seen. So many videos. So much helpful information. It’s pure perfection!

Machine learning and Android Things

For more Raspberry Pi–based machine learning projects, see:

Adrian Rosebeck deep learning pokemon pokedex
Raspberry Pi Santa/Not Santa detector

And for more Android Things projects, we highly recommend:

Demonstation of Joe Birch's BrailleBox
Android Things Candy Dispenser Raspberry Pi
Lantern Raspberry Pi powered augmented reality projector lamp

Aaaand, for getting started with all things Android on your Raspberry Pi, check out issue 71 of The MagPi!

The post The Android Things flower that smiles with you appeared first on Raspberry Pi.

Machine Learning with AWS Fargate and AWS CodePipeline at Corteva Agriscience

Post Syndicated from Nathan Taber original https://aws.amazon.com/blogs/compute/machine-learning-with-aws-fargate-and-aws-codepipeline-at-corteva-agriscience/

This post contributed by Duke Takle and Kevin Hayes at Corteva Agriscience

At Corteva Agriscience, the agricultural division of DowDuPont, our purpose is to enrich the lives of those who produce and those who consume, ensuring progress for generations to come. As a global business, we support a network of research stations to improve agricultural productivity around the world

As analytical technology advances the volume of data, as well as the speed at which it must be processed, meeting the needs of our scientists poses unique challenges. Corteva Cloud Engineering teams are responsible for collaborating with and enabling software developers, data scientists, and others. Their work allows Corteva research and development to become the most efficient innovation machine in the agricultural industry.

Recently, our Systems and Innovations for Breeding and Seed Products organization approached the Cloud Engineering team with the challenge of how to deploy a novel machine learning (ML) algorithm for scoring genetic markers. The solution would require supporting labs across six continents in a process that is run daily. This algorithm replaces time-intensive manual scoring of genotypic assays with a robust, automated solution. When examining the solution space for this challenge, the main requirements for our solution were global deployability, application uptime, and scalability.

Before the implementing this algorithm in AWS, ML autoscoring was done as a proof of concept using pre-production instances on premises. It required several technicians to continue to process assays by hand. After implementing on AWS, we have enabled those technicians to be better used in other areas, such as technology development.

Solutions Considered

A RESTful web service seemed to be an obvious way to solve the problem presented. AWS has several patterns that could implement a RESTful web service, such as Amazon API Gateway, AWS Lambda, Amazon EC2, AWS Auto Scaling, Amazon Elastic Container Service (ECS) using the EC2 launch type, and AWS Fargate.

At the time, the project came into our backlog, we had just heard of Fargate. Fargate does have a few limitations (scratch storage, CPU, and memory), none of which were a problem. So EC2, Auto Scaling, and ECS with the EC2 launch type were ruled out because they would have introduced unneeded complexity. The unneeded complexity is mostly around management of EC2 instances to either run the application or the container needed for the solution.

When the project came into our group, there had been a substantial proof-of-concept done with a Docker container. While we are strong API Gateway and Lambda proponents, there is no need to duplicate processes or services that AWS provides. We also knew that we needed to be able to move fast. We wanted to put the power in the hands of our developers to focus on building out the solution. Additionally, we needed something that could scale across our organization and provide some rationalization in how we approach these problems. AWS services, such as Fargate, AWS CodePipeline, and AWS CloudFormation, made that possible.

Solution Overview

Our group prefers using existing AWS services to bring a complete project to the production environment.

CI/CD Pipeline

A complete discussion of the CI/CD pipeline for the project is beyond the scope of this post. However, in broad strokes, the pipeline is:

  1. Compile some C++ code wrapped in Python, create a Python wheel, and publish it to an artifact store.
  2. Create a Docker image with that wheel installed and publish it to ECR.
  3. Deploy and test the new image to our test environment.
  4. Deploy the new image to the production environment.

Solution

As mentioned earlier, the application is a Docker container deployed with the Fargate launch type. It uses an Aurora PostgreSQL DB instance for the backend data. The application itself is only needed internally so the Application Load Balancer is created with the scheme set to “internal” and deployed into our private application subnets.

Our environments are all constructed with CloudFormation templates. Each environment is constructed in a separate AWS account and connected back to a central utility account. The infrastructure stacks export a number of useful bits like the VPC, subnets, IAM roles, security groups, etc. This scheme allows us to move projects through the several accounts without changing the CloudFormation templates, just the parameters that are fed into them.

For this solution, we use an existing VPC, set of subnets, IAM role, and ACM certificate in the us-east-1 Region. The solution CloudFormation stack describes and manages the following resources:

AWS::ECS::Cluster*
AWS::EC2::SecurityGroup
AWS::EC2::SecurityGroupIngress
AWS::Logs::LogGroup
AWS::ECS::TaskDefinition*
AWS::ElasticLoadBalancingV2::LoadBalancer
AWS::ElasticLoadBalancingV2::TargetGroup
AWS::ElasticLoadBalancingV2::Listener
AWS::ECS::Service*
AWS::ApplicationAutoScaling::ScalableTarget
AWS::ApplicationAutoScaling::ScalingPolicy
AWS::ElasticLoadBalancingV2::ListenerRule

A complete discussion of all the resources for the solution is beyond the scope of this post. However, we can explore the resource definitions of the components specific to Fargate. The following three simple segments of CloudFormation are all that is needed to create a Fargate stack: an ECS cluster, task definition, and service. More complete examples of the CloudFormation templates are linked at the end of this post, with stack creation instructions.

AWS::ECS::Cluster:

"ECSCluster": {
    "Type":"AWS::ECS::Cluster",
    "Properties" : {
        "ClusterName" : { "Ref": "clusterName" }
    }
}

The ECS Cluster resource is a simple grouping for the other ECS resources to be created. The cluster created in this stack holds the tasks and service that implement the actual solution. Finally, in the AWS Management Console, the cluster is the entry point to find info about your ECS resources.

AWS::ECS::TaskDefinition

"fargateDemoTaskDefinition": {
    "Type": "AWS::ECS::TaskDefinition",
    "Properties": {
        "ContainerDefinitions": [
            {
                "Essential": "true",
                "Image": { "Ref": "taskImage" },
                "LogConfiguration": {
                    "LogDriver": "awslogs",
                    "Options": {
                        "awslogs-group": {
                            "Ref": "cloudwatchLogsGroup"
                        },
                        "awslogs-region": {
                            "Ref": "AWS::Region"
                        },
                        "awslogs-stream-prefix": "fargate-demo-app"
                    }
                },
                "Name": "fargate-demo-app",
                "PortMappings": [
                    {
                        "ContainerPort": 80
                    }
                ]
            }
        ],
        "ExecutionRoleArn": {"Fn::ImportValue": "fargateDemoRoleArnV1"},
        "Family": {
            "Fn::Join": [
                "",
                [ { "Ref": "AWS::StackName" }, "-fargate-demo-app" ]
            ]
        },
        "NetworkMode": "awsvpc",
        "RequiresCompatibilities" : [ "FARGATE" ],
        "TaskRoleArn": {"Fn::ImportValue": "fargateDemoRoleArnV1"},
        "Cpu": { "Ref": "cpuAllocation" },
        "Memory": { "Ref": "memoryAllocation" }
    }
}

The ECS Task Definition is where we specify and configure the container. Interesting things to note are the CPU and memory configuration items. It is important to note the valid combinations for CPU/memory settings, as shown in the following table.

CPUMemory
0.25 vCPU0.5 GB, 1 GB, and 2 GB
0.5 vCPUMin. 1 GB and Max. 4 GB, in 1-GB increments
1 vCPUMin. 2 GB and Max. 8 GB, in 1-GB increments
2 vCPUMin. 4 GB and Max. 16 GB, in 1-GB increments
4 vCPUMin. 8 GB and Max. 30 GB, in 1-GB increments

AWS::ECS::Service

"fargateDemoService": {
     "Type": "AWS::ECS::Service",
     "DependsOn": [
         "fargateDemoALBListener"
     ],
     "Properties": {
         "Cluster": { "Ref": "ECSCluster" },
         "DesiredCount": { "Ref": "minimumCount" },
         "LaunchType": "FARGATE",
         "LoadBalancers": [
             {
                 "ContainerName": "fargate-demo-app",
                 "ContainerPort": "80",
                 "TargetGroupArn": { "Ref": "fargateDemoTargetGroup" }
             }
         ],
         "NetworkConfiguration":{
             "AwsvpcConfiguration":{
                 "SecurityGroups": [
                     { "Ref":"fargateDemoSecuityGroup" }
                 ],
                 "Subnets":[
                    {"Fn::ImportValue": "privateSubnetOneV1"},
                    {"Fn::ImportValue": "privateSubnetTwoV1"},
                    {"Fn::ImportValue": "privateSubnetThreeV1"}
                 ]
             }
         },
         "TaskDefinition": { "Ref":"fargateDemoTaskDefinition" }
     }
}

The ECS Service resource is how we can configure where and how many instances of tasks are executed to solve our problem. In this case, we see that there are at least minimumCount instances of the task running in any of three private subnets in our VPC.

Conclusion

Deploying this algorithm on AWS using containers and Fargate allowed us to start running the application at scale with low support overhead. This has resulted in faster turnaround time with fewer staff and a concomitant reduction in cost.

“We are very excited with the deployment of Polaris, the autoscoring of the marker lab genotyping data using AWS technologies. This key technology deployment has enhanced performance, scalability, and efficiency of our global labs to deliver over 1.4 Billion data points annually to our key customers in Plant Breeding and Integrated Operations.”

Sandra Milach, Director of Systems and Innovations for Breeding and Seed Products.

We are distributing this solution to all our worldwide laboratories to harmonize data quality, and speed. We hope this enables an increase in the velocity of genetic gain to increase yields of crops for farmers around the world.

You can learn more about the work we do at Corteva at www.corteva.com.

Try it yourself:

The snippets above are instructive but not complete. We have published two repositories on GitHub that you can explore to see how we built this solution:

Note: the components in these repos do not include our production code, but they show you how this works using Amazon ECS and AWS Fargate.

There’s Waldo! Finding the elusive traveller using AI

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/theres-waldo-wheres-wally/

Let me start by stating that here in the UK, we call Waldo Wally. And as I’m writing this post at my desk at Pi Towers, Cambridge, I have taken the decision to refer to the red and white-clad fellow as Wally moving forward.

Just so you know.

There’s Waldo is a robot that finds Waldo

There’s Waldo is a robot built to find Waldo and point at him. The robot arm is controlled by a Raspberry Pi using the PYARM Python library for the UARM Metal. Once initialized the arm is instructed to extend and take a photo of the canvas below.

The magical mind of Matt Reed

Both in his work and personal time, Matt Reed is a maker. In a nutshell, he has the job we all want — Creative Technologist — and gets to spend his working hours building interesting marketing projects for companies such as Redbull and Pi Towers favourite, Oreo. And lucky for us, he uses a Raspberry Pi in many of his projects — hurray!

Where’s Waldo Wally

With There’s Waldo, Matt has trained the AutoML Vision app, Google’s new image content analysis AI service, to recognise Wally in a series of images. With an AI model trained to recognise the features of the elusive traveller, a webcam attached to a Raspberry Pi 3B snaps a photo and the AI algorithm scans all faces, finding familiarities.

Matt Reed on Twitter

model is predicting WAAAYYY better than expected as this webcam image here wasn’t even part the training set. You can run from #ai but apparently can’t hide from #GoogleCloud

Once a match for Wally’s face is found with 95% or higher confidence, a robotic arm, controlled by the Pyarm Python library, points a comically small, plastic hand at where it believes Wally to be.

Deep learning and model training

We’ve started to discover more and more deep learning projects using Raspberry Pi — and with the recent release of TensorFlow 1.9 for the Pi, we’re sure this will soon become an even more common occurrence.

Adrian Rosebeck deep learning pokemon pokedex

For more projects using deep learning and the Raspberry Pi, check out Adrian Rosebrock’s deep learning Pokédex, and his Santa Detector.

And for more projects from Matt Reed and the redpepper team, you can follow Matt’s Twitter, visit his website, and check out his community profile in The MagPi.

The post There’s Waldo! Finding the elusive traveller using AI appeared first on Raspberry Pi.

Take a photo of yourself as an unreliable cartoon

Post Syndicated from Helen Lynn original https://www.raspberrypi.org/blog/take-a-photo-of-yourself-unreliable-cartoon/

Take a selfie, wait for the image to appear, and behold a cartoon version of yourself. Or, at least, behold a cartoon version of whatever the camera thought it saw. Welcome to Draw This by maker Dan Macnish.

Dan has made code, instructions, and wiring diagrams available to help you bring this beguiling weirdery into your own life.

raspberry pi cartoon polaroid camera

Neural networks, object recognition, and cartoons

One of the fun things about this re-imagined polaroid is that you never get to see the original image. You point, and shoot – and out pops a cartoon; the camera’s best interpretation of what it saw. The result is always a surprise. A food selfie of a healthy salad might turn into an enormous hot dog, or a photo with friends might be photobombed by a goat.

OK. Let’s take this one step at a time.

Pi + camera + button + LED

Draw This uses a Raspberry Pi 3 and a Camera Module, with a button and a useful status LED connected to the GPIO pins via a breadboard. You press the button, and the camera captures a still image while the LED comes on and stays lit for a couple of seconds while the Pi processes the image. So far, so standard Pi camera build.

Interpreting and re-interpreting the camera image

Dan uses Python to process the captured photograph, employing a pre-trained machine learning model from Google to recognise multiple objects in the image. Now he brings the strangeness. The Pi matches the things it sees in the photo with doodles from Google’s huge open-source Quick, Draw! dataset, and generates a new image that represents the objects in the original image as doodles. Then a thermal printer connected to the Pi’s GPIO pins prints the results.

A 28 x 14 grid of kangaroo doodles in dark grey on a white background

Kangaroos from the Quick, Draw! dataset (I got distracted)

Potential for peculiar

Reading about this build leaves me yearning to see its oddest interpretation of a scene, so if you make this and you find it really does turn you or your friend into a goat, please do share that with us.

And as you can see from my kangaroo digression above, there is a ton of potential for bizarro makes that use the Quick, Draw! dataset, object recognition models, or both; it’s not just the marsupials that are inexplicably compelling (I dare you to go and look and see how long it takes you to get back to whatever you were in the middle of). If you’re planning to make this, or something inspired by this, check out Dan’s cartoonify GitHub repo. And tell us all about it in the comments.

The post Take a photo of yourself as an unreliable cartoon appeared first on Raspberry Pi.

HackSpace magazine 8: Raspberry Pi <3 Arduino

Post Syndicated from Andrew Gregory original https://www.raspberrypi.org/blog/hackspace-magazine-8/

Arduino is officially brilliant. It’s the perfect companion for your Raspberry Pi, opening up new possibilities for robotics, drones and all sorts of physical computing projects. In HackSpace magazine issue 8  we’re taking a look at what’s going on on planet Arduino, and how it can make our world better.

HackSpace magazine

This little board and its ecosystem are hugely important to the world of digital making. It’s affordable, it’s powerful, and it’s open hardware so you know that if you embed one of these in a project and the company goes bust tomorrow, the hardware will always be viable.

Arduino has helped power a new generation of digital makers, and now with a new team in charge, new boards and new software, it’s ready for the next generation.

Noisy toys

We get to speak to loads of fascinating people, but this month marks the first time we’ve ever met a science busker. Meet Stephen Summers, a former teacher who makes a mess with cornflour, water, and sound waves, all in the name of sharing the joy of physics.

HackSpace magazine

Glass-blowing

While we love messing about with digital technologies, we’re also a big fan of good old-fashioned craft skills. And you can’t get much more old-fashioned than traditional glass-blowing. Join us as we attempt to turn red hot molten glass into a multicoloured object without burning ourselves or setting anything on fire.

Guitar synth

People are endlessly clever, inventive, and all-round brilliant. A fantastic example is Björk, the Icelandic musician whose work defies categorisation. Another is Matt Bradshaw, who has made a synthesiser that you play by strumming six metal strings with a plectrum to complete a circuit. Oh, and named it after Björk. Read all about it and get inspired to do something equally bonkers.

HackSpace magazine

Machine learning

Do you have children? Do they leave the lights on all the time, causing you to shout, “THIS ISN’T BLACKPOOL FLAMING ILLUMINATIONS, YOU KNOW!” Well, now you can replace those children with an Arduino. With a bit of machine learning, the Arduino can train itself to turn the lights on and off at the right time, all the time. Plus they don’t cost as much as human children, so it’s a double win!

Dry ice cream

When the sun comes out in Blighty, it doesn’t hang around for long. So why wait for your domestic fridge to freeze your tasty dairy-based desserts, when you can add some solid carbon dioxide and freeze it in a flash? Follow our tutorial and you too can have tasty treats with the ironically warm glow that comes from using chemicals at -78°C.

HackSpace magazine

And there’s more

We’ve filled the rest of the magazine with a robot orchestra, watch restoration, audio boards for Raspberry Pi, magical colour-changing wearables, and more. Get stuck in!



Get your copy of HackSpace magazine

If you like the sound of this month’s content, you can find HackSpace magazine in WHSmith, Tesco, Sainsbury’s, and independent newsagents in the UK. If you live in the US, check out your local Barnes & Noble, Fry’s, or Micro Center next week. We’re also shipping to stores in Australia, Hong Kong, Canada, Singapore, Belgium, and Brazil, so be sure to ask your local newsagent whether they’ll be getting HackSpace magazine.

And if you can’t get to the shops, fear not: you can subscribe from £4 an issue from our online shop. And if you’d rather try before you buy, you can always download the free PDF. Happy reading, and happy making!

The post HackSpace magazine 8: Raspberry Pi <3 Arduino appeared first on Raspberry Pi.

AWS Online Tech Talks – June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-june-2018/

AWS Online Tech Talks – June 2018

Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today!

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

 

Analytics & Big Data

June 18, 2018 | 11:00 AM – 11:45 AM PTGet Started with Real-Time Streaming Data in Under 5 Minutes – Learn how to use Amazon Kinesis to capture, store, and analyze streaming data in real-time including IoT device data, VPC flow logs, and clickstream data.
June 20, 2018 | 11:00 AM – 11:45 AM PT – Insights For Everyone – Deploying Data across your Organization – Learn how to deploy data at scale using AWS Analytics and QuickSight’s new reader role and usage based pricing.

 

AWS re:Invent
June 13, 2018 | 05:00 PM – 05:30 PM PTEpisode 2: AWS re:Invent Breakout Content Secret Sauce – Hear from one of our own AWS content experts as we dive deep into the re:Invent content strategy and how we maintain a high bar.
Compute

June 25, 2018 | 01:00 PM – 01:45 PM PTAccelerating Containerized Workloads with Amazon EC2 Spot Instances – Learn how to efficiently deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances.

June 26, 2018 | 01:00 PM – 01:45 PM PTEnsuring Your Windows Server Workloads Are Well-Architected – Get the benefits, best practices and tools on running your Microsoft Workloads on AWS leveraging a well-architected approach.

 

Containers
June 25, 2018 | 09:00 AM – 09:45 AM PTRunning Kubernetes on AWS – Learn about the basics of running Kubernetes on AWS including how setup masters, networking, security, and add auto-scaling to your cluster.

 

Databases

June 18, 2018 | 01:00 PM – 01:45 PM PTOracle to Amazon Aurora Migration, Step by Step – Learn how to migrate your Oracle database to Amazon Aurora.
DevOps

June 20, 2018 | 09:00 AM – 09:45 AM PTSet Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tools – Learn how to set up a CI/CD pipeline for deploying containers using the AWS Developer Tools.

 

Enterprise & Hybrid
June 18, 2018 | 09:00 AM – 09:45 AM PTDe-risking Enterprise Migration with AWS Managed Services – Learn how enterprise customers are de-risking cloud adoption with AWS Managed Services.

June 19, 2018 | 11:00 AM – 11:45 AM PTLaunch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new

 

AWS Environments

June 21, 2018 | 11:00 AM – 11:45 AM PTLeading Your Team Through a Cloud Transformation – Learn how you can help lead your organization through a cloud transformation.

June 21, 2018 | 01:00 PM – 01:45 PM PTEnabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences.

June 28, 2018 | 01:00 PM – 01:45 PM PTFireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device.
IoT

June 27, 2018 | 11:00 AM – 11:45 AM PTAWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products.

 

Machine Learning

June 19, 2018 | 09:00 AM – 09:45 AM PTIntegrating Amazon SageMaker into your Enterprise – Learn how to integrate Amazon SageMaker and other AWS Services within an Enterprise environment.

June 21, 2018 | 09:00 AM – 09:45 AM PTBuilding Text Analytics Applications on AWS using Amazon Comprehend – Learn how you can unlock the value of your unstructured data with NLP-based text analytics.

 

Management Tools

June 20, 2018 | 01:00 PM – 01:45 PM PTOptimizing Application Performance and Costs with Auto Scaling – Learn how selecting the right scaling option can help optimize application performance and costs.

 

Mobile
June 25, 2018 | 11:00 AM – 11:45 AM PTDrive User Engagement with Amazon Pinpoint – Learn how Amazon Pinpoint simplifies and streamlines effective user engagement.

 

Security, Identity & Compliance

June 26, 2018 | 09:00 AM – 09:45 AM PTUnderstanding AWS Secrets Manager – Learn how AWS Secrets Manager helps you rotate and manage access to secrets centrally.
June 28, 2018 | 09:00 AM – 09:45 AM PTUsing Amazon Inspector to Discover Potential Security Issues – See how Amazon Inspector can be used to discover security issues of your instances.

 

Serverless

June 19, 2018 | 01:00 PM – 01:45 PM PTProductionize Serverless Application Building and Deployments with AWS SAM – Learn expert tips and techniques for building and deploying serverless applications at scale with AWS SAM.

 

Storage

June 26, 2018 | 11:00 AM – 11:45 AM PTDeep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services.
June 27, 2018 | 01:00 PM – 01:45 PM PTChanging the Game: Extending Compute Capabilities to the Edge – Discover how to change the game for IIoT and edge analytics applications with AWS Snowball Edge plus enhanced Compute instances.
June 28, 2018 | 11:00 AM – 11:45 AM PTBig Data and Analytics Workloads on Amazon EFS – Get best practices and deployment advice for running big data and analytics workloads on Amazon EFS.

Amazon SageMaker Updates – Tokyo Region, CloudFormation, Chainer, and GreenGrass ML

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/sagemaker-tokyo-summit-2018/

Today, at the AWS Summit in Tokyo we announced a number of updates and new features for Amazon SageMaker. Starting today, SageMaker is available in Asia Pacific (Tokyo)! SageMaker also now supports CloudFormation. A new machine learning framework, Chainer, is now available in the SageMaker Python SDK, in addition to MXNet and Tensorflow. Finally, support for running Chainer models on several devices was added to AWS Greengrass Machine Learning.

Amazon SageMaker Chainer Estimator


Chainer is a popular, flexible, and intuitive deep learning framework. Chainer networks work on a “Define-by-Run” scheme, where the network topology is defined dynamically via forward computation. This is in contrast to many other frameworks which work on a “Define-and-Run” scheme where the topology of the network is defined separately from the data. A lot of developers enjoy the Chainer scheme since it allows them to write their networks with native python constructs and tools.

Luckily, using Chainer with SageMaker is just as easy as using a TensorFlow or MXNet estimator. In fact, it might even be a bit easier since it’s likely you can take your existing scripts and use them to train on SageMaker with very few modifications. With TensorFlow or MXNet users have to implement a train function with a particular signature. With Chainer your scripts can be a little bit more portable as you can simply read from a few environment variables like SM_MODEL_DIR, SM_NUM_GPUS, and others. We can wrap our existing script in a if __name__ == '__main__': guard and invoke it locally or on sagemaker.


import argparse
import os

if __name__ =='__main__':

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--batch-size', type=int, default=64)
    parser.add_argument('--learning-rate', type=float, default=0.05)

    # Data, model, and output directories
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])

    args, _ = parser.parse_known_args()

    # ... load from args.train and args.test, train a model, write model to args.model_dir.

Then, we can run that script locally or use the SageMaker Python SDK to launch it on some GPU instances in SageMaker. The hyperparameters will get passed in to the script as CLI commands and the environment variables above will be autopopulated. When we call fit the input channels we pass will be populated in the SM_CHANNEL_* environment variables.


from sagemaker.chainer.estimator import Chainer
# Create my estimator
chainer_estimator = Chainer(
    entry_point='example.py',
    train_instance_count=1,
    train_instance_type='ml.p3.2xlarge',
    hyperparameters={'epochs': 10, 'batch-size': 64}
)
# Train my estimator
chainer_estimator.fit({'train': train_input, 'test': test_input})

# Deploy my estimator to a SageMaker Endpoint and get a Predictor
predictor = chainer_estimator.deploy(
    instance_type="ml.m4.xlarge",
    initial_instance_count=1
)

Now, instead of bringing your own docker container for training and hosting with Chainer, you can just maintain your script. You can see the full sagemaker-chainer-containers on github. One of my favorite features of the new container is built-in chainermn for easy multi-node distribution of your chainer training jobs.

There’s a lot more documentation and information available in both the README and the example notebooks.

AWS GreenGrass ML with Chainer

AWS GreenGrass ML now includes a pre-built Chainer package for all devices powered by Intel Atom, NVIDIA Jetson, TX2, and Raspberry Pi. So, now GreenGrass ML provides pre-built packages for TensorFlow, Apache MXNet, and Chainer! You can train your models on SageMaker then easily deploy it to any GreenGrass-enabled device using GreenGrass ML.

JAWS UG

I want to give a quick shout out to all of our wonderful and inspirational friends in the JAWS UG who attended the AWS Summit in Tokyo today. I’ve very much enjoyed seeing your pictures of the summit. Thanks for making Japan an amazing place for AWS developers! I can’t wait to visit again and meet with all of you.

Randall

Detecting Lies through Mouse Movements

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/05/detecting_lies_.html

Interesting research: “The detection of faked identity using unexpected questions and mouse dynamics,” by Merulin Monaro, Luciano Gamberini, and Guiseppe Sartori.

Abstract: The detection of faked identities is a major problem in security. Current memory-detection techniques cannot be used as they require prior knowledge of the respondent’s true identity. Here, we report a novel technique for detecting faked identities based on the use of unexpected questions that may be used to check the respondent identity without any prior autobiographical information. While truth-tellers respond automatically to unexpected questions, liars have to “build” and verify their responses. This lack of automaticity is reflected in the mouse movements used to record the responses as well as in the number of errors. Responses to unexpected questions are compared to responses to expected and control questions (i.e., questions to which a liar also must respond truthfully). Parameters that encode mouse movement were analyzed using machine learning classifiers and the results indicate that the mouse trajectories and errors on unexpected questions efficiently distinguish liars from truth-tellers. Furthermore, we showed that liars may be identified also when they are responding truthfully. Unexpected questions combined with the analysis of mouse movement may efficiently spot participants with faked identities without the need for any prior information on the examinee.

Boing Boing post.

AWS IoT 1-Click – Use Simple Devices to Trigger Lambda Functions

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-iot-1-click-use-simple-devices-to-trigger-lambda-functions/

We announced a preview of AWS IoT 1-Click at AWS re:Invent 2017 and have been refining it ever since, focusing on simplicity and a clean out-of-box experience. Designed to make IoT available and accessible to a broad audience, AWS IoT 1-Click is now generally available, along with new IoT buttons from AWS and AT&T.

I sat down with the dev team a month or two ago to learn about the service so that I could start thinking about my blog post. During the meeting they gave me a pair of IoT buttons and I started to think about some creative ways to put them to use. Here are a few that I came up with:

Help Request – Earlier this month I spent a very pleasant weekend at the HackTillDawn hackathon in Los Angeles. As the participants were hacking away, they occasionally had questions about AWS, machine learning, Amazon SageMaker, and AWS DeepLens. While we had plenty of AWS Solution Architects on hand (decked out in fashionable & distinctive AWS shirts for easy identification), I imagined an IoT button for each team. Pressing the button would alert the SA crew via SMS and direct them to the proper table.

Camera ControlTim Bray and I were in the AWS video studio, prepping for the first episode of Tim’s series on AWS Messaging. Minutes before we opened the Twitch stream I realized that we did not have a clean, unobtrusive way to ask the camera operator to switch to a closeup view. Again, I imagined that a couple of IoT buttons would allow us to make the request.

Remote Dog Treat Dispenser – My dog barks every time a stranger opens the gate in front of our house. While it is great to have confirmation that my Ring doorbell is working, I would like to be able to press a button and dispense a treat so that Luna stops barking!

Homes, offices, factories, schools, vehicles, and health care facilities can all benefit from IoT buttons and other simple IoT devices, all managed using AWS IoT 1-Click.

All About AWS IoT 1-Click
As I said earlier, we have been focusing on simplicity and a clean out-of-box experience. Here’s what that means:

Architects can dream up applications for inexpensive, low-powered devices.

Developers don’t need to write any device-level code. They can make use of pre-built actions, which send email or SMS messages, or write their own custom actions using AWS Lambda functions.

Installers don’t have to install certificates or configure cloud endpoints on newly acquired devices, and don’t have to worry about firmware updates.

Administrators can monitor the overall status and health of each device, and can arrange to receive alerts when a device nears the end of its useful life and needs to be replaced, using a single interface that spans device types and manufacturers.

I’ll show you how easy this is in just a moment. But first, let’s talk about the current set of devices that are supported by AWS IoT 1-Click.

Who’s Got the Button?
We’re launching with support for two types of buttons (both pictured above). Both types of buttons are pre-configured with X.509 certificates, communicate to the cloud over secure connections, and are ready to use.

The AWS IoT Enterprise Button communicates via Wi-Fi. It has a 2000-click lifetime, encrypts outbound data using TLS, and can be configured using BLE and our mobile app. It retails for $19.99 (shipping and handling not included) and can be used in the United States, Europe, and Japan.

The AT&T LTE-M Button communicates via the LTE-M cellular network. It has a 1500-click lifetime, and also encrypts outbound data using TLS. The device and the bundled data plan is available an an introductory price of $29.99 (shipping and handling not included), and can be used in the United States.

We are very interested in working with device manufacturers in order to make even more shapes, sizes, and types of devices (badge readers, asset trackers, motion detectors, and industrial sensors, to name a few) available to our customers. Our team will be happy to tell you about our provisioning tools and our facility for pushing OTA (over the air) updates to large fleets of devices; you can contact them at [email protected].

AWS IoT 1-Click Concepts
I’m eager to show you how to use AWS IoT 1-Click and the buttons, but need to introduce a few concepts first.

Device – A button or other item that can send messages. Each device is uniquely identified by a serial number.

Placement Template – Describes a like-minded collection of devices to be deployed. Specifies the action to be performed and lists the names of custom attributes for each device.

Placement – A device that has been deployed. Referring to placements instead of devices gives you the freedom to replace and upgrade devices with minimal disruption. Each placement can include values for custom attributes such as a location (“Building 8, 3rd Floor, Room 1337”) or a purpose (“Coffee Request Button”).

Action – The AWS Lambda function to invoke when the button is pressed. You can write a function from scratch, or you can make use of a pair of predefined functions that send an email or an SMS message. The actions have access to the attributes; you can, for example, send an SMS message with the text “Urgent need for coffee in Building 8, 3rd Floor, Room 1337.”

Getting Started with AWS IoT 1-Click
Let’s set up an IoT button using the AWS IoT 1-Click Console:

If I didn’t have any buttons I could click Buy devices to get some. But, I do have some, so I click Claim devices to move ahead. I enter the device ID or claim code for my AT&T button and click Claim (I can enter multiple claim codes or device IDs if I want):

The AWS buttons can be claimed using the console or the mobile app; the first step is to use the mobile app to configure the button to use my Wi-Fi:

Then I scan the barcode on the box and click the button to complete the process of claiming the device. Both of my buttons are now visible in the console:

I am now ready to put them to use. I click on Projects, and then Create a project:

I name and describe my project, and click Next to proceed:

Now I define a device template, along with names and default values for the placement attributes. Here’s how I set up a device template (projects can contain several, but I just need one):

The action has two mandatory parameters (phone number and SMS message) built in; I add three more (Building, Room, and Floor) and click Create project:

I’m almost ready to ask for some coffee! The next step is to associate my buttons with this project by creating a placement for each one. I click Create placements to proceed. I name each placement, select the device to associate with it, and then enter values for the attributes that I established for the project. I can also add additional attributes that are peculiar to this placement:

I can inspect my project and see that everything looks good:

I click on the buttons and the SMS messages appear:

I can monitor device activity in the AWS IoT 1-Click Console:

And also in the Lambda Console:

The Lambda function itself is also accessible, and can be used as-is or customized:

As you can see, this is the code that lets me use {{*}}include all of the placement attributes in the message and {{Building}} (for example) to include a specific placement attribute.

Now Available
I’ve barely scratched the surface of this cool new service and I encourage you to give it a try (or a click) yourself. Buy a button or two, build something cool, and let me know all about it!

Pricing is based on the number of enabled devices in your account, measured monthly and pro-rated for partial months. Devices can be enabled or disabled at any time. See the AWS IoT 1-Click Pricing page for more info.

To learn more, visit the AWS IoT 1-Click home page or read the AWS IoT 1-Click documentation.

Jeff;

 

Introducing the AWS Machine Learning Competency for Consulting Partners

Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/introducing-the-aws-machine-learning-competency-for-consulting-partners/

Today I’m excited to announce a new Machine Learning Competency for Consulting Partners in the Amazon Partner Network (APN). This AWS Competency program allows APN Consulting Partners to demonstrate a deep expertise in machine learning on AWS by providing solutions that enable machine learning and data science workflows for their customers. This new AWS Competency is in addition to the Machine Learning comptency for our APN Technology Partners, that we launched at the re:Invent 2017 partner summit.

These APN Consulting Partners help organizations solve their machine learning and data challenges through:

  • Providing data services that help data scientists and machine learning practitioners prepare their enterprise data for training.
  • Platform solutions that provide data scientists and machine learning practitioners with tools to take their data, train models, and make predictions on new data.
  • SaaS and API solutions to enable predictive capabilities within customer applications.

Why work with an AWS Machine Learning Competency Partner?

The AWS Competency Program helps customers find the most qualified partners with deep expertise. AWS Machine Learning Competency Partners undergo a strict validation of their capabilities to demonstrate technical proficiency and proven customer success with AWS machine learning tools.

If you’re an AWS customer interested in machine learning workloads on AWS, check out our AWS Machine Learning launch partners below:

 

Interested in becoming an AWS Machine Learning Competency Partner?

APN Partners with experience in Machine Learning can learn more about becoming an AWS Machine Learning Competency Partner here. To learn more about the benefits of joining the AWS Partner Network, see our APN Partner website.

Thanks to the AWS Partner Team for their help with this post!
Randall

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.