Tag Archives: reinforcement-learning

ML Platform Meetup: Infra for Contextual Bandits and Reinforcement Learning

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/ml-platform-meetup-infra-for-contextual-bandits-and-reinforcement-learning-4a90305948ef?source=rss----2615bd06b42e---4

Faisal Siddiqi

Infrastructure for Contextual Bandits and Reinforcement Learning — theme of the ML Platform meetup hosted at Netflix, Los Gatos on Sep 12, 2019.

Contextual and Multi-armed Bandits enable faster and adaptive alternatives to traditional A/B Testing. They enable rapid learning and better decision-making for product rollouts. Broadly speaking, these approaches can be seen as a stepping stone to full-on Reinforcement Learning (RL) with closed-loop, on-policy evaluation and model objectives tied to reward functions. At Netflix, we are running several such experiments. For example, one set of experiments is focussed on personalizing our artwork assets to quickly select and leverage the “winning” images for a title we recommend to our members.

As with other traditional machine learning and deep learning paths, a lot of what the core algorithms can do depends upon the support they get from the surrounding infrastructure and the tooling that the ML platform provides. Given the infrastructure space for RL approaches is still relatively nascent, we wanted to understand what others in the community are doing in this space.

This was the motivation for the meetup’s theme. It featured three relevant talks from LinkedIn, Netflix and Facebook, and a platform architecture overview talk from first time participant Dropbox.



After a brief introduction on the theme and motivation of its choice, the talks were kicked off by Kinjal Basu from LinkedIn who talked about Online Parameter Selection for Web-Based Ranking via Bayesian Optimization. In this talk, Kinjal used the example of the LinkedIn Feed, to demonstrate how they use bandit algorithms to solve for the optimal parameter selection problem efficiently.

He started by laying out some of the challenges around inefficiencies of engineering time when manually optimizing for weights/parameters in their business objective functions. The key insight was that by assuming a latent Gaussian Process (GP) prior on the key business metric actions like viral engagement, job applications, etc., they were able to reframe the problem as a straight-forward black-box optimization problem. This allowed them to use BayesOpt techniques to solve this problem.

The algorithm used to solve this reformulated optimization problem is a popular E/E technique known as Thompson Sampling. He talked about the infrastructure used to implement this. They have built an offline BayesOpt library, a parameter store to retrieve the right set of parameters, and an online serving layer to score the objective at serving time given the parameter distribution for a particular member.

He also described some practical considerations, like member-parameter stickiness, to avoid per session variance in a member’s experience. Their offline parameter distribution is recomputed hourly, so the member experience remains consistent within the hour. Some simulation results and some online A/B test results were shared, demonstrating substantial lifts in the primary business metrics, while keeping the secondary metrics above preset guardrails.

He concluded by stressing the efficiency their teams had achieved by doing online parameter exploration instead of the much slower human-in-the-loop manual explorations. In the future, they plan to explore adding new algorithms like UCB, considering formulating the problem as a grey-box optimization problem, and switching between the various business metrics to identify which is the optimal metric to optimize.



The second talk was by Netflix on our Bandit Infrastructure built for personalization use cases. Fernando Amat and Elliot Chow jointly gave this talk.

Fernando started the first part of the talk and described the core recommendation problem of identifying the top few titles in a large catalog that will maximize the probability of play. Using the example of evidence personalization — images, text, trailers, synopsis, all assets that come together to add meaning to a title — he described how the problem is essentially a slate recommendation task and is well suited to be solved using a Bandit framework.

If such a framework is to be generic, it must support different contexts, attributions and reward functions. He described a simple Policy API that models the Slate tasks. This API supports the selection of a state given a list of options using the appropriate algorithm and a way to quantify the propensities, so the data can be de-biased. Fernando ended his part by highlighting some of the Bandit Metrics they implemented for offline policy evaluation, like Inverse Propensity Scoring (IPS), Doubly Robust (DR), and Direct Method (DM).

For Bandits, where attribution is a critical part of the equation, it’s imperative to have a flexible and robust data infrastructure. Elliot started the second part of the talk by describing the real-time framework they have built to bring together all signals in one place making them accessible through a queryable API. These signals include member activity data (login, search, playback), intent-to-treat (what title/assets the system wants to impress to the member) and the treatment (impressions of images, trailers) that actually made it to the member’s device.

Elliot talked about what is involved in “Closing the loop”. First, the intent-to-treat needs to be joined with the treatment logging along the way, the policies in effect, the features used and the various propensities. Next, the reward function needs to be updated, in near real time, on every logged action (like a playback) for both short-term and long-term rewards. And finally each new observation needs to update the policy, compute offline policy evaluation metrics and then push the policy back to production so it can generate new intents to treat.

To be able to support this, the team had to standardize on several infrastructure components. Elliot talked about the three key components — a) Standardized Logging from the treatment services, b) Real-time stream processing over Apache Flink for member activity joins, and c) an Apache Spark client for attribution and reward computation. The team has also developed a few common attribution datasets as “out-of-the-box” entities to be used by the consuming teams.

Finally, Elliot ended by talking about some of the challenges in building this Bandit framework. In particular, he talked about the misattribution potential in a complex microservice architecture where often intermediary results are cached. He also talked about common pitfalls of stream-processed data like out of order processing.

This framework has been in production for almost a year now and has been used to support several A/B tests across different recommendation use cases at Netflix.



After a short break, the second session started with a talk from Facebook focussed on practical solutions to exploration problems. Sam Daulton described how the infrastructure and product use cases came along. He described how the adaptive experimentation efforts are aimed at enabling fast experimentation with a goal of adding varying degrees of automation for experts using the platform in an ad hoc fashion all the way to no-human-in-the-loop efforts.

He dived into a policy search problem they tried to solve: How many posts to load for a user depending upon their device’s connection quality. They modeled the problem as an infinite-arm bandit problem and used Gaussian Process (GP) regression. They used Bayesian Optimization to perform multi-metric optimization — e.g., jointly optimizing decrease in CPU utilization along with increase in user engagement. One of the challenges he described was how to efficiently choose a decision point, when the joint optimization search presented a Pareto frontier in the possible solution space. They used constraints on individual metrics in the face of noisy experiments to allow business decision makers to arrive at an optimal decision point.

Not all spaces can be efficiently explored online, so several research teams at Facebook use Simulations offline. For example, a ranking team would ingest live user traffic and subject it to a number of ranking configurations and simulate the event outcomes using predictive models running on canary rankers. The simulations were often biased and needed de-biasing (using multi-task GP regression) for them to be used alongside online results. They observed that by combining their online results with de-biased simulation results they were able to substantially improve their model fit.

To support these efforts, they developed and open sourced some tools along the way. Sam described Ax and BoTorch — Ax is a library for managing adaptive experiments and BoTorch is a library for Bayesian Optimization research. There are many applications already in production for these tools from both basic hyperparameter exploration to more involved AutoML use cases.

The final section of Sam’s talk focussed on Constrained Bayesian Contextual Bandits. They described the problem of video uploads to Facebook where the goal is to maximize the quality of the video without a decrease in reliability of the upload. They modeled it as a Thompson Sampling optimization problem using a Bayesian Linear model. To enforce the constraints, they used a modified algorithm, Constrained Thompson Sampling, to ensure a non-negative change in reliability. The reward function also similarly needed some shaping to align with the constrained objective. With this reward shaping optimization, Sam shared some results that showed how the Constrained Thompson Sampling algorithm surfaced many actions that satisfied the reliability constraints, where vanilla Thompson Sampling had failed.



The last talk of the event was a system architecture introduction by Dropbox’s Tsahi Glik. As a first time participant, their talk was more of an architecture overview of the ML Infra in place at Dropbox.

Tsahi started off by giving some ML usage examples at Dropbox like Smart Sync which predicts which file you will use on a particular device, so it’s preloaded. Some of the challenges he called out were the diversity and size of the disparate data sources that Dropbox has to manage. Data privacy is increasingly important and presents its own set of challenges. From an ML practice perspective, they also have to deal with a wide variety of development processes and ML frameworks, custom work for new use cases and challenges with reproducibility of training.

He shared a high level overview of their ML platform showing the various common stages of developing and deploying a model categorized by the online and offline components. He then dived into some individual components of the platform.

The first component he talked about was a user activity service to collect the input signals for the models. This service, Antenna, provides a way to query user activity events and summarizes the activity with various aggregations. The next component he dived deeper into was a content ingestion pipeline for OCR (optical character recognition). As an example, he explained how the image of a receipt is converted into contextual text. The pipeline takes the image through multiple models for various subtasks. The first classifies whether the image has some detectable text, the second does corner detection, the third does word box detection followed by deep LSTM neural net that does the core sequence based OCR. The final stage performs some lexicographical post processing.

He talked about the practical considerations of ingesting user content — they need to prevent malicious content from impacting the service. To enable this they have adopted a plugin based architecture and each task plugin runs in a sandbox jail environment.

Their offline data preparation ETLs run on Spark and they use Airflow as the orchestration layer. Their training infrastructure relies on a hybrid cloud approach. They have built a layer and command line tool called dxblearn that abstracts the training paths, allowing the researchers to train either locally or leverage AWS. dxblearn also allows them to fire off training jobs for hyperparameter tuning.

Published models are sent to a model store in S3 which are then picked up by their central model prediction service that does online inferencing for all use cases. Using a central inferencing service allows them to partition compute resources appropriately and having a standard API makes it easy to share and also run inferencing in the cloud.

They have also built a common “suggest backend” that is a generic predictive application that can be used by the various edge and production facing services that regularizes the data fetching, prediction and experiment configuration needed for a product prediction use case. This allows them to do live experimentation more easily.

The last part of Tsahi’s talk described a product use case leveraging their ML Platform. He used the example of a promotion campaign ranker, (eg “Try Dropbox business”) for up-selling. This is modeled as a multi-armed bandit problem, an example well in line with the meetup theme.

The biggest value of such meetups lies in the high bandwidth exchange of ideas from like-minded practitioners. In addition to some great questions after the talks, the 150+ attendees stayed well past 2 hours in the reception exchanging stories and lessons learnt solving similar problems at scale.

In the Personalization org at Netflix, we are always interested in exchanging ideas about this rapidly evolving ML space in general and the bandits and reinforcement learning space in particular. We are committed to sharing our learnings with the community and hope to discuss progress here, especially our work on Policy Evaluation and Bandit Metrics in future meetups. If you are interested in working on this exciting space, there are many open opportunities on both engineering and research endeavors.

ML Platform Meetup: Infra for Contextual Bandits and Reinforcement Learning was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Lerner — using RL agents for test case scheduling

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/lerner-using-rl-agents-for-test-case-scheduling-3e0686211198?source=rss----2615bd06b42e---4

Lerner — using RL agents for test case scheduling

By: Stanislav Kirdey, Kevin Cureton, Scott Rick, Sankar Ramanathan


Netflix brings delightful customer experiences to homes on a variety of devices that continues to grow each day. The device ecosystem is rich with partners ranging from Silicon-on-Chip (SoC) manufacturers, Original Design Manufacturer (ODM) and Original Equipment Manufacturer (OEM) vendors.

Partners across the globe leverage Netflix device certification process on a continual basis to ensure that quality products and experiences are delivered to their customers. The certification process involves the verification of partner’s implementation of features provided by the Netflix SDK.

The Partner Device Ecosystem organization in Netflix is responsible for ensuring successful integration and testing of the Netflix application on all partner devices. Netflix engineers run a series of tests and benchmarks to validate the device across multiple dimensions including compatibility of the device with the Netflix SDK, device performance, audio-video playback quality, license handling, encryption and security. All this leads to a plethora of test cases, most of them automated, that need to be executed to validate the functionality of a device running Netflix.


With a collection of tests that, by nature, are time consuming to run and sometimes require manual intervention, we need to prioritize and schedule test executions in a way that will expedite detection of test failures. There are several problems efficient test scheduling could help us solve:

  1. Quickly detect a regression in the integration of the Netflix SDK on a consumer electronic or MVPD (multichannel video programming distributor) device.
  2. Detect a regression in a test case. Using the Netflix Reference Application and known good devices, ensure the test case continues to function and tests what is expected.
  3. When code many test cases are dependent on has changed, choose the right test cases among thousands of affected tests to quickly validate the change before committing it and running extensive, and expensive, tests.
  4. Choose the most promising subset of tests out of thousands of test cases available when running continuous integration against a device.
  5. Recommend a set of test cases to execute against the device that would increase the probability of failing the device in real-time.

Solving the above problems could help Netflix and our Partners save time and money during the entire lifecycle of device design, build, test, and certification.

These problems could be solved in several different ways. In our quest to be objective, scientific, and inline with the Netflix philosophy of using data to drive solutions for intriguing problems, we proceeded by leveraging machine learning.

Our inspiration was the findings in a research paper “Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration” by Helge Spieker, et. al. We thought that reinforcement learning would be a promising approach that could provide great flexibility in the training process. Likewise it has very low requirements on the initial amount of training data.

In the case of continuously testing a Netflix SDK integration on a new device, we usually lack relevant data for model training in the early phases of integration. In this situation training an agent is a great fit as it allows us to start with very little input data and let the agent explore and exploit the patterns it learns in the process of SDK integration and regression testing. The agent in reinforcement learning is an entity that performs a decision on what action to take considering the current state of the environment, and gets a reward based on the quality of the action.


We built a system called Lerner that consists of a set of microservices and a python library that allows scalable agent training and inference for test case scheduling. We also provide an API client in Python.

Lerner works in tandem with our continuous integration framework that executes on-device tests using the Netflix Test Studio platform. Tests are run on Netflix Reference Applications (running as containers on Titus), as well as on physical devices.

There were several motivations that led to building a custom solution:

  1. We wanted to keep the APIs and integrations as simple as possible.
  2. We needed a way to run agents and tie the runs to the internal infrastructure for analytics, reporting, and visualizations.
  3. We wanted the to tool be available as a standalone library as well as scalable API service.

Lerner provides ability to setup any number of agents making it the first component in our re-usable reinforcement learning framework for device certification.

Lerner, as a web-service, relies on Amazon Web Services (AWS) and Netflix’s Open Source Software (OSS) tools. We use Spinnaker to deploy instances and host the API containers on Titus — which allows fast deployment times and rapid scalability. Lerner uses AWS services to store binary versions of the agents, agent configurations, and training data. To maintain the quality of Lerner APIs, we are using the server-less paradigm for Lerner’s own integration testing by utilizing AWS Lambda.

The agent training library is written in Python and supports versions 2.7, 3.5, 3.6, and 3.7. The library is available in the artifactory repository for easy installation. It can be used in Python notebooks — allowing for rapid experimentation in isolated environments without a need to perform API calls. The agent training library exposes different types of learning agents that utilize neural networks to approximate action.

The neural network (NN)-based agent uses a deep net with fully connected layers. The NN gets the state of a particular test case (the input) and outputs a continuous value, where a higher number means an earlier position in a test execution schedule. The inputs to the neural network include: general historical features such as the last N executions and several domain specific features that provide meta-information about a test case.

The Lerner APIs are split into three areas:

  1. Storing execution results.
  2. Getting recommendations based on the current state of the environment.
  3. Assign reward to the agent based on the execution result and predicted recommendations.

A process of getting recommendations and rewarding the agent using APIs consists of 4 steps:

  1. Out of all available test cases for a particular job — form a request that can be interpreted by Lerner. This involves aggregation of historical results and additional features.
  2. Lerner returns a recommendation identified with a unique episode id.
  3. A CI system can execute the recommendation and submit the execution results to Lerner based on the episode id.
  4. Call an API to assign a reward based on the agent id and episode id.

Below is a diagram of the services and persistence layers that support the functionality of the Lerner API.

The self-service nature of the tool makes it easy for service owners to integrate with Lerner, create agents, ask agents for recommendations and reward them after execution results are available.

The metrics relevant to the training and recommendation process are reported to Atlas and visualized using Netflix’s Lumen. Users of the service can track the statistics specific to the agents they setup and deploy, which allows them to build their own dashboards.

We have identified some interesting patterns while doing online reinforcement learning.

  • The recommendation/execution reward cycle can happen without any prior training data.
  • We can bootstrap several CI jobs that would use agents with different reward functions, and gain additional insight based on agents performance. It could help us design and implement more targeted reward functions.
  • We can keep a small amount of historical data to train agents. The data can be truncated after each execution and offloaded to a long-term storage for further analysis.

Some of the downsides:

  • It might take time for an agent to stop exploring and start exploiting the accumulated experience.
  • As agents stored in a binary format in the database, an update of an agent from multiple jobs could cause a race condition in its state. Handling concurrency in the training process is cumbersome and requires trade offs. We achieved the desired state by relying on the locking mechanisms of the underlying persistence layer that stores and serves agent binaries.

Thus, we have the luxury of training as many agents as we want that could prioritize and recommend test cases based on their unique learning experiences.


We are currently piloting the system and have live agents serving predictions for various CI runs. At the moment we run Lerner-based CIs in parallel with CIs that either execute test cases in random order or use simple heuristics as sorting test cases by time and execute everything that previously failed.

The system was built with simplicity and performance in mind, so the set of APIs are minimal. We developed client libraries that allow seamless, but opinionated, integration with Lerner.

We collect several metrics to evaluate the performance of a recommendation, with main metrics being time taken to first failure and time taken to complete a whole scheduled run.

Lerner-based recommendations are proving to be different and more insightful than random runs, as they allow us to fit a particular time budget and detect patterns such as cases that tend to fail together in a cluster, cases that haven’t been run in a long time, and so on.

The below graphs shows more or less an artificial case when a schedule of 100+ test cases would contain several flaky tests. The Y-axis represents how many minutes it took to complete the schedule or reach a first failed test case. In blue, we have random recommendations with no time budget constraints. In green you can see executions based on Lerner recommendations under a time constraint of 60 minutes. The green spikes represent Lerner exploring the environment, where the wiggly lines around 0 are the executions that failed quickly as Lerner was exploiting its policy.

Execution of schedules that were randomly generated. Y-axis represents time to finish execution or reach first failure.
Execution of Lerner based schedules. You can see moments when Lerner was exploring the environment, and the wiggly lines represent when the schedule was generated based on exploiting existing knowledge.

Next Steps

The next phases of the project will focus on:

  • Reward functions that are aware of a comprehensive domain context, such as assigning appropriate rewards to states where infrastructure is fragile and test case could not be run appropriately.
  • Administrative user-interface to manage agents.
  • More generic, simple, and user-friendly framework for reinforcement learning and agent deployment.
  • Using Lerner on all available CIs jobs against all SDK versions.
  • Experiment with different neural network architectures.

If you would like to be a part of our team, come join us.

Lerner — using RL agents for test case scheduling was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.