Tag Archives: Data Science

Reimagining Experimentation Analysis at Netflix

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/reimagining-experimentation-analysis-at-netflix-71356393af21?source=rss----2615bd06b42e---4

Toby Mao, Sri Sri Perangur, Colin McFarland

Another day, another custom script to analyze an A/B test. Maybe you’ve done this before and have an old script lying around. If it’s new, it’s probably going to take some time to set up, right? Not at Netflix.

ABlaze: The standard view of analyses in the XP UI

Suppose you’re running a new video encoding test and theorize that the two new encodes should reduce play delay, a metric describing how long it takes for a video to play after you press the start button. You can look at ABlaze (our centralized A/B testing platform) and take a quick look at how it’s performing.

Simulated dataset that shows what the distribution of play delay may look like. Note that the new encodes perform well in the lower quantiles but worse in the higher ones

You notice that the first new encode (Cell 2 — Encode 1) increased the mean of the play delay but decreased the median!

After recreating the dataset, you can plot the raw numbers and perform custom analyses to understand the distribution of the data across test cells.

With our new platform for experimentation analysis, it’s easy for scientists to perfectly recreate analyses on their laptops in a notebook. They can then choose from a library of statistics and visualizations or contribute their own to get a deeper understanding of the metrics.

Extending the same view of ABlaze with other contributed models and visualizations

Why it Matters

Netflix runs on an A/B testing culture: nearly every decision we make about our product and business is guided by member behavior observed in test. At any point a Netflix user is in many different A/B tests orchestrated through ABlaze. This enables us to optimize their experience at speed. Our A/B tests range across UI, algorithms, messaging, marketing, operations, and infrastructure changes. A user might be in a title artwork test, personalization algorithm test, or a video encoding testing, or all three at the same time.

The analysis reports tell us whether or not a new experience made statistically significant changes to relevant metrics, such as member behavior, or technical metrics that describe streaming video quality. However, the default reports only provide a summary view of the data with some powerful but limited filtering options. Our data scientists often want to apply their knowledge of the business and statistics to fully understand the outcome of an experiment.

Instead of relying on engineers to productionize scientific contributions, we’ve made a strategic bet to build an architecture that enables data scientists to easily contribute.

The two main challenges with this approach are establishing an easy contribution framework and handling Netflix’s scale of data. When dealing with ‘big data’, it’s common to perform computation on frameworks like Apache Spark or Map Reduce. In order to reduce the learning curve of contributing analyses, we’ve decided to take an alternative path by performing all of our analyses on one machine. Due to compression and high performance computing, scientists can analyze billions of rows of raw data on their laptops using languages and statistical libraries they are familiar with like Python and R.

Challenges with Pre-existing Infrastructure

Netflix’s well-known experimentation culture was fueled by our previous infrastructure: an optimized framework that scaled to the wide variety of use cases across Netflix. But as our experimentation culture grew, so too did our product areas, users, and ambitions around more sophisticated methodology on measurement.

Our data scientists faced numerous challenges in our previous infrastructure. Complex business logic was embedded directly into the ETL pipelines by data engineers. In order to replicate results, scientists had to delve deep into the data, code, and documentation. Due to Netflix’s scale of over 150 million subscribers, scientists also frequently encountered issues while fetching data and performing custom statistical models in Python or R.

To offer new methods to the community and overcome any existing engineering barriers, scientists would have to run custom scripts outside of the centralized platform. Heavily used or high value scripts were sometimes converted into Shiny apps, allowing easy access to these novel features. However, because these apps lived separately from the platform, they could be difficult to maintain as the underlying data and platform evolved. Also, since these apps were generally written for specific use cases, they were difficult to generalize and graduate back into the platform.

Our scientists come from many backgrounds, such as neuroscience, biostatistics, economics, and physics; each of these backgrounds has a meaningful contribution to how experiments should be analyzed. Instead of spending their time wrangling data and conducting the same ad-hoc analyses multiple times, we would like our data scientists to focus on contributing new and innovative techniques for analyzing tests, such as Interleaving, Quantile Bootstrapping, Quasi Experiments, Quantile Regression, and Heterogeneous Treatment Effects. Additionally, as these new techniques are contributed, we want them to be effortlessly leveraged across the Netflix experimentation community.

Previous XP architecture: all systems are engineering-owned and not easily introspectable

Reimagining our Infrastructure: Democratization Across 3 Tracks

We are reimagining new infrastructure that makes the scientific development experience better. We’ve chosen to break down the contribution framework into 3 steps.

1. Getting Data with the Metrics Repo
2. Computing Statistics with Causal Models
3. Rendering Visualizations with Plotly

Democratization across 3 tracks: Metrics, Stats, Viz

The new architecture employs a modular design that permits data scientists to contribute using SQL, Python, and R, the tools of their trade. Users can contribute metrics and methods directly, without needing to master data engineering tools. We’ve also made sure that both production and local workflows use the same code base, so reproducibility is a given and promotion to production is just a pull request away.

New XP architecture: Systems highlighted in red are introspectable and contributable by data scientists

Getting data with Metrics Repo

Metrics Repo is an in-house Python framework where users define programmatically generated SQL queries and metric definitions. It centralizes metrics definitions which used to be scattered across many teams. Previously, many teams at Netflix had their own pipelines to calculate success metrics which caused a lot of fragmentation and discrepancies in calculations.

A key design decision of Metrics Repo is that it moves the last mile of metric computation away from engineering owned ETL pipelines into dynamically generated SQL. This allows scientists to add metrics and join arbitrary tables. The new architecture is much more flexible compared to the previous Spark based jobs. Views of reports are only calculated on demand and take a couple minutes to execute, so there are no migrations or backfills when making changes or updates to metrics. Adding a new metric is as easy as adding a new field or joining a different table in SQL. By leveraging PyPika, we represent each table as a Python class that can be customized with filters and additional joins. The code is self documenting and serializes to JSON so it can be easily exposed as an API.

Calculating Statistics with Causal Models

Causal Models is an in-house Python library that allows scientists to contribute generic models for causal inference. Previously, the centralized platform only had T-Test and Mann-Whitney while advanced statistical tests were only available via scripts or Shiny apps. Scientists can now add their statistical models by overriding two functions in a model subclass. Many of the models are simple wrappers over Scipy, but it’s flexible enough to do arbitrarily complex calculations. The library also provides helper methods which abstract accessing compressed or raw data. We use rpy2 so that models can be written in either R or Python.

We do not want data scientists to have to go outside of their comfort zone by writing Spark Scala or Map Reduce jobs. We also want to leverage the large ecosystem of statistical libraries written in Python and R. However, many analyses have raw datasets that don’t fit on one machine. So, we’ve implemented an optional compression layer that drastically reduces the size of the data. Depending on the statistic, the compression can be either lossless or tunably lossy. Additionally, we’ve structured the API so that model implementors don’t need to distinguish between compressed and uncompressed data. When contributing a new statistical test, the data scientist only needs to think about one comparison computation at a time. We take the functions that they’ve written and parallelize it for them through multi-processing.

Sometimes statistical models are expensive to run even on compressed data. It can be difficult to efficiently perform linear algebra operations in native Python or R. In those cases, our mathematical engineering team writes custom C++ in order to speed through those bottlenecks. Our scientists can then reference them easily in Python via pybind11 or in R via Rcpp.

As a result, innovative methods like Quantile Bootstrapping and OLS with heterogeneous effects are no longer confined to un-versioned controlled notebooks/scripts. The barrier to entry is very low to develop on the production system and sharing methods across metrics and business areas is effortless.

Rendering Visualizations with Plotly

In the old model, visualizations in the experimentation platform were created by UI engineers in React. The new architecture is still based on React, but we allow data scientists to contribute arbitrary graphs and plots using Plotly. We chose to use Plotly because it has a JSON specification that is implemented in many different frameworks and languages, including R and Python. Scientists can pick and choose from a wide variety of pre-made visualizations or create their own for others to use.

This work kickstarted an initiative called Netflix Vizkit to create a cross-library shared design that lowers the barrier for a unified look and feel in contributions.

Many scientists at Netflix primarily use notebooks for day to day development, so we wanted to make sure they could perform A/B test analysis on them as well. To ensure that the analysis shown in ABlaze can be replicated in a notebook, with e run the exact same code in both environments, even the visualizations!

Now scientists can easily introspect the data and extend it in an ad-hoc analysis. They can develop new metrics, statistical models, and visualizations in their notebooks and contribute it to the platform knowing the results will be identical because their exact code will be running in production. As a result, anyone at Netflix looking at ABlaze can now view these new contributions when looking at test analyses.

XP: Combining contributions into analyses

Next Steps

We aim to accelerate research in causal inference methodology, expedite product innovation, and ultimately delight our members. We’re looking forward to enhancing our frameworks to tackle experimentation automation. This is an ongoing journey. If you are passionate about the field, we have opportunities to join our dream team!


Reimagining Experimentation Analysis at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data first, SLA always

Post Syndicated from Grab Tech original https://engineering.grab.com/data-first-sla-always

Introducing Trailblazer, the Data Engineering team’s solution to implementing change data capture of all upstream databases. In this article, we introduce the reason why we needed to move away from periodic batch ingestion towards a real time solution and show how we achieved this through an end to end streaming pipeline.

Context

Our mission as Grab’s Data Engineering team is to fulfill 100% of SLAs for data availability to our downstream users. Our 40 person team is responsible for providing accurate and reliable data to data analysts and data scientists so that they can produce actionable reports that will help Grab’s leadership team make data-driven decisions. We maintain data for a variety of business intelligence tools such as Tableau, Presto and Holistics as well as predictive algorithms for all of Grab.

We ingest data from multiple upstream sources, such as relational databases, Kafka or third party applications such as Salesforce or Zendesk. The majority of these source data exists in MySQL and we run ETL pipelines to mirror any updates into our data lake. These pipelines are triggered on an hourly or daily basis and are powered by an in-house Loader application which performs Spark batch ingestion and loading of data from source to sink.

Problems with the Loader application started to surface when Grab’s data exceeded the petabyte threshold. As such for larger tables, the most practical method to ingest data was to perform ETL only on rows that were updated within a specified timeframe. This is akin to issuing the query

SELECT * FROM table WHERE updated >= [start_time] AND updated < [end_time]

Now imagine two situations. One, firing this query to a huge table without an updated field. Two, firing the same query to the huge table, this time without indexes on the updated field. In the first scenario, the query will never work and we can never perform incremental ingestion on the table based on a timed window. The second scenario carries the dangers of creating high CPU load to replicate the database that we are querying from. Neither has an ideal outcome.

One other problem that we identified was the unpredictability of growth in data volume. Tables smaller than one gigabyte were ingested by fully scanning the table and overwriting the data in the data lake. This worked out well for us until the table size increased exponentially, at which point our Spark jobs failed due to JDBC timeouts. If we were only dealing with a handful of tables, this issue could have been addressed by switching our data ingestion strategy from full scan to a timed window.

When assessing the issue, we discovered that there were hundreds of tables running under the full scan strategy, all of them potentially crashing our data system, all time bombs silently waiting to explode.

The team urgently needed a new approach to ETL. Our Loader application was highly coupled to upstream table characteristics. We needed to find solutions that were truly scalable, which meant decoupling our pipelines from the upstream.

Change data capture (CDC)

Much like event sourcing, any log change to the database is captured and streamed out for downstream applications to consume. This process is lightweight since any row level update to the table is instantly captured by a real time processor, avoiding the need for large chunked queries on the table. In addition, CDC works regardless of upstream table definition, so we do not need to worry about missing updated columns impacting our data migration process.

Binary Logs (binlogs) are the CDC agents of MySQL. All updates, insertions or deletions performed on the table are captured as a series of logged events containing the past state of the row and it’s newly modified state. Check out the binlogs reference to find out more.

In order to persist all binlogs generated upstream, our team created a Spark Structured Streaming application called Trailblazer. Trailblazer streams all MySQL binlogs to our data lake. These binlogs serve as a foundation for us to build Presto tables for data auditing and help to remove the direct dependency of our batch ETL jobs to the source MySQL.

Trailblazer is an amalgamation of various data streaming stacks. Binlogs are captured by Debezium which runs on Kafka connect clusters. All binlogs are sent to our Kafka cluster, which is managed by the Data Engineering Infrastructure team and are streamed out to a real time bucket via a Spark structured streaming application. Hourly or daily ETL compaction jobs ingests the change logs from the real time bucket to materialize tables for downstream users to consume.

CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services
CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services

 

Some statistics

To date, we are streaming hundreds oftables across 60 Spark streaming jobs and with the constant increase in Grab’s database instances, the numbers are expected to keep growing.

Designing Trailblazer streams

We built our streaming application using Spark structured streaming 2.3. Structured streaming was designed to remove the technical aspects of provisioning streams. Developers can focus on perfecting business logic without worrying about fundamentals such as checkpoint management or reading and writing to data sources.

Key architecture for Trailblazer streaming
Key architecture for Trailblazer streaming

 

In the design phase, we made sure to follow several key principles that helped in managing our streams.

Checkpoints have to be externally managed

Structured streaming manages checkpoints both in a local directory and in a ‘_metadata’ directory on S3 buckets, such that the state of the stream can be restored in the event of failure and restart.

This is all well and good, with two exceptions. First, changing the starting point of data ingestion meant ssh-ing into the machine and manipulating metadata, which could be extremely dangerous. Second, we could not assume cluster prevalence since clusters can die and be recreated with data erased from its local disk or the distributed file system.

Our solution was to do a work around at the application level. All checkpoints will be stored in temporary directories with the existing timestamp appended as path (eg /tmp/checkpoint/job_A/1560697200/… ). A linearly progressive timestamp guarantees that the same directory will never be reused by new instances of the stream. This explains why we never restore its state from local disk but instead, store all checkpoints in a highly available Redis cluster, with key as the Kafka topic and value as a JSON of partition : offset.

Key

debz-schema-A.schema_A.table_B

Value

{"11":19183566,"12":19295602,"13":18992606[[a]](#cmnt1)[[b]](#cmnt2)[[c]](#cmnt3)[[d]](#cmnt4)[[e]](#cmnt5)[[f]](#cmnt6),"14":19269499,"15":19197199,"16":19060873,"17":19237853,"18":19107959,"19":19188181,"0":19193976,"1":19072585,"2":19205764,"3":19122454,"4":19231068,"5":19301523,"6":19287447,"7":19418871,"8":19152003,"9":19112431,"10":19151479}
Example of how offsets are stored in Redis as Key : Value pairs

 

Fortunately, structured streaming provides the StreamQueryListener class which we can use to register checkpoints after the completion of each microbatch.

Streams must handle 0, 1 or 1 million data

Scalability is at the heart of all well-designed applications. Spark streaming jobs are built for scalability in the face of varying data volumes.

In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing
In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing

 

There are a few settings that we can configure to influence the degree of scalability for a streaming app

  • spark.dynamicAllocation.enabled=true gives spark autonomy to provision / revoke executors to suit the workload
  • spark.dynamicAllocation.maxExecutors controls the maximum job parallelism
  • maxOffsetsPerTrigger controls the maximum number of messages ingested from Kafka per microbatch
  • trigger controls the duration between microbatchs and is a property of the DataStreamWriter class

Data as key health indicator

Scaling the number of streaming jobs without prior collection of performance metrics is a bad idea. There is a high chance that you will discover a dead stream when checking your stream hours after initialization. I’ll cite Murphy’s law as proof.

Thus we vigilantly monitored our data streams. We used tools such as Datadog for metric monitoring, Slack for oncall issue reporting, PagerDuty for urgent cases and our inhouse data auditor as a service (DASH) for counts discrepancy reporting between streamed and source data. More details on monitoring will be discussed in the later part.

Streams are ephemeral

Streams may die due to a hundred and one reasons so don’t blame yourself or your programming insecurities. Issues with upstream dependencies, such as a node within your Kafka cluster running out of disk space, could lead to partition unavailability which would crash the application. On one occasion, our streaming application was unable to resolve DNS when writing to AWS S3 storage. This amounted to multiple failures within our Spark job that eventually culminated in the termination of the stream.

In this case, allow the stream to  shutdown gracefully, send out your alerts and have a mechanism in place to retry the failed stream. We run all streaming jobs on Airflow and any failure to the stream will automatically be retried through a new task issued by the scheduler.

If you have had experience with large scale management of streams, please leave a comment so we can continue this discussion!

Monitoring data streams

Here are some key features that were set up to monitor our streams.

Running : Active jobs ratio

The number of streaming jobs could increase in the future, thus becoming a challenge for the oncall team to track all jobs that are supposed to be up and running.

One proposal  is  to track the number of jobs in production against the number of jobs that are actually running. By querying MySQL tables, we can filter out all the jobs that are meant to be active. Since Trailblazer streams are spark-submit jobs managed by YARN, we can query YARN’s resource manager REST API to retrieve  all the jobs that are running. We then construct a ratio of running : active jobs and report them to Datadog. If the ratio is not 1 for an extended duration, an alert will be issued for the oncall to take action.

If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert
If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert

 

Microbatch runtime

We define a 30 second window for each microbatch and track the actual runtime using metrics reported by the query listener. A runtime that exceeds the designated window is a potential indicator that the streaming job is deprived of resources and needs to be scaled up.

Job liveliness

Each job reports its health by emitting a count of 1 heartbeat. This heartbeat is created at the end of every microbatch via a query listener. This process is useful in detecting stale jobs (jobs that are registered as RUNNING in YARN but are actually hung).

Kafka offset divergence

In order to ensure that the message output rate to the consumer exceeds the message input rate from the producer, we sum up all presently ingested topic-partition offsets and compare that value to the sum of all topic-partition end offsets in Kafka. We then add an alerting logic on top of these metrics to inform the oncall team if the difference between the two values grows too big.

It is important to track the offset divergence parameter as streams can be lagging. Should the rate of consumption fall below the rate of message production, we would run the risk of falling short of Kafka’s retention window, leading to data losses.

Hourly data checks

DASH runs hourly and serves as our first line of defence to detect any data quality issues within the streams. We issue queries to the source database and our streaming layer to confirm that the ID counts of data created within the last hour match.

DASH helps in the early detection of upstream issues. We have noticed cases where our Debezium connectors failed and our checker reported fewer data than expected since there were no incoming messages to Kafka.

DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack

 

Materializing tables through compaction

Having CDC data in our data lake does not conclude our responsibilities. Batched compaction allows us to apply all captured CDC, to be available as Presto tables for downstream consumption. The job is set to trigger hourly and process all changes to the database within the past hour.  For example, changes to a record are visible in real-time, but the latest state of the record will not be reflected until the next time a batch job runs. We addressed several issues with streaming during this phase.

Deduplication of data

Trailblazer was not built to deliver exactly once guarantees. We ensure that the issues regarding duplicated CDCs are addressed during compaction.

Availability of all data until certain hour

We want to make sure that downstream pipelines use output data of the hourly batch job only when the pipeline has all records for that hour. In case there is an event that is processed late by streaming, the current pipeline will wait until the data is completed. In this case, we are consciously choosing consistency over availability for our downstream users. For example, missing a few insert booking records in peak hours due to consumer processing delay can generate the wrong downstream results leading to miscalculation in revenue. We want to start  downstream processes only when the data for the hour or day is complete.

Need for latest state of each event

Our compaction job performs upserts on the data to ensure that our downstream users can consume  records in their latest state.  

Future applications

Trailblazer is a milestone for the Data Engineering team as it represents our commitment to achieve large scale data streams to reduce latencies for our end users. Moving ahead, our team will be exploring how we can further optimize streaming jobs by analysing data trends over time and to build applications such as snapshot tables on top of the CDCs being streamed in our data lake.

Making Grab’s everyday app super

Post Syndicated from Grab Tech original https://engineering.grab.com/grab-everyday-super-app

Grab is Southeast Asia’s leading superapp, providing highly-used daily services such as ride-hailing, food delivery, payments, and more. Our goal is to give people better access to the services that matter to them, with more value and convenience, so we’ve been expanding our ecosystem to include bill payments, hotel bookings, trip planners, and videos – with more to come. We want to outserve our customers – not just by packing the Grab app with useful features and services, but by making the whole experience a unique and personalized one for each of them.

To realize our super app ambitions, we work with partners who, like us, want to help drive Southeast Asia forward.

A lot of the collaborative work we do with our partners can be seen in the Grab Feed. This is where we broadcast various types of content about Grab and our partners in an aggregated manner, adding value to the overall user experience. Here’s what the feed looks like:

Grab super app feed
Waiting for the next promo? Check the Feed.
Looking for news and entertainment? Check the Feed.
Want to know if it’s a good time to book a car? CHECK. THE. FEED.

 

As we continue to add more cards, services, and chunks of content into Grab Feed, there’s a risk that our users will find it harder to find the information relevant to them. So we work to ensure that our platform is able to distinguish and show information based on what’s most suited for the user’s profile. This goes back to what has always been our central focus – the customer – and is why we put so much importance in personalising the Grab experience for each of them.

To excel in a heavily diversified market like Southeast Asia, we leverage on the depth of our data to understand what sorts of information users want to see and when they should see them. In this article we will discuss Grab Feed’s recommendation logic and strategies, as well as its future roadmap.

Start your Engines

Grab super app feed

The problem we’re trying to solve here is known as the recommendations problem. In a nutshell, this problem is about inferring the preference of consumers to recommend content and services to them. In Grab Feed, we have different types of content that we want to show to different types of consumers and our challenge is to ensure that everyone gets quality content served to them.

Grab super app feed

To solve this, we have built a recommendation engine, which is a system that suggests the type of content a user should consider consuming. In order to make a recommendation, we need to understand three factors:

  1. Users. There’s a lot we can infer about our users based on how they’ve used the Grab app, such as the number of rides they’ve taken, the type of food they like to order, the movie voucher deals they’ve purchased, the games they’ve played, and so on.
    This information gives us the opportunity to understand our users’ preferences better, enabling us to match their profiles with relevant and suitable content.
  2. Items. These are the characteristics of the content. We consider the type of the content (e.g. video, games, rewards) and consumability (e.g. purchase, view, redeem). We also consider other metadata such as store hours for merchants, points to burn for rewards, and GPS coordinates for points of interest.
  3. Context. This pertains to the setting in which a user is consuming our content. It could be the time of day, the user’s location, or the current feed category.

Using signals from all these factors, we build a model that returns a ranked set of cards to the user. More on this in the next few sections.

Understanding our User

Grab super app feed

Interpreting user preference from the signals mentioned above is a whole challenge in itself. It’s important here to note that we are in a constant state of experimentation. Slowly but surely, we are continuing to fine tune how to measure content preferences. That being said, we look at two areas:

  1. Action. We firmly believe that not all interactions are made equal. Does liking a card actually mean you like it? Do you like things at the same rate as your friends? What about transactions, are those more preferred? The feed introduces a lot of ways for the users to give feedback to the platform. These events include likes, clicks, swipes, views, transactions, and call-to-actions.

Depending on the model, we can take slightly different approaches. We can learn the importance of each event and aggregate them to have an expected rating, or we can predict the probability of each event and rank accordingly.

  1. Recency. Old interactions are probably not as useful as new ones. The feed is a product that is constantly evolving, and so are the preferences of our users. Failing to decay the weight of older interactions will give us recommendations that are no longer meaningful to our users.

Optimising the Experience

Grab super app feed

Building a viable recommendation engine requires several phases. Working iteratively, we are able to create a few core recommendation strategies to produce the final model in determining the content’s relevance to the user. We’ll discuss each strategy in this section.

  1. Popularity. This strategy is better known as trending recommendations. We capture online clickstream events over a rolling time window and aggregate the events to show the user what’s popular to everyone at that point in time. Listening to the crowds is generally an effective strategy, but this particular strategy also helps us address the cold start problem by providing recommendations for new feed users.
  2. User Favourites. We understand that our users have different tastes and that users will have content that they engage with more than other users would.  In this strategy, we capture that personal engagement and the user’s evolving preferences.
  3. Collaborative Filtering.A key goal in building our everyday super app is to let users experience different services. To allow discoverability, we study similar users to uncover a s et ofsimilar preferences they may have, which we can then use to guide what we show other users.
  4. Habitual Behaviour. There will be times where users only want to do a specific thing, and we wouldn’t want them to scroll all the way down just to do it. We’ve built in habitual recommendations to address this. So if users always use the feed to scroll through food choices at lunch or to take a peek at ride peaks (pun intended) on Sunday morning, we’ve still got them covered.
  5. Deep Recommendations. We’ve shown you how we use Feed data to drive usage across the platform. But what about using the platform data to drive the user feed behaviour? By embedding users’ activities from across our multiple businesses, we’re also able to leverage this data along with clickstream to determine the content preferences for each user.

We apply all these strategies to find out the best recommendations to serve the users either by selection or by aggregation. These decisions are determined through regular experiments and studies of our users.

Always Learning

We’re constantly learning and relearning about our users. There are a lot of ways to understand behaviour and a lot of different ways to incorporate different strategies, so we’re always iterating on these to deliver the most personal experience on the app.

To identify a user’s preferences and optimal strategy exposure, we capitalise on our Experimentation Platform to expose different configurations of our Recommendation Engine to different users. To monitor the quality of our recommendations, we measure the impact with online metrics such as interaction, clickthrough, and engagement rates and offline metrics like [email protected] Normalized Discounted Cumulative Gain (NDCG).

Future Work

Through our experience building out this recommendations platform, we realised that the space was large enough and that there’s a lot of pieces that can continuously be built. To keep improving, we’re already working on the following items:

  1. Multi-objective optimisation for business and technical metrics
  2. Building out automation pipelines for hyperparameter optimisation
  3. Incorporating online learning for real-time model updates
  4. Multi-armed bandits for user personalised recommendation strategies
  5. Recsplanation system to allow stakeholders to better understand the system

Conclusion

Grab is one of Southeast Asia’s fastest growing companies. As its business, partnerships, and offerings continue to grow, the super app real estate problem will only keep on getting bigger. In this post, we discuss how we are addressing that problem by building out a recommendation system that understands our users and personalises the experience for each of them. This system (us included) continues to learn and iterate from our users feedback to deliver the best version for them.

If you’ve got any feedback, suggestions, or other great ideas, feel free to reach me at [email protected] Interested in working on these technologies yourself? Check out our career page.

Catwalk: Serving Machine Learning Models at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/catwalk-serving-machine-learning-models-at-scale

Introduction

Grab’s unwavering ambition is to be the best Super App in Southeast Asia that adds value to the everyday for our customers. In order to achieve that, the customer experience must be flawless for each and every Grab service. Let’s take our frequently used ride-hailing service as an example. We want fair pricing for both drivers and passengers, accurate estimation of ETAs, effective detection of fraudulent activities, and ensured ride safety for our customers. The key to perfecting these customer journeys is artificial intelligence (AI).

Grab has a tremendous amount of data that we can leverage to solve complex problems such as fraudulent user activity, and to provide our customers personalized experiences on our products. One of the tools we are using to make sense of this data is machine learning (ML).

As Grab made giant strides towards increasingly using machine learning across the organization, more and more teams were organically building model serving solutions for their own use cases. Unfortunately, these model serving solutions required data scientists to understand the infrastructure underlying them. Moreover, there was a lot of overlap in the effort it took to build these model serving solutions.

That’s why we came up with Catwalk: an easy-to-use, self-serve, machine learning model serving platform for everyone at Grab.

Goals

To determine what we wanted Catwalk to do, we first looked at the typical workflow of our target audience – data scientists at Grab:

  • Build a trained model to solve a problem.
  • Deploy the model to their project’s particular serving solution. If this involves writing to a database, then the data scientists need to programmatically obtain the outputs, and write them to the database. If this involves running the model on a server, the data scientists require a deep understanding of how the server scales and works internally to ensure that the model behaves as expected.
  • Use the deployed model to serve users, and obtain feedback such as user interaction data. Retrain the model using this data to make it more accurate.
  • Deploy the retrained model as a new version.
  • Use monitoring and logging to check the performance of the new version. If the new version is misbehaving, revert back to the old version so that production traffic is not affected. Otherwise run an AB test between the new version and the previous one.

We discovered an obvious pain point – the process of deploying models requires additional effort and attention, which results in data scientists being distracted from their problem at hand. Apart from that, having many data scientists build and maintain their own serving solutions meant there was a lot of duplicated effort. With Grab increasingly adopting machine learning, this was a state of affairs that could not be allowed to continue.

To address the problems, we came up with Catwalk with goals to:

  1. Abstract away the complexities and expose a minimal interface for data scientists
  2. Prevent duplication of effort by creating an ML model serving platform for everyone in Grab
  3. Create a highly performant, highly available, model versioning supported ML model serving platform and integrate it with existing monitoring systems at Grab
  4. Shorten time to market by making model deployment self-service

What is Catwalk?

In a nutshell, Catwalk is a platform where we run Tensorflow Serving containers on a Kubernetes cluster integrated with the observability stack used at Grab.

In the next sections, we are going to explain the two main components in Catwalk – Tensorflow Serving and Kubernetes, and how they help us obtain our outlined goals.

What is Tensorflow Serving?

Tensorflow Serving is an open-source ML model serving project by Google. In Google’s own words, “Tensorflow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. Tensorflow Serving provides out-of-the-box integration with Tensorflow models, but can be easily extended to serve other types of models and data.”

Why Tensorflow Serving?

There are a number of ML model serving platforms in the market right now. We chose Tensorflow Serving because of these three reasons, ordered by priority:

  1. Highly performant. It has proven performance handling tens of millions of inferences per second at Google according to their website.
  2. Highly available. It has a model versioning system to make sure there is always a healthy version being served while loading a new version into its memory
  3. Actively maintained by the developer community and backed by Google

Even though, by default, Tensorflow Serving only supports models built with Tensorflow, this is not a constraint, though, because Grab is actively moving toward using Tensorflow.

How are we using Tensorflow Serving?

In this section, we will explain how we are using Tensorflow Serving and how it helps abstract away complexities for data scientists.

Here are the steps showing how we are using Tensorflow Serving to serve a trained model:

  1. Data scientists export the model using tf.saved_model API and drop it to an S3 models bucket. The exported model is a folder containing model files that can be loaded to Tensorflow Serving.
  2. Data scientists are granted permission to manage their folder.
  3. We run Tensorflow Serving and point it to load the model files directly from the S3 models bucket. Tensorflow Serving supports loading models directly from S3 out of the box. The model is served!
  4. Data scientists come up with a retrained model. They export and upload it to their model folder.
  5. As Tensorflow Serving keeps watching the S3 models bucket for new models, it automatically loads the retrained model and serves. Depending on the model configuration, it can either gracefully replace the running model version with a newer version or serve multiple versions at the same time.
Tensorflow Serving Diagram

The only interface to data scientists is a path to their model folder in the S3 models bucket. To update their model, they upload exported models to their folder and the models will automatically be served. The complexities are gone. We’ve achieved one of the goals!

Well, not really…

Imagine youare going to run Tensorflow Serving to serve one model in a cloud provider, which means you  need a compute resource from a cloud provider to run it. Running it on one box doesn’t provide high availability, so you need another box running the same model. Auto scaling is also needed in order to scale out based on the traffic. On top of these many boxes lies a load balancer. The load balancer evenly spreads incoming traffic to all the boxes, thus ensuring that there is a single point of entry for any clients, which can be abstracted away from the horizontal scaling. The load balancer also exposes an HTTP endpoint to external users. As a result, we form a Tensorflow Serving cluster that is ready to serve.

Next, imagine you have more models to deploy. You have three options

  1. Load the models into the existing cluster – having one cluster serve all models.
  2. Spin up a new cluster to serve each model – having multiple clusters, one cluster serves one model.
  3. Combination of 1 and 2 – having multiple clusters, one cluster serves a few models.

The first option would not scale, because it’s just not possible to load all models into one cluster as the cluster has limited resources.

The second option will definitely work but it doesn’t sound like an effective process, as you need to create a set of resources every time you have a new model to deploy. Additionally, how do you optimize the usage of resources, e.g., there might be unutilized resources in your clusters that could potentially be shared by the rest.

The third option looks promising, you can manually choose the cluster to deploy each of your new models into so that all the clusters’ resource utilization is optimal. The problem is you have to manuallymanage it. Managing 100 models using 25 clusters can be a challenging task. Furthermore, running multiple models in a cluster can also cause a problem as different models usually have different resource utilization patterns and can interfere with each other. For example, one model might use up all the CPU and the other model won’t be able to serve anymore.

Wouldn’t it be better if we had a system that automatically orchestrates model deployments based on resource utilization patterns and prevents them from interfering with each other? Fortunately, that  is exactly what Kubernetes is meant to do!

So what is Kubernetes?

Kubernetes abstracts a cluster of physical/virtual hosts (such as EC2) into a cluster of logical hosts (pods in Kubernetes terms). It provides a container-centric management environment. It orchestrates computing, networking, and storage infrastructure on behalf of user workloads.

Let’s look at some of the definitions of Kubernetes resources

Tensorflow Serving Diagram
  • Cluster – a cluster of nodes running Kubernetes.
  • Node – a node inside a cluster.
  • Deployment – a configuration to instruct Kubernetes the desired state of an application. It also takes care of rolling out an update (canary, percentage rollout, etc), rolling back and horizontal scaling.
  • Pod – a single processing unit. In our case, Tensorflow Serving will be running as a container in a pod. Pod can have CPU/memory limits defined.
  • Service – an abstraction layer that abstracts out a group of pods and exposes the application to clients.
  • Ingress – a collection of routing rules that govern how external users access services running in a cluster.
  • Ingress Controller – a controller responsible for reading the ingress information and processing that data accordingly such as creating a cloud-provider load balancer or spinning up a new pod as a load balancer using the rules defined in the ingress resource.

Essentially, we deploy resources to instruct Kubernetes the desired state of our application and Kubernetes will make sure that it is always the case.

How are we using Kubernetes?

In this section, we will walk you through how we deploy Tensorflow Serving in Kubernetes cluster and how it makes managing model deployments very convenient.

We used a managed Kubernetes service, to create a Kubernetes cluster and manually provisioned compute resources as nodes. As a result, we have a Kubernetes cluster with nodes that are ready to run applications.

An application to serve one model consists of

  1. Two or more Tensorflow Serving pods that serves a model with an autoscaler to scale pods based on resource consumption
  2. A load balancer to evenly spread incoming traffic to pods
  3. An exposed HTTP endpoint to external users

In order to deploy the application, we need to

  1. Deploy a deployment resource specifying
  2. Number of pods of Tensorflow Serving
  3. An S3 url for Tensorflow Serving to load model files
  4. Deploy a service resource to expose it
  5. Deploy an ingress resource to define an HTTP endpoint url

Kubernetes then allocates Tensorflow Serving pods to the cluster with the number of pods according to the value defined in deployment resource. Pods can be allocated to any node inside the cluster, Kubernetes makes sure that the node it allocates a pod into has sufficient resources that the pod needs. In case there is no node that has sufficient resources, we can easily scale out the cluster by adding new nodes into it.

In order for the rules defined inthe ingressresource to work, the cluster must have an ingress controller running, which is what guided our choice of the load balancer. What an ingress controller does is simple: it keeps checking the ingressresource, creates a load balancer and defines rules based on rules in the ingressresource. Once the load balancer is configured, it will be able to redirect incoming requests to the Tensorflow Serving pods.

That’s it! We have a scalable Tensorflow Serving application that serves a model through a load balancer! In order to serve another model, all we need to do is to deploy the same set of resources but with the model’s S3 url and HTTP endpoint.

To illustrate what is running inside the cluster, let’s see how it looks like when we deploy two applications: one for serving pricing model another one for serving fraud-check model. Each application is configured to have two Tensorflow Serving pods and exposed at /v1/models/model

Tensorflow Serving Diagram

There are two Tensorflow Serving pods that serve fraud-check model and exposed through a load balancer. Same for the pricing model, the only differences are the model it is serving and the exposed HTTP endpoint url. The load balancer rules for pricing and fraud-check model look like this

IfThen forward to
Path is /v1/models/pricingpricing pod ip-1
pricing pod ip-2
Path is /v1/models/fraud-checkfraud-check pod ip-1
fraud-check pod ip-2

Stats and Logs

The last piece is how stats and logs work. Before getting to that, we need to introduce DaemonSet. According to the document, DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage collected. Deleting a DaemonSet will clean up the pods it created.

We deployed datadog-agent and filebeat as a DaemonSet. As a result, we always have one datadog-agent pod and one filebeat pod in all nodes and they are accessible from Tensorflow Serving pods in the same node. Tensorflow Serving pods emit a stats event for every request to datadog-agent pod in the node it is running in.

Here is a sample of DataDog stats:

DataDog stats

And logs that we put in place:

Logs

Benefits Gained from Catwalk

Catwalk has become the go-to, centralized system to serve machine learning models. Data scientists are not required to take care of the serving infrastructure hence they can focus on what matters the most: come up with models to solve customer problems. They are only required to provide exported model files and estimation of expected traffic in order to prepare sufficient resources to run their model. In return, they are presented with an endpoint to make inference calls to their model, along with all necessary tools for monitoring and debugging. Updating the model version is self-service, and the model improvement cycle is much shorter than before. We used to count in days, we now count in minutes.

Future Plans

Improvement on Automation

Currently, the first deployment of any model will still need some manual task from the platform team. We aim to automate this processentirely. We’ll work with our awesome CI/CD team who is making the best use of Spinnaker.

Model serving on mobile devices

As a platform, we are looking at setting standards for model serving across Grab. This includes model serving on mobile devices as well. Tensorflow Serving also provides a Lite version to be used on mobile devices. It is a whole new paradigm with vastly different tradeoffs for machine learning practitioners. We are quite excited to set some best practices in this area.

gRPC support

Catwalk currently supports HTTP/1.1. We’ll hook Grab’s service discovery mechanism to open gRPC traffic, which TFS already supports.

If you are interested in building pipelines for machine learning related topics, and you share our vision of driving South East Asia forward, come join us!

Tourists on GrabChat!

Post Syndicated from Grab Tech original https://engineering.grab.com/tourist-chat-data-story

Just over two years ago we introduced GrabChat, Southeast Asia’s first of its kind in-app messaging platform. Since then we’ve added all sorts of useful features to it. Auto-translated messages, the ability to send photos, and even voice messages! It’s been a great tool to facilitate smoother communications between our driver-partners and our passengers, and one group in particular has found it incredibly useful: tourists!

Now, we’ve analysed tourist data before, but we were curious about how GrabChat in particular has served this demographic. So we looked for interesting insights using sampled tourist chat data from Singapore, Malaysia, and Indonesia for the period of December 2018 to March 2019. That’s more than 3.7 million individual GrabChat messages sent by tourists! Here’s what we found.

Average chats per booking per country

Looking at the volume of the chats being transmitted per booking, we can see that the “chattiest” tourists are from East Timor, Nigeria, and Ukraine with averages of 6.0, 5.6, and 5.1 chats per booking respectively.

Then we wondered: if tourists from all over the world are talking this much to our driver-partners, how are they actually communicating if their mother-tongue is not the local language?

Need a Translator?

When we go to another country, we eat all the heavenly good food, fall in love with the culture, and admire the scenery. Language and communication barriers shouldn’t get in the way of all of that. That’s why Grab’s Chat feature has got it covered!

With Grab’s in-house translation solutions, any Grab passenger can send messages in their preferred language without fear of being misunderstood – or not understood at all! Their messages will be automatically translated into Bahasa Indonesia, Bahasa Melayu, Simplified Chinese, Thai, or Vietnamese depending on where they are. This applies not only apply to Grab’s transport services- GrabChat can be used when ordering GrabFood too!

Percentage of translated GrabChat messages
Indonesia saw the highest usage of translations on a by-booking basis!

 

Let’s look deeper into the tourist translation statistics for each country with the donut charts below. We can see that the most popular translation route for tourists in Indonesia was from English to Indonesian. The story is different for Singapore and Malaysia: we can see that there are translations to and from a more diverse set of languages, reflecting a more multicultural demographic.

Percentage of translated GrabChat messages
The most popular translation routes for tourist bookings in Indonesia, Malaysia, and Singapore.

 

Tap for Templates!

GrabChat also provides achat template feature. Templates are prewritten messages that you can send with just one tap! Did we mention that they are translated automatically too? Passengers and drivers can have a fast, simple, and translated conversation with each other without typing a single word- and sometimes, templates are really all you need.

Examples of chat templates, as they appear in GrabChat!
Examples of chat templates, as they appear in GrabChat!

 

As if all this wasn’t convenient enough, you can also make your own custom templates! Use them for those repetitive, identical messages you always seem to be sending out like telling your drivers where the hotel lobby is, or how to navigate right to your doorstep, or even to send a quick description of what you look like to make it easier for a driver to find you!

Template message usage

Taking a look at individual country data, tourists in Indonesia used templates the most with almost 60% of all of them using a template in their conversations at least once. Malaysia and Singapore saw lower but still sizeable utilisation rates of this feature, at 53% and 33% respectively.

Template message usage percentage
Indonesia saw the highest usage of templates on a by-booking basis.

 

In our analysis, we found an interesting insight! There was a positive correlation between template usage and the success rate of rides. Overall, bookings that used templates in their conversations saw 10% more completions over bookings that didn’t.

Template vs completed bookings

Picture this: a hassle-free experience

A picture says a thousand words, and for tourists using GrabChat’s image feature, those thousand words don’t even need to be translated. Instead of typing out a description of where they are standing for pickup, they can just click, snap, and send an image!

Our data revealed that GrabChat’s image functionality is most frequently used in areas where the tourist traffic is the highest. In fact, image function in GrabChat saw the most use in pickup areas such as airports, large shopping malls, public transport stations, and hotels, because it was harder for drivers to find their passengers in these crowded areas. Even with our super convenient Entrances feature, every little bit of information goes a long way to help your driver find you!

Pickup locations

If we take it a step further and look at the actual areas  within the cities where images were sent the most, we see that our initial hypothesis still holds fast.

Pickup locations
The top 5 pickup areas per country in which images were the most prevalent in GrabChat (for tourists).

 

In Singapore, we see the most images being sent out at the Downtown Core area- this area contains the majestic Marina Bay Sands, the Merlion statue, and the Esplanade, amongst other iconic attractions.

In Malaysia, the highest image usage occurs at none other than the Kuala Lumpur City Centre (KLCC) itself. This area includes the Twin Towers, a plethora of malls and hotels, Bukit Bintang (a bustling and lively night-life zone), and even an aquarium.

Indonesia’s top location for image chats is Kuta. A beach village in Bali, Kuta is a tourist hotspot with surfing, water parks, bars, budget-friendly yet delicious food, and numerous cultural attractions.

Speak up!

Allowing for two-way communication via GrabChat empowers both passengers and drivers to improve their journeys by divulging useful information, and asking clarifying questions: how many bags do you have? Does your car accommodate my pet dog? I’m standing by the lobby with my two kids- these are the sorts of things that are talked about in GrabChat messages.

During the analysis of our multitudes of wide-ranging GrabChat conversations, we picked up some pro-tips for you to get a Grab ride with even more convenience and ease, whether you’re a tourist or not:

Tip #1: Did some shopping on your trip? Swamped with bags? Send a message to your driver to let them know how many pieces of luggage you have with you.

As one might expect, chats that have keywords such as “luggage” or “baggage” (or any other related term) occur the most when riders are going to, or leaving, an airport. Most of the tourists on GrabChat asked the drivers if there was space for all of their things in the car. Interestingly, some of them also told the drivers how to recognise them for pickup based off of the descriptions of their bags!

Tip #2: Your children make good landmarks! If you’re in a crowded spot and you’re worried your driver can’t find you, drop them a message to let them know you’re that family with a baby and a little girl in pigtails.

When it comes to children, we found that passengers mainly use them to help identify themselves to the driver. Messages like “I’m with my two kids” or “We are a family with a baby” came up numerous times, and served as descriptions to facilitate fast pickup. These sorts of chats were the most prevalent in crowded areas like airports and shopping centres.

Tip #3: Don’t get caught off guard- be sure your furry friends have a seat!

Taking a look at pet related chats, we learned that our tourists have used GrabChat to ask clarifying questions to the driver. Passengers have likely considered that not every driver or vehicle is accommodating towards animals. The most common type of message was about whether pets are allowed in the vehicle. For example: “Is it okay if I bring a puppy?” or “I have a dog with me in a carrier, is that alright?”. Better safe than sorry! Alternatively, if you’re travelling with a pet, why not see if GrabPet is available in your country?

From the chat content analysis we have learned that tourists do indeed use GrabChat to talk to their drivers about specific details of their trip. We see that the chat feature is an invaluable tool that anyone can use to clear up any ambiguities and make their journeys more pleasant.

Lerner — using RL agents for test case scheduling

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/lerner-using-rl-agents-for-test-case-scheduling-3e0686211198?source=rss----2615bd06b42e---4

Lerner — using RL agents for test case scheduling

By: Stanislav Kirdey, Kevin Cureton, Scott Rick, Sankar Ramanathan

Introduction

Netflix brings delightful customer experiences to homes on a variety of devices that continues to grow each day. The device ecosystem is rich with partners ranging from Silicon-on-Chip (SoC) manufacturers, Original Design Manufacturer (ODM) and Original Equipment Manufacturer (OEM) vendors.

Partners across the globe leverage Netflix device certification process on a continual basis to ensure that quality products and experiences are delivered to their customers. The certification process involves the verification of partner’s implementation of features provided by the Netflix SDK.

The Partner Device Ecosystem organization in Netflix is responsible for ensuring successful integration and testing of the Netflix application on all partner devices. Netflix engineers run a series of tests and benchmarks to validate the device across multiple dimensions including compatibility of the device with the Netflix SDK, device performance, audio-video playback quality, license handling, encryption and security. All this leads to a plethora of test cases, most of them automated, that need to be executed to validate the functionality of a device running Netflix.

Problem

With a collection of tests that, by nature, are time consuming to run and sometimes require manual intervention, we need to prioritize and schedule test executions in a way that will expedite detection of test failures. There are several problems efficient test scheduling could help us solve:

  1. Quickly detect a regression in the integration of the Netflix SDK on a consumer electronic or MVPD (multichannel video programming distributor) device.
  2. Detect a regression in a test case. Using the Netflix Reference Application and known good devices, ensure the test case continues to function and tests what is expected.
  3. When code many test cases are dependent on has changed, choose the right test cases among thousands of affected tests to quickly validate the change before committing it and running extensive, and expensive, tests.
  4. Choose the most promising subset of tests out of thousands of test cases available when running continuous integration against a device.
  5. Recommend a set of test cases to execute against the device that would increase the probability of failing the device in real-time.

Solving the above problems could help Netflix and our Partners save time and money during the entire lifecycle of device design, build, test, and certification.

These problems could be solved in several different ways. In our quest to be objective, scientific, and inline with the Netflix philosophy of using data to drive solutions for intriguing problems, we proceeded by leveraging machine learning.

Our inspiration was the findings in a research paper “Reinforcement Learning for Automatic Test Case Prioritization and Selection in Continuous Integration” by Helge Spieker, et. al. We thought that reinforcement learning would be a promising approach that could provide great flexibility in the training process. Likewise it has very low requirements on the initial amount of training data.

In the case of continuously testing a Netflix SDK integration on a new device, we usually lack relevant data for model training in the early phases of integration. In this situation training an agent is a great fit as it allows us to start with very little input data and let the agent explore and exploit the patterns it learns in the process of SDK integration and regression testing. The agent in reinforcement learning is an entity that performs a decision on what action to take considering the current state of the environment, and gets a reward based on the quality of the action.

Solution

We built a system called Lerner that consists of a set of microservices and a python library that allows scalable agent training and inference for test case scheduling. We also provide an API client in Python.

Lerner works in tandem with our continuous integration framework that executes on-device tests using the Netflix Test Studio platform. Tests are run on Netflix Reference Applications (running as containers on Titus), as well as on physical devices.

There were several motivations that led to building a custom solution:

  1. We wanted to keep the APIs and integrations as simple as possible.
  2. We needed a way to run agents and tie the runs to the internal infrastructure for analytics, reporting, and visualizations.
  3. We wanted the to tool be available as a standalone library as well as scalable API service.

Lerner provides ability to setup any number of agents making it the first component in our re-usable reinforcement learning framework for device certification.

Lerner, as a web-service, relies on Amazon Web Services (AWS) and Netflix’s Open Source Software (OSS) tools. We use Spinnaker to deploy instances and host the API containers on Titus — which allows fast deployment times and rapid scalability. Lerner uses AWS services to store binary versions of the agents, agent configurations, and training data. To maintain the quality of Lerner APIs, we are using the server-less paradigm for Lerner’s own integration testing by utilizing AWS Lambda.

The agent training library is written in Python and supports versions 2.7, 3.5, 3.6, and 3.7. The library is available in the artifactory repository for easy installation. It can be used in Python notebooks — allowing for rapid experimentation in isolated environments without a need to perform API calls. The agent training library exposes different types of learning agents that utilize neural networks to approximate action.

The neural network (NN)-based agent uses a deep net with fully connected layers. The NN gets the state of a particular test case (the input) and outputs a continuous value, where a higher number means an earlier position in a test execution schedule. The inputs to the neural network include: general historical features such as the last N executions and several domain specific features that provide meta-information about a test case.

The Lerner APIs are split into three areas:

  1. Storing execution results.
  2. Getting recommendations based on the current state of the environment.
  3. Assign reward to the agent based on the execution result and predicted recommendations.

A process of getting recommendations and rewarding the agent using APIs consists of 4 steps:

  1. Out of all available test cases for a particular job — form a request that can be interpreted by Lerner. This involves aggregation of historical results and additional features.
  2. Lerner returns a recommendation identified with a unique episode id.
  3. A CI system can execute the recommendation and submit the execution results to Lerner based on the episode id.
  4. Call an API to assign a reward based on the agent id and episode id.

Below is a diagram of the services and persistence layers that support the functionality of the Lerner API.

The self-service nature of the tool makes it easy for service owners to integrate with Lerner, create agents, ask agents for recommendations and reward them after execution results are available.

The metrics relevant to the training and recommendation process are reported to Atlas and visualized using Netflix’s Lumen. Users of the service can track the statistics specific to the agents they setup and deploy, which allows them to build their own dashboards.

We have identified some interesting patterns while doing online reinforcement learning.

  • The recommendation/execution reward cycle can happen without any prior training data.
  • We can bootstrap several CI jobs that would use agents with different reward functions, and gain additional insight based on agents performance. It could help us design and implement more targeted reward functions.
  • We can keep a small amount of historical data to train agents. The data can be truncated after each execution and offloaded to a long-term storage for further analysis.

Some of the downsides:

  • It might take time for an agent to stop exploring and start exploiting the accumulated experience.
  • As agents stored in a binary format in the database, an update of an agent from multiple jobs could cause a race condition in its state. Handling concurrency in the training process is cumbersome and requires trade offs. We achieved the desired state by relying on the locking mechanisms of the underlying persistence layer that stores and serves agent binaries.

Thus, we have the luxury of training as many agents as we want that could prioritize and recommend test cases based on their unique learning experiences.

Outcome

We are currently piloting the system and have live agents serving predictions for various CI runs. At the moment we run Lerner-based CIs in parallel with CIs that either execute test cases in random order or use simple heuristics as sorting test cases by time and execute everything that previously failed.

The system was built with simplicity and performance in mind, so the set of APIs are minimal. We developed client libraries that allow seamless, but opinionated, integration with Lerner.

We collect several metrics to evaluate the performance of a recommendation, with main metrics being time taken to first failure and time taken to complete a whole scheduled run.

Lerner-based recommendations are proving to be different and more insightful than random runs, as they allow us to fit a particular time budget and detect patterns such as cases that tend to fail together in a cluster, cases that haven’t been run in a long time, and so on.

The below graphs shows more or less an artificial case when a schedule of 100+ test cases would contain several flaky tests. The Y-axis represents how many minutes it took to complete the schedule or reach a first failed test case. In blue, we have random recommendations with no time budget constraints. In green you can see executions based on Lerner recommendations under a time constraint of 60 minutes. The green spikes represent Lerner exploring the environment, where the wiggly lines around 0 are the executions that failed quickly as Lerner was exploiting its policy.

Execution of schedules that were randomly generated. Y-axis represents time to finish execution or reach first failure.
Execution of Lerner based schedules. You can see moments when Lerner was exploring the environment, and the wiggly lines represent when the schedule was generated based on exploiting existing knowledge.

Next Steps

The next phases of the project will focus on:

  • Reward functions that are aware of a comprehensive domain context, such as assigning appropriate rewards to states where infrastructure is fragile and test case could not be run appropriately.
  • Administrative user-interface to manage agents.
  • More generic, simple, and user-friendly framework for reinforcement learning and agent deployment.
  • Using Lerner on all available CIs jobs against all SDK versions.
  • Experiment with different neural network architectures.

If you would like to be a part of our team, come join us.


Lerner — using RL agents for test case scheduling was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bubble Tea Craze on GrabFood!

Post Syndicated from Grab Tech original https://engineering.grab.com/bubble-tea-craze-on-grabfood

Bigger and More Bubble Tea!

Bubble tea orders on GrabFood has been constantly and dramatically increasing with an impressive regional average growth rate of 3,000% in the year of 2018!  Just look at the percentage increase over the year of 2018, across all countries!

CountriesBubble tea growth by percentage in 2018*
Indonesia>8500% growth from Jan 2018 to Dec 2018
Philippines>3,500% growth from June 2018 to Dec 2018
Thailand>3,000% growth from Jan 21018 to Dec 2018
Vietnam>1,500% growth from May 2018 to Dec 2018
Singapore>700% growth from May 2018 to Dec 2018
Malaysia>250% growth from May 2018 to Dec 2018

*Time period: January 2018 to December 2018, or from the time GrabFood was launched.

What’s driving this growth is not just die-hard bubble tea fans who can’t go a week without drinking this sweet treat, but a growing bubble tea fan club in Southeast Asia. The number of bubble tea lovers on GrabFood grew over 12,000% in 2018 – and there’s no sign of stopping!

With increasing consumer demand, how is Southeast Asia’s bubble tea supply catching up?  As of December 2018, GrabFood has close to 4,000 bubble tea outlets from a network of over 1,500 brands – a 200% growth in bubble tea outlets in Southeast Asia!

Bubble-Tea-Lover growth on GrabFood

If this stat doesn’t stick, here is a map to show you how much bubble tea orders in different Southeast Asian cities have grown!

Maps of bubble tea merchants on GrabFood

And here is a little shoutout to our star merchants including Chatime, Coco Fresh Tea & Juice, Macao Imperial Tea, Ochaya, Koi Tea, Cafe Amazon, The Alley, iTEA, Gong Cha, and Serenitea.

Just how much do you drink?

On average, Southeast Asians drink  4 cups of bubble tea per person per month on GrabFood. Thai consumers top the regional average by 2 cups, consuming about six cups of bubble tea per person per month. This is closely followed by Filipino consumers who drink an average of 5 cups per person per month.

Average bubble tea consumption by cups per person per month

Favourite Flavours!

Have a look at the dazzling array of Bubble Tea flavours available on GrabFood today and you’ll find some uniquely Southeast Asian flavours like Chendol, Durian, and Gula Melaka, as well as rare flavours like salted cream and cheese! Can you spot your favourite flavours here?

Bubble tea flavour consumption per month

Let’s break it down by the country that GrabFood serves, and see who likes which flavours of Bubble Tea more!

Bubble tea flavour consumption per month by country

Top the Toppings!

Pearl seems to be the unbeatable best topping of most of the countries, except Vietnam whose No. 1 topping turned out to be Cheese Pudding! Top 3 toppings that topped your favorite bubble tea are:

Top list of toppings

Best Time for Bubble Tea!

Don’t we all need a cup of sweet Bubble Tea in the afternoon to get us through the day?  Across Southeast Asia, GrabFood’s data reveals that most people order bubble tea to accompany their meals at lunch, or as a  perfect midday energizer!

Times of the day when most people order bubble tea

Conclusion

So hazelnut or chocolate, pearl or (and) pudding (who says we can’t have the best of both worlds!)? The options are abundant and the choice is yours to enjoy!

If you have a sweet tooth, or simply want to reward yourself with Southeast Asia’s most popular drink, go ahead – you are only a couple of taps away from savouring this cup full of delight

Python at Netflix

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/python-at-netflix-bba45dae649e?source=rss----2615bd06b42e---4

By Pythonistas at Netflix, coordinated by Amjith Ramanujam and edited by Ellen Livengood

As many of us prepare to go to PyCon, we wanted to share a sampling of how Python is used at Netflix. We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members. We use and contribute to many open-source Python packages, some of which are mentioned below. If any of this interests you, check out the jobs site or find us at PyCon. We have donated a few Netflix Originals posters to the PyLadies Auction and look forward to seeing you all there.

Open Connect

Open Connect is Netflix’s content delivery network (CDN). An easy, though imprecise, way of thinking about Netflix infrastructure is that everything that happens before you press Play on your remote control (e.g., are you logged in? what plan do you have? what have you watched so we can recommend new titles to you? what do you want to watch?) takes place in Amazon Web Services (AWS), whereas everything that happens afterwards (i.e., video streaming) takes place in the Open Connect network. Content is placed on the network of servers in the Open Connect CDN as close to the end user as possible, improving the streaming experience for our customers and reducing costs for both Netflix and our Internet Service Provider (ISP) partners.

Various software systems are needed to design, build, and operate this CDN infrastructure, and a significant number of them are written in Python. The network devices that underlie a large portion of the CDN are mostly managed by Python applications. Such applications track the inventory of our network gear: what devices, of which models, with which hardware components, located in which sites. The configuration of these devices is controlled by several other systems including source of truth, application of configurations to devices, and back up. Device interaction for the collection of health and other operational data is yet another Python application. Python has long been a popular programming language in the networking space because it’s an intuitive language that allows engineers to quickly solve networking problems. Subsequently, many useful libraries get developed, making the language even more desirable to learn and use.

Demand Engineering

Demand Engineering is responsible for Regional Failovers, Traffic Distribution, Capacity Operations, and Fleet Efficiency of the Netflix cloud. We are proud to say that our team’s tools are built primarily in Python. The service that orchestrates failover uses numpy and scipy to perform numerical analysis, boto3 to make changes to our AWS infrastructure, rq to run asynchronous workloads and we wrap it all up in a thin layer of Flask APIs. The ability to drop into a bpython shell and improvise has saved the day more than once.

We are heavy users of Jupyter Notebooks and nteract to analyze operational data and prototype visualization tools that help us detect capacity regressions.

CORE

The CORE team uses Python in our alerting and statistical analytical work. We lean on many of the statistical and mathematical libraries (numpy, scipy, ruptures, pandas) to help automate the analysis of 1000s of related signals when our alerting systems indicate problems. We’ve developed a time series correlation system used both inside and outside the team as well as a distributed worker system to parallelize large amounts of analytical work to deliver results quickly.

Python is also a tool we typically use for automation tasks, data exploration and cleaning, and as a convenient source for visualization work.

Monitoring, alerting and auto-remediation

The Insight Engineering team is responsible for building and operating the tools for operational insight, alerting, diagnostics, and auto-remediation. With the increased popularity of Python, the team now supports Python clients for most of their services. One example is the Spectator Python client library, a library for instrumenting code to record dimensional time series metrics. We build Python libraries to interact with other Netflix platform level services. In addition to libraries, the Winston and Bolt products are also built using Python frameworks (Gunicorn + Flask + Flask-RESTPlus).

Information Security

The information security team uses Python to accomplish a number of high leverage goals for Netflix: security automation, risk classification, auto-remediation, and vulnerability identification to name a few. We’ve had a number of successful Python open sources, including Security Monkey (our team’s most active open source project). We leverage Python to protect our SSH resources using Bless. Our Infrastructure Security team leverages Python to help with IAM permission tuning using Repokid. We use Python to help generate TLS certificates using Lemur.

Some of our more recent projects include Prism: a batch framework to help security engineers measure paved road adoption, risk factors, and identify vulnerabilities in source code. We currently provide Python and Ruby libraries for Prism. The Diffy forensics triage tool is written entirely in Python. We also use Python to detect sensitive data using Lanius.

Personalization Algorithms

We use Python extensively within our broader Personalization Machine Learning Infrastructure to train some of the Machine Learning models for key aspects of the Netflix experience: from our recommendation algorithms to artwork personalization to marketing algorithms. For example, some algorithms use TensorFlow, Keras, and PyTorch to learn Deep Neural Networks, XGBoost and LightGBM to learn Gradient Boosted Decision Trees or the broader scientific stack in Python (e.g. numpy, scipy, sklearn, matplotlib, pandas, cvxpy). Because we’re constantly trying out new approaches, we use Jupyter Notebooks to drive many of our experiments. We have also developed a number of higher-level libraries to help integrate these with the rest of our ecosystem (e.g. data access, fact logging and feature extraction, model evaluation, and publishing).

Machine Learning Infrastructure

Besides personalization, Netflix applies machine learning to hundreds of use cases across the company. Many of these applications are powered by Metaflow, a Python framework that makes it easy to execute ML projects from the prototype stage to production.

Metaflow pushes the limits of Python: We leverage well parallelized and optimized Python code to fetch data at 10Gbps, handle hundreds of millions of data points in memory, and orchestrate computation over tens of thousands of CPU cores.

Notebooks

We are avid users of Jupyter notebooks at Netflix, and we’ve written about the reasons and nature of this investment before.

But Python plays a huge role in how we provide those services. Python is a primary language when we need to develop, debug, explore, and prototype different interactions with the Jupyter ecosystem. We use Python to build custom extensions to the Jupyter server that allows us to manage tasks like logging, archiving, publishing, and cloning notebooks on behalf of our users.
We provide many flavors of Python to our users via different Jupyter kernels, and manage the deployment of those kernel specifications using Python.

Orchestration

The Big Data Orchestration team is responsible for providing all of the services and tooling to schedule and execute ETL and Adhoc pipelines.

Many of the components of the orchestration service are written in Python. Starting with our scheduler, which uses Jupyter Notebooks with papermill to provide templatized job types (Spark, Presto, …). This allows our users to have a standardized and easy way to express work that needs to be executed. You can see some deeper details on the subject here. We have been using notebooks as real runbooks for situations where human intervention is required — for example: to restart everything that has failed in the last hour.

Internally, we also built an event-driven platform that is fully written in Python. We have created streams of events from a number of systems that get unified into a single tool. This allows us to define conditions to filter events, and actions to react or route them. As a result of this, we have been able to decouple microservices and get visibility into everything that happens on the data platform.

Our team also built the pygenie client which interfaces with Genie, a federated job execution service. Internally, we have additional extensions to this library that apply business conventions and integrate with the Netflix platform. These libraries are the primary way users interface programmatically with work in the Big Data platform.

Finally, it’s been our team’s commitment to contribute to papermill and scrapbook open source projects. Our work there has been both for our own and external use cases. These efforts have been gaining a lot of traction in the open source community and we’re glad to be able to contribute to these shared projects.

Experimentation Platform

The scientific computing team for experimentation is creating a platform for scientists and engineers to analyze AB tests and other experiments. Scientists and engineers can contribute new innovations on three fronts, data, statistics, and visualizations.

The Metrics Repo is a Python framework based on PyPika that allows contributors to write reusable parameterized SQL queries. It serves as an entry point into any new analysis.

The Causal Models library is a Python & R framework for scientists to contribute new models for causal inference. It leverages PyArrow and RPy2 so that statistics can be calculated seamlessly in either language.

The Visualizations library is based on Plotly. Since Plotly is a widely adopted visualization spec, there are a variety of tools that allow contributors to produce an output that is consumable by our platforms.

Partner Ecosystem

The Partner Ecosystem group is expanding its use of Python for testing Netflix applications on devices. Python is forming the core of a new CI infrastructure, including controlling our orchestration servers, controlling Spinnaker, test case querying and filtering, and scheduling test runs on devices and containers. Additional post-run analysis is being done in Python using TensorFlow to determine which tests are most likely to show problems on which devices.

Video Encoding and Media Cloud Engineering

Our team takes care of encoding (and re-encoding) the Netflix catalog, as well as leveraging machine learning for insights into that catalog.
We use Python for ~50 projects such as vmaf and mezzfs, we build computer vision solutions using a media map-reduce platform called Archer, and we use Python for many internal projects.
We have also open sourced a few tools to ease development/distribution of Python projects, like setupmeta and pickley.

Netflix Animation and NVFX

Python is the industry standard for all of the major applications we use to create Animated and VFX content, so it goes without saying that we are using it very heavily. All of our integrations with Maya and Nuke are in Python, and the bulk of our Shotgun tools are also in Python. We’re just getting started on getting our tooling in the cloud, and anticipate deploying many of our own custom Python AMIs/containers.

Content Machine Learning, Science & Analytics

The Content Machine Learning team uses Python extensively for the development of machine learning models that are the core of forecasting audience size, viewership, and other demand metrics for all content.


Python at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How we harnessed the wisdom of crowds to improve restaurant location accuracy

Post Syndicated from Grab Tech original https://engineering.grab.com/correcting-restaurant-locations-harnessing-wisdom-of-the-crowd

While studying GPS ping data to understand how long our driver-partners needed to spend at restaurants during a GrabFood delivery, we came across an interesting observation. We realized that there was a significant proportion of restaurants where our driver-partners were waiting for abnormally short durations, often for just seconds.

Considering that it typically takes a driver a few minutes to enter the restaurant, pick up the order and then leave, we decided to dig further into this phenomenon. What we uncovered was that these super short pit stops were restaurants that were registered at incorrect coordinates within the system due to reasons such as the restaurant had moved to a new location, or human error during onboarding the restaurants. Incorrectly registered locations within our system impact all involved parties – eaters may not see the restaurant because it falls outside their delivery radius or they may see an incorrect ETA, drivers may have trouble finding the restaurant and may end up having to cancel the order, and restaurants who may get fewer orders without really knowing why. 

So we asked ourselves – how can we improve this situation by leveraging the wealth of data that we have? 

The Solution

One of the biggest advantages we have is the huge driver-partner fleet we have on the ground in cities across Southeast Asia. They know the roads and cities like the back of their hand, and they are resourceful. As a result, they are often able to find the restaurants and complete orders even if the location was registered incorrectly. Knowing this, we looked at GPS pings and timestamps from these drivers, and combined this information with when they indicated that they have ordered or collected food from the restaurant. This is then used to infer the “pick-up location” from which the food was collected. 

Inferring this location is not so straightforward though. GPS ping quality can vary significantly across devices and will be affected by whether the device is outdoors or indoors (e.g. if the restaurant is inside a mall). Hence we compute metrics from times and distances between pings, ping frequency and ping quality to filter out orders where the GPS quality is determined to be sub-par. The thresholds for such filtering are determined based on a statistical analysis of orders by regions and times of day. 

One of the outcomes of such an analysis is that we deemed it acceptable to consider a driver “at” a restaurant, if their GPS ping falls within a predetermined radius of the registered location of the restaurant. However, knowing that a driver is at the restaurant does not necessarily tell us “when” he or she  is actually at the restaurant. See the following figure for an example. 

Map showing driver paths and GPS location

 

As you can see from the area covered by the green circle, there are 3 distinct occurrences or “streaks” when the driver can be determined to be at the restaurant location – once when they are approaching the restaurant from the southwest before taking two right turns, then again when they are actually at the restaurant coming in from the northeast, and again when they leave the restaurant heading southwest before making a U-turn and then heading northeast. In this case, if the driver indicates that they have collected the food during the second streak, chronology is respected – the driver reaches the restaurant, the driver collects the food, the driver leaves the restaurant. However if the driver indicates that they have collected the food during one of the other streaks, that is an invalid pick-up even though it is “at” the restaurant.

Such potentially invalid pick-ups could result in noisy estimates of restaurant location, as well as hamper us in our parent task of accurately estimating how long drivers need to wait at restaurants. Therefore, we modify the definition of the driver being at the restaurant to only include the time of the longest streak i.e. the time when the driver spent the longest time within the registered location radius. 

Extending this across multiple orders and drivers, we can form a cluster of pick-up locations (both “at” and otherwise) for each restaurant. Each restaurant then gets ranked through a combination of:

Order volume: Restaurants which receive more orders are likely to have more valid signals for any predictions we make. Increasing the confidence we have in our estimates.

Fraction of the orders where the pick-up location was not “at” the restaurant: This fraction indicates the number of orders with a pick-up location not near the registered restaurant location (with near being defined both spatially and temporally as above). A higher value indicates a higher likelihood of the restaurant not being in the registered location subject to order volume

Median distance between registered and estimated locations: This factor is used to rank restaurants by a notion of “importance”. A restaurant which is just outside the fixed radius from above can be addressed after another restaurant which is a kilometer away. 

This ranked list of restaurants is then passed on to our mapping operations team to verify. The team checks various sources to verify if the restaurant is incorrectly located which is then fed back to the GrabFood system and the locations updated accordingly.

Results

  • We have a system to catch and fix obvious errors

The table below shows a few examples of errors we were able to catch and fix. The image on the left shows the distance between an incorrectly registered address and the actual location of the restaurant.

RestaurantPath from registered location to estimated locationZoomed in view of estimated location
Sederhana  MinangSederhana  Minang path from registered to estimated locationSederhana  Minang zoomed in view of estimated location
Papa Ron’s PizzaPapa Ron's Pizza path from registered to estimated locationPapa Ron's Pizza zoomed in view of estimated location
Rich-O Donuts & CafeRich-O Donuts & Cafe path from registered to estimated locationRich-O Donuts & Cafe zoomed in view of estimated location

Fixing these errors periodically greatly reduced the median error distance (measured as the straight line distance between the estimated location and registered location) in each city as restaurant locations were corrected.

BangkokHo Chi Minh
Median error distance in BangkokMedian error distance in Ho Chi Minh
  • We helped to reduce cancellations

We also tracked the number of GrabFood orders cancelled because the restaurant could not be found by our driver-partners as indicated on the app. Once we started making periodic updates, we saw a 5x decrease in cancellations because of incorrect restaurant locations. 

Relative cancellation rate due to incorrect location
  • We discovered some interesting findings!

In some cases, we were actually stumped when trying to correct some of the locations according to what the system estimated. One of the most interesting examples was the restaurant “Waroeng Steak and Shake” in Bekasi. According to our system, the restaurant’s location was further up Jalan Raya Jatiwaringin than we thought it to be. 

Waroeng Steak and Shake map location

Examining this on Google Maps, we noticed that both locations oddly seemed to have a branch of the restaurant. What was going on here? 

Waroeng Steak and Shake map location on Google Maps

By looking at Google Reviews (credit to my colleague Kenneth Loh for the idea), we realized that  the restaurant seemed to have changed its location, and this is what our system was picking up on. 

Waroeng Steak and Shake Google Maps reviews

In summary, the system was able to respond to a change in location for the restaurant without any active action taken by the restaurant and while other data sources had duplicates. 

What’s Next?

Going forward, we are looking to automate some aspects of this workflow. Currently, the validation part is handled by our mapping operations team and we are looking to feedback their validation and actions taken so that we can finetune various hyperparameters in our system (registered location radii, normalization factors, etc) and/or train more advanced models that are cognizant of different geo and driver characteristics in different markets.

Additionally while we know that we should expect poor results for some scenarios (e.g. inside malls due to poor GPS quality and often approximate registered locations), we can extract such information (restaurant is inside a mall in this case) through a combination of manual feedback from operations teams and drivers, as well as automated NLP techniques such as name and address parsing and entity recognition. 

In the end, it is always useful to question the predictions that a system makes. By looking at some abnormally small wait times at restaurants, we were able to discover, provide feedback and continually update restaurant locations within the GrabFood ecosystem resulting in an overall better experience for our eaters, driver-partners and merchant-partners.

Design Principles for Mathematical Engineering in Experimentation Platform at Netflix

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/design-principles-for-mathematical-engineering-in-experimentation-platform-15b3ea143b1f?source=rss----2615bd06b42e---4

Jeffrey Wong, Senior Modeling Architect, Experimentation Platform
Colin McFarland, Director, Experimentation Platform

At Netflix, we have data scientists coming from many backgrounds such as neuroscience, statistics and biostatistics, economics, and physics; each of these backgrounds has a meaningful contribution to how experiments should be analyzed. To unlock these innovations we are making a strategic choice that our focus should be geared towards developing the surrounding infrastructure so that scientists’ work can be easily absorbed into the wider Netflix Experimentation Platform. There are 2 major challenges to succeed in our mission:

  1. We want to democratize the platform and create a contribution model: with a developer and production deployment experience that is designed for data scientists and friendly to the stacks they use.
  2. We have to do it at Netflix’s scale: For hundreds of millions of users across hundreds of concurrent tests, spanning many deployment strategies from traditional A/B experiments, to evolving areas like quasi experiments.

Mathematical engineers at Netflix in particular work on the scalability and engineering of models that estimate treatment effects. They develop scientific libraries that scientists can apply to analyze experiments, and also contribute to the engineering foundations to build a scientific platform where new research can graduate to. In order to produce software that improves a scientist’s productivity we have come up with the following design principles.

1. Composition

Data Science is a curiosity driven field, and should not be unnecessarily constrained[1]. We support data scientists to have freedom to explore research in any new direction. To help, we provide software autonomy for data scientists by focusing on composition, a design principle popular in data science software like ggplot2 and dplyr[2]. Composition exposes a set of fundamental building blocks that can be assembled in various combinations to solve complex problems. For example, ggplot2 provides several lightweight functions like geom_bar, geom_point, geom_line, and theme, that allow the user to assemble custom visualizations; every graph whether simple or complex can be composed of small, lightweight ggplot2 primitives.

In the democratization of the experimentation platform we also want to allow custom analysis. Since converting every experiment analysis into its own function for the experimentation platform is not scalable, we are making the strategic bet to invest in building high quality causal inference primitives that can be composed into an arbitrarily complex analysis. The primitives include a grammar for describing the data generating process, generic counterfactual simulations, regression, bootstrapping, and more.

2. Performance

If our software is not performant it could limit adoption, subsequent innovation, and business impact. This will also make graduating new research into the experimentation platform difficult. Performance can be tackled from at least three angles:

A) Efficient computation

We should leverage the structure of the data and of the problem as much as possible to identify the optimal compute strategy. For example, if we want to fit ridge regression with various different regularization strengths we can do an SVD upfront and express the full solution path very efficiently in terms of the SVD.

B) Efficient use of memory

We should optimize for sparse linear algebra. When there are many linear algebra operations, we should understand them holistically so that we can optimize the order of operations and not materialize unnecessary intermediate matrices. When indexing into vectors and matrices, we should index contiguous blocks as much as possible to improve spatial locality[3].

C) Compression

Algorithms should be able to work on raw data as well as compressed data. For example, regression adjustment algorithms should be able to use frequency weights, analytic weights, and probability weights[4]. Compression algorithms can be lossless, or lossy with a tuning parameter to control the loss of information and impact on the standard error of the treatment effect.

3. Graduation

We need a process for graduating new research into the experimentation platform. The end to end data science cycle usually starts with a data scientist writing a script to do a new analysis. If the script is used several times it is rewritten into a function and moved into the Analysis Library. If performance is a concern, it can be refactored to build on top of high performance causal inference primitives made by mathematical engineers. This is the first phase of graduation.

The first phase will have a lot of iterations. The iterations go in both directions: data scientists can promote functions into the library, but they can also use functions from the library in their analysis scripts.

The second phase interfaces the Analysis Library with the rest of the experimentation ecosystem. This is the promotion of the library into the Statistics Backend, and negotiating engineering contracts for input into the Statistics Backend and output from the Statistics Backend. This can be done in an experimental notebook environment, where data scientists can demonstrate end to end what their new work will look like in the platform. This enables them to have conversations with stakeholders and other partners, and get feedback on how useful the new features are. Once the concepts have been proven in the experimental environment, the new research can graduate into the production experimentation platform. Now we can expose the innovation to a large audience of data scientists, engineers and product managers at Netflix.

4. Reproducibility

Reproducibility builds trustworthiness, transparency, and understanding for the platform. Developers should be able to reproduce an experiment analysis report outside of the platform using only the backend libraries. The ability to replicate, as well as rerun the analysis programmatically with different parameters is crucial for agility.

5. Introspection

In order to get data scientists involved with the production ecosystem, whether for debugging or innovation, they need to be able to step through the functions the platform is calling. This level of interaction goes beyond reproducibility. Introspectable code allows data scientists to check data, the inputs into models, the outputs, and the treatment effect. It also allows them to see where the opportunities are to insert new code. To make this easy we need to understand the steps of the analysis, and expose functions to see intermediate steps. For example we could break down the analysis of an experiment as

  • Compose data query
  • Retrieve data
  • Preprocess data
  • Fit treatment effect model
  • Use treatment effect model to estimate various treatment effects and variances
  • Post process treatment effects, for example with multiple hypothesis correction
  • Serialize analysis results to send back to the Experimentation Platform

It is difficult for a data scientist to step through the online analysis code. Our path to introspectability is to power the analysis engine using python and R, a stack that is easy for a data scientist to step through. By making the analysis engine a python and R library we will also gain reproducibility.

6. Scientific Code in Production and in Offline Environments

In the causal inference domain data scientists tend to write code in python and R. We intentionally are not rewriting scientific functions into a new language like Java, because that will render the library useless for data scientists since they cannot integrate optimized functions back into their work. Rewriting poses reproducibility challenges since the python/R stack would need to match the Java stack. Introspection is also more difficult because the production code requires a separate development environment.

We choose to develop high performance scientific primitives in C++, which can easily be wrapped into both python and R, and also delivers on highly performant, production quality scientific code. In order to support the diversity of the data science teams and offer first class support for hybrid stacks like python and R, we standardize data on the Apache Arrow format in order to facilitate data exchange to different statistics languages with minimal overhead.

7. Well Defined Point of Entry, Well Defined Point of Exit

Our causal inference primitives are developed in a pure, scientific library, without business logic. For example, regression can be written to accept a feature matrix and a response vector, without any specific experimentation data structures. This makes the library portable, and allows data scientists to write extensions that can reuse the highly performant statistics functions for their own adhoc analysis. It is also portable enough for other teams to share.

Since these scientific libraries are decoupled from business logic, they will always be sandwiched in any engineering platform; upstream will have a data layer, and downstream will have a visualization and interpretation layer. To facilitate a smooth data flow, we need to design simple connectors. For example, all analyses need to receive data and a description of the data generating process. By focusing on composition, an arbitrary analysis can be constructed by layering causal analysis primitives on top of that starting point. Similarly, the end of an analysis will always consolidate into one data structure. This simplifies the workflow for downstream consumers so that they know what data type to consume.

Next Steps

We are actively developing high performance software for regression, heterogeneous treatment effects, longitudinal studies and much more for the Experimentation Platform at Netflix. We aim to accelerate research in causal inference methodology, expedite product innovation, and ultimately bring the best experience and delight to our members. This is an ongoing journey, and if you are passionate about our exciting work, join our all-star team!


Design Principles for Mathematical Engineering in Experimentation Platform at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Recipe for building a widget: How we helped to “peak-shift” demand by helping passengers understand travel trends

Post Syndicated from Grab Tech original https://engineering.grab.com/peak-shift-demand-travel-trends

Credits: Photo by rawpixel on Unsplash

 

Stuck in traffic in a Grab ride? Pass the time by opening your Grab app and checking out the Feed – just scroll down! You’ll find widgets for games, polls, videos, news and even food recommendations!

Beyond serving your everyday needs, we want to provide our users with information that is interesting, useful and relevant. That’s why we’re always coming up with new widgets.

Building each widget takes close collaboration across multiple different teams – from Product Management to Design, Engineering, Behavioral Science, and Data Science and Analytics. Sounds like a lot of people, doesn’t it? But you’ll be surprised to hear that this behind-the-scenes collaboration works rapidly, usually in the span of one month! Which means we’re often moving from ideation phase to product release in just a few weeks.

Travel Trends Widget

This fast-and-furious process is anchored on one word – “customer-centric”. And that’s how it all began with our  “Travel Trends Widget” – a widget that provides passengers with an overview of historical supply and demand trends for their current location and nearby time periods.

Because we had so much fun developing this widget, we wanted to write a blog post to share with you what we did and how we did it!

Inspiration: Where it all started

Transport demand can be rather lumpy. Owing to organic patterns (e.g. office hours), a lot of passengers tend to request for cars around the same time. In periods like this, the increase in demand could outpace the arrival of driver supply, increasing the waiting time for passengers.

Our goal at Grab is to make sure people get a ride when they want it and at the price they want, so we got to thinking about how we can ease this friction by leveraging our treasure trove – Big Data! – to help our passengers better plan their trips.

As we were looking at the data, we noticed that there is a seasonality to demand and supply: at certain times and days, imbalances appear, peak and disappear, and the process repeats itself. Studies say that humans in general, unless shown a compelling reason or benefit for change, are habitual beings subject to inertia. So we set out to achieve exactly that: To create a widget to surface information to our passengers that may help them alter their decisions on when they choose to book a ride, thereby redistributing some of the present peak demands to periods just before and after peak – also known as “peak shifting the demand”!

While this widget is the first-of-its-kind in the ride-hailing industry, “peak-shifting” was actually coined and introduced long ago!

London Transport Museum Trends

As you can see from this post from the London Transport Museum (Source: Transport for London), London tube tried peak-shifting long before anyone else: Original Ad from 1928 displayed on the left, and Ad from 2015 displayed on the right, comparing the trends to 1928.

Trends from a hotel in Beijing

You may also have seen something similar at the last hotel you stayed at. Notice here a poster in an elevator at a Beijing hotel, announcing the best times to eat breakfast in comfort and avoid the crowd. (Photo credits to Prashant, our Product Manager, who saw this on holiday.)

To apply “peak-shifting” and help our users better plan their trips, we decided to dig in and leverage our data. It was way more complex than we had initially thought, as market conditions could be different on different days. This meant that  generic statements like “5PM-8PM are peak hours and prices will be hight” would not hold true. Contrary to general perception, we observed that even during peak hours, there are buckets of time when there is no surge or low surge.

For instance, plot 1 and plot 2 below shows how a typical Monday and Tuesday surge looks like in a given month respectively. One of the key insights is that the surge trends during peak hour is different on Monday from Tuesday. It reinforces our initial hypothesis that every day is unique.

So we used machine learning techniques to build a forecasting widget which can help our users and give them the power to plan their trips beforehand. This widget is able to provide the pricing trends for the next 2 hours. So with a bit of flexibility, riders can ride the tide!

Grab trends

So how exactly does this widget work?!

Historical trends for Monday

It pulls together historically observed imbalances between supply and demand, for the consumer’s current location and nearby time periods. Aggregated data is displayed to consumers in easily interpreted visualisations, so that they can plan to leave at times when there are more supply, and with potentially more savings for fares.

How did we build the widget? Loop, agile working process, POC & workstream

Widget-building is an agile, collaborative, and simultaneous process. First, we started the process with analysis from Product Analytics team, pulling out data on traffic trends, surge patterns, and behavioral insights of both passengers and drivers in Singapore.

When we noticed the existence of seasonality for each day of the week, we came up with more precise analytical and business questions to dig deeper into the data. Upon verification of hypotheses, we decided that we will build a widget.

Then joined the Behavioural Science, UX (User Experience) Design and the Product Management teams, who started giving shape to the problem we are solving. Our Behavioural Scientists shared their expertise on how information, suggestions and choices should be presented to enable easy assimilation and beneficial action. Daily whiteboarding breakouts, endless back-and forth conversations, and a healthy amount of challenge-and-accept culture ensured that we distilled the idea down to its core. We then presented the relevant information with just the right level of detail, and with the right amount of messaging, to allow users to take the intended action i.e. shift his/her demand outside of peak periods if possible.

Our amazing regional Copywriting team then swung in to put our intent into words in 7 different languages for our users across South-East Asia. Simultaneously, our UX designers and Full-stack Engineers started exploring the best visual components to communicate data on time trends to users. More on this later, but suffice to say that plenty of ideas were explored and discarded in a collaborative process, which aimed to create something that’s intuitive and engaging while being robust and scalable to work across all types of devices.

While these designs made their way up to engineering, the Data Science team worked on finding the most rigorous method to deduce the historical trend of surge across all our cities and areas, and time periods within them. There were discussions on how to best store and update this data reliably so that the widget itself can access it with great performance.

Soon after, we went into the development process, and voila! We had the first iteration of the widget ready on our staging (internal testing) servers in just 2 weeks! This prototype was opened up to the core team for influx of feedback.

And just two weeks later, the widget made its way to our Singapore and Jakarta Feeds, accessible to the world at large! Feedback from our users started pouring in almost immediately (thanks to the rich feedback functionality that comes with each widget), ranging from great to sometimes not-so-great, and we listened to all of it with a keen ear! And thus began a new cycle of iterations and continuous improvement, more of which we will share in a subsequent post.

In the trenches with the creators: How multiple teams got together to make this come true

Various disciplines within our cross functional team came together to whip out this widget by quipping their expertise to the end product.

Using Behavioural Science to simplify choices and design good outcomes

Behavioural Science helped to explore many facets of consumer behaviour in order to plan and design the widget: understanding how consumers think and conceptualizing a widget that can be easily understood and used by the consumers.

While fares are governed entirely by market conditions, it’s important for us to explain the economics to customers. As a customer-centric company, we aim to make the consumers feel like they own their decisions, which they can take based on full information. And this is the role of Behavioral Scientists at Grab!

In guiding the customers through the information, Behavioural Science team had the following three objectives in mind while building this Travel Trends widget:

  1. Offer transparency on the fares: By exposing our historic surge levels for a 4 hour period, we wanted to ensure that the passenger is aware of the surge levels and does not treat the fare as a nasty shock.
  2. Give information that helps them plan: By showing them surge levels for the future 2 hours, we wanted to help customers who have the flexibility, plan for a better time, hence, giving them the power to decide based on transparent information.
  3. Provide helpful tips: Every bar gives users tips on the conditions at that time and the immediate future. For instance, a low surge bar, followed by a high surge bar gives the tip “Psst… Leave now, It might get busy later!”, helping people understand the graph better and nudging them to take an action. If you are interested in saving fares, may we suggest tapping around all the bars to reveal the secret pro-tips?

Designing interfaces that lead to consumer success by abstracting complexity

Design team is the one behind the colors and shapes that make up the widget that you see and interact with! The team took inspiration from Google’s Popular Times.

Source/Credits: Google Live Popular Times
Source/Credits: Google Live Popular Times

 

Right from the offset, our content and product teams were keen to surface additional information and actions with each bar to keep the widget interactive and useful. One of the early challenges was to arrive at the right gesture that invites the user to interact and intuitively navigate the bars on the widget but also does not conflict with other gestures (eg scrolling and scrubbing) that the user was pre-trained to perform on the feed. We found out that tapping was simultaneously an unused and yet intuitive gesturethat we could use for interaction with the bars.

We then went into rounds of iteration on the visual design of the widget. In this process, multiple stakeholders were involved ranging from Product to Content to Engineering. We had to overcome a number of constraints i.e. the limited canvas of a widget and the context of a user when she is exploring the feed. By re-using existing libraries and components, we managed to keep the development light and ship something fast.

GrabCar trends near you

Dozens of revisions and four iterations later, we landed with a design that we felt equipped the feature for its user-facing goal, and did so in a manner which was aesthetically appealing!

And finally we managed to deliver on the feature’s goal, by surfacing just the right detail of information in a manner that is intuitive yet effective to peak-shift demand.  

Bringing all of this to fruition through high performance engineering

Our Development Engineering team was in charge of developing the widget and making it available to our users in just a few weeks’ time – materialising the work of the other teams.

One of their challenges was to find the best way to process the vast amount of data (millions of database entries) so it can be visualized simply as bar charts. Grab’s engineers had to achieve this while making sure performance is as resilient as possible.

There were two options in doing this:

a) Fetch the data directly from the DB for each API call; or

b) Store the data in an in-memory data structure on a timely basis, so when a user calls the API will no longer have to hit the DB.

After considering that this feature will likely expect a lot of traffic thus high QPS, we decided that the former option would be too costly. Ultimately, we chose the latter option since it is more performant and more scalable.

At the frontend, the challenge was to cater to the intricate request from our designers. We use chart libraries to increase our development speed, and not all of the requirements were readily supported by these libraries.

For instance, let’s say this library makes visualising charts easy, but not so much for customising them. If designers wanted to have an average line in a dotted form, the library did not support this so easily. Also, the moving arrow pointers as you move between bar chart, changing colors of the bars changes when clicked – all required countless CSS tweaks.

CSS tweak on trends widget
CSS tweak on trends widget

Closing the product loop with user feedback and data driven insights

One of the most crucial parts of launching any product is to ensure that customers are engaging with the widget and finding it useful.

To understand what customers think about the widget, whether they find it useful and whether it is helping them to plan better,  we delved into the huge mine of clickstream data.

User feedback on the trends widget

We found that 1 in 3 users who make a booking everyday interact with the widget. And of these people, more than 70% users have given positive rating for the widget. This validates our initial hypothesis that if given an option, our customers will love the freedom to plan their trips and inculcate more transparent ecosystem.

These users also indicate the things they like most about the widget. 61% of users gave positive rating for usefulness, 20% were impressed by the design (Kudos to our fantastic designer Ajmal!!) and 13% for usability.

Tweet about the widget

Beyond internal data, our widget made some rounds on social media channels. For Example, here is screenshot of what our users have to say on Twitter.

We closely track these metrics on user engagement and feedback to ensure that we keep improving and coming up with new iterations which helps us to serve our customers in a better way.

Conclusion

We hope you enjoyed reading about how we went from ideation, through iterations to a finished widget in the hands of the user, all in 1 month! Many hands helped along the way. If you are interested in joining this hyper-proactive problem-solving team, please check out Grab’s career site!

And if you have feedback for us, we are here to listen! While we cannot be happier to see some positive reaction from the public, we are also thrilled to hear your suggestions and advice. Please leave us a memo using the Widget’s comment function!

Epilogue

We just released an upgrade to this widget which allows users to set reminders and be notified about availability of good fares in a time period of their choosing. We will keep a watch and come knocking! Go ahead, find the widget on your Grab feed, set a reminder and save on fares on your next ride!