Tag Archives: Netflix

Open-Sourcing a Monitoring GUI for Metaflow

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/open-sourcing-a-monitoring-gui-for-metaflow-75ff465f0d60

Open-Sourcing a Monitoring GUI for Metaflow, Netflix’s ML Platform

tl;dr Today, we are open-sourcing a long-awaited GUI for Metaflow. The Metaflow GUI allows data scientists to monitor their workflows in real-time, track experiments, and see detailed logs and results for every executed task. The GUI can be extended with plugins, allowing the community to build integrations to other systems, custom visualizations, and embed upcoming features of Metaflow directly into its views.

Metaflow is a full-stack framework for data science that we started developing at Netflix over four years ago and which we open-sourced in 2019. It allows data scientists to define ML workflows, test them locally, scale-out to the cloud, and deploy to production in idiomatic Python code. Since open-sourcing, the Metaflow community has been growing quickly: it is now the 7th most starred active project on Netflix’s GitHub account with nearly 4800 stars. Outside Netflix, Metaflow is used to power machine learning in production by hundreds of companies across industries from bioinformatics to real estate.

Since its inception, Metaflow has been a command-line-centric tool. It makes it easy for data scientists to express even complex machine learning applications in idiomatic Python, test them locally, or scale them out in the cloud — all using their favorite IDEs and terminals. Following our culture of freedom and responsibility, Metaflow grants data scientists the freedom to choose the right modeling approach, handle data and features flexibly, and construct workflows easily while ensuring that the resulting project executes responsibly and robustly on the production infrastructure.

As the number and criticality of projects running on Metaflow increased — some of which are very central to our business — our ML platform team started receiving an increasing number of support requests. Frequently, the questions were of the nature “can you help me understand why my flow takes so long to execute” or “how can I find the logs for a model that failed last night.” Technically, Metaflow provides a Python API that allows the user to inspect all details e.g., in a notebook, but writing code in a notebook to answer basic questions like this felt overkill and unnecessarily tedious. After observing the situation for months, we started forming an understanding of the kind of a new user interface that could address the growing needs of our users.

Requirements for a Metaflow GUI

Metaflow is a human-centered system by design. We consider our Python API and the CLI to be integral parts of the overall user interface and user experience, which singularly focuses on making it easier to build production-ready ML projects from scratch. In our approach, Python code provides a highly expressive and productive user interface for expressing complex business logic, such as ML models and workflows. At the same time, the CLI allows users to execute specific commands quickly and even automate common actions. When it comes to complex, real-life development work like this, it would be hard to achieve the same level of productivity on a graphical user interface.

However, textual UIs are quite lacking when it comes to discoverability and getting a holistic understanding of the system’s state. The questions we were hearing reflected this gap: we were lacking a user interface that would allow the users, quite simply, to figure out quickly what is happening in their Metaflow projects.

Netflix has a long history of developing innovative tools for observability, so when we began to specify requirements for the new GUI, we were able to leverage experiences from the previous GUIs built for other use cases, as well as real-life user stories from Metaflow users. We wanted to scope the GUI tightly, focusing on a specific gap in the Metaflow experience:

  1. The GUI should allow the users to see what flows and tasks are executing and what is happening inside them. Notably, we didn’t want to replace any of the functionality in the Metaflow APIs or CLI with the GUI — just to complement them. This meant that the GUI would be read-only: all actions like writing code and starting executions should happen on the users’ IDE and terminal as before. We also had no need to build a model-monitoring GUI yet, which is a wholly separate problem domain.
  2. The GUI would be targeted at professional data scientists. Instead of a fancy GUI for demos and presentations, we wanted a serious productivity tool with carefully thought-out user workflows that would fit seamlessly into our toolchain of data science. This requires attention to small details: for instance, users should be able to copy a link to any view in the GUI and share it e.g., on Slack, for easy collaboration and support (or to integrate with the Metaflow Slack bot). And, there should be natural affordances for navigating between the CLI, the GUI, and notebooks.
  3. The GUI should be scalable and snappy: it should handle our existing repository consisting of millions of runs, some of which contain tens of thousands of tasks without hiccups. Based on our experiences with other GUIs operating at Netflix-scale, this is not a trivial requirement: scalability needs to be baked into the design from the very beginning. Sluggish GUIs are hard to debug and fix afterwards, and they can have a significantly negative impact on productivity.
  4. The GUI should integrate well with other GUIs. A modern ML stack consists of many independent systems like data warehouses, compute layers, model serving systems, and, in particular, notebooks. It should be possible to find runs and tasks of interest in the Metaflow GUI and use a task-specific view to jump to other GUIs for further information. Our landscape of tools is constantly evolving, so we didn’t want to hardcode these links and views in the GUI itself. Instead, following the integration-friendly ethos of Metaflow, we want to embed relevant information in the GUI as plugins.
  5. Finally, we wanted to minimize the operational overhead of the GUI. In particular, under no circumstances should the GUI impact Metaflow executions. The GUI backend should be a simple service, optionally sitting alongside the existing Metaflow metadata service, providing a read-only, real-time view to the stored state. The frontend side should be easily extensible and maintainable, suggesting that we wanted a modern React app.

Monitoring GUI for Metaflow

As our ML Platform team had limited frontend resources, we reached out to Codemate to help with the implementation. As it often happens in software engineering projects, the project took longer than expected to finish, mostly because the problem of tracking and visualizing thousands of concurrent objects in real-time in a highly distributed environment is a surprisingly non-trivial problem (duh!). After countless iterations, we are finally very happy with the outcome, which we have now used in production for a few months.

When you open the GUI, you see an overview of all flows and runs, both current and historical, which you can group and filter in various ways:

Runs Grouped by flows

We can use this view for experiment tracking: Metaflow records every execution automatically, so data scientists can track all their work using this view. Naturally, the view can be grouped by user. They can also tag their runs and filter the view by tags, allowing them to focus on particular subsets of experiments.

After you click a specific run, you see all its tasks on a timeline:

Timeline view for a run

The timeline view is extremely useful in understanding performance bottlenecks, distribution of task runtimes, and finding failed tasks. At the top, you can see global attributes of the run, such as its status, start time, parameters etc. You can click a specific task to see more details:

Task view

This task view shows logs produced by a task, its results, and optionally links to other systems that are relevant to the task. For instance, if the task had deployed a model to a model serving platform, the view could include a link to a UI used for monitoring microservices.

As specified in our requirements, the GUI should work well with Metaflow CLI. To facilitate this, the top bar includes a navigation component where the user can copy-paste any pathspec, i.e., a path to any object in the Metaflow universe, which are prominently shown in the CLI output. This way, the user can easily move from the CLI to the GUI to observe runs and tasks in detail.

While the CLI is great, it is challenging to visualize flows. Each flow can be represented as a Directed Acyclic Graph (DAG), and so the GUI provides a much better way to visualize a flow. The DAG view presents all the steps of a flow and how they are related. Each step may have developer comments. They are colored to indicate the current state. Split steps are grouped by shaded boxes, while steps that participated in a foreach are grouped by a double shade box. Clicking on a step will take you to the Task view.

DAG View

Users at different organizations will likely have some special use cases that are not directly supported. The Metaflow GUI is extensible through its plugin API. For example, Netflix has its container orchestration platform called Titus. Users can configure tasks to utilize Titus to scale up or out. When failures happen, users will need to access their Titus containers for more information, and within the task view, a simple plugin provides a link for further troubleshooting.

Example task-level plugin

Try it at home!

We know that our user stories and requirements for a Metaflow GUI are not unique to Netflix. A number of companies in the Metaflow community have requested GUI for Metaflow in the past. To support the thriving community and invite 3rd party contributions to the GUI, we are open-sourcing our Monitoring GUI for Metaflow today!

You can find detailed instructions for how to deploy the GUI here. If you want to see the GUI in action before deploying it, Outerbounds, a new startup founded by our ex-colleagues, has deployed a public demo instance of the GUI. Outerbounds also hosts an active Slack community of Metaflow users where you can find support for GUI-related issues and share feedback and ideas for improvement.

With the new GUI, data scientists don’t have to fly blind anymore. Instead of reaching out to a platform team for support, they can easily see the state of their workflows on their own. We hope that Metaflow users outside Netflix will find the GUI equally beneficial, and companies will find creative ways to improve the GUI with new plugins.

For more context on the development process and motivation for the GUI, you can watch this recording of the GUI launch meetup.


Open-Sourcing a Monitoring GUI for Metaflow was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

CAMBI, a banding artifact detector

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/cambi-a-banding-artifact-detector-96777ae12fe2

by Joel Sole, Mariana Afonso, Lukas Krasula, Zhi Li, and Pulkit Tandon

Introducing the banding artifacts detector developed by Netflix aiming at further improving the delivered video quality

Banding artifacts can be pretty annoying. But, first of all, you may wonder, what is a banding artifact?

Banding artifact?

You are at home enjoying a show on your brand-new TV. Great content delivered at excellent quality. But then, you notice some bands in an otherwise beautiful sunset scene. What was that? A sci-fi plot twist? Some device glitch? More likely, banding artifacts, which appear as false staircase edges in what should be smoothly varying image areas.

Bands can show up in the sky in that sunset scene, in dark scenes, in flat backgrounds, etc. In any case, we don’t like them, nor should anybody be distracted from the storyline by their presence.

Just a subtle change in the video signal can cause banding artifacts. This slight variation in the value of some pixels disproportionately impacts the perceived quality. Bands are more visible (and annoying) when the viewing conditions are right: large TV with good contrast and a dark environment without screen reflections.

Some examples below. Since we don’t know where and when you are reading this blog post, we exaggerate the banding artifacts, so you get the gist. The first example is from the opening scene of one of our first shows. Check out the sky. Do you see the bands? The viewing environment (background brightness, ambient lighting, screen brightness, contrast, viewing distance) influences the bands’ visibility. You may play with those factors and observe how the perception of banding is affected.

Banding artifacts are also found in compressed images, as in this one we have often used to illustrate the point:

Even the Voyager encountered banding along the way; xkcd 🙂

How annoying is it?

We set up an experiment to measure the perceived quality in the presence of banding artifacts. We asked participants to rate the impact of the banding artifacts on a scale from 0 (unwatchable) to 100 (imperceptible) for a range of videos with different resolutions, bit-rates, and dithering. Participants rated 86 videos in total. Most of the content was banding-prone, while some not. The collected mean opinion scores (MOS) covered the entire scale.

According to usual metrics, the videos in the experiment with perceptible banding should be mid to high-quality (i.e., PSNR>40dB and VMAF>80). However, the experiment scores show something entirely different, as we’ll see below.

You can’t fix it if you don’t know it’s there

Netflix encodes video at scale. Likewise, video quality is assessed at scale within the encoding pipeline, not by an army of humans rating each video. This is where objective video quality metrics come in, as they automatically provide actionable insights into the actual quality of an encode.

PSNR has been the primary video quality metric for decades: it is based on the average pixel distance of the encoded video to the source video. In the case of banding, this distance is tiny compared to its perceptual impact. Consequently, there is little information about banding in the PSNR numbers. The data from the subjective experiment confirms this lack of correlation between PSNR and MOS:

Another video quality metric is VMAF, which Netflix jointly developed with several collaborators and open-sourced on Github. VMAF has become a de facto standard for evaluating the performance of encoding systems and driving encoding optimizations, being a crucial factor for the quality of Netflix encodes. However, VMAF does not specifically target banding artifacts. It was designed with our streaming use case in mind, in particular, to capture the video quality of movies and shows in the presence of encoding and scaling artifacts. VMAF works exceptionally well in the general case, but, like PSNR, lacks correlation with MOS in the presence of banding:

VMAF, PSNR, and other commonly used video quality metrics don’t detect banding artifacts properly and, if we can’t catch the issue, we cannot take steps to fix it. Ideally, our wish list for a banding detector would include the following items:

  • High correlation with MOS for content distorted with banding artifacts
  • Simple, intuitive, distortion-specific, and based on human visual system principles
  • Consistent performance across the different resolutions, qualities, and bit-depths delivered in our service
  • Robust to dithering, which video pipelines commonly introduce

We didn’t find any algorithm in the literature that fit our purposes. So we set out to develop one.

CAMBI

We hand-crafted in a traditional NNN (non-neural network) way an algorithm to meet our requirements. A white box solution derived from first principles with just a few, visually-motivated, parameters: the contrast-aware multiscale banding index (CAMBI).

A block diagram describing the steps involved in CAMBI is shown below. CAMBI operates as a no-reference banding detector taking a (distorted) video as an input and producing a banding visibility score as the output. The algorithm extracts pixel-level maps at multiple scales for frames of the encoded video. Subsequently, it combines these maps into a single index motivated by the human contrast sensitivity function (CSF).

Pre-processing

Each input frame goes through up to three pre-processing steps.

The first step extracts the luma component: although chromatic banding exists, like most past works, we assume that most of the banding can be captured in the luma channel. The second step is converting the luma channel to 10-bit (if the input is 8-bit).

Third, we account for the presence of dithering in the frame. Dithering is intentionally applied noise used to randomize quantization error that is shown to reduce banding visibility. To account for both dithered and non-dithered content, we use a 2×2 filter to smoothen the intensity values to replicate the low-pass filtering done by the human visual system.

Multiscale Banding Confidence

We consider banding detection a contrast-detection problem, and hence banding visibility is majorly governed by the CSF. The CSF itself largely depends on the perceived contrast across a step and the spatial frequency of the steps. CAMBI explicitly accounts for the contrast across pixels by looking at the differences in pixel intensity and does this at multiple scales to account for spatial frequency. This is done by calculating pixel-wise banding confidence at different contrasts and scales, each referred to as a CAMBI map for the frame. Banding confidence computation also considers the sensitivity to change in brightness depending on the local brightness. At the end of this process, twenty CAMBI maps are obtained per frame capturing banding across four contrast steps and five scales.

Spatio-Temporal Pooling

CAMBI maps are spatiotemporally pooled to obtain the final banding index. Spatial pooling is done based on the observation that CAMBI maps belong to the initial linear phase of the CSF. First, pooling is applied in the contrast dimension by keeping the maximum weighted contrast for each position. The result is five maps, one per scale. There is an example of such maps further down in this post.

Since regions with the poorest quality dominate the perceived quality of the video, only a percentage of the pixels, those with the most banding, is considered during spatial pooling for the maps at each scale. The resulting scores per scale are linearly combined with CSF-based weights to derive the CAMBI for each frame.

According to our experiments, CAMBI is temporally stable within a single video shot, so a simple average suffices as a temporal pooling mechanism across frames. However, note that this assumption breaks down for videos with multiple shots with different characteristics.

CAMBI agrees with the subjective assessments

Our results show that CAMBI provides a high correlation with MOS while, as illustrated above, VMAF and PSNR have very little correlation. The table reports two correlation coefficients, namely Spearman Rank Order Correlation (SROCC) and Pearson’s Linear Correlation (PLCC):

The following plot visualizes that CAMBI correlates well with subjective scores and that a CAMBI of around 5 is where banding starts to be slightly annoying. Note that, unlike the two quality metrics, CAMBI correlates inversely with MOS: the higher the CAMBI score is, the more perceptible the banding is, and thus the quality is lower.

Staring at the sunset

We use this sunset as an example of banding and how CAMBI scores it. Below we also show the same sunset with fake colors, so bands pop up even more.

There is no banding on the sea part of the image. In the sky, the size of the bands increases as the distance from the sun increases. The five maps below, one per scale, capture the confidence of banding at different spatial frequencies. These maps are further spatially pooled, accounting for the CSF, giving a CAMBI score of 19 for the frame, which perceptually corresponds to somewhere between ‘annoying’ to ‘very annoying’ banding according to the MOS data.

Open-source and next steps

A banding detection mechanism robust to multiple encoding parameters can help identify the onset of banding in videos and serve as the first step towards its mitigation. In the future, we hope to leverage CAMBI to develop a new version of VMAF that can account for banding artifacts.

We open-sourced CAMBI as a new standalone feature in libvmaf. Similar to VMAF, CAMBI is an organic project expected to be gradually improved over time. We welcome any feedback and contributions.

Acknowledgments

We want to thank Christos Bampis, Kyle Swanson, Andrey Norkin, and Anush Moorthy for the fruitful discussions and all the participants in the subjective tests that made this work possible.


CAMBI, a banding artifact detector was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Show Must Go On: Securing Netflix Studios At Scale

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/the-show-must-go-on-securing-netflix-studios-at-scale-19b801c86479

Written by Jose Fernandez, Arthur Gonigberg, Julia Knecht, and Patrick Thomas

Netflix Zuul Open Source Logo

In 2017, Netflix Studios was hitting an inflection point from a period of merely rapid growth to the sort of explosive growth that throws “how do we scale?” into every conversation. The vision was to create a “Studio in the Cloud”, with applications supporting every part of the business from pitch to play. The security team was working diligently to support this effort, faced with two apparently contradictory priorities:

  • 1) streamline any security processes so that we could get applications built and deployed to the public internet faster
  • 2) raise the overall security bar so that the accumulated risk of this giant and growing portfolio of newly internet-facing, high-sensitivity assets didn’t exceed its value

The journey to resolve that contradiction has been a collaboration that we’re proud of, and that we think exemplifies how Netflix approaches infrastructure product development and product security partnerships. You’ll hear from two teams here: first Application Security, and then Cloud Gateway.

Julia & Patrick (Netflix Application Security): In deciding how to address this, we focused on two observations. The first was that there were too many security things that each software team needed to think about — things like TLS certificates, authentication, security headers, request logging, rate limiting, among many others. There were security checklists for developers, but they were lengthy and mostly manual, neither of which contributed to the goal of accelerating development. Adding to the complexity, many of the checklist items themselves had a variety of different options to fulfill them (“new apps do this, but legacy apps do that”; “Java apps should use this approach, but Ruby apps should try one of these four things”… yes, there were flowcharts inside checklists. Ouch.). For development teams, just working through the flowcharts of requirements and options was a monumental task. Supporting developers through those checklists for edge cases, and then validating that each team’s choices resulted in an architecture with all the desired security properties, was similarly not scalable for our security engineers.

Our second observation centered on strong authentication as our highest-leverage control. Missing or incomplete authentication in an application was the most critical type of issue we regularly faced, while at the same time, an application that had a bulletproof authentication story was an application we considered to be lower risk. Concepts like Zero Trust, Beyond Corp, and Identity Aware Proxies all seemed to point the same way: there is powerful assurance in making 100% authentication a property of the architecture of the application rather than an implementation detail within an application.

With both of these observations in hand, we looked at the challenge through a lens that we have found incredibly valuable: how do we productize it? Netflix engineers talk a lot about the concept of a “Paved Road”. One especially attractive part of a Paved Road approach for security teams with a large portfolio is that it helps turn lots of questions into a boolean proposition: Instead of “Tell me how your app does this important security thing?”, it’s just “Are you using this paved road product that handles that?”. So, what would a product look like that could tackle most of the security checklist for a team, and that also could give us that architectural property of guaranteed authentication? With these lofty goals in mind, we turned to our central engineering teams to help get us there.

Partnering to Productize Security

Jose & Arthur (Netflix Cloud Gateway): The Cloud Gateway team develops and operates Netflix’s “Front Door”. Historically we have been responsible for connecting, routing, and steering internet traffic from Netflix subscribers to services in the cloud. Our gateways are powered by our flagship open-source technology Zuul. When Netflix Studios and our security partners approached us, the proposal was conceptually simple and a good fit for our modular, filter-based approach. To try it out, we deployed a custom Zuul build (which we named “API Wall” and eventually, more affectionately, “Wall-E”) with a new filter for Netflix’s Single-Sign-On provider, enabled it for all requests, and boom! — an application deployment strategy that guarantees authentication for services behind it.

Wall-E logical diagram showing a proxy with distinct filters

Killing the Checklist

Once we worked together to integrate our SSO with Wall-E, we had established a pretty exciting pattern of adding security requirements as filters. We thought back to our checklist through the lens of: which of these things are consistent enough across applications to add as a required filter? Our web application firewall (WAF), DDoS prevention, security header validation, and durable logging all fit the bill. One by one, we saw our checklists’ requirements bite the dust, and shift from ‘individual app developer-owned’ to ‘Wall-E owned’ (and consistently implemented!).

By this point, it was clear that we had achieved the vision in the AppSec team’s original request. We eventually were able to add so much security leverage into Wall-E that the bulk of the “going internet-facing” checklist for Studio applications boiled down to one item: Will you use Wall-E?

A small section of our go-external security questionnaire and checklist for studio apps before Wall-E and after Wall-E.

The Early Adopter Challenge

Wall-E’s early adopters were handpicked and nudged along by the Application Security team. Back then, the Cloud Gateway team had to work closely with application developers to provide a seamless migration without disrupting users. These joint efforts took several weeks for both parties. During our initial consultations, it was clear that developers preferred prioritizing product work over security or infrastructure improvements. Our meetings usually ended like this: “Security suggested we talk to you, and we like the idea of improving our security posture, but we have product goals to meet. Let’s talk again next quarter”. These conversations surfaced a couple of problems we knew we had to overcome to address this early adopter challenge:

  1. Setting up Wall-E for an application took too much time and effort, and the hands-on approach would not scale.
  2. Security improvements alone were not enough to drive organic adoption in Netflix’s “context not control” culture.

We were under pressure to improve our adoption numbers and decided to focus first on the setup friction by improving the developer experience and automating the onboarding process.

Scaling With Developer Experience

Developers in the Netflix streaming world compose the customer-facing Netflix experience out of hundreds of microservices, reachable by complex routing rules. On the Netflix Studio side, in Content Engineering, each team develops distinct products with simpler routing needs. To support that much different model, we did another thing that seemed simple at the time but has had an outsized impact over the years: we asked app teams to integrate with us by creating a version-controlled YAML file. Originally this was intended as a simplified and developer-friendly way to help collect domain names and some routing rules into a versionable package, but we quickly realized we had stumbled into a powerful model: we were harvesting developer intent.

An interactive Wall-E configuration wizard, and a concise declarative format for an application’s routing, resource, and authentication decisions

This small change was a kind of magic, and completely flipped our relationship with development teams: since we had a concise, standardized definition of the app they intended to expose, we could proactively automate a lot of the setup. Specify a domain name? Wall-E can ensure that it automagically exists, with DNS and TLS configured correctly. Iterating on this experience eventually led to other intent-based streamlining, like asking about intended user populations and related applications (to select OAuth configs and claims). We could now tell developers that setting up Wall-E would only take a few minutes and that our tooling would automate everything.

Going Faster, Faster

As all of these pieces came together, app teams outside Studio took notice. For a typical paved road application with no unusual security complications, a team could go from “git init” to a production-ready, fully authenticated, internet accessible application in a little less than 10 minutes. The automation of the infrastructure setup, combined with reducing risk enough to streamline security review saves developers days, if not weeks, on each application. Developers didn’t necessarily care that the original motivating factor was about security: what they saw in practice was that apps using Wall-E could get in front of users sooner, and iterate faster.

This created that virtuous cycle that core engineering product teams get incredibly excited about: more users make the amortized platform investment more valuable, but they also bring more ideas and clarity for feature ideas, which in turn attract more users. This set the tone for the next year of development, along two tracks: fixing adoption blockers, and turning more “developer intent” into product features to just handle things for them.

For adoption, both the security team and our team were asking the same question of developers: Is there anything that prevents you from using Wall-E? Each time we got an answer to that question, we tried to figure out how we could address it. Nearly all of the blockers related to systems in which (usually for historical reasons) some application team was solving both authentication and application routing in a custom way. Examples include legacy mTLS and various webhook schemes​. With Wall-E as a clear, durable, paved road choice, we finally had enough of a carrot to move these teams away from supporting unique, potentially risky features. The value proposition wasn’t just “let us help you migrate and you’ll only ever have to deal with incoming traffic that is already properly authenticated”, it was also “you can throw away the services and manual processes that handled your custom mechanisms and offload any responsibility for authentication, WAF integration and monitoring, and DDoS protection to the platform”. Overall, we cannot overstate the value of organizationally committing to a single paved road product to handle these kinds of concerns. It creates an amazing clarity and strategic pressure that helps align actual services that teams operate to the charters and expertise that define them. The difference between 2–4 “right-ish” ways and a single paved road one is powerful.

Also, with fewer exceptions and clearer criteria for apps that should adopt this paved road, our AppSec Engineering and User Focused Security Engineering (UFSE) teams could automate security guidance to give more appropriate automated nudges for adoption. Every leader’s security risk dashboard now includes a Wall-E adoption metric, and roughly ⅔ of recommended apps have chosen to adopt it. Wall-E now fronts over 350 applications, and is adding roughly 3 new production applications (mostly internet-facing) per week.

Automated guidance data, showing the percentage of applications recommended to use Wall-E which have taken it up. The jumpiness in the number of apps recommended for adoption is real: as adoption blockers were discovered then eventually solved, and as we standardized guidance across the company, our automated recommendations reflected these developments.

As adoption continued to increase, we looked at various signals of developer intent for good functionality to move from development-team-owned to platform-owned. One particularly pleasing example turned out to be UI hosting: it popped up over and over again as both an awkward exception to our “full authentication” goal, and also oftentimes the only thing that required Single Page App (SPA) UI teams to run actual cloud instances and have to be on-call for infrastructure. This eventually matured into an opinionated, declarative asset service that abstracts static file hosting for application teams: developers get fast static asset deployments, security gets strong guardrails around UI applications, and Netflix overall has fewer cloud instances to manage (and pay for!). Wall-E became a requirement for the best UI developer experience, and that drove even more adoption.

A productized approach also meant that we could efficiently enable lots of complex but “nice to have” features to enhance the developer experience, like Atlas metrics for free, and integration with our request tracing tool, Edgar.

From Product to Platform

You may have noticed a word sneak into the conversation up there… “platform”. Netflix has a Developer Productivity organization: teams dedicated to helping other developers be more effective. A big part of their work is this idea of harvesting developer intent and automating the necessary touchpoints across our systems. As these teams came to see Wall-E as the clear answer for many of their customers, they started integrating their tools to configure Wall-E from the even higher level developer intents they were harvesting. In effect, this moves authentication and traffic routing (and everything else that Wall-E handles) from being a specific product that developers need to think about and make a choice about, to just a fact that developers can trust and generally ignore. In 2019, essentially 100% of the Wall-E app configuration was done manually by developers. In 2021, that interaction has changed dramatically: now more than 50% of app configuration in WallE is done by automated tools (which are acting on higher-level abstractions on behalf of developers).

This scale and standardization again multiplies value: our internal risk quantification forecasts show compelling annualized savings in risk and incident response costs across the Wall-E portfolio. These applications have fewer, less severe, and less exploitable bugs compared to non-Wall-E apps, and we rarely need an urgent response from app owners (we call this not-getting-paged-at-midnight-as-a-service). Developer time saved on initial application setup and unneeded services additionally adds up on the order of team-months of productivity per year.

Looking back to the core need that started us down this road (“streamline any security processes […]” and “raise the overall security bar […]”), Wall-E’s evolution to being part of the platform cements and extends the initial success. Going forward, more and more apps and developers can benefit from these security assurances while needing to think less and less about them. It’s an outcome we’re quite proud of.

Let’s Do More Of That

To briefly recap, here’s a few of the things that we take away from this journey:

  • If you can do one thing to manage a large product security portfolio, do bulletproof authentication; preferably as a property of the architecture
  • Security teams and central engineering teams can and should have a collaborative, mutually supportive partnership
  • “Productizing” a capability (eg: clearly articulated; defined value proposition; branded; measured), even for internal tools, is useful to drive adoption and find further value
  • A specific product makes the “paved road” clearer; a boolean “uses/doesn’t use” is strongly preferable to various options with subtle caveats
  • Hitch the security wagon to developer productivity
  • Harvesting intent is powerful; it lets many teams add value

What’s Next

We see incredible power in this kind of security/infrastructure partnership work, and we’re excited to leverage these wins into our next goal: to truly become an infrastructure-as-service provider by building a full-fledged Gateway API, thereby handing off ownership of the developer experience to our partner teams in the Developer Productivity organization. This will allow us to focus on the challenges that will come on our way to the next milestone: 1000 applications behind Wall-E.

If this kind of thing is exciting to you, we are hiring for both of these teams: Senior Software Engineer and Engineering Manager on Application Networking; and Senior Security Partner and Appsec Senior Software Engineer.

With special thanks to Cloud Gateway and InfoSec team members past and present, especially Sunil Agrawal, Mikey Cohen, Will Rose, Dilip Kancharla, our partners on Studio & Developer Productivity, and the early Wall-E adopters that provided valuable feedback and ideas. And also to Queen for the song references we slipped in; tell us if you find ’em all.


The Show Must Go On: Securing Netflix Studios At Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Netflix Drive

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/netflix-drive-a607538c3055

A file and folder interface for Netflix Cloud Services

Written by Vikram Krishnamurthy, Kishore Kasi, Abhishek Kapatkar, Tejas Chopra, Prudhviraj Karumanchi, Kelsey Francis, Shailesh Birari

In this post, we are introducing Netflix Drive, a Cloud drive for media assets and providing a high level overview of some of its features and interfaces. We intend this to be a first post in a series of posts covering Netflix Drive. In the future posts, we will do an architectural deep dive into the several components of Netflix Drive.

Netflix, and particularly Studio applications (and Studio in the Cloud) produce petabytes of data backed by billions of media assets. Several artists and workflows that may be globally distributed, work on different projects, and each of these projects produce content that forms a part of the large corpus of assets.

Here is an example of globally distributed production where several artists and workflows work in conjunction to create and share assets for one or many projects.

Fig 1: Globally distributed production with artists working on different assets from different parts of the world

There are workflows in which these artists may want to view a subset of these assets from this large dataset, for example, pertaining to a specific project. These artists may want to create personal workspaces and work on generating intermediate assets. To support such use cases, access control at the user workspace and project workspace granularity is extremely important for presenting a globally consistent view of pertinent data to these artists.

Netflix Drive aims to solve this problem of exposing different namespaces and attaching appropriate access control to help build a scalable, performant, globally distributed platform for storing and retrieving pertinent assets.

Netflix Drive is envisioned to be a Cloud Drive for Studio and Media applications and lends itself to be a generic paved path solution for all content in Netflix.

It exposes a file/folder interface for applications to save their data and an API interface for control operations. Netflix Drive relies on a data store that will be the persistent storage layer for assets, and a metadata store which will provide a relevant mapping from the file system hierarchy to the data store entities. The major pieces, as shown in Fig. 2, are the file system interface, the API interface, and the metadata and data stores. We will delve into these in the following sections.

Fig 2: Netflix Drive components

File interface for Netflix Drive

Creative applications such as Nuke, Maya, Adobe Photoshop store and retrieve content using files and folders. Netflix Drive relies on FUSE (File System In User Space) to provide POSIX files and folders interface to such applications. A FUSE based POSIX interface provides feature customization elasticity, deployment configuration flexibility as well as a standard and seamless file/folder interface. A similar user space abstraction is available for Windows (WinFSP) and MacOS (MacFUSE)

The operations that originate from user, application and system actions on files and folders translate to a well defined set of function and system calls which are forwarded by the Linux Virtual File System Layer (or a pass-through/filter driver in Windows) to the FUSE layer in user space. The resulting metadata and data operations will be implemented by appropriate metadata and data adapters in Netflix Drive.

Fig 3: POSIX interface of Netflix Drive

The POSIX files and folders interface for Netflix Drive is designed as a layered system with the FUSE implementation hooks forming the top layer. This layer will provide entry points for all of the relevant VFS calls that will be implemented. Netflix Drive contains an abstraction layer below FUSE which allows different metadata and data stores to be plugged into the architecture by having their corresponding adapters implement the interface. We will discuss more about the layered architecture in the section below.

API Interface for Netflix Drive

Along with exposing a file interface which will be a hub of all abstractions, Netflix Drive also exposes API and Polled Task interfaces to allow applications and workflow tools to trigger control operations in Netflix Drive.

For example, applications can explicitly use REST endpoints to publish files stored in Netflix Drive to cloud, and later use a REST endpoint to retrieve a subset of the published files from cloud. The API interface can also be used to track the transfers of large files and allows other applications to be built on top of Netflix Drive.

Fig 4: Control interface of Netflix Drive

The Polled Task interface allows studio and media workflow orchestrators to post or dispatch tasks to Netflix Drive instances on disparate workstations or containers. This allows Netflix Drive to be bootstrapped with an empty namespace when the workstation comes up and dynamically project a specific set of assets relevant to the artists’ work sessions or workflow stages. Further these assets can be projected into a namespace of the artist’s or application’s choosing.

Alternatively, workstations/containers can be launched with the assets of interest prefetched at startup. These allow artists and applications to obtain a workstation which already contains relevant files and optionally add and delete asset trees during the work session. For example, artists perform transformative work on files, and use Netflix Drive to store/fetch intermediate results as well as the final copy which can be transformed back into a media asset.

Bootstrapping Netflix Drive

Given the two different modes in which applications can interact with Netflix Drive, now let us discuss how Netflix Drive is bootstrapped.

On startup, Netflix Drive expects a manifest that contains information about the data store, metadata store, and credentials (tied to a user login) to form an instance of namespace hierarchy. A Netflix Drive mount point may contain multiple Netflix Drive namespaces.

A dynamic instance allows Netflix Drive to show a user-selected and user-accessible subset of data from a large corpus of assets. A user instance allows it to act like a Cloud Drive, where users can work on content which is automatically synced in the background periodically to Cloud. On restart on a new machine, the same files and folders will be prefetched from the cloud. We will cover the different namespaces of Netflix Drive in more detail in a subsequent blog post.

Here is an example of a typical bootstrap manifest file.

This image shows a bootstrap manifest json which highlights how Netflix Drive can work with different metadata stores (such as Redis, CockroachDB), and data stores (such as Ceph, S3) and tie them together to provide persistence layer for assets
A sample manifest file.

The manifest is a persistent artifact which renders a user workstation its Netflix Drive personality. It survives instance failures and is able to recreate the same stateful interface on any newly deployed instance.

Metadata and Data Store Abstractions

In order to allow a variety of different metadata stores and data stores to be easily plugged into the architecture, Netflix Drive exposes abstract interfaces for both metadata and data stores. Here is a high level diagram explaining the different layers of abstractions in Netflix Drive

Fig 5: Layered architecture of Netflix Drive

Metadata Store Characteristics

Each file in Netflix Drive would have one or many corresponding metadata nodes, corresponding to different versions of the file. The file system hierarchy would be modeled as a tree in the metadata store where the root node is the top level folder for the application.

Each metadata node will contain several attributes, such as checksum of the file, location of the data, user permissions to access data, file metadata such as size, modification time, etc. A metadata node may also provide support for extended attributes which can be used to model ACLs, symbolic links, or other expressive file system constructs.

Metadata Store may also expose the concept of workspaces, where each user/application can have several workspaces, and can share workspaces with other users/applications. These are higher level constructs that are very useful to Studio applications.

Data Store Characteristics

Netflix Drive relies on a data store that allows streaming bytes into files/objects persisted on the storage media. The data store should expose APIs that allow Netflix Drive to perform I/O operations. The transfer mechanism for transport of bytes is a function of the data store.

In the first manifestation, Netflix Drive is using an object store (such as Amazon S3) as a data store. In order to expose file store-like properties, there were some changes needed in the object store. Each file can be stored as one or more objects. For Studio applications, file sizes may exceed the maximum object size for Cloud Storage, and so, the data store service should have the ability to store multiple parts of a file as separate objects. It is the responsibility of the data store service to tie these objects to a single file and inform the metadata store of the single unique Id for these several object parts. This Data store internally implements the chunking of file into several parts, encrypting of the content, and life cycle management of the data.

Multi-tiered architecture

Netflix Drive allows multiple data stores to be a part of the same installation via its bootstrap manifest.

Fig 6: Multiple data stores of Netflix Drive

Some studio applications such as encoding and transcoding have different I/O characteristics than a typical cloud drive.

Most of the data produced by these applications is ephemeral in nature, and is read often initially. The final encoded copy needs to be persisted and the ephemeral data can be deleted. To serve such applications, Netflix Drive can persist the ephemeral data in storage tiers which are closer to the application that allow lower read latencies and better economies for read request, since cloud storage reads incur an egress cost. Finally, once the encoded copy is prepared, this copy can be persisted by Netflix Drive to a persistent storage tier in the cloud. A single data store may also choose to archive some subset of content stored in cheaper alternatives.

Security

Studio applications require strict adherence to security models where only users or applications with specific permissions should be allowed to access specific assets. Security is one of the cornerstones of Netflix Drive design. Netflix Drive dynamic namespace design allows an artist or workflow to access only a small subset of the assets based on the workspace information and access control and is one of the benefits of using Netflix Drive in Studio workflows. Netflix Drive encapsulates the authentication and authorization models in its metadata store. These are translated into POSIX ACLs in Netflix Drive. In the future, Netflix Drive can allow more expressive ACLs by leveraging extended attributes associated with Metadata nodes corresponding to an asset.

Netflix Drive is currently being used by several Studio teams as the paved path solution for working with assets and is integrated with several media suite applications. As of today, Netflix Drive can be installed on CentOS, MacOS and Windows. In the future blog posts, we will cover implementation details, learnings, performance analysis of Netflix Drive, and some of the applications and workflows built on top of Netflix Drive.

If you are passionate about building Storage and Infrastructure solutions for Netflix Data Platform, we are always looking for talented engineers and managers. Please check out our job listings


Netflix Drive was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Packaging award-winning shows with award-winning technology

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/packaging-award-winning-shows-with-award-winning-technology-c1010594ba39

By Cyril Concolato

Introduction

In previous blog posts, our colleagues at Netflix have explained how 4K video streams are optimized, how even legacy video streams are improved and more recently how new audio codecs can provide better aural experiences to our members. In all these cases, prior to being delivered through our content delivery network Open Connect, our award-winning TV shows, movies and documentaries like The Crown need to be packaged to enable crucial features for our members. In this post, we explain these features and how we rely on award-winning standard formats and open source software to enable them.

The Crown

Key Packaging Features

In typical streaming pipelines, packaging is the step that happens just after encoding, as depicted in the figure below. The output of an encoder is a sequence of bytes, called an elementary stream, which can only be parsed with some understanding of the elementary stream syntax. For example, detecting frame boundaries in an AV1 video stream requires being able to parse so-called Open Bitstream Units (OBU) and identifying Temporal Delimiters OBU. However, high level operations performed on client devices, such as seeking, do not need to be aware of the elementary syntax and benefit from a codec-agnostic format. The packaging step aims at producing such a codec-agnostic sequence of bytes, called packaged format, or container format, which can be manipulated, to some extent, without a deep knowledge of the coding format.

Figure 1 — Simplified architecture of a streaming preparation pipeline

A key feature that our members rightfully deserve when playing audio, video, and timed text is synchronization. At Netflix, we strive to provide an experience where you never see the lips of the Queen of England move before you hear her corresponding dialog in The Crown. Synchronization is achieved by fundamental elements of signaling such as clocks or time lines, time stamps, and time scales that are provided in packaged content.

Our members don’t simply watch our series from beginning to end. They seek into Bridgerton when they resume watching. They rewind and replay their favorite chess move in The Queen’s Gambit. They skip introductions and recaps when they frantically binge-watch Lupin. They make playback decisions when they watch interactive titles such as You vs. Wild. Due to the nature of the audio or video compression techniques, a player cannot necessarily start decoding the stream exactly where our members want. Under the hood, players have to locate points in the stream where decoding can start, decode as quickly as they can, until the user seek point is reached before starting playback. This is another basic feature of packaging: signaling frame types and particularly Random Access Points.

When our members’ kids watch Carmen Sandiego in the back seats of their parents’ car or more generally when the network throughput varies, adaptive streaming technologies are applied to provide the best viewing experience under the network conditions. Adaptive streaming technologies require that streams of various qualities be encoded to common constraints but they also rely on another key feature of packaging to offer seamless quality switching, called indexing. Indexing lets the player fetch only the corresponding segments of the new stream.

Many other elements of signaling are provided in our packaged content to enable the viewing to start as quickly as possible and in the best possible conditions. Decryption modules need to be initialized with the appropriate scheme and initialization vector. Hardware video decoders need to know in advance the resolution and bit depth of the video streams to allocate their decoding buffers. Rendering pipelines need to know ahead of time the speaker configuration of audio streams or whether the video streams are HDR or SDR. Being able to signal all these elements is also a key feature of modern packaging formats.

The role of standards and open source software

Our 200+ million members watch Netflix on a wide variety of devices, from smartphones, to laptops, to TVs and many more, developed by a large number of partners. Reducing the friction when on-boarding a new device and making sure that our content will be playable on old devices for a long time is very important. That is where standards play a key role. The ISO Base Media File Format (ISOBMFF) is the key packaging standard in the entertainment industry as recently recognized with a Technology & Engineering Emmy® Award by the National Academy of Television Arts & Sciences (NATAS).

ISOBMFF provides all the key packaging features mentioned above, and as history proves, it is also versatile and extensible, in its capabilities of adding new signaling features and in its support of codec. Streams encoded with well-established codecs such as AVC and AAC can be carried in ISOBMFF files, but the specification is also regularly extended to support the latest codecs. The Media Systems team at Netflix actively contributes to the development, the maintenance, and the adoption of ISOBMFF. As an example, Netflix led the specification for the carriage of AOM’s AV1 video streams in ISOBMFF.

With 20+ years of existence, ISOBMFF accumulated a lot of technical tools for various use cases. Figure 2 illustrates the complexity of ISOBMFF today through the concept of ‘brands’, a concept similar to profiles in audio or video standards. Initially, limited and well-nested, the standard is now very broad and evolving in various directions.

Figure 2 — Illustrating the complexity of the 6th edition of ISOBMFF. Each rectangle represents a ‘brand’ (indicated by a four character code in bold), and its required set of tools (indicated by a ‘+’ line). Brands are nested. All the tools of inner brands are required by outer brands.

For the Netflix streaming service, we rely on a subset of these tools as identified by the Common Media Application Format (CMAF) standard, and the content protection tools defined in the Common Encryption (CENC) standard.

Multimedia standards like ISOBMFF, CMAF and CENC go hand in hand with open source software implementations. Open source software can demonstrate the features of the standard, enabling the industry to understand its benefits and broadening its adoption. Open source software can also help improve the quality of a standard by highlighting possible ambiguities through a neutral, reference implementation. The Media Systems team at Netflix maintains such a reference open source implementation, called Photon, for the SMPTE IMF standard. For ISOBMFF, Netflix uses MP4Box, the reference open source implementation from the GPAC team.

In this packaging ecosystem of standards and open source software, our work within the Media Systems team includes identifying the tools within the existing standards to address new streaming use cases. When such tools don’t exist, we define new standards or expand existing ones, including ISOBMFF and CMAF, and support open source software to match these standards. For example, when our video encoding colleagues design dynamically optimized encoding schemes producing streaming segments with variable durations, we modify our workflow to ensure that segments across video streams with different bit rates remain time aligned. Similarly, when our audio encoding colleagues introduce xHE-AAC, which obsoletes the old assumption that every audio frame is decodable, we guarantee that audio/video segments remain aligned too. Finally, when we want to help the industry converge to a common encryption scheme for new video codecs such as AV1, we coordinate the discussions to select the scheme, in this case pattern-based subsample encryption (a.k.a ‘cbcs’), and lead the way by providing reference bitstreams. And of course, our work includes handling the many types of devices in the field that don’t have proper support of the standards.

Conclusion

We hope that this post gave you a better understanding of a part of the work of the Media Systems team at Netflix, and hopefully next time you watch one of our award-winning shows, you will recognize the part played by ISOBMFF, a key, award-winning technology. If you want to explore another facet of the team’s work, have a look at the other award-winning technology, TTML, that we use for our Japanese subtitles.

We’re hiring!

If this work sounds exciting to you and you’d like to help the Media Systems team deliver an even better experience, Netflix is searching for an experienced Engineering Manager for the team. Please contact Anne Aaron for more info.


Packaging award-winning shows with award-winning technology was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Growth Engineering at Netflix- Creating a Scalable Offers Platform

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/growth-engineering-at-netflix-creating-a-scalable-offers-platform-69330136dd87

by Eric Eiswerth

Background

Netflix has been offering streaming video-on-demand (SVOD) for over 10 years. Throughout that time we’ve primarily relied on 3 plans (Basic, Standard, & Premium), combined with the 30-day free trial to drive global customer acquisition. The world has changed a lot in this time. Competition for people’s leisure time has increased, the device ecosystem has grown phenomenally, and consumers want to watch premium content whenever they want, wherever they are, and on whatever device they prefer. We need to be constantly adapting and innovating as a result of this change.

The Growth Engineering team is responsible for executing growth initiatives that help us anticipate and adapt to this change. In particular, it’s our job to design and build the systems and protocols that enable customers from all over the world to sign up for Netflix with the plan features and incentives that best suit their needs. For more background on Growth Engineering and the signup funnel, please have a look at our previous blog post that covers the basics. Alternatively, here’s a quick review of what the typical user journey for a signup looks like:

Signup Funnel Dynamics

There are 3 steps in a basic Netflix signup. We refer to these steps that comprise a user journey as a signup flow. Each step of the flow serves a distinct purpose.

  1. Introduction and account creation
    Highlight our value propositions and begin the account creation process.
  2. Plans & offers
    Highlight the various types of Netflix plans, along with any potential offers.
  3. Payment
    Highlight the various payment options we have so customers can choose what suits their needs best.

The primary focus for the remainder of this post will be step 2: plans & offers. In particular, we’ll define plans and offers, review the legacy architecture and some of its shortcomings, and dig into our new architecture and some of its advantages.

Plans & Offers

Definitions

Let’s define what a plan and an offer is at Netflix. A plan is essentially a set of features with a price.

An offer is an incentive that typically involves a monetary discount or superior product features for a limited amount of time. Broadly speaking, an offer consists of one or more incentives and a set of attributes.

When we merge these two concepts together and present them to the customer, we have the plan selection page (shown above). Here, you can see that we have 3 plans and a 30-day free trial offer, regardless of which plan you choose. Let’s take a deeper look at the architecture, protocols, and systems involved.

Legacy Architecture

As previously mentioned, Netflix has had a relatively static set of plans and offers since the inception of streaming. As a result of this simple product offering, the architecture was also quite straightforward. It consisted of a small set of XML files that were loaded at runtime and stored in local memory. This was a perfectly sufficient design for many years. However, there are some downsides as the company continues to grow and the product continues to evolve. To name a few:

  • Updating XML files is error-prone and manual in nature.
  • A full deployment of the service is required whenever the XML files are updated.
  • Updating the XML files requires engaging domain experts from the backend engineering team that owns these files. This pulls them away from other business-critical work and can be a distraction.
  • A flat domain object structure that resulted in client-side logic in order to extract relevant plan and offer information in order to render the UI. For example, consider the data structure for a 30 day free trial on the Basic plan.
{
"offerId": 123,
"planId": 111,
"price": "$8.99",
"hasSD": true,
"hasHD": false,
"hasFreeTrial": true,
etc…
}
  • As the company matures and our product offering adapts to our global audience, all of the above issues are exacerbated further.

Below is a visual representation of the various systems involved in retrieving plan and offer data. Moving forward, we’ll refer to the combination of plan and offer data simply as SKU (Stock Keeping Unit) data.

New Architecture

If you recall from our previous blog post, Growth Engineering owns the business logic and protocols that allow our UI partners to build lightweight and flexible applications for almost any platform. This implies that the presentation layer should be void of any business logic and should simply be responsible for rendering data that is passed to it. In order to accomplish this we have designed a microservice architecture that emphasizes the Separation of Concerns design principle. Consider the updated system interaction diagram below:

There are 2 noteworthy changes that are worth discussing further. First, notice the presence of a dedicated SKU Eligibility Service. This service contains specialized business logic that used to be part of the Orchestration Service. By migrating this logic to a new microservice we simplify the Orchestration Service, clarify ownership over the domain, and unlock new use cases since it is now possible for other services not shown in this diagram to also consume eligible SKU data.

Second, notice that the SKU Service has been extended to a platform, which now leverages a rules engine and SKU catalog DB. This platform unlocks tremendous business value since product-oriented teams are now free to use the platform to experiment with different product offerings for our global audience, with little to no code changes required. This means that engineers can spend less time doing tedious work and more time designing creative solutions to better prepare us for future needs. Let’s take a deeper look at the role of each service involved in retrieving SKU data, starting from the visitor’s device and working our way down the stack.

Step 1 — Device sends a request for the plan selection page
As discussed in our previous Growth Engineering blog post, we use a custom JSON protocol between our client UIs and our middle-tier Orchestration Service. An example of what this protocol might look like for a browser request to retrieve the plan selection page shown above might look as follows:

GET /plans
{
“flow”: “browser”,
“mode”: “planSelection”
}

As you can see, there are 2 critical pieces of information in this request:

  • Flow — The flow is a way to identify the platform. This allows the Orchestration Service to route the request to the appropriate platform-specific request handling logic.
  • Mode — This is essentially the name of the page being requested.

Given the flow and mode, the Orchestration Service can then process the request.

Step 2 — Request is routed to the Orchestration Service for processing
The Orchestration Service is responsible for validating upstream requests, orchestrating calls to downstream services, and composing JSON responses during a signup flow. For this particular request the Orchestration Service needs to retrieve the SKU data from the SKU Eligibility Service and build the JSON response that can be consumed by the UI layer.

The JSON response for this request might look something like below. Notice the difference in data structures from the legacy implementation. This new contextual representation facilitates greater reuse, as well as potentially supporting offers other than a 30 day free trial:

{
“flow”: “browser”,
“mode”: “planSelection”,
“fields”: {
“skus”: [
{
“id”: 123,
“incentives”: [“FREE_TRIAL”],
“plan”: {
“name”: “Basic”,
“quality”: “SD”,
“price” : “$8.99”,
...
}
...
},
{
“id”: 456,
“incentives”: [“FREE_TRIAL”],
“plan”: {
“name”: “Standard”,
“quality”: “HD”,
“price” : “$13.99”,
...
}
...
},
{
“id”: 789,
“incentives”: [“FREE_TRIAL”],
“plan”: {
“name”: “Premium”,
“quality”: “UHD”,
“price” : “$17.99”,
...
}
...
}
],
“selectedSku”: {
“type”: “Numeric”,
“value”: 789
}
"nextAction": {
"type": "Action"
"withFields": [
"selectedSku"
]
}
}
}

As you can see, the response contains a list of SKUs, the selected SKU, and an action. The action corresponds to the button on the page and the withFields specify which fields the server expects to have sent back when the button is clicked.

Step 3 & 4 — Determine eligibility and retrieve eligible SKUs from SKU Eligibility Service
Netflix is a global company and we often have different SKUs in different regions. This means we need to distinguish between availability of SKUs and eligibility for SKUs. You can think of eligibility as something that is applied at the user level, while availability is at the country level. The SKU Platform contains the global set of SKUs and as a result, is said to control the availability of SKUs. Eligibility for SKUs is determined by the SKU Eligibility Service. This distinction creates clear ownership boundaries and enables the Growth Engineering team to focus on surfacing the correct SKUs for our visitors.

This centralization of eligibility logic in the SKU Eligibility Service also enables innovation in different parts of the product that have traditionally been ignored. Different services can now interface directly with the SKU Eligibility Service in order to retrieve SKU data.

Step 5 — Retrieve eligible SKUs from SKU Platform
The SKU Platform consists of a rules engine, a database, and application logic. The database contains the plans, prices and offers. The rules engine provides a means to extract available plans and offers when certain conditions within a rule match. Let’s consider a simple example where we attempt to retrieve offers in the US.

Keeping the Separation of Concerns in mind, notice that the SKU Platform has only one core responsibility. It is responsible for managing all Netflix SKUs. It provides access to these SKUs via a simple API that takes customer context and attempts to match it against the set of SKU rules. SKU eligibility is computed upstream and is treated just as any other condition would be in the SKU ruleset. By not coupling the concepts of eligibility and availability into a single service, we enable increased developer productivity since each team is able to focus on their core competencies and any change in eligibility does not affect the SKU Platform. One of the core tenets of a platform is the ability to support self-service. This negates the need to engage the backend domain experts for every desired change. The SKU Platform supports this via lightweight configuration changes to rules that do not require a full deployment. The next step is to invest further into self-service and support rule changes via a SKU UI. Stay tuned for more details on this, as well as more details on the internals of the new SKU Platform in one of our upcoming blog posts.

Conclusion

This work was a large cross-functional effort. We rebuilt our offers and plans from the ground up. It resulted in systems changes, as well as interaction changes between teams. Where there was once ambiguity, we now have clearly defined ownership over SKU availability and eligibility. We are now capable of introducing new plans and offers in various markets around the globe in order to meet our customer’s needs, with minimal engineering effort.

Let’s review some of the advantages the new architecture has over the legacy implementation. To name a few:

  • Domain objects that have a more reusable and extensible “shape”. This shape facilitates code reuse at the UI layer as well as the service layers.
  • A SKU Platform that enables product innovation with minimal engineering involvement. This means engineers can focus on more challenging and creative solutions for other problems. It also means fewer engineering teams are required to support initiatives in this space.
  • Configuration instead of code for updating SKU data, which improves innovation velocity.
  • Lower latency as a result of fewer service calls, which means fewer errors for our visitors.

The world is constantly changing. Device capabilities continue to improve. How, when, and where people want to be entertained continues to evolve. With these types of continued investments in infrastructure, the Growth Engineering team is able to build a solid foundation for future innovations that will allow us to continue to deliver the best possible experience for our members.

Join Growth Engineering and help us build the next generation of services that will allow the next 200 million subscribers to experience the joy of Netflix.


Growth Engineering at Netflix- Creating a Scalable Offers Platform was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Growth Engineering at Netflix — Automated Imagery Generation

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/growth-engineering-at-netflix-automated-imagery-generation-5a105fd51569

Growth Engineering at Netflix — Automated Imagery Generation

by Eric Eiswerth

Background

There’s a good chance you’ve probably visited the Netflix homepage. In the Growth Engineering team, we refer to this as the top of the signup funnel. For more background on the signup funnel and Growth Engineering’s role in the signup funnel, please read our initial post on the topic: Growth Engineering at Netflix — Accelerating Innovation. The primary focus of this post will be the top of the signup funnel. In particular, the Netflix homepage:

As discussed in our previous post, Growth Engineering owns the business logic and protocols that allow our UI partners to build lightweight and flexible applications for almost any platform. In some cases, like the homepage, this even involves providing appropriate imagery (e.g., the background image shown above). In this post, we’ll take a deep dive into the journey of content-based imagery on the Netflix homepage.

Motivation

At Netflix we do one thing — entertainment — and we aim to do it really well. We live and breathe TV shows and films, and we want everyone to be able to enjoy them too. That’s why we aspire to have best in class stories, across genres and believe people should have access to new voices, cultures and perspectives. The member-focused teams at Netflix are responsible for making sure the member experience is relevant and personalized, ensuring that this content is shown to the right people at the right time. But what about non-members; those who are simply interested in signing up for Netflix, how should we highlight our content and convey our value propositions to them?

The Solution

The main mechanism for highlighting our content in the signup flow is through content-based imagery. Before designing a solution it’s important to understand the main product requirements for such a feature:

  • The content needs to be new, relevant, and regional (not all countries have the same catalogue).
  • The artwork needs to appeal to a broader audience. The non-member homepage serves a very broad audience and is not personalized to the extent of the member experience.
  • The imagery needs to be localized.
  • We need to be able to easily determine what imagery is present for a given platform, region, and language.
  • The homepage needs to load in a reasonable amount of time, even in poor network conditions.

Unpacking Product Requirements

Given the scale we require and the product requirements listed above, there are a number of technical requirements:

  • A list of titles for the asset, in some order.
  • Ensure the titles are appropriate for a broad audience, which means all titles need to be tagged with metadata.
  • Localized images for each of the titles.
  • Different assets for different device types and screen sizes.
  • Server-generated assets, since client-side generation would require the retrieval of many individual images, which would increase latency and time-to-render.
  • To reduce latency, assets should be generated in an offline fashion and not in real time.
  • The assets need to be compressed, without reducing quality significantly.
  • The assets will need to be stored somewhere and we’ll need to generate URLs for each of them.
  • We’ll need to figure out how to provide the correct asset URL for a given request.
  • We’ll need to build a search index so that the assets can be searchable.

Given this set of requirements, we can effectively break this work down into 3 functional buckets:

The Design

For our design, we decided to build 3 separate microservices, mapping to the aforementioned functional buckets. Let’s take a look at each of these services in turn.

Asset Generation

The Asset Generation Service is responsible for generating groups of assets. We call these groups of assets, asset groups. Each request will generate a single asset group that will contain one or more assets. To support the demands of our stakeholders we designed a Domain Specific Language (DSL) that we call an asset generation recipe. An asset generation request contains a recipe. Below is an example of a simple recipe:

{
"titleIds": [12345, 23456, 34567, …],
"countries": [“US”],
"type": “perspective”, // this is the design of the asset
"rows": 10, // the number of rows and columns can control density
"cols": 15,
"padding": 10, // padding between individual images
"columnOffsets": [0, 0, 0, 0…], // the y-offset for each column
"rowOffsets": [0, -100, 0, -100, …], // the x-offset for each row
"size": [1920, 1080] // size in pixels
}

This recipe can then be issued via an HTTP POST request to the Asset Generation Service. The recipe can then be translated into ImageMagick commands that can do the heavy lifting. At a high level, the following diagram captures the necessary steps required to build an asset.

Generating a single localized asset is a big achievement, but we still need to store the asset somewhere and have the ability to search for it. This requires an asset storage solution.

Asset Storage

We refer to asset storage and management simply as asset management. We felt it would be beneficial to create a separate microservice for asset management for 2 reasons. First, asset generation is CPU intensive and bursty. We can leverage high performance VMs in AWS to generate the assets. We can scale up when generation is occurring and scale down when there is no batch in the queue. However, it would be cost-inefficient to leverage this same hardware for lightweight and more consistent traffic patterns that an asset management service requires.

Let’s take a look at the internals of the Asset Management Service.

At this point we’ve laid out all the details in order to generate a content-based asset and have it stored as part of an asset group, which is persisted and indexed. The next thing to consider is, how do we retrieve an asset in real time and surface it on the Netflix homepage?

If you recall in our previous blog post, Growth Engineering owns a service called the Orchestration Service. It is a mid-tier service that emits a custom JSON data structure that contains fields that are consumed by the UI. The UI can then use these fields to control the presentation in the UI layer. There are two approaches for adding fields to the Orchestration Service’s response. First, the fields can be coded by hand. Second, fields can be added via configuration via a service we call the Customization Service. Since assets will need to be periodically refreshed and we want this process to be entirely automated, it makes sense to pursue the configuration-based approach. To accomplish this, the Asset Management Service needs to translate an asset group into a rule definition for the Customization Service.

Customization Service

Let’s review the Orchestration Service and introduce the Customization Service. The Orchestration Service emits fields in response to upstream requests. For the homepage, there are typically only a small number of fields provided by the Orchestration Service. The following fields are supplied by application code. For example:

{
“fields”: {
“email” : {
“type”: “StringField”,
“value”: “”
},
“nextAction”: {
“type”: “Action”,
“withFields” [“email”]
}
}
}

The Orchestration Service also supports fields supplied by configuration. We call these adaptive fields. Adaptive fields are provided by the Customization Service. The Customization Service is a rules engine that emits the adaptive fields. For example, a rule to provide the background image for the homepage in the en-US locale would look as follows:

{
“country”: “US”,
“language”: “en”,
“platform”: “browser”,
“resolution”: “high”
}

The corresponding payload for such a rule might look as follows:

{
“backgroundImage”: “https://cdn.netflix.com/bgimageurl.jpg”
}

Bringing this all together, the response from the Orchestration Service would now look as follows:

{
“fields”: {
“email” : {
“type”: “StringField”,
“value”: “”
},
“nextAction”: {
“type”: “Action”,
“withFields” [“email”]
}
},
“adaptiveFields”: {
“backgroundImage”: “https://cdn.netflix.com/bgimageurl.jpg”
}
}

At this point, we are now able to generate an asset, persist it, search it, and generate customization rules for it. The generated rules then enable us to return a particular asset for a particular request. Let’s put it all together and review the system interaction diagram.

We now have all the pieces in place to automatically generate artwork and have that artwork appear on the Netflix homepage for a given request. At least one open question remains, how can we scale asset generation?

Scaling Asset Generation

Arguably, there are a number of approaches that could be used to scale asset generation. We decided to opt for an all-or-nothing approach. Meaning, all assets for a given recipe need to be generated as a single asset group. This enables smooth rollback in case of any errors. Additionally, asset generation is CPU intensive and each recipe can produce 1000s of assets as a result of the number of platform, region, and language permutations. Even with high performance VMs, generating 1000s of assets can take a long time. As a result, we needed to find a way to distribute asset generation across multiple VMs. Here’s what the final architecture looked like.

Briefly, let’s review the steps:

  1. The batch process is initiated by a cron job. The job executes a script that contains an asset generation recipe.
  2. The Asset Generation Service receives the request and creates asset generation tasks that can be distributed across any number of Asset Generation Worker nodes. One of the nodes is elected as the leader via Zookeeper. Its job is to coordinate asset generation across the other workers and ensure all assets get generated.
  3. Once the primary worker node has all the assets, it creates an asset group in the Asset Management Service. The Asset Management Service persists, indexes, and uploads the assets to the CDN.
  4. Finally, the Asset Management Service creates rules from the asset group and pushes the rules to the Customization Service. Once the data is published in the Customization Service, the Orchestration Service can supply the correct URLs in its JSON response by invoking the Customization Service with a request context that matches a given set of rules.

Conclusion

Automated asset generation has proven to be an extremely valuable investment. It is low-maintenance, high-leverage, and has allowed us to experiment with a variety of different types of assets on different platforms and on different parts of the product. This project was technically challenging and highly rewarding, both to the engineers involved in the project, and to the business. The Netflix homepage has come a long way over the last several years.

We’re hiring! Join Growth Engineering and help us build the future of Netflix.


Growth Engineering at Netflix — Automated Imagery Generation was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Mythbusting the Analytics Journey

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/mythbusting-the-analytics-journey-58d692ea707e

Part of our series on who works in Analytics at Netflix — and what the role entails

by Alex Diamond

This Q&A aims to mythbust some common misconceptions about succeeding in analytics at a big tech company.

This isn’t your typical recruiting story. I wasn’t actively looking for a new job and Netflix was the only place I applied. I didn’t know anyone who worked there and just submitted my resume through the Jobs page 🤷🏼‍♀️ . I wasn’t even entirely sure what the right role fit would be and originally applied for a different position, before being redirected to the Analytics Engineer role. So if you find yourself in a similar situation, don’t be discouraged!

How did you come to Netflix?

Movies and TV have always been one of my primary sources of joy. I distinctly remember being a teenager, perching my laptop on the edge of the kitchen table to “borrow” my neighbor’s WiFi (back in the days before passwords 👵🏻), and streaming my favorite Netflix show. I felt a little bit of ✨magic✨ come through the screen each time, and that always stuck with me. So when I saw the opportunity to actually contribute in some way to making the content I loved, I jumped at it. Working in Studio Data Science & Engineering (“Studio DSE”) was basically a dream come true.

Not only did I find the subject matter interesting, but the Netflix culture seemed to align with how I do my best work. I liked the idea of Freedom and Responsibility, especially if it meant having autonomy to execute projects all the way from inception through completion. Another major point of interest for me was working with “stunning colleagues”, from whom I could continue to learn and grow.

What was your path to working with data?

My road-to-data was more of a stumbling-into-data. I went to an alternative high school for at-risk students and had major gaps in my formal education — not exactly a head start. I then enrolled at a local public college at 16. When it was time to pick a major, I was struggling in every subject except one: Math. I completed a combined math bachelors + masters program, but without any professional guidance, networking, or internships, I was entirely lost. I had the piece of paper, but what next? I held plenty of jobs as a student, but now I needed a career.

A visual representation of all the jobs I had in high school and college: From pizza, to gourmet rice krispie treats, to clothing retail, to doors and locks

After receiving a grand total of *zero* interviews from sending out my resume, the natural next step was…more school. I entered a PhD program in Computer Science and shortly thereafter discovered I really liked the coding aspects more than the theory. So I earned the honor of being a PhD dropout.

A visual representation of all the hats I’ve worn

And here’s where things started to click! I used my newfound Python and SQL skills to land an entry-level Business Intelligence Analyst position at a company called Big Ass Fans. They make — you guessed it — very large industrial ventilation fans. I was given the opportunity to branch out and learn new skills to tackle any problem in front of me, aka my “becoming useful” phase. Within a few months I’d picked up BI tools, predictive modeling, and data ingestion/ETL. After a few years of wearing many different proverbial hats, I put them all to use in the Analytics Engineer role here. And ever since, Netflix has been a place where I can do my best work, put to use the skills I’ve gathered over the years, and grow in new ways.

What does an ordinary day look like?

As part of the Studio DSE team, our work is focused on aiding the movie-making process for our Netflix Originals, leading all the way up to a title’s launch on the service. Despite the affinity for TV and movies that brought me here, I didn’t actually know very much about how they got made. But over time, and by asking lots of questions, I’ve picked up the industry lingo! (Can you guess what “DOOD” stands for?)

My main stakeholders are members of our Studio team. They’re experts on the production process and an invaluable resource for me, sharing their expertise and providing context when I don’t know what something means. True to the “people over process” philosophy, we adapt alongside our stakeholders’ needs throughout the production process. That means the work products don’t always fit what you might imagine a traditional Analytics Engineer builds — if such a thing even exists!

A typical production lifecycle

On an ordinary day, my time is generally split evenly across:

  • 🤝📢 Speaking with stakeholders to understand their primary needs
  • 🐱💻 Writing code (SQL, Python)
  • 📊📈 Building visual outputs (Tableau, memos, scrappy web apps)
  • 🤯✍️ Brainstorming and vision planning for future work

Some days have more of one than the others, but variety is the spice of life! The one constant is that my day always starts with a ridiculous amount of coffee. And that it later continues with even more coffee. ☕☕☕

My road-to-data was more of a stumbling-into-data.

What advice would you give to someone just starting their career in data?

🐾 Dip your toes in things. As you try new things, your interests will evolve and you’ll pick up skills across a broad span of subject areas. The first time I tried building the front-end for a small web app, it wasn’t very pretty. But it piqued my interest and after a few times it started to become second nature.

💪 Find your strengths and weaknesses. You don’t have to be an expert in everything. Just knowing when to reach out for guidance on something allows you to uplevel your skills in that area over time. My weakness is statistics: I can use it when needed but it’s just not a subject that comes naturally to me. I own that about myself and lean on my stats-loving peers when needed.

🌸 Look for roles that allow you to grow. As you grow in your career, you’ll provide impact to the business in ways you didn’t even expect. As a business intelligence analyst, I gained data science skills. And in my current Analytics Engineer role, I’ve picked up a lot of product management and strategic thinking experience.

This is what I look like.

☝️ One Last Thing

I started off my career with the vague notion of, “I guess I want to be a data scientist?” But what that’s meant in practice has really varied depending on the needs of each job and project. It’s ok if you don’t have it all figured out. Be excited to try new things, lean into strengths, and don’t be afraid of your weaknesses — own them.

If this post resonates with you and you’d like to explore opportunities with Netflix, check out our analytics site, search open roles, and learn about our culture. You can also find more stories like this here.


Mythbusting the Analytics Journey was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Life of a Netflix Partner Engineer — The case of extra 40 ms

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/life-of-a-netflix-partner-engineer-the-case-of-extra-40-ms-b4c2dd278513

Life of a Netflix Partner Engineer — The case of the extra 40 ms

By: John Blair, Netflix Partner Engineering

The Netflix application runs on hundreds of smart TVs, streaming sticks and pay TV set top boxes. The role of a Partner Engineer at Netflix is to help device manufacturers launch the Netflix application on their devices. In this article we talk about one particularly difficult issue that blocked the launch of a device in Europe.

The mystery begins

Towards the end of 2017, I was on a conference call to discuss an issue with the Netflix application on a new set top box. The box was a new Android TV device with 4k playback, based on Android Open Source Project (AOSP) version 5.0, aka “Lollipop”. I had been at Netflix for a few years, and had shipped multiple devices, but this was my first Android TV device.

All four players involved in the device were on the call: there was the large European pay TV company (the operator) launching the device, the contractor integrating the set-top-box firmware (the integrator), the system-on-a-chip provider (the chip vendor), and myself (Netflix).

The integrator and Netflix had already completed the rigorous Netflix certification process, but during the TV operator’s internal trial an executive at the company reported a serious issue: Netflix playback on his device was “stuttering.”, i.e. video would play for a very short time, then pause, then start again, then pause. It didn’t happen all the time, but would reliably start to happen within a few days of powering on the box. They supplied a video and it looked terrible.

The device integrator had found a way to reproduce the problem: repeatedly start Netflix, start playback, then return to the device UI. They supplied a script to automate the process. Sometimes it took as long as five minutes, but the script would always reliably reproduce the bug.

Meanwhile, a field engineer for the chip vendor had diagnosed the root cause: Netflix’s Android TV application, called Ninja, was not delivering audio data quickly enough. The stuttering was caused by buffer starvation in the device audio pipeline. Playback stopped when the decoder waited for Ninja to deliver more of the audio stream, then resumed once more data arrived. The integrator, the chip vendor and the operator all thought the issue was identified and their message to me was clear: Netflix, you have a bug in your application, and you need to fix it. I could hear the stress in the voices from the operator. Their device was late and running over budget and they expected results from me.

The investigation

I was skeptical. The same Ninja application runs on millions of Android TV devices, including smart TVs and other set top boxes. If there was a bug in Ninja, why is it only happening on this device?

I started by reproducing the issue myself using the script provided by the integrator. I contacted my counterpart at the chip vendor, asked if he’d seen anything like this before (he hadn’t). Next I started reading the Ninja source code. I wanted to find the precise code that delivers the audio data. I recognized a lot, but I started to lose the plot in the playback code and I needed help.

I walked upstairs and found the engineer who wrote the audio and video pipeline in Ninja, and he gave me a guided tour of the code. I spent some quality time with the source code myself to understand its working parts, adding my own logging to confirm my understanding. The Netflix application is complex, but at its simplest it streams data from a Netflix server, buffers several seconds worth of video and audio data on the device, then delivers video and audio frames one-at-a-time to the device’s playback hardware.

A diagram showing content downloaded to a device into a streaming buffer, then copied into the device decode buffer.
Figure 1: Device Playback Pipeline (simplified)

Let’s take a moment to talk about the audio/video pipeline in the Netflix application. Everything up until the “decoder buffer” is the same on every set top box and smart TV, but moving the A/V data into the device’s decoder buffer is a device-specific routine running in its own thread. This routine’s job is to keep the decoder buffer full by calling a Netflix provided API which provides the next frame of audio or video data. In Ninja, this job is performed by an Android Thread. There is a simple state machine and some logic to handle different play states, but under normal playback the thread copies one frame of data into the Android playback API, then tells the thread scheduler to wait 15 ms and invoke the handler again. When you create an Android thread, you can request that the thread be run repeatedly, as if in a loop, but it is the Android Thread scheduler that calls the handler, not your own application.

To play a 60fps video, the highest frame rate available in the Netflix catalog, the device must render a new frame every 16.66 ms, so checking for a new sample every 15ms is just fast enough to stay ahead of any video stream Netflix can provide. Because the integrator had identified the audio stream as the problem, I zeroed in on the specific thread handler that was delivering audio samples to the Android audio service.

I wanted to answer this question: where is the extra time? I assumed some function invoked by the handler would be the culprit, so I sprinkled log messages throughout the handler, assuming the guilty code would be apparent. What was soon apparent was that there was nothing in the handler that was misbehaving, and the handler was running in a few milliseconds even when playback was stuttering.

Aha, Insight

In the end, I focused on three numbers: the rate of data transfer, the time when the handler was invoked and the time when the handler passed control back to Android. I wrote a script to parse the log output, and made the graph below which gave me the answer.

A graph showing time spent in the thread handler and audio data throughput.
Figure 2: Visualizing Audio Throughput and Thread Handler Timing

The orange line is the rate that data moved from the streaming buffer into the Android audio system, in bytes/millisecond. You can see three distinct behaviors in this chart:

  1. The two, tall spiky parts where the data rate reaches 500 bytes/ms. This phase is buffering, before playback starts. The handler is copying data as fast as it can.
  2. The region in the middle is normal playback. Audio data is moved at about 45 bytes/ms.
  3. The stuttering region is on the right, when audio data is moving at closer to 10 bytes/ms. This is not fast enough to maintain playback.

The unavoidable conclusion: the orange line confirms what the chip vendor’s engineer reported: Ninja is not delivering audio data quickly enough.

To understand why, let’s see what story the yellow and grey lines tell.

The yellow line shows the time spent in the handler routine itself, calculated from timestamps recorded at the top and the bottom of the handler. In both normal and stutter playback regions, the time spent in the handler was the same: about 2 ms. The spikes show instances when the runtime was slower due to time spent on other tasks on the device.

The real root cause

The grey line, the time between calls invoking the handler, tells a different story. In the normal playback case you can see the handler is invoked about every 15 ms. In the stutter case, on the right, the handler is invoked approximately every 55 ms. There are an extra 40 ms between invocations, and there’s no way that can keep up with playback. But why?

I reported my discovery to the integrator and the chip vendor (look, it’s the Android Thread scheduler!), but they continued to push back on the Netflix behavior. Why don’t you just copy more data each time the handler is called? This was a fair criticism, but changing this behavior involved deeper changes than I was prepared to make, and I continued my search for the root cause. I dove into the Android source code, and learned that Android Threads are a userspace construct, and the thread scheduler uses the epoll() system call for timing. I knew epoll() performance isn’t guaranteed, so I suspected something was affecting epoll() in a systematic way.

At this point I was saved by another engineer at the chip supplier, who discovered a bug that had already been fixed in the next version of Android, named Marshmallow. The Android thread scheduler changes the behavior of threads depending whether or not an application is running in the foreground or the background. Threads in the background are assigned an extra 40 ms (40000000 ns) of wait time.

A bug deep in the plumbing of Android itself meant this extra timer value was retained when the thread moved to the foreground. Usually the audio handler thread was created while the application was in the foreground, but sometimes the thread was created a little sooner, while Ninja was still in the background. When this happened, playback would stutter.

Lessons learned

This wasn’t the last bug we fixed on this platform, but it was the hardest to track down. It was outside of the Netflix application, in a part of the system that was outside of the playback pipeline, and all of the initial data pointed to a bug in the Netflix application itself.

This story really exemplifies an aspect of my job I love: I can’t predict all of the issues that our partners will throw at me, and I know that to fix them I have to understand multiple systems, work with great colleagues, and constantly push myself to learn more. What I do has a direct impact on real people and their enjoyment of a great product. I know when people enjoy Netflix in their living room, I’m an essential part of the team that made it happen.


Life of a Netflix Partner Engineer — The case of extra 40 ms was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Netflix Scales its API with GraphQL Federation (Part 2)

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-2-bbe71aaec44a

In our previous post and QConPlus talk, we discussed GraphQL Federation as a solution for distributing our GraphQL schema and implementation. In this post, we shift our attention to what is needed to run a federated GraphQL platform successfully — from our journey implementing it to lessons learned.

Netflix GraphQL Federation

Our Journey so Far

Over the past year, we’ve implemented the core infrastructure pieces necessary for a federated GraphQL architecture as described in our previous post:

Studio Edge Architecture Diagram
Studio Edge Architecture

The first Domain Graph Service (DGS) on the platform was the former GraphQL monolith that we discussed in our first post (Studio API). Next, we worked with a few other application teams to make DGSs that would expose their APIs alongside the former monolith. We had our first Studio applications consuming the federated graph, without any performance degradation, by the end of the 2019. Once we knew that the architecture was feasible, we focused on readying it for broader usage. Our goal was to open up the Studio Edge platform for self-service in April 2020.

April 2020 was a turbulent time with the pandemic and overnight transition to working remotely. Nevertheless, teams started to jump into the graph in droves. Soon we had hundreds of engineers contributing directly to the API on a daily basis. And what about that Studio API monolith that used to be a bottleneck? We migrated the fields exposed by Studio API to individually owned DGSs without breaking the API for consumers. The original monolith is slated to be completely deprecated by the end of 2020.

This journey hasn’t been without its challenges. The biggest challenge was aligning on this strategy across the organization. Initially, there was a lot of skepticism and dissent; the concept was fairly new and would require high alignment across the organization to be successful. Our team spent a lot of time addressing dissenting points and making adjustments to the architecture based on feedback from developers. Through our prototype development and proactive partnership with some key critical voices, we were able to instill confidence and close crucial gaps.

Once we achieved broad alignment on the idea, we needed to ensure that adoption was seamless. This required building robust core infrastructure, ensuring a great developer experience, and solving for key cross-cutting concerns.

Core Infrastructure

Our GraphQL Gateway is based on Apollo’s reference implementation and is written in Kotlin. This gives us access to Netflix’s Java ecosystem, while also giving us the robust language features such as coroutines for efficient parallel fetches, and an expressive type system with null safety.

The schema registry is developed in-house, also in Kotlin. For storing schema changes, we use an internal library that implements the event sourcing pattern on top of the Cassandra database. Using event sourcing allows us to implement new developer experience features such as the Schema History view. The schema registry also integrates with our CI/CD systems like Spinnaker to automatically setup cloud networking for DGSs.

Developer Education & Experience

In the previous architecture, only the monolith Studio API team needed to learn GraphQL. In Studio Edge, every DGS team needs to build expertise in GraphQL. GraphQL has its own learning curve and can get especially tricky for complex cases like batching & lookahead. Also, as discussed in the previous post, understanding GraphQL Federation and implementing entity resolvers is not trivial either.

We partnered with Netflix’s Developer Experience (DevEx) team to build out documentation, training materials, and tutorials for developers. For general GraphQL questions, we lean on the open source community plus cultivate an internal GraphQL community to discuss hot topics like pagination, error handling, nullability, and naming conventions.

DGS Framework & Developer Tools

To make it easy for backend engineers to build a GraphQL DGS, the DevEx team built a “DGS Framework” on top of GraphQL Java and Spring Boot. The framework takes care of all the cross-cutting concerns of running a GraphQL service in production while also making it easier for developers to write GraphQL resolvers. In addition, DevEx built robust tooling for pushing schemas to the Schema Registry and a Self Service UI for browsing the various DGS’s schemas. Check out their conference talk and expect a future blog post from our colleagues. The DGS framework is planned to be open-sourced in early 2021.

Schema Governance

Netflix’s studio data is extremely rich and complex. Early on, we anticipated that active schema management would be crucial for schema evolution and overall health. We had a Studio Data Architect already in the org who was focused on data modeling and alignment across Studio. We engaged with them to determine graph schema best practices to best suit the needs of Studio Engineering.

Our goal was to design a GraphQL schema that was reflective of the domain itself, not the database model. UI developers should not have to build Backends For Frontends (BFF) to massage the data for their needs, rather, they should help shape the schema so that it satisfies their needs. Embracing a collaborative schema design approach was essential to achieving this goal.

Schema Design Workflow Diagram
Schema Design Workflow

The collaborative design process involves feedback and reviews across team boundaries. To streamline schema design and review, we formed a schema working group and a managed technical program for on-boarding to the federated architecture. While reviews add overhead to the product development process, we believe that prioritizing the quality of the graph model will reduce the amount of future changes and reworking needed. The level of review varies based on the entities affected; for the core federated types, more rigor is required (though tooling helps streamline that flow).

We have a deprecation workflow in place for evolving the schema. We’ve leveraged GraphQL’s deprecation feature and also track usage stats for every field in the schema. Once the stats show that a deprecated field is no longer used, we can make a backward incompatible change to remove the field from the schema.

Clients with Deprecated Field Usage
Clients with Deprecated Field Usage

We embraced a schema-first approach instead of generating our schema from existing models such as the Protobuf objects in our gRPC APIs. While Protobufs and gRPC are excellent solutions for building service APIs, we prefer decoupling our GraphQL schema from those layers to enable cleaner graph design and independent evolvability. In some scenarios, we implement generic mapping code from GraphQL resolvers to gRPC calls, but the extra boilerplate is worth the long-term flexibility of the GraphQL API.

Underlying our approach is a foundation of “context over control”, which is a key tenet of Netflix’s culture. Instead of trying to hold tight control of the entire graph, we give guidance and context to product teams so that they can apply their domain knowledge to make a flexible API for their domain. As this architecture matures, we will continue to monitor schema health and develop new tooling, processes, and best practices where needed.

Observability

In our previous architecture, observability was achieved through manual analysis and routing via the API team, which scaled poorly. For our federated architecture, we prioritized solving observability needs in a more scalable manner. We prioritized three areas:

  • Alerting — report when something goes awry
  • Discovery — easily determine what isn’t working
  • Diagnosis — debug why something isn’t working

Our guiding metrics in this space are mean time to resolution (MTTR) and service level objectives and indicators (SLO/SLI).

We teamed up with experts from Netflix’s Telemetry team. We integrated the Gateway and DGS architectural components with Zipkin, the internal distributed tracing tool Edgar, and application monitoring tool TellTale. In GraphQL, almost every response is a 200 with custom errors in the error block. We introspect these custom error codes from the response and emit them to our metrics server, Atlas. These integrations created a great foundation of rich visibility and insights for the consumers and developers of the GraphQL API.

Trace for a Federated Request Lifecycle
Edgar Trace for a Federated Request Lifecycle
Timeline View for a Federated Request lifecycle
Timeline View for a Federated Request

Distributed Log Correlation helps with debugging more complex server issues. By surfacing the application level logging details for all systems involved in processing a request, we gain deeper insights into what happened across the stack. Developers can easily see what was happening around the same time as a given request, to inspect surrounding factors that might have impacted an interaction.

Log correlation across multiple services for a request lifecycle
Logs across multiple services for a Federated Request

To solve the “who do I ask about…” routing problem, we integrated deep linking from GraphQL types and fields to their owning team’s support channels. Finding support is now as simple as clicking a link from a trace, which helps shorten MTTR and reduce the number of times the gateway team needs to get involved.

Securing the Federated Graph

Our goal is to enable robust and consistent security practices across the federated architecture. To achieve this, we partnered with the security experts at Netflix to build security into the graph. Let’s look at two essential parts of our security solution: AuthN and AuthZ.

Authentication

All of our product experiences in the Studio space require an authenticated account, so we restrict the GraphQL Gateway access to only trusted authenticated callers. Additionally, Graph Introspection is restricted to Netflix internal developers.

Authorization

Before Studio Edge, authorization logic was fragmented across teams. Some teams implemented authorization in their BFFs, some in microservices, and others did both for good measure. The result was often a different authorization story for a given piece of data depending on which UI a user was accessing it through. UI teams also found themselves needing to implement (and re-implement) authorization checks with each new frontend.

In Studio Edge, we delegated the authorization responsibility to DGS owners. This resulted in consistent authorization for the same user across different applications. Plus, Product Managers, Engineers and the Security team can easily get a bird’s eye view of who has access to each data type and how.

We have multiple authorization offerings within Netflix: from a simple system that grants access based on user identity to a more granular system that brings in the concept of roles and capabilities. DGS developers can choose a solution based on their needs. Then they simply annotate their resolvers with @Secured annotation and configure that to use one of the available systems. If needed, more complex authorization can be implemented in the resolver or in downstream systems.

Future of Authorization

We are currently prototyping a GraphQL-aware authorization solution. The Schema Registry automatically generates Access Control Groups (ACGs) for each field and its corresponding type when its schema is registered. Product managers & DGS Engineers decide membership and rules for these generated ACGs. Since the ACGs map to a field in GraphQL, the DGS framework then automatically applies the rules associated with the ACG during execution.

Architecting for Failure

The GraphQL Gateway is the single entry point for all requests; a failure on the gateway can cause significant disruptions. Following Netflix engineering best practices, we assume failures will happen and design ways to mitigate the impact of those failures. These are our design principles for ensuring the gateway layer is resilient:

  1. Single purpose
  2. Stateless service
  3. Demand controlled
  4. Multi-region
  5. Sharded by functionality

First, we focus the responsibilities of the gateway layer on a single purpose: parse client queries, then build and execute query plans. By reducing the scope, we limit the range of problems that can occur. We aim to perform any additional resource-intensive operations off-box with the exception of logging and metrics. Taking on additional unrelated logic in the gateway layer could increase surface area for failures in this critical tier.

Second, we run multiple stateless instances of the gateway service. Any gateway instance is able to generate and execute a query plan for any request. When we do code changes to the gateway layer, we rigorously test them before rolling out to production.

Third, we seek to balance the resources each request consumes through applying demand control. We rate-limit callers to avoid overloading the underlying databases that are the source of most of our domain elements. We also run a static query cost calculation on all incoming queries and reject expensive queries to avoid gridlock in gateway and DGS resources. Our partners understand these tradeoffs and work with us to meet these requirements, reworking expensive queries and reducing high volume callers.

Fourth, we deploy our gateway layer to multiple AWS regions around the world. This allows us to limit the blast radius for problems that inevitably arise. When problems happen, we can fail over to another region to ensure our clients are minimally impacted.

Last, we deploy multiple functional shards of our gateway layer. The code is the same in each shard and incoming requests are routed based on category. For example, GraphQL subscriptions generally result in long-lived connections while Queries & Mutations are short-lived. We use a separate fleet of instances for Subscriptions so “running out of connections” does not affect the availability of Queries and Mutations.

There is more we can do to improve resilience. We have plans to do canary deployments and analysis for gateway deployments and, eventually, schema changes. Today, our gateway dynamically updates its schema by polling the schema registry. We are in the process of decoupling these by storing the federation config in a versioned S3 bucket, making the gateway resilient to schema registry failures.

Closing Thoughts

GraphQL and Federation have been a productivity multiplier for Studio applications. Motivated by this, we’ve recently prototyped using GraphQL Federation for the Netflix consumer app search page on iOS & Android. To do this, we created three DGSs to provide the data for a minimal portion of the consumer graph. We are sending a small subset of users to this alternative stack and measuring high-level metrics. We are excited to see the results and explore further applicability in the Netflix consumer space.

Despite our positive experience, GraphQL Federation is early in its maturity lifecycle and may not be the best fit for every team or organization. Learning GraphQL and DGS development, running a federation layer, and doing a migration requires high commitment from partner teams and seamless cross-functional collaboration. If you’re considering going in this direction, we recommend checking out Apollo’s SaaS offering for Federation and the many online resources for learning GraphQL. For ecosystems like ours with a large swath of microservices that need to be aggregated together, the development velocity and improved operability has made the transition worth it.

In closing, we want to hear from you! If you have already implemented federation or tried to solve this problem with another approach, we would love to learn more. Sharing knowledge is one of the ways our industry learns and improves rapidly. Finally, if you’d like to be a part of solving complex and interesting problems like this at Netflix scale, check out our jobs page or reach out to us directly.

By Tejas Shikhare, Edited by Philip Fisher-Ogden

Additional Credits: Stephen Spalding, Jennifer Shin, Robert Reta, Antoine Boyer, Bruce Wang, David Simmer


How Netflix Scales its API with GraphQL Federation (Part 2) was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Supporting content decision makers with machine learning

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/supporting-content-decision-makers-with-machine-learning-995b7b76006f

by Melody Dye*, Chaitanya Ekanadham*, Avneesh Saluja*, Ashish Rastogi
* contributed equally

Netflix is pioneering content creation at an unprecedented scale. Our catalog of thousands of films and series caters to 195M+ members in over 190 countries who span a broad and diverse range of tastes. Content, marketing, and studio production executives make the key decisions that aspire to maximize each series’ or film’s potential to bring joy to our subscribers as it progresses from pitch to play on our service. Our job is to support them.

The commissioning of a series or film, which we refer to as a title, is a creative decision. Executives consider many factors including narrative quality, relation to the current societal context or zeitgeist, creative talent relationships, and audience composition and size, to name a few. The stakes are high (content is expensive!) as is the uncertainty of the outcome (it is difficult to predict which shows or films will become hits). To mitigate this uncertainty, executives throughout the entertainment industry have always consulted historical data to help characterize the potential audience of a title using comparable titles, if they exist. Two key questions in this endeavor are:

  • Which existing titles are comparable and in what ways?
  • What audience size can we expect and in which regions?

The increasing vastness and diversity of what our members are watching make answering these questions particularly challenging using conventional methods, which draw on a limited set of comparable titles and their respective performance metrics (e.g., box office, Nielsen ratings). This challenge is also an opportunity. In this post we explore how machine learning and statistical modeling can aid creative decision makers in tackling these questions at a global scale. The key advantage of these techniques is twofold. First, they draw on a much wider range of historical titles (spanning global as well as niche audiences). Second, they leverage each historical title more effectively by isolating the components (e.g., thematic elements) that are relevant for the title in question.

Our approach is rooted in transfer learning, whereby performance on a target task is improved by leveraging model parameters learned on a separate but related source task. We define a set of source tasks that are loosely related to the target tasks represented by the two questions above. For each source task, we learn a model on a large set of historical titles, leveraging information such as title metadata (e.g., genre, runtime, series or film) as well as tags or text summaries curated by domain experts describing thematic/plot elements. Once we learn this model, we extract model parameters constituting a numerical representation or embedding of the title. These embeddings are then used as inputs to downstream models specialized on the target tasks for a smaller set of titles directly relevant for content decisions (Figure 1). All models were developed and deployed using metaflow, Netflix’s open source framework for bringing models into production.

To assess the usefulness of these embeddings, we look at two indicators: 1) Do they improve the performance on the target task via downstream models? And just as importantly, 2) Are they useful to our creative partners, i.e. do they lend insight or facilitate apt comparisons (e.g., revealing that a pair of titles attracts similar audiences, or that a pair of countries have similar viewing behavior)? These considerations are key in informing subsequent lines of research and innovation.

Figure 1: Similar title identification and audience sizing can be supported by a common learned title embedding.

Similar titles

In entertainment, it is common to contextualize a new project in terms of existing titles. For example, a creative executive developing a title might wonder: Does this teen movie have more of the wholesome, romantic vibe ofTo All the Boys I’ve Loved Before or more of the dark comedic bent of The End of the F***ing World? Similarly, a marketing executive refining her “elevator pitch” might summarize a title with: “The existential angst of Eternal Sunshine of the Spotless Mind meets the surrealist flourishes of The One I Love.”

To make these types of comparisons even richer we “embed” titles in a high-dimensional space or “similarity map,” wherein more similar titles appear closer together with respect to a spatial distance metric such as Euclidean distance. We can then use this similarity map to identify clusters of titles that share common elements (Figure 2), as well as surface candidate similar titles for an unlaunched title.

Notably, there is no “ground truth” about what is similar: embeddings optimized on different source tasks will yield different similarity maps. For example, if we derive our embeddings from a model that classifies genre, the resulting map will minimize the distance between titles that are thematically similar (Figure 2). By contrast, embeddings derived from a model that predicts audience size will align titles with similar performance characteristics. By offering multiple views into how a given title is situated within the broader content universe, these similarity maps offer a valuable tool for ideation and exploration for our creative decision makers.

Figure 2: T-SNE visualization of embeddings learned from content categorization task.

Transfer learning for audience sizing

Another crucial input for content decision makers is an estimate of how large the potential audience will be (and ideally, how that audience breaks down geographically). For example, knowing that a title will likely drive a primary audience in Spain along with sizable audiences in Mexico, Brazil, and Argentina would aid in deciding how best to promote it and what localized assets (subtitles, dubbings) to create ahead of time.

Predicting the potential audience size of a title is a complex problem in its own right, and we leave a more detailed treatment for the future. Here, we simply highlight how embeddings can be leveraged to help tackle this problem. We can include any combination of the following as features in a supervised modeling framework that predicts audience size in a given country:

  • Embedding of a title
  • Embedding of a country we’d like to predict audience size in
  • Audience sizes of past titles with similar embeddings (or some aggregation of them)
Figure 3: How we can use transfer-learned embeddings to help with demand prediction.

As an example, if we are trying to predict the audience size of a dark comedic title in Brazil, we can leverage the aforementioned similarity maps to identify similar dark comedies with an observed audience size in Brazil. We can then include these observed audience sizes (or some weighted average based on similarity) as features. These features are interpretable (they are associated with known titles and one can reason/debate about whether those titles’ performances should factor into the prediction) and significantly improve prediction accuracy.

Learning embeddings

How do we produce these embeddings? The first step is to identify source tasks that will produce useful embeddings for downstream model consumption. Here we discuss two types of tasks: supervised and self-supervised.

Supervised

A major motivation for transfer learning is to “pre-train” model parameters by first learning them on a related source task for which we have more training data. Inspecting the data we have on hand, we find that for any title on our service with sufficient viewing data, we can (1) categorize the title based on who watched it (a.k.a. “content category”) and (2) observe how many subscribers watched it in each country (“audience size”). From this title-level information, we devise the following supervised learning tasks:

  • {metadata, tags, summaries} → content category
  • {metadata, tags, summaries, country} → audience size in country

When implementing specific solutions to these tasks, two important modeling decisions we need to make are selecting a) a suitable method (“encoder”) for converting title-level features (metadata, tags, summaries) into an amenable representation for a predictive model and b) a model (“predictor”) that predicts labels (content category, audience size) given an encoded title. Since our goal is to learn somewhat general-purpose embeddings that can plug into multiple use cases, we generally prefer parameter-rich models for the encoder and simpler models for the predictor.

Our choice of encoder (Figure 4) depends on the type of input. For text-based summaries, we leverage pre-trained models like BERT to provide context-dependent word embeddings that are then run through a recurrent neural network style architecture, such as a bidirectional LSTM or GRU. For tags, we directly learn tag representations by considering each title as a tag collection, or a “bag-of-tags”. For audience size models where predictions are country-specific, we also directly learn country embeddings and concatenate the resulting embedding to the tag or summary-based representation. Essentially, conversion of each tag and country to its resulting embedding is done via a lookup table.

Likewise, the predictor depends on the task. For category prediction, we train a linear model on top of the encoder representation, apply a softmax operation, and minimize the negative log likelihood. For audience size prediction, we use a single hidden-layer feedforward neural network to minimize the mean squared error for a given title-country pair. Both the encoder and predictor models are optimized via backpropagation, and the representation produced by the optimized encoder is used in downstream models.

Figure 4: encoder architectures to handle various kinds of title-related inputs. For text summaries, we first convert each word to its context-dependent representation via BERT or a related model, followed by a biGRU to convert the sequence of embeddings to a single (final-state) representation. For tags, we compute the average tag representation (since each title is associated with multiple tags).

Self-supervised

Knowledge graphs are abstract graph-based data structures which encode relations (edges) between entities (nodes). Each edge in the graph, i.e. head-relation-tail triple, is known as a fact, and in this way a set of facts (i.e. “knowledge”) results in a graph. However, the real power of the graph is the information contained in the relational structure.

At Netflix, we apply this concept to the knowledge contained in the content universe. Consider a simplified graph whose nodes consist of three entity types: {titles, books, metadata tags} and whose edges encode relationships between them (e.g., “Apocalypse Now is based on Heart of Darkness” ; “21 Grams has a storyline around moral dilemmas”) as illustrated in Figure 5. These facts can be represented as triples (h, r, t), e.g. (Apocalypse Now, based_on, Heart of Darkness), (21 Grams, storyline, moral dilemmas). Next, we can craft a self-supervised learning task where we randomly select edges in the graph to form a test set, and condition on the rest of the graph to predict these missing edges. This task, also known as link prediction, allows us to learn embeddings for all entities in the graph. There are a number of approaches to extract embeddings and our current approach is based on the TransE algorithm. TransE learns an embedding F that minimizes the average Euclidean distance between (F(h) + F(r)) and F(t).

Figure 5: Left: Illustration of a graph relating titles, books, and thematic elements to each other. Right: Illustration of translational embeddings in which the sum of the head and relation embeddings approximates the tail embedding.

The self-supervision is crucial since it allows us to train on titles both on and off our service, expanding the training set considerably and unlocking more gains from transfer learning. The resulting embeddings can then be used in the aforementioned similarity models and audience sizing models models.

Epilogue

Making great content is hard. It involves many different factors and requires considerable investment, all for an outcome that is very difficult to predict. The success of our titles is ultimately determined by our members, and we must do our best to serve their needs given the tools and data we have. We identified two ways to support content decision makers: surfacing similar titles and predicting audience size, drawing from various areas such as transfer learning, embedding representations, natural language processing, and supervised learning. Surfacing these types of insights in a scalable manner is becoming ever more crucial as both our subscriber base and catalog grow and become increasingly diverse. If you’d like to be a part of this effort, please contact us!.


Supporting content decision makers with machine learning was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Keeping Netflix Reliable Using Prioritized Load Shedding

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94

How viewers are able to watch their favorite show on Netflix while the infrastructure self-recovers from a system failure

By Manuel Correa, Arthur Gonigberg, and Daniel West

Getting stuck in traffic is one of the most frustrating experiences for drivers around the world. Everyone slows to a crawl, sometimes for a minor issue or sometimes for no reason at all. As engineers at Netflix, we are constantly reevaluating how to redesign traffic management. What if we knew the urgency of each traveler and could selectively route cars through, rather than making everyone wait?

In Netflix engineering, we’re driven by ensuring Netflix is there when you need it to be. Yet, as recent as last year, our systems were susceptible to metaphorical traffic jams; we had on/off circuit breakers, but no progressive way to shed load. Motivated by improving the lives of our members, we’ve introduced priority-based progressive load shedding.

The animation below shows the behavior of the Netflix viewer experience when the backend is throttling traffic based on priority. While the lower priority requests are throttled, the playback experience remains uninterrupted and the viewer is able to enjoy their title. Let’s dig into how we accomplished this.

Failure can occur due to a myriad of reasons: misbehaving clients that trigger a retry storm, an under-scaled service in the backend, a bad deployment, a network blip, or issues with the cloud provider. All such failures can put a system under unexpected load, and at some point in the past, every single one of these examples has prevented our members’ ability to play. With these incidents in mind, we set out to make Netflix more resilient with these goals:

  1. Consistently prioritize requests across device types (Mobile, Browser, and TV)
  2. Progressively throttle requests based on priority
  3. Validate assumptions by using Chaos Testing (deliberate fault injection) for requests of specific priorities

The resulting architecture that we envisioned with priority throttling and chaos testing included is captured below.

High level playback architecture with priority throttling and chaos testing

Building a request taxonomy

We decided to focus on three dimensions in order to categorize request traffic: throughput, functionality, and criticality. Based on these characteristics, traffic was classified into the following:

  • NON_CRITICAL: This traffic does not affect playback or members’ experience. Logs and background requests are examples of this type of traffic. These requests are usually high throughput which contributes to a large percentage of load in the system.
  • DEGRADED_EXPERIENCE: This traffic affects members’ experience, but not the ability to play. The traffic in this bucket is used for features like: stop and pause markers, language selection in the player, viewing history, and others.
  • CRITICAL: This traffic affects the ability to play. Members will see an error message when they hit play if the request fails.

Using attributes of the request, the API gateway service (Zuul) categorizes the requests into NON_CRITICAL, DEGRADED_EXPERIENCE and CRITICAL buckets, and computes a priority score between 1 to 100 for each request given its individual characteristics. The computation is done as a first step so that it is available for the rest of the request lifecycle.

Most of the time, the request workflow proceeds normally without taking the request priority into account. However, as with any service, sometimes we reach a point when either one of our backends is in trouble or Zuul itself is in trouble. When that happens requests with higher priority get preferential treatment. The higher priority requests will get served, while the lower priority ones will not. The implementation is analogous to a priority queue with a dynamic priority threshold. This allows Zuul to drop requests with a priority lower than the current threshold.

Finding the best place to throttle traffic

Zuul can apply load shedding in two moments during the request lifecycle: when it routes requests to a specific back-end service (service throttling) or at the time of initial request processing, which affects all back-end services (global throttling).

Service throttling

Zuul can sense when a back-end service is in trouble by monitoring the error rates and concurrent requests to that service. Those two metrics are approximate indicators of failures and latency. When the threshold percentage for one of these two metrics is crossed, we reduce load on the service by throttling traffic.

Global throttling

Another case is when Zuul itself is in trouble. As opposed to the scenario above, global throttling will affect all back-end services behind Zuul, rather than a single back-end service. The impact of this global throttling can cause much bigger problems for members. The key metrics used to trigger global throttling are CPU utilization, concurrent requests, and connection count. When any of the thresholds for those metrics are crossed, Zuul will aggressively throttle traffic to keep itself up and healthy while the system recovers. This functionality is critical: if Zuul goes down, no traffic can get through to our backend services, resulting in a total outage.

Introducing priority-based progressive load shedding

Once we had the prioritization piece in place, we were able to combine it with our load shedding mechanism to dramatically improve streaming reliability. When we’re in a bad situation (i.e. any of the thresholds above are exceeded), we progressively drop traffic, starting with the lowest priority. A cubic function is used to manage the level of throttling. If things get really, really bad the level will hit the sharp side of the curve, throttling everything.

The graph above is an example of how the cubic function is applied. As the overload percentage increases (i.e. the range between the throttling threshold and the max capacity), the priority threshold trails it very slowly: at 35%, it’s still in the mid-90s. If the system continues to degrade, we hit priority 50 at 80% exceeded and then eventually 10 at 95%, and so on.

Given that a relatively small amount of requests impact streaming availability, throttling low priority traffic may affect certain product features but will not prevent members pressing “play” and watching their favorite show. By adding progressive priority-based load shedding, Zuul can shed enough traffic to stabilize services without members noticing.

Handling retry storms

When Zuul decides to drop traffic, it sends a signal to devices to let them know that we need them to back off. It does this by indicating how many retries they can perform and what kind of time window they can perform them in. For example:

{ “maxRetries” : <max-retries>, “retryAfterSeconds”: <seconds> }

Using this backpressure mechanism, we can stop retry storms much faster than we could in the past. We automatically adjust these two dials based on the priority of the request. Requests with higher priority will retry more aggressively than lower ones, also increasing streaming availability.

Validating which requests are right for the job

To validate our request taxonomy assumptions on whether a specific request fell into the NON_CRITICAL, DEGRADED, or CRITICAL bucket, we needed a way to test the user’s experience when that request was shed. To accomplish this, we leveraged our internal failure injection tool (FIT) and created a failure injection point in Zuul that allowed us to shed any request based on a supplied priority. This enabled us to manually simulate a load shedded experience by blocking ranges of priorities for a specific device or member, giving us an idea of which requests could be safely shed without impacting the user.

Continually ensuring those requests are still right for the job

One of the goals here is to reduce members’ pain by shedding requests that are not expected to affect the user’s streaming experience. However, Netflix changes quickly and requests that were thought to be noncritical can unexpectedly become critical. In addition, Netflix has a wide variety of client devices, client versions, and ways to interact with the system. To make sure we weren’t causing members pain when throttling NON_CRITICAL requests in any of these scenarios, we leveraged our infrastructure experimentation platform ChAP.

This platform allows us to stage an A/B experiment that will allocate a small number of production users to either a control or treatment group for 45 minutes while throttling a range of priorities for the treatment group. This lets us capture a variety of live use cases and measure the impact to their playback experience. ChAP analyzes the members’ KPIs per device to determine if there is a deviation between the control and the treatment groups.

In our first experiment, we detected a race condition in both Android and iOS devices for a low priority request that caused sporadic playback errors. Since we practice continuous experimentation, once the initial experiments were run and the bugs were fixed, we scheduled them to run on a periodic basis. This allows us to detect regressions early and keep users streaming.

Experiment regression detection before and after fix (SPS indicates streaming availability)

Reaping the benefits

In 2019, before progressive load shedding was in place, the Netflix streaming services experienced an outage that resulted in a sizable percentage of members who were not able to play for a period of time. In 2020, days after the implementation was deployed, the team started seeing the benefit of the solution. Netflix experienced a similar issue with the same potential impact as the outage seen in 2019. Unlike then, Zuul’s progressive load shedding kicked in and started shedding traffic until the service was in a healthy state without impacting members’ ability to play at all.

The graph below shows a stable streaming availability metric stream per second (SPS) while Zuul is performing progressive load shedding based on request priority during the incident. The different colors in the graph represent requests with different priority being throttled.

Members were happily watching their favorite show on Netflix while the infrastructure was self-recovering from a system failure.

We are not done yet

For future work, the team is looking into expanding the use of request priority for other use cases like better retry policies between devices and back-ends, dynamically changing load shedding thresholds, tuning the request priorities using Chaos Testing as a guiding principle, and other areas that will make Netflix even more resilient.

If you’re interested in helping Netflix stay up in the face of shifting systems and unexpected failures, reach out to us. We’re hiring!


Keeping Netflix Reliable Using Prioritized Load Shedding was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Day in the Life of a Content Analytics Engineer

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/a-day-in-the-life-of-a-content-analytics-engineer-eb0250b993be

Part of our series on who works in Analytics at Netflix — and what the role entails

by Rocio Ruelas

Back when we were all working in offices, my favorite days were Monday, Wednesday, and Friday. Those were the days with the best hot breakfast, and I’ve always been a sucker for free food. I started the day by arriving at the LA office right before 8am and finding a parking spot close to the entrance. I would greet the familiar faces at the reception desk and take a moment to check out which Netflix Original was currently being projected across the lobby. Take the elevator uninterrupted up to the top floor. Grab myself a plate of scrambled eggs, salsa, and bacon. Pour myself some coffee. Then sit at a small table next to the floor-to-ceiling windows with a clear view of the Hollywood sign.

My morning journey from lobby to elevators to breakfast (Photo Credit: Netflix)

During the day, the LA office buzzes with excitement and conversation. My time in the morning is like the calm before the storm — a chance to reflect before my head is full of numbers and figures. I often think about all the things that led me to becoming a Netflix employee. From my family immigrating to the United States from Mexico when I was very young to the teachers and professors that encouraged a low income student like me to dream big. It has been a journey and I’m grateful to be at a place that values the voice I bring to the table.

At the time of posting we’re working from home due to the pandemic, so my days look a bit different: The hot breakfasts are not as consistent and conversations are mainly with my dog. We still find ways to keep connected, but I for one am looking forward to when the office is fully open and I can look out to the Hollywood sign again.

Ok. But what do I actually do? (Besides eating breakfast)

What do I do at Netflix?

I’m a Senior Analytics Engineer on the Content and Marketing Analytics Research team. My team focuses on innovating and maintaining the metrics Netflix uses to understand performance of our shows and films on the service. We partner closely with the business strategy team to provide as much information as we can to our content executives, so that — combined with their industry experience — they can make the best decisions for Netflix.

Being an Analytics Engineer is like being a hybrid of a librarian 📚 and a Swiss army knife 🛠️: Two good things to have on hand when you’re not quite sure what you will need. Like a librarian, I have access to an encyclopedia of knowledge about our content data and have become the resident expert in one of our most important internal metrics. And like a Swiss army knife, I possess a multitude of tools to get the job done — be it SQL, Jupyter Notebooks, Tableau, or Google Sheets.

One of my favorite things about being an Analytics Engineer is the variety. I have some days where I am brainstorming and collaborating with amazing colleagues and other days where I can put my headphones on to work out a tough problem or build a dashboard.

One of my current projects involves understanding how viewing habits have evolved over the past several years. We started out with a small working group where we brainstormed the key questions to address, what data we could use to answer said questions, and came up with a work plan for how the analysis might take shape. Then I put on my headphones and got to work, writing SQL and using Tableau to present the data in a useful way. We met frequently to discuss our findings and iterate on the analysis. The great thing about these working groups is that we each contribute different skills and ideas. We benefit from both our individual strengths and our willingness to collaborate — Our values of Selflessness and Inclusion, in action.

How did I become interested in Analytics?

I did not set out from the start to be an Analyst. I never had a 5 year plan and my path has been a winding one.

Yours truly, featuring part of my extensive Netflix apparel collection
Yours truly, featuring part of my extensive Netflix apparel collection

In college, I majored in Physics because it was “the science that explains all the other sciences”. But what I ended up liking most about it was the math. Between that and the fact that there aren’t many entry-level physics jobs, I pursued a PhD in Applied Mathematics. This turned out to be a wise choice as I avoided entering the workforce right before the 2008 recession.

I loved grad school. The lectures, the research, and most of all the lifelong friendships. But as much as I enjoyed being a student, the academic track wasn’t for me. So without much of a plan I headed back home to California after graduation.

Looking around to see what I could do with my Applied Math background, I quickly settled on Data Science. I wasn’t well versed in it but I knew it was in demand. I started my new data science career as an analyst at a small marketing company. I had an incredible boss who encouraged me to learn new skills on the job. I honed my SQL and Python skills and implemented a clustering model. I also got my first introduction to working for an actual business.

Later on I went to Hulu to grow in the core skills of a data scientist. But while the predictive modeling I was doing was interesting and challenging, I missed being close to the business. As an analyst, I got to attend more meetings with the decision makers and be part of the conversation.

So by the time the opportunity arose to interview for a position at Netflix, I had figured out that Analytics was the best area for me.

It has been a journey and I’m grateful to be at a place that values the voice I bring to the table.

Why Netflix?

Growing up I watched a lot of TV. I mean a lot of TV. But I never thought I could actually work in the TV and Film business. I feel incredibly fortunate to be working at a job I am passionate about and to be at a company that brings joy to people around the world.

Even though I’d been a loyal Netflix customer since the DVD days, I had not heard about their unique culture until I started interviewing. When I did read the culture doc (which I recently learned is also published in Spanish and 12 other languages!), it sounded pretty intimidating. Phrases like “high performance” and “dream team” made me imagine an almost gladiator-style workplace. But I quickly learned this wasn’t the case. Through a combination of my existing network, the interview process, and other online resources about the company, I found that folks are actually very friendly and helpful! Everyone just wants to do their best work and help you do your best work too. Think more The Great British Baking Show and less Hell’s Kitchen. Selflessness really is embraced as an important Netflix value.

Having been here for 3 years now, I can say that working at Netflix is really special. The company is always evolving, big decisions are made in a transparent way, and I’m encouraged to voice my thoughts. But the single most important factor is the people. My Content Analytics teammates continuously impress me not only with their quality of work, but also with their kindness and mutual trust. This foundation makes innovating more fun, lets us be open about our passions outside of work, and means we genuinely enjoy each other’s company. That balance is crucial for me and is why this truly is the place where I can do my best work.

If this post resonates with you and you’d like to explore opportunities with Netflix, check out our analytics site, search open roles, and learn about our culture. You can also find more stories like this here.


A Day in the Life of a Content Analytics Engineer was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building Netflix’s Distributed Tracing Infrastructure

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/building-netflixs-distributed-tracing-infrastructure-bb856c319304

by Maulik Pandey

Our Team — Kevin Lew, Narayanan Arunachalam, Elizabeth Carretto, Dustin Haffner, Andrei Ushakov, Seth Katz, Greg Burrell, Ram Vaithilingam, Mike Smith and Maulik Pandey

@Netflixhelps Why doesn’t Tiger King play on my phone?” — a Netflix member via Twitter

This is an example of a question our on-call engineers need to answer to help resolve a member issue — which is difficult when troubleshooting distributed systems. Investigating a video streaming failure consists of inspecting all aspects of a member account. In our previous blog post we introduced Edgar, our troubleshooting tool for streaming sessions. Now let’s look at how we designed the tracing infrastructure that powers Edgar.

Distributed Tracing: the missing context in troubleshooting services at scale

Prior to Edgar, our engineers had to sift through a mountain of metadata and logs pulled from various Netflix microservices in order to understand a specific streaming failure experienced by any of our members. Reconstructing a streaming session was a tedious and time consuming process that involved tracing all interactions (requests) between the Netflix app, our Content Delivery Network (CDN), and backend microservices. The process started with manual pull of member account information that was part of the session. The next step was to put all puzzle pieces together and hope the resulting picture would help resolve the member issue. We needed to increase engineering productivity via distributed request tracing.

If we had an ID for each streaming session then distributed tracing could easily reconstruct session failure by providing service topology, retry and error tags, and latency measurements for all service calls. We could also get contextual information about the streaming session by joining relevant traces with account metadata and service logs. This insight led us to build Edgar: a distributed tracing infrastructure and user experience.

Figure 1. Troubleshooting a session in Edgar

When we started building Edgar four years ago, there were very few open-source distributed tracing systems that satisfied our needs. Our tactical approach was to use Netflix-specific libraries for collecting traces from Java-based streaming services until open source tracer libraries matured. By 2017, open source projects like Open-Tracing and Open-Zipkin were mature enough for use in polyglot runtime environments at Netflix. We chose Open-Zipkin because it had better integrations with our Spring Boot based Java runtime environment. We use Mantis for processing the stream of collected traces, and we use Cassandra for storing traces. Our distributed tracing infrastructure is grouped into three sections: tracer library instrumentation, stream processing, and storage. Traces collected from various microservices are ingested in a stream processing manner into the data store. The following sections describe our journey in building these components.

Trace Instrumentation: how will it impact our service?

That is the first question our engineering teams asked us when integrating the tracer library. It is an important question because tracer libraries intercept all requests flowing through mission-critical streaming services. Safe integration and deployment of tracer libraries in our polyglot runtime environments was our top priority. We earned the trust of our engineers by developing empathy for their operational burden and by focusing on providing efficient tracer library integrations in runtime environments.

Distributed tracing relies on propagating context for local interprocess calls (IPC) and client calls to remote microservices for any arbitrary request. Passing the request context captures causal relationships between microservices during runtime. We adopted Open-Zipkin’s B3 HTTP header based context propagation mechanism. We ensure that context propagation headers are correctly passed between microservices across a variety of our “paved road” Java and Node runtime environments, which include both older environments with legacy codebases and newer environments such as Spring Boot. We execute the Freedom & Responsibility principle of our culture in supporting tracer libraries for environments like Python, NodeJS, and Ruby on Rails that are not part of the “paved road” developer experience. Our loosely coupled but highly aligned engineering teams have the freedom to choose an appropriate tracer library for their runtime environment and have the responsibility to ensure correct context propagation and integration of network call interceptors.

Our runtime environment integrations inject infrastructure tags like service name, auto-scaling group (ASG), and container instance identifiers. Edgar uses this infrastructure tagging schema to query and join traces with log data for troubleshooting streaming sessions. Additionally, it became easy to provide deep links to different monitoring and deployment systems in Edgar due to consistent tagging. With runtime environment integrations in place, we had to set an appropriate trace data sampling policy for building a troubleshooting experience.

Stream Processing: to sample or not to sample trace data?

This was the most important question we considered when building our infrastructure because data sampling policy dictates the amount of traces that are recorded, transported, and stored. A lenient trace data sampling policy generates a large number of traces in each service container and can lead to degraded performance of streaming services as more CPU, memory, and network resources are consumed by the tracer library. An additional implication of a lenient sampling policy is the need for scalable stream processing and storage infrastructure fleets to handle increased data volume.

We knew that a heavily sampled trace dataset is not reliable for troubleshooting because there is no guarantee that the request you want is in the gathered samples. We needed a thoughtful approach for collecting all traces in the streaming microservices while keeping low operational complexity of running our infrastructure.

Most distributed tracing systems enforce sampling policy at the request ingestion point in a microservice call graph. We took a hybrid head-based sampling approach that allows for recording 100% of traces for a specific and configurable set of requests, while continuing to randomly sample traffic per the policy set at ingestion point. This flexibility allows tracer libraries to record 100% traces in our mission-critical streaming microservices while collecting minimal traces from auxiliary systems like offline batch data processing. Our engineering teams tuned their services for performance after factoring in increased resource utilization due to tracing. The next challenge was to stream large amounts of traces via a scalable data processing platform.

Mantis is our go-to platform for processing operational data at Netflix. We chose Mantis as our backbone to transport and process large volumes of trace data because we needed a backpressure-aware, scalable stream processing system. Our trace data collection agent transports traces to Mantis job cluster via the Mantis Publish library. We buffer spans for a time period in order to collect all spans for a trace in the first job. A second job taps the data feed from the first job, does tail sampling of data and writes traces to the storage system. This setup of chained Mantis jobs allows us to scale each data processing component independently. An additional advantage of using Mantis is the ability to perform real-time ad-hoc data exploration in Raven using the Mantis Query Language (MQL). However, having a scalable stream processing platform doesn’t help much if you can’t store data in a cost efficient manner.

Storage: don’t break the bank!

We started with Elasticsearch as our data store due to its flexible data model and querying capabilities. As we onboarded more streaming services, the trace data volume started increasing exponentially. The increased operational burden of scaling ElasticSearch clusters due to high data write rate became painful for us. The data read queries took an increasingly longer time to finish because ElasticSearch clusters were using heavy compute resources for creating indexes on ingested traces. The high data ingestion rate eventually degraded both read and write operations. We solved this by migrating to Cassandra as our data store for handling high data ingestion rates. Using simple lookup indices in Cassandra gives us the ability to maintain acceptable read latencies while doing heavy writes.

In theory, scaling up horizontally would allow us to handle higher write rates and retain larger amounts of data in Cassandra clusters. This implies that the cost of storing traces grows linearly to the amount of data being stored. We needed to ensure storage cost growth was sub-linear to the amount of data being stored. In pursuit of this goal, we outlined following storage optimization strategies:

  1. Use cheaper Elastic Block Store (EBS) volumes instead of SSD instance stores in EC2.
  2. Employ better compression technique to reduce trace data size.
  3. Store only relevant and interesting traces by using simple rules-based filters.

We were adding new Cassandra nodes whenever the EC2 SSD instance stores of existing nodes reached maximum storage capacity. The use of a cheaper EBS Elastic volume instead of an SSD instance store was an attractive option because AWS allows dynamic increase in EBS volume size without re-provisioning the EC2 node. This allowed us to increase total storage capacity without adding a new Cassandra node to the existing cluster. In 2019 our stunning colleagues in the Cloud Database Engineering (CDE) team benchmarked EBS performance for our use case and migrated existing clusters to use EBS Elastic volumes. By optimizing the Time Window Compaction Strategy (TWCS) parameters, they reduced the disk write and merge operations of Cassandra SSTable files, thereby reducing the EBS I/O rate. This optimization helped us reduce the data replication network traffic amongst the cluster nodes because SSTable files were created less often than in our previous configuration. Additionally, by enabling Zstd block compression on Cassandra data files, the size of our trace data files was reduced by half. With these optimized Cassandra clusters in place, it now costs us 71% less to operate clusters and we could store 35x more data than our previous configuration.

We observed that Edgar users explored less than 1% of collected traces. This insight leads us to believe that we can reduce write pressure and retain more data in the storage system if we drop traces that users will not care about. We currently use a simple rule based filter in our Storage Mantis job that retains interesting traces for very rarely looked service call paths in Edgar. The filter qualifies a trace as an interesting data point by inspecting all buffered spans of a trace for warnings, errors, and retry tags. This tail-based sampling approach reduced the trace data volume by 20% without impacting user experience. There is an opportunity to use machine learning based classification techniques to further reduce trace data volume.

While we have made substantial progress, we are now at another inflection point in building our trace data storage system. Onboarding new user experiences on Edgar could require us to store 10x the amount of current data volume. As a result, we are currently experimenting with a tiered storage approach for a new data gateway. This data gateway provides a querying interface that abstracts the complexity of reading and writing data from tiered data stores. Additionally, the data gateway routes ingested data to the Cassandra cluster and transfers compacted data files from Cassandra cluster to S3. We plan to retain the last few hours worth of data in Cassandra clusters and keep the rest in S3 buckets for long term retention of traces.

Table 1. Timeline of Storage Optimizations

Secondary advantages

In addition to powering Edgar, trace data is used for the following use cases:

Application Health Monitoring

Trace data is a key signal used by Telltale in monitoring macro level application health at Netflix. Telltale uses the causal information from traces to infer microservice topology and correlate traces with time series data from Atlas. This approach paints a richer observability portrait of application health.

Resiliency Engineering

Our chaos engineering team uses traces to verify that failures are correctly injected while our engineers stress test their microservices via Failure Injection Testing (FIT) platform.

Regional Evacuation

The Demand Engineering team leverages tracing to improve the correctness of prescaling during regional evacuations. Traces provide visibility into the types of devices interacting with microservices such that changes in demand for these services can be better accounted for when an AWS region is evacuated.

Estimate infrastructure cost of running an A/B test

The Data Science and Product team factors in the costs of running A/B tests on microservices by analyzing traces that have relevant A/B test names as tags.

What’s next?

The scope and complexity of our software systems continue to increase as Netflix grows. We will focus on following areas for extending Edgar:

  • Provide a great developer experience for collecting traces across all runtime environments. With an easy way to to try out distributed tracing, we hope that more engineers instrument their services with traces and provide additional context for each request by tagging relevant metadata.
  • Enhance our analytics capability for querying trace data to enable power users at Netflix in building their own dashboards and systems for narrowly focused use cases.
  • Build abstractions that correlate data from metrics, logging, and tracing systems to provide additional contextual information for troubleshooting.

As we progress in building distributed tracing infrastructure, our engineers continue to rely on Edgar for troubleshooting streaming issues like “Why doesn’t Tiger King play on my phone?”. Our distributed tracing infrastructure helps in ensuring that Netflix members continue to enjoy a must-watch show like Tiger King!

We are looking for stunning colleagues to join us on this journey of building distributed tracing infrastructure. If you are passionate about Observability then come talk to us.


Building Netflix’s Distributed Tracing Infrastructure was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introducing the first video in our new series, Verified, featuring Netflix’s Jason Chan

Post Syndicated from Stephen Schmidt original https://aws.amazon.com/blogs/security/introducing-first-video-new-series-verified-featuring-netflix-jason-chan/

The year has been a profoundly different one for us all, and like many of you, I’ve been adjusting, both professionally and personally, to this “new normal.” Here at AWS we’ve seen an increase in customers looking for secure solutions to maintain productivity in an increased work-from-home world. We’ve also seen an uptick in requests for training; it’s clear, a sense of community and learning are critically important as workforces physically distance.

For these reasons, I’m happy to announce the launch of Verified: Presented by AWS re:Inforce. I’m hosting this series, but I’ll be joined by leaders in cloud security across a variety of industries. The goal is to have an open conversation about the common issues we face in securing our systems and tools. Topics will include how the pandemic is impacting cloud security, tips for creating an effective security program from the ground up, how to create a culture of security, emerging security trends, and more. Learn more by following me on Twitter (@StephenSchmidt), and get regular updates from @AWSSecurityInfo. Verified is just one of the many ways we will continue sharing best practices with our customers during this time. You can find more by reading the AWS Security Blog, reviewing our documentation, visiting the AWS Security and Compliance webpages, watching re:Invent and re:Inforce playlists, and/or reviewing the Security Pillar of Well Architected.

Our first conversation, above, is with Jason Chan, Vice President of Information Security at Netflix. Jason spoke to us about the security program at Netflix, his approach to hiring security talent, and how Zero Trust enables a remote workforce. Jason also has solid insights to share about how he started and grew the security program at Netflix.

“In the early days, what we were really trying to figure out is how do we build a large-scale consumer video-streaming service in the public cloud, and how do you do that in a secure way? There wasn’t a ton of expertise in that, so when I was building the security team at Netflix, I thought, ‘how do we bring in folks from a variety of backgrounds, generalists … to tackle this problem?’”

He also gave his view on how a growing security team can measure ROI. “I think it’s difficult to have a pure equation around that. So what we try to spend our time doing is really making sure that we, as a team, are aligned on what is the most important—what are the most important assets to protect, what are the most critical risks that we’re trying to prevent—and then make sure that leadership is aligned with that, because, as we all know, there’s not unlimited resources, right? You can’t hire an unlimited number of folks or spend an unlimited amount of money, so you’re always trying to figure out how do you prioritize, and how do you find where is going to be the biggest impact for your value?”

Check out Jason’s full interview above, and stay tuned for further videos in this series. If you have an idea or a topic you’d like covered in this series, please drop us a comment below. Thanks!

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Steve Schmidt

Steve is Vice President and Chief Information Security Officer for AWS. His duties include leading product design, management, and engineering development efforts focused on bringing the competitive, economic, and security benefits of cloud computing to business and government customers. Prior to AWS, he had an extensive career at the Federal Bureau of Investigation, where he served as a senior executive and section chief. He currently holds 11 patents in the field of cloud security architecture. Follow Steve on Twitter.

How Our Paths Brought Us to Data and Netflix

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-our-paths-brought-us-to-data-and-netflix-4eced44a6872

Part of our series on who works in Analytics at Netflix — and what the role entails

by Julie Beckley & Chris Pham

This Q&A provides insights into the diverse set of skills, projects, and culture within Data Science and Engineering (DSE) at Netflix through the eyes of two team members: Chris Pham and Julie Beckley.

Photo from a team curling offsite — There’s us to the right!

[Chris] Julie and I joined the Streaming DSE team at Netflix a few years ago and have been close colleagues and friends since then. At work, we regularly lean on each other for help based on our respective areas of expertise — I bring my breadth of big data tools and technologies while Julie has been building statistical models for the past decade. Outside of work, we share a love of good food and coffee, exchanging tips on making espresso.

1. What was your path to working in data?

[Julie] I took a traditional path to data science. Since mathematics was my favorite subject in school, I decided to pursue it for my bachelors degree at McGill University (while indulging in French culture in the beautiful city of Montreal). Over the course of the four years it became clear that I enjoyed combining analytical skills with solving real world problems, so a PhD in Statistics was a natural next step. After completing my education, I was still not certain whether I wanted a job in academia or industry. I took a role as a Research Staff Member at IBM Research, which served as a middle ground with a joint focus on real world applications, academic research, and even allowed me to teach a graduate Machine Learning course! I then transitioned to a full industry role at Netflix.

[Chris] I initially wanted to build a career in consulting after receiving my graduate degree in Economics because I had a passion for analytical problem solving and statistical modeling. A role in data science eventually seemed like a natural transition, but it wasn’t without its hurdles: With my consulting background, I had to go through a few other roles first while learning how to code on the side. A lot of my learning and training was self-guided until 2016, when a manager at my last company took a chance on me and helped me make the rare transfer from a role in HR to Data Science.

2. Tell me about some of the exciting projects you’re a part of.

[Julie] Chris and I have the same primary stakeholders (or engineering team that we support): Encoding Technologies. They are continuously innovating compression algorithms to efficiently send high quality audio and video files to our customers over the internet. I focus on improving experimentation methodology to test how well the newest files are working: do they need less bits to stream while providing a higher video quality? Do they cause less errors? My work is typically developed in R or Python. I love the cross-functional nature of my work, as it allows me to learn from others and creatively explore new statistical methodologies to improve the Netflix service.

[Chris] When I first started working with Encoding Technologies, there was so much data waiting to be translated into actionable insights. It was fun starting from almost nothing and transforming all of that data into self-serve tools and dashboards for the team to understand their contribution to the Netflix streaming experience. These projects have involved using Spark, Python, SQL, Tableau, and Jupyter notebooks. Over the last year, I’ve spent a lot of time analyzing data to inform how we roll out new encoding innovations to the diverse ecosystem of devices that stream Netflix.

3. How do your projects impact the business at Netflix?

[Julie] Encoding experimentation (and more broadly, streaming experimentation) is critical for ensuring our customers have a good Quality of Experience when watching Netflix. In other words, the content you’re about to watch needs to load quickly with high video quality. When we test new encodes, we need effective data science methods to quickly and accurately understand whether customers are having a better experience. With these insights, the engineering teams can quickly understand what’s working well and what needs to be improved. It’s super exciting to see the impact of my work when I hear from friends and family that Netflix is streaming well for them!

[Chris] There’s a lot of things to consider when we roll out a new compression algorithm. Which devices get this treatment? What is the benefit to the streaming experience? Is the benefit uniform, or do certain cohorts of members — such as those who stream over a cellular connection — benefit more? How does a decision of this scale affect the efficiency of our globally distributed content delivery network, Open Connect? It’s one big optimization problem that requires balancing several different factors. Streaming DSE is at the center of it all, bringing together different teams at Netflix and using data to drive decisions that impact our members around the world.

4. What does it take to succeed at Netflix in a data role?

[Julie] One of the special things about working at Netflix is that a diverse set of skills and backgrounds is truly appreciated, since there are many ways to add value to the company. From my experience, being proactive in pushing forward on your ideas is key. The values in the Netflix culture document allow for a framework where everyone is a leader to work well — this is because we expect initiative, direct and candid feedback, and transparency in everything we do. This leads to a great environment where I am constantly challenged, learning, and receiving constructive feedback on how I can do better!

[Chris] I think a big part of our jobs is continuously thinking about how data can benefit our stakeholders. Julie and I will never know as much about video and audio compression algorithms as our talented Encoding Technologies team, but we should be the ones most familiar with the data: How to access, analyze, and visualize it; how to transform it into metrics that act as strong and accurate proxies for a member’s experience; and how to guide others to draw the right conclusions from data so they can act on it. Writing memos is a big part of Netflix culture, which I’ve found has been helpful for sharing ideas, soliciting feedback, and documenting project details. So writing well, especially the ability to translate technical concepts for a non-technical audience, is also very useful.

5. What piece of advice would you pass along to those just starting out their career in data?

[Julie] One piece of advice I would pass along (and wish I could give to my younger self) is not to stress and try to plan every step of your data science career. Your career is long (and unpredictable!), so as long as you work hard and stay motivated, it will move in an exciting direction.

[Chris] Everyone wants to build fancy models or tools, but fewer are willing to do the foundational things like cleaning the data and writing the documentation. I’ve found that volunteering and being proactive (no matter the task) has been an effective way of building trust with others, and it opened my career up to many more opportunities early on.

If this post resonates with you and you’d like to explore opportunities with Netflix, check out our analytics site, search open roles, and learn about our culture. You can also find more stories like this here.


How Our Paths Brought Us to Data and Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Analytics at Netflix: Who we are and what we do

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/analytics-at-netflix-who-we-are-and-what-we-do-7d9c08fe6965

Analytics at Netflix: Who We Are and What We Do

An Introduction to Analytics and Visualization Engineering at Netflix

by Molly Jackman & Meghana Reddy

Explained: Season 1 (Photo Credit: Netflix)

Across nearly every industry, there is recognition that data analytics is key to driving informed business decision-making. But there is far less agreement on what that term “data analytics” actually means — or what to call the people responsible for the work.

Even within Netflix, we have many groups that do some form of data analysis, including business strategy and consumer insights. But here we are talking about Netflix’s Data Science and Engineering group, which specializes in analytics at scale. The group has technical, engineering-oriented roles that fall under two broad category titles: “Analytics Engineers” and “Visualization Engineers.” In this post, we refer to these two titles collectively as the “analytics role.” These professionals come from a wide range of backgrounds and bring different skills to their work, while sharing a common drive to generate and scale business impact through data.

Individuals in these roles possess deep business context and are thought leaders alongside their business counterparts. This enables them to fully understand where their partners are coming from.

What’s the purpose of the analytics role at Netflix?

When you think about data at Netflix, what comes to mind? Oftentimes it is our content recommendation algorithm or the online delivery of video to your device at home. Both are integral parts of the business, but far from the whole picture. Data is used to inform a wide range of questions — ‘How can we make the product experience even better?’, ‘Which shows and films bring the most joy to our members?’, ‘Who can we partner with to expand access to our service in new markets?’. Our Analytics and Visualization Engineers are taking on these and other big questions for the company, informing decision-making across every corner of the business.

We align our analytic teams with business area verticals
We align our analytic teams with business area verticals

Since the problem space is so varied, we align our analytics professionals with the listed business area verticals rather than organizing them within a single functional horizontal. The expectation is that individuals in these roles possess deep business context and are thought leaders alongside their business counterparts. This enables them to fully understand where their partners are coming from. It also means Analytics and Visualization Engineers are a specialized resource and a rare commodity. There are many more questions and stakeholders than analytics team members, and the job is not to take on every request. Instead, these individual contributors are given freedom to choose their projects and are responsible for prioritizing the ones that will have the most business impact (and deprioritizing the rest). This requires a lot of judgment and embodies our “context not control” culture.

“OK, but what do they actually do…?”

What does the job entail?

You’ve probably caught on to some common themes: People in the analytics role are highly connected to the business, solve end-to-end problems, and are directly responsible for improving business outcomes. But what makes this group really shine are their differences. They come from lots of backgrounds, which yields different perspectives on how to approach problems. We use the catch-all titles of Analytics and Visualization Engineers so as to not get too hung up on specific credentials. Instead, people are empowered to leverage their unique skills to make Netflix better.

A couple other defining characteristics of the role are full ownership of the problem (in Netflix lingo, you are the “informed captain” of your space) and creating trustworthy outputs. These are only possible through the one-two punch of deep business context 👊 and technical excellence 👊. Full ownership often means building new data pipelines, navigating complex schemas and large data sets, developing or improving metrics for business performance, and creating intuitive visualizations and dashboards — always with an eye towards actionable insights.

We use the catch-all titles of Analytics and Visualization Engineers so as to not get too hung up on specific credentials. Instead, people are empowered to leverage their unique skills to make Netflix better.

Because these professionals vary in their expertise, so too does their day-to-day. Below are three broadly defined personas to help illustrate some of the different backgrounds, motivations, and activities of individuals in the analytics role at Netflix. Many of our colleagues have come in with expertise that spans multiple personas. Others have grown into new areas as part of their professional development at Netflix. Ultimately, these skills are all on a continuum, some broad and some deep, and these are just a few examples of such expertise. So if you find yourself connecting with any part of these descriptions, the analytics role could be for you.

  • The Analyst is motivated by delivering metrics, findings, or dashboards that drive analytical insights and business decisions. They love to communicate their discoveries to nontechnical audiences, explain caveats, and debate analytic choices and strategic implications with peers and stakeholders. Their expertise is descriptive analytic methodology, but they have the necessary tools to be scrappy (e.g. coding, math, stats), and do what’s required to answer the highest priority business questions.
  • The Engineer enjoys making data available by piping it in from new sources in optimal ways, building robust data models, prototyping systems, and doing project-specific engineering. They’re still analysts at heart but, similar to data engineers, they have a deep understanding of data warehouse capabilities and are pros at data processing optimization and performance tuning. Being at this intersection of disciplines allows them to produce full-stack outputs, layering visualizations and analytics on their projects.
  • The Visualizer is passionate about the scalability, beauty, and functionality of dashboards and their capability for telling a visual story. They also have an eye for principled engineering, i.e. managing the data under the surface. They want to pick the perfect chart type for the narrative while also focusing on delivering key analytic insights. They may use industry tools (e.g. Tableau, Looker, Power BI) to their fullest extent, developing a deeper understanding of analytics by examining these tools under the hood. Or they may create sophisticated visuals from scratch and build the type of custom UI that enterprise tools don’t offer (e.g. JavaScript web apps).

Introducing Analytics at Netflix

Whether you’re a data professional, student, or Netflix enthusiast, we invite you to meet our stunning colleagues and hear their stories. If this series resonates with you and you’d like to explore opportunities with us, check out our analytics site, search open roles, and learn about our culture.

Welcome to Analytics at Netflix!

Related Posts:


Analytics at Netflix: Who we are and what we do was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Edgar: Solving Mysteries Faster with Observability

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f

Edgar helps Netflix teams troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

by Elizabeth Carretto

Everyone loves Unsolved Mysteries. There’s always someone who seems like the surefire culprit. There’s a clear motive, the perfect opportunity, and an incriminating footprint left behind. Yet, this is Unsolved Mysteries! It’s never that simple. Whether it’s a cryptic note behind the TV or a mysterious phone call from an unknown number at a critical moment, the pieces rarely fit together perfectly. As mystery lovers, we want to answer the age-old question of whodunit; we want to understand what really happened.

For engineers, instead of whodunit, the question is often “what failed and why?” When a problem occurs, we put on our detective hats and start our mystery-solving process by gathering evidence. The more complex a system, the more places to look for clues. An engineer can find herself digging through logs, poring over traces, and staring at dozens of dashboards.

All of these sources make it challenging to know where to begin and add to the time spent figuring out what went wrong. While this abundance of dashboards and information is by no means unique to Netflix, it certainly holds true within our microservices architecture. Each microservice may be easy to understand and debug individually, but what about when combined into a request that hits tens or hundreds of microservices? Searching for key evidence becomes like digging for a needle in a group of haystacks.

Example call graph in Edgar

In some cases, the question we’re answering is, “What’s happening right now??” and every second without resolution can carry a heavy cost. We want to resolve the problem as quickly as possible so our members can resume enjoying their favorite movies and shows. For teams building observability tools, the question is: how do we make understanding a system’s behavior fast and digestible? Quick to parse, and easy to pinpoint where something went wrong even if you aren’t deeply familiar with the inner workings and intricacies of that system? At Netflix, we’ve answered that question with a suite of observability tools. In an earlier blog post, we discussed Telltale, our health monitoring system. Telltale tells us when an application is unhealthy, but sometimes we need more fine-grained insight. We need to know why a specific request is failing and where. We built Edgar to ease this burden, by empowering our users to troubleshoot distributed systems efficiently with the help of a summarized presentation of request tracing, logs, analysis, and metadata.

What is Edgar?

Edgar is a self-service tool for troubleshooting distributed systems, built on a foundation of request tracing, with additional context layered on top. With request tracing and additional data from logs, events, metadata, and analysis, Edgar is able to show the flow of a request through our distributed system — what services were hit by a call, what information was passed from one service to the next, what happened inside that service, how long did it take, and what status was emitted — and highlight where an issue may have occurred. If you’re familiar with platforms like Zipkin or OpenTelemetry, this likely sounds familiar. But, there are a few substantial differences in how Edgar approaches its data and its users.

  • While Edgar is built on top of request tracing, it also uses the traces as the thread to tie additional context together. Deriving meaningful value from trace data alone can be challenging, as Cindy Sridharan articulated in this blog post. In addition to trace data, Edgar pulls in additional context from logs, events, and metadata, sifting through them to determine valuable and relevant information, so that Edgar can visually highlight where an error occurred and provide detailed context.
  • Edgar captures 100% of interesting traces, as opposed to sampling a small fixed percentage of traffic. This difference has substantial technological implications, from the classification of what’s interesting to transport to cost-effective storage (keep an eye out for later Netflix Tech Blog posts addressing these topics).
  • Edgar provides a powerful and consumable user experience to both engineers and non-engineers alike. If you embrace the cost and complexity of storing vast amounts of traces, you want to get the most value out of that cost. With Edgar, we’ve found that we can leverage that value by curating an experience for additional teams such as customer service operations, and we have embraced the challenge of building a product that makes trace data easy to access, easy to grok, and easy to gain insight by several user personas.

Tracing as a foundation

Logs, metrics, and traces are the three pillars of observability. Metrics communicate what’s happening on a macro scale, traces illustrate the ecosystem of an isolated request, and the logs provide a detail-rich snapshot into what happened within a service. These pillars have immense value and it is no surprise that the industry has invested heavily in building impressive dashboards and tooling around each. The downside is that we have so many dashboards. In one request hitting just ten services, there might be ten different analytics dashboards and ten different log stores. However, a request has its own unique trace identifier, which is a common thread tying all the pieces of this request together. The trace ID is typically generated at the first service that receives the request and then passed along from service to service as a header value. This makes the trace a great starting point to unify this data in a centralized location.

A trace is a set of segments representing each step of a single request throughout a system. Distributed tracing is the process of generating, transporting, storing, and retrieving traces in a distributed system. As a request flows between services, each distinct unit of work is documented as a span. A trace is made up of many spans, which are grouped together using a trace ID to form a single, end-to-end umbrella. A span:

  • Represents a unit of work, such as a network call from one service to another (a client/server relationship) or a purely internal action (e.g., starting and finishing a method).
  • Relates to other spans through a parent/child relationship.
  • Contains a set of key value pairs called tags, where service owners can attach helpful values such as urls, version numbers, regions, corresponding IDs, and errors. Tags can be associated with errors or warnings, which Edgar can display visually on a graph representation of the request.
  • Has a start time and an end time. Thanks to these timestamps, a user can quickly see how long the operation took.

The trace (along with its underlying spans) allows us to graphically represent the request chronologically.

Sample timeline view of a trace, based on Jaegar UI’s timeline view

Adding context to traces

With distributed tracing alone, Edgar is able to draw the path of a request as it flows through various systems. This centralized view is extremely helpful to determine which services were hit and when, but it lacks nuance. A tag might indicate there was an error but doesn’t fully answer the question of what happened. Adding logs to the picture can help a great deal. With logs, a user can see what the service itself had to say about what went wrong. If a data fetcher fails, the log can tell you what query it was running and what exact IDs or fields led to the failure. That alone might give an engineer the knowledge she needs to reproduce the issue. In Edgar, we parse the logs looking for error or warning values. We add these errors and warnings to our UI, highlighting them in our call graph and clearly associating them with a given service, to make it easy for users to view any errors we uncovered.

Example view of errors associated with a service, including an error parsed from a log

With the trace and additional context from logs illustrating the issue, one of the next questions may be how does this individual trace fit into the overall health and behavior of each service. Is this an anomaly or are we dealing with a pattern? To help answer this question, Edgar pulls in anomaly detection from a partner application, Telltale. Telltale provides Edgar with latency benchmarks that indicate if the individual trace’s latency is abnormal for this given service. A trace alone could tell you that a service took 500ms to respond, but it takes in-depth knowledge of a particular service’s typical behavior to make a determination if this response time is an outlier. Telltale’s anomaly analysis looks at historic behavior and can evaluate whether the latency experienced by this trace is anomalous. With this knowledge, Edgar can then visually warn that something happened in a service that caused its latency to fall outside of normal bounds.

Sample latency analysis

Edgar should reduce burden, not add to it

Presenting all of this data in one interface reduces the footwork of an engineer to uncover each source. However, discovery is only part of the path to resolution. With all the evidence presented and summarized by Edgar, an engineer may know what went wrong and where it went wrong. This is a huge step towards resolution, but not yet cause for celebration. The root cause may have been identified, but who owns the service in question? Many times, finding the right point of contact would require a jump into Slack or a company directory, which costs more time. In Edgar, we have integrated with our services to provide that information in-app alongside the details of a trace. For any service configured with an owner and support channel, Edgar provides a link to a service’s contact email and their Slack channel, smoothing the hand-off from one party to the next. If an engineer does need to pass an issue along to another team or person, Edgar’s request detail page contains all the context — the trace, logs, analysis — and is easily shareable, eliminating the need to write a detailed description or provide a cascade of links to communicate the issue.

Edgar’s request detail page

A key aspect of Edgar’s mission is to minimize the burden on both users and service owners. With all of its data sources, the sheer quantity of data could become overwhelming. It is essential for Edgar to maintain a prioritized interface, built to highlight errors and abnormalities to the user and assist users in taking the next step towards resolution. As our UI grows, it’s important to be discerning and judicious in how we handle new data sources, weaving them into our existing errors and warnings models to minimize disruption and to facilitate speedy understanding. We lean heavily on focus groups and user feedback to ensure a tight feedback loop so that Edgar can continue to meet our users’ needs as their services and use cases evolve.

As services evolve, they might change their log format or use new tags to indicate errors. We built an admin page to give our service owners that configurability and to decouple our product from in-depth service knowledge. Service owners can configure the essential details of their log stores, such as where their logs are located and what fields they use for trace IDs and span IDs. Knowing their trace and span IDs is what enables Edgar to correlate the traces and logs. Beyond that though, what are the idiosyncrasies of their logs? Some fields may be irrelevant or deprecated, and teams would like to hide them by default. Alternatively, some fields contain the most important information, and by promoting them in the Edgar UI, they are able to view these fields more quickly. This self-service configuration helps reduce the burden on service owners.

Initial log configuration in Edgar

Leveraging Edgar

In order for users to turn to Edgar in a situation when time is of the essence, users need to be able to trust Edgar. In particular, they need to be able to count on Edgar having data about their issue. Many approaches to distributed tracing involve setting a sample rate, such as 5%, and then only tracing that percentage of request traffic. Instead of sampling a fixed percentage, Edgar’s mission is to capture 100% of interesting requests. As a result, when an error happens, Edgar’s users can be confident they will be able to find it. That’s key to positioning Edgar as a reliable source. Edgar’s approach makes a commitment to have data about a given issue.

In addition to storing trace data for all requests, Edgar implemented a feature to collect additional details on-demand at a user’s discretion for a given criteria. With this fine-grained level of tracing turned on, Edgar captures request and response payloads as well as headers for requests matching the user’s criteria. This adds clarity to exactly what data is being passed from service to service through a request’s path. While this level of granularity is unsustainable for all request traffic, it is a robust tool in targeted use cases, especially for errors that prove challenging to reproduce.

As you can imagine, this comes with very real storage costs. While the Edgar team has done its best to manage these costs effectively and to optimize our storage, the cost is not insignificant. One way to strengthen our return on investment is by being a key tool throughout the software development lifecycle. Edgar is a crucial tool for operating and maintaining a production service, where reducing the time to recovery has direct customer impact. Engineers also rely on our tool throughout development and testing, and they use the Edgar request page to communicate issues across teams.

By providing our tool to multiple sets of users, we are able to leverage our cost more efficiently. Edgar has become not just a tool for engineers, but rather a tool for anyone who needs to troubleshoot a service at Netflix. In Edgar’s early days, as we strove to build valuable abstractions on top of trace data, the Edgar team first targeted streaming video use cases. We built a curated experience for streaming video, grouping requests into playback sessions, marked by starting and stopping playback for a given asset. We found this experience was powerful for customer service operations as well as engineering teams. Our team listened to customer service operations to understand which common issues caused an undue amount of support pain so that we could summarize these issues in our UI. This empowers customer service operations, as well as engineers, to quickly understand member issues with minimal digging. By logically grouping traces and summarizing the behavior at a higher level, trace data becomes extremely useful in answering questions like why a member didn’t receive 4k video for a certain title or why a member couldn’t watch certain content.

An example error viewing a playback session in Edgar

Extending Edgar for Studio

As the studio side of Netflix grew, we realized that our movie and show production support would benefit from a similar aggregation of user activity. Our movie and show production support might need to answer why someone from the production crew can’t log in or access their materials for a particular project. As we worked to serve this new user group, we sought to understand what issues our production support needed to answer most frequently and then tied together various data sources to answer those questions in Edgar.

The Edgar team built out an experience to meet this need, building another abstraction with trace data; this time, the focus was on troubleshooting production-related use cases and applications, rather than a streaming video session. Edgar provides our production support the ability to search for a given contractor, vendor, or member of production staff by their name or email. After finding the individual, Edgar reaches into numerous log stores for their user ID, and then pulls together their login history, role access change log, and recent traces emitted from production-related applications. Edgar scans through this data for errors and warnings and then presents those errors right at the front. Perhaps a vendor tried to login with the wrong password too many times, or they were assigned an incorrect role on a production. In this new domain, Edgar is solving the same multi-dashboarded problem by tying together information and pointing its users to the next step of resolution.

An example error for a production-related user

What Edgar is and is not

Edgar’s goal is not to be the be-all, end-all of tools or to be the One Tool to Rule Them All. Rather, our goal is to act as a concierge of troubleshooting — Edgar should quickly be able to guide users to an understanding of an issue, as well usher them to the next location, where they can remedy the problem. Let’s say a production vendor is unable to access materials for their production due to an incorrect role/permissions assignment, and this production vendor reaches out to support for assistance troubleshooting. When a support user searches for this vendor, Edgar should be able to indicate that this vendor recently had a role change and summarize what this role change is. Instead of being assigned to Dead To Me Season 2, they were assigned to Season 1! In this case, Edgar’s goal is to help a support user come to this conclusion and direct them quickly to the role management tool where this can be rectified, not to own the full circle of resolution.

Usage at Netflix

While Edgar was created around Netflix’s core streaming video use-case, it has since evolved to cover a wide array of applications. While Netflix streaming video is used by millions of members, some applications using Edgar may measure their volume in requests per minute, rather than requests per second, and may only have tens or hundreds of users rather than millions. While we started with a curated approach to solve a pain point for engineers and support working on streaming video, we found that this pain point is scale agnostic. Getting to the bottom of a problem is costly for all engineers, whether they are building a budget forecasting application used heavily by 30 people or a SVOD application used by millions.

Today, many applications and services at Netflix, covering a wide array of type and scale, publish trace data that is accessible in Edgar, and teams ranging from service owners to customer service operations rely on Edgar’s insights. From streaming to studio, Edgar leverages its wealth of knowledge to speed up troubleshooting across applications with the same fundamental approach of summarizing request tracing, logs, analysis, and metadata.

As you settle into your couch to watch a new episode of Unsolved Mysteries, you may still find yourself with more questions than answers. Why did the victim leave his house so abruptly? How did the suspect disappear into thin air? Hang on, how many people saw that UFO?? Unfortunately, Edgar can’t help you there (trust me, we’re disappointed too). But, if your relaxing evening is interrupted by a production outage, Edgar will be behind the scenes, helping Netflix engineers solve the mystery at hand.

Keeping services up and running allows Netflix to share stories with our members around the globe. Underneath every outage and failure, there is a story to tell, and powerful observability tooling is needed to tell it. If you are passionate about observability then come talk to us.


Edgar: Solving Mysteries Faster with Observability was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimized shot-based encodes for 4K: Now streaming!

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/optimized-shot-based-encodes-for-4k-now-streaming-47b516b10bbb

by Aditya Mavlankar, Liwei Guo, Anush Moorthy and Anne Aaron

Netflix has an ever-expanding collection of titles which customers can enjoy in 4K resolution with a suitable device and subscription plan. Netflix creates premium bitstreams for those titles in addition to the catalog-wide 8-bit stream profiles¹. Premium features comprise a title-dependent combination of 10-bit bit-depth, 4K resolution, high frame rate (HFR) and high dynamic range (HDR) and pave the way for an extraordinary viewing experience.

The premium bitstreams, launched several years ago, were rolled out with a fixed-bitrate ladder, with fixed 4K resolution bitrates — 8, 10, 12 and 16 Mbps — regardless of content characteristics. Since then, we’ve developed algorithms such as per-title encode optimizations and per-shot dynamic optimization, but these innovations were not back-ported on these premium bitstreams. Moreover, the encoding group of pictures (GoP) duration (or keyframe period) was constant throughout the stream causing additional inefficiency due to shot boundaries not aligning with GoP boundaries.

As the number of 4K titles in our catalog continues to grow and more devices support the premium features, we expect these video streams to have an increasing impact on our members and the network. We’ve worked hard over the last year to leapfrog to our most advanced encoding innovations — shot-optimized encoding and 4K VMAF model — and applied those to the premium bitstreams. More specifically, we’ve improved the traditional 4K and 10-bit ladder by employing

In this blog post, we present benefits of applying the above-mentioned optimizations to standard dynamic range (SDR) 10-bit and 4K streams (some titles are also HFR). As for HDR, our team is currently developing an HDR extension to VMAF, Netflix’s video quality metric, which will then be used to optimize the HDR streams.

¹ The 8-bit stream profiles go up to 1080p resolution.

Bitrate versus quality comparison

For a sample of titles from the 4K collection, the following plots show the rate-quality comparison of the fixed-bitrate ladder and the optimized ladder. The plots have been arranged in decreasing order of the new highest bitrate — which is now content adaptive and commensurate with the overall complexity of the respective title.

Fig. 1: Example of a thriller-drama episode showing new highest bitrate of 11.8 Mbps
Fig. 2: Example of a sitcom episode with some action showing new highest bitrate of 8.5 Mbps
Fig. 3: Example of a sitcom episode with less action showing new highest bitrate of 6.6 Mbps
Fig. 4: Example of a 4K animation episode showing new highest bitrate of 1.8 Mbps

The bitrate as well as quality shown for any point is the average for the corresponding stream, computed over the duration of the title. The annotation next to the point is the corresponding encoding resolution; it should be noted that video received by the client device is decoded and scaled to the device’s display resolution. As for VMAF score computation, for encoding resolutions less than 4K, we follow the VMAF best practice to upscale to 4K assuming bicubic upsampling. Aside from the encoding resolution, each point is also associated with an appropriate pixel aspect ratio (PAR) to achieve a target 16:9 display aspect ratio (DAR). For example, the 640×480 encoding resolution is paired with a 4:3 PAR to achieve 16:9 DAR, consistent with the DAR for other points on the ladder.

The last example, showing the new highest bitrate to be 1.8 Mbps, is for a 4K animation title episode which can be very efficiently encoded. It serves as an extreme example of content adaptive ladder optimization — it however should not to be interpreted as all animation titles landing on similar low bitrates.

The resolutions and bitrates for the fixed-bitrate ladder are pre-determined; minor deviation in the achieved bitrate is due to rate control in the encoder implementation not hitting the target bitrate precisely. On the other hand, each point on the optimized ladder is associated with optimal bit allocation across all shots with the goal of maximizing a video quality objective function while resulting in the corresponding average bitrate. Consequently, for the optimized encodes, the bitrate varies shot to shot depending on relative complexity and overall bit budget and in theory can reach the respective codec level maximum. Various points are constrained to different codec levels, so receivers with different decoder level capabilities can stream the corresponding subset of points up to the corresponding level.

The fixed-bitrate ladder often appears like steps — since it is not title adaptive it switches “late” to most encoding resolutions and as a result the quality stays flat within that resolution even with increasing bitrate. For example, two 1080p points with identical VMAF score or four 4K points with identical VMAF score, resulting in wasted bits and increased storage footprint.

On the other hand, the optimized ladder appears closer to a monotonically increasing curve — increasing bitrate results in an increasing VMAF score. As a side note, we do have some additional points, not shown in the plots, that are used in resolution limited scenarios — such as a streaming session limited to 720p or 1080p highest encoding resolution. Such points lie under (or to the right of) the convex hull main ladder curve but allow quality to ramp up in resolution limited scenarios.

Challenging-to-encode content

For the optimized ladders we have logic to detect quality saturation at the high end, meaning an increase in bitrate not resulting in material improvement in quality. Once such a bitrate is reached it is a good candidate for the topmost rung of the ladder. An additional limit can be imposed as a safeguard to avoid excessively high bitrates.

Sometimes we ingest a title that would need more bits at the highest end of the quality spectrum — even higher than the 16 Mbps limit of the fixed-bitrate ladder. For example,

  • a rock concert with fast-changing lighting effects and other details or
  • a wildlife documentary with fast action and/or challenging spatial details.

This scenario is generally rare. Nevertheless, below plot highlights such a case where the optimized ladder exceeds the fixed-bitrate ladder in terms of the highest bitrate, thereby achieving an improvement in the highest quality.

As expected, the quality is higher for the same bitrate, even when compared in the low or medium bitrate regions.

Fig. 5: Example of a movie with action and great amount of rich spatial details showing new highest bitrate of 17.2 Mbps

Visual examples

As an example, we compare the 1.75 Mbps encode from the fixed-bitrate ladder with the 1.45 Mbps encode from the optimized ladder for one of the titles from our 4K collection. Since 4K resolution entails a rather large number of pixels, we show 1024×512 pixel cutouts from the two encodes. The encodes are decoded and scaled to a 4K canvas prior to extracting the cutouts. We toggle between the cutouts so it is convenient to spot differences. We also show the corresponding full frame which helps to get a sense of how the cutout fits in the corresponding video frame.

Fig. 6: Pristine full frame — the purpose is to give a sense of how below cutouts fit in the frame
Fig. 7: Toggling between 1024×512 pixel cutouts from two encodes as annotated. Corresponding to pristine frame shown in Figure 6.
Fig. 8: Pristine full frame — the purpose is to give a sense of how below cutouts fit in the frame
Fig. 9: Toggling between 1024×512 pixel cutouts from two encodes as annotated. Corresponding to pristine frame shown in Figure 8.
Fig. 10: Pristine full frame — the purpose is to give a sense of how below cutouts fit in the frame
Fig. 11: Toggling between 1024×512 pixel cutouts from two encodes as annotated. Corresponding to pristine frame shown in Figure 10.
Fig. 12: Pristine full frame — the purpose is to give a sense of how below cutouts fit in the frame
Fig. 13: Toggling between 1024×512 pixel cutouts from two encodes as annotated. Corresponding to pristine frame shown in Figure 12.
Fig. 14: Pristine full frame — the purpose is to give a sense of how below cutouts fit in the frame
Fig. 15: Toggling between 1024×512 pixel cutouts from two encodes as annotated. Corresponding to pristine frame shown in Figure 14.

As can be seen, the encode from the optimized ladder delivers crisper textures and higher detail for less bits. At 1.45 Mbps it is by no means a perfect 4K rendition, but still very commendable for that bitrate. There exist higher bitrate points on the optimized ladder that deliver impeccable 4K quality, also for less bits compared to the fixed-bitrate ladder.

Compression and bitrate ladder improvements

Even before testing the new streams in the field, we observe the following advantages of the optimized ladders vs the fixed ladders, evaluated over 100 sample titles:

  • Computing the Bjøntegaard Delta (BD) rate shows 50% gains on average over the fixed-bitrate ladder. Meaning, on average we need 50% less bitrate to achieve the same quality with the optimized ladder.
  • The highest 4K bitrate on average is 8 Mbps which is also a 50% reduction compared to 16 Mbps of the fixed-bitrate ladder.
  • As mobile devices continue to improve, they adopt premium features (other than 4K resolution) like 10-bit and HFR. These video encodes can be delivered to mobile devices as well. The fixed-bitrate ladder starts at 560 kbps which may be too high for some cellular networks. The optimized ladder, on the other hand, has lower bitrate points that are viable in most cellular scenarios.
  • The optimized ladder entails a smaller storage footprint compared to the fixed-bitrate ladder.
  • The new ladder considers adding 1440p resolution (aka QHD) points if they lie on the convex hull of rate-quality tradeoff and most titles seem to get the 1440p treatment. As a result, when averaged over 100 titles, the bitrate required to jump to a resolution higher than 1080p (meaning either QHD or 4K) is 1.7 Mbps compared to 8 Mbps of the fixed-bitrate ladder. When averaged over 100 titles, the bitrate required to jump to 4K resolution is 3.2 Mbps compared to 8 Mbps of the fixed-bitrate ladder.

Benefits to members

At Netflix we perform A/B testing of encoding optimizations to detect any playback issues on client devices as well as gauge the benefits experienced by our members. One set of streaming sessions receives the default encodes and the other set of streaming sessions receives the new encodes. This in turn allows us to compare error rates as well as various metrics related to quality of experience (QoE). Although our streams are standard compliant, the A/B testing can and does sometimes find device-side implementations with minor gaps; in such cases we work with our device partners to find the best remedy.

Overall, while A/B testing these new encodes, we have seen the following benefits, which are in line with the offline evaluation covered in the previous section:

  • For members with high-bandwidth connections we deliver the same great quality at half the bitrate on average.
  • For members with constrained bandwidth we deliver higher quality at the same (or even lower) bitrate — higher VMAF at the same encoding resolution and bitrate or even higher resolutions than they could stream before. For example, members who were limited by their network to 720p can now be served 1080p or higher resolution instead.
  • Most streaming sessions start with a higher initial quality.
  • The number of rebuffers per hour go down by over 65%; members also experience fewer quality drops while streaming.
  • The reduced bitrate together with some Digital Rights Management (DRM) system improvements (not covered in this blog) result in reducing the initial play delay by about 10%.

Next steps

We have started re-encoding the 4K titles in our catalog to generate the optimized streams and we expect to complete in a couple of months. We continue to work on applying similar optimizations to our HDR streams.

Acknowledgements

We thank Lishan Zhu for help rendered during A/B testing.

This is a collective effort on the part of our larger team, known as Encoding Technologies, and various other teams that we have crucial partnerships with, such as:

If you are passionate about video compression research and would like to contribute to this field, we have an open position.


Optimized shot-based encodes for 4K: Now streaming! was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.