When Netflix had just started offering online video content years ago, it didn’t consider piracy to be a major issue.
However, now that the company itself is one of the largest content producers, this outlook has changed drastically.
Like many other rightsholders, Netflix now keeps a close eye on pirate sites and services. The company has its own in-house anti-piracy team and also works with third-party companies, to issue takedown requests.
Increasingly, the streaming service is also teaming with other rightsholders to coordinate its enforcement efforts. Netflix is one of the founding members of the Alliance For Creativity and Entertainment (ACE) and earlier this year it joined the Motion Picture Association (MPA).
This week brings yet another expansion, one that takes it across borders to Italy. Local anti-piracy organization FAPAV just announced that Netflix and the Italian football league Serie A are its latest members.
“The addition of two new members, Lega Serie A and Netflix, will consolidate and further enrich the association’s membership base,” FAPAV announced at its end-of-year gathering.
The anti-piracy group has been very active in Europe over the past several years. A few months ago, it helped to bring down the pirate release group FREE/iNCOMiNG, for example, and it was also involved in the court case that brought down the famous torrent site TNTVillage.
FAPAV has also been a driving force behind local pirate site blockades. In 2019, more than 400 new sites were blocked in Italy, which is more than double compared to the year before.
With help from Netflix, FAPAV will keep its focus on various piracy sources in the coming year.
IPTV piracy will be one of its main priorities in 2020, the group says. In addition to enforcement actions, it will also continue its efforts to educate the public on the downsides of piracy through informational campaigns.
Hack Day at Netflix is an opportunity to build and show off a feature, tool, or quirky app. The goal is simple: experiment with new ideas/technologies, engage with colleagues across different disciplines, and have fun!
We know even the silliest idea can spur something more.
The most important value of our Hack Days is that they support a culture of innovation. We believe in this work, even if it never ships, and enjoy sharing the creativity and thought put into these ideas.
Below, you can find videos made by the hackers of some of our favorite hacks from this event.
Nostalgiflixis a chrome extension that transforms your Netflix web browser into an interactive TV time machine covering three decades (80’s, 90’s, and 00’s.) By dragging the UI slider around, you can view titles originally released within the selected year ( based on their historic box office and episode air dates.) More importantly you can also adjust the video filters in real-time to creatively downgrade the viewing experience, further enhancing the nostalgic effect. We think this feature could encourage our users to watch more of our older content while having fun reliving those moments of cinematic history.
This is a real time visualization of all contacts around the world. Each square on the map represent one of our global contact centers, spanning from Salt Lake City to Brazil, India, and Japan. The heatmap in the background is a historical trend of calls over the last hour, showing which countries are currently most active in contacting customer service. Every line you see is a live customer contact — starting at the customer’s country and ending at the contact center it was routed to. Four different types of contacts are represented in this visualization, white for regular phone calls, light blue for chats, green for calls that are initiated through our mobile apps on android and iOS, and red for contacts which are escalated from one representative to another.
Audio Descriptive tracks provide descriptive narration in addition to dialog, helping visually impaired and blind members enjoy our shows. For the Hack Day project, we explored using recent research¹ to automatically generate descriptions, then used our own internal authoring tools to refine the output. We then used synthetic audio and automated mixing techniques to deliver a final audio description track.
Despite a clear decrease in momentum in the UK in recent times, site-blocking remains a favored anti-piracy tool in many countries around the world.
Companies exploiting the Australian market seem convinced that the practice is good for business, as a brand new blocking application filed at the Federal Court shows. First reported by ComputerWorld, it features a broad coalition of movie, TV show, and anime companies, all of whom have previous blocking experience in Australia.
To keep the ‘feel’ of the application as local as possible, it’s no surprise that Roadshow Films is the lead applicant, despite having just one movie (The Lego Movie) listed in court documents. The remaining 11 include Disney, Paramount, Columbia, Universal, Warner and Netflix, plus Hong Kong-based broadcaster Television Broadcasts Limited and anime distributor Madman Anime.
With the companies involved having trod the blocking injunction path many times before, the application itself now takes a very familiar form. It demands that 50 local ISPs including Telstra, Optus, TPG and Vodafone block a wide range of ‘pirate’ sites. In terms of content, however, this is one of the broadest applications yet.
In Australia legal-speak, pirate sites of all kinds are referred to as “Target Online Locations” (TOL), of which there are 87 (identified by their domains) in the current application.
There are several categories of ‘TOL’ – streaming platforms, download platforms, linking sites (including torrent sites), sites that offer software that allows streaming or downloads, those that provide subtitles for copyright works, plus sites that offer proxy access to pirate sites.
Some notable inclusions are the community-resurrected KickassTorrents site operating from Katcr.co, plus some less than authentic Kickass clones operating from around half a dozen additional URLs.
The same goes for a range of domains trading on the SolarMovie, YIFY and YTS brands, without being connected to the original sites. In fact, many domains listed in the application follow this copycat theme, including those featuring 123movies, Primewire, CouchTuner, Putlocker, WatchFree, ProjectFreeTV, and YesMovies-style wording.
An interesting addition is that of getpopcorntime.is. This isn’t the original Popcorn Time app download site but does offer a variant of the software that can be used to gain access to movies and TV shows. However, the domain itself doesn’t offer any infringing content, or any links to the same.
Subtitle download sites, including TVSubtitles.net and MSubs.net, are included in the application. These types of platforms were previously the topic of debate in a previous application but the court eventually conceded they can indeed be blocked.
In a sign of how far the net is now being cast (most of the major pirate sites are already blocked in Australia), this application also features Russian torrent giant Rutor.info and China-focused btbtdy.me. Both of these sites have plenty of alternative domains so blocking just these two is unlikely to achieve much.
Finally, no blocking application would be complete without an effort to block all the ‘proxy’ sites that have the sole purpose of facilitating access to sites blocked as a result of previous injunctions. The problem in respect of these proxies seems to be considerable, with at least 13 of the 87 domains in this application falling into that category.
The full list of domains requested for blocking is as follows:
Infrastructure for Contextual Bandits and Reinforcement Learning — theme of the ML Platform meetup hosted at Netflix, Los Gatos on Sep 12, 2019.
Contextual and Multi-armed Bandits enable faster and adaptive alternatives to traditional A/B Testing. They enable rapid learning and better decision-making for product rollouts. Broadly speaking, these approaches can be seen as a stepping stone to full-on Reinforcement Learning (RL) with closed-loop, on-policy evaluation and model objectives tied to reward functions. At Netflix, we are running several such experiments. For example, one set of experiments is focussed on personalizing our artwork assets to quickly select and leverage the “winning” images for a title we recommend to our members.
As with other traditional machine learning and deep learning paths, a lot of what the core algorithms can do depends upon the support they get from the surrounding infrastructure and the tooling that the ML platform provides. Given the infrastructure space for RL approaches is still relatively nascent, we wanted to understand what others in the community are doing in this space.
This was the motivation for the meetup’s theme. It featured three relevant talks from LinkedIn, Netflix and Facebook, and a platform architecture overview talk from first time participant Dropbox.
After a brief introduction on the theme and motivation of its choice, the talks were kicked off by Kinjal Basu from LinkedIn who talked about Online Parameter Selection for Web-Based Ranking via Bayesian Optimization. In this talk, Kinjal used the example of the LinkedIn Feed, to demonstrate how they use bandit algorithms to solve for the optimal parameter selection problem efficiently.
He started by laying out some of the challenges around inefficiencies of engineering time when manually optimizing for weights/parameters in their business objective functions. The key insight was that by assuming a latent Gaussian Process (GP) prior on the key business metric actions like viral engagement, job applications, etc., they were able to reframe the problem as a straight-forward black-box optimization problem. This allowed them to use BayesOpt techniques to solve this problem.
The algorithm used to solve this reformulated optimization problem is a popular E/E technique known as Thompson Sampling. He talked about the infrastructure used to implement this. They have built an offline BayesOpt library, a parameter store to retrieve the right set of parameters, and an online serving layer to score the objective at serving time given the parameter distribution for a particular member.
He also described some practical considerations, like member-parameter stickiness, to avoid per session variance in a member’s experience. Their offline parameter distribution is recomputed hourly, so the member experience remains consistent within the hour. Some simulation results and some online A/B test results were shared, demonstrating substantial lifts in the primary business metrics, while keeping the secondary metrics above preset guardrails.
He concluded by stressing the efficiency their teams had achieved by doing online parameter exploration instead of the much slower human-in-the-loop manual explorations. In the future, they plan to explore adding new algorithms like UCB, considering formulating the problem as a grey-box optimization problem, and switching between the various business metrics to identify which is the optimal metric to optimize.
The second talk was by Netflix on our Bandit Infrastructure built for personalization use cases. Fernando Amat and Elliot Chow jointly gave this talk.
Fernando started the first part of the talk and described the core recommendation problem of identifying the top few titles in a large catalog that will maximize the probability of play. Using the example of evidence personalization — images, text, trailers, synopsis, all assets that come together to add meaning to a title — he described how the problem is essentially a slate recommendation task and is well suited to be solved using a Bandit framework.
If such a framework is to be generic, it must support different contexts, attributions and reward functions. He described a simple Policy API that models the Slate tasks. This API supports the selection of a state given a list of options using the appropriate algorithm and a way to quantify the propensities, so the data can be de-biased. Fernando ended his part by highlighting some of the Bandit Metrics they implemented for offline policy evaluation, like Inverse Propensity Scoring (IPS), Doubly Robust (DR), and Direct Method (DM).
For Bandits, where attribution is a critical part of the equation, it’s imperative to have a flexible and robust data infrastructure. Elliot started the second part of the talk by describing the real-time framework they have built to bring together all signals in one place making them accessible through a queryable API. These signals include member activity data (login, search, playback), intent-to-treat (what title/assets the system wants to impress to the member) and the treatment (impressions of images, trailers) that actually made it to the member’s device.
Elliot talked about what is involved in “Closing the loop”. First, the intent-to-treat needs to be joined with the treatment logging along the way, the policies in effect, the features used and the various propensities. Next, the reward function needs to be updated, in near real time, on every logged action (like a playback) for both short-term and long-term rewards. And finally each new observation needs to update the policy, compute offline policy evaluation metrics and then push the policy back to production so it can generate new intents to treat.
To be able to support this, the team had to standardize on several infrastructure components. Elliot talked about the three key components — a) Standardized Logging from the treatment services, b) Real-time stream processing over Apache Flink for member activity joins, and c) an Apache Spark client for attribution and reward computation. The team has also developed a few common attribution datasets as “out-of-the-box” entities to be used by the consuming teams.
Finally, Elliot ended by talking about some of the challenges in building this Bandit framework. In particular, he talked about the misattribution potential in a complex microservice architecture where often intermediary results are cached. He also talked about common pitfalls of stream-processed data like out of order processing.
This framework has been in production for almost a year now and has been used to support several A/B tests across different recommendation use cases at Netflix.
After a short break, the second session started with a talk from Facebook focussed on practical solutions to exploration problems. Sam Daulton described how the infrastructure and product use cases came along. He described how the adaptive experimentation efforts are aimed at enabling fast experimentation with a goal of adding varying degrees of automation for experts using the platform in an ad hoc fashion all the way to no-human-in-the-loop efforts.
He dived into a policy search problem they tried to solve: How many posts to load for a user depending upon their device’s connection quality. They modeled the problem as an infinite-arm bandit problem and used Gaussian Process (GP) regression. They used Bayesian Optimization to perform multi-metric optimization — e.g., jointly optimizing decrease in CPU utilization along with increase in user engagement. One of the challenges he described was how to efficiently choose a decision point, when the joint optimization search presented a Pareto frontier in the possible solution space. They used constraints on individual metrics in the face of noisy experiments to allow business decision makers to arrive at an optimal decision point.
Not all spaces can be efficiently explored online, so several research teams at Facebook use Simulations offline. For example, a ranking team would ingest live user traffic and subject it to a number of ranking configurations and simulate the event outcomes using predictive models running on canary rankers. The simulations were often biased and needed de-biasing (using multi-task GP regression) for them to be used alongside online results. They observed that by combining their online results with de-biased simulation results they were able to substantially improve their model fit.
To support these efforts, they developed and open sourced some tools along the way. Sam described Ax and BoTorch — Ax is a library for managing adaptive experiments and BoTorch is a library for Bayesian Optimization research. There are many applications already in production for these tools from both basic hyperparameter exploration to more involved AutoML use cases.
The final section of Sam’s talk focussed on Constrained Bayesian Contextual Bandits. They described the problem of video uploads to Facebook where the goal is to maximize the quality of the video without a decrease in reliability of the upload. They modeled it as a Thompson Sampling optimization problem using a Bayesian Linear model. To enforce the constraints, they used a modified algorithm, Constrained Thompson Sampling, to ensure a non-negative change in reliability. The reward function also similarly needed some shaping to align with the constrained objective. With this reward shaping optimization, Sam shared some results that showed how the Constrained Thompson Sampling algorithm surfaced many actions that satisfied the reliability constraints, where vanilla Thompson Sampling had failed.
The last talk of the event was a system architecture introduction by Dropbox’s Tsahi Glik. As a first time participant, their talk was more of an architecture overview of the ML Infra in place at Dropbox.
Tsahi started off by giving some ML usage examples at Dropbox like Smart Sync which predicts which file you will use on a particular device, so it’s preloaded. Some of the challenges he called out were the diversity and size of the disparate data sources that Dropbox has to manage. Data privacy is increasingly important and presents its own set of challenges. From an ML practice perspective, they also have to deal with a wide variety of development processes and ML frameworks, custom work for new use cases and challenges with reproducibility of training.
He shared a high level overview of their ML platform showing the various common stages of developing and deploying a model categorized by the online and offline components. He then dived into some individual components of the platform.
The first component he talked about was a user activity service to collect the input signals for the models. This service, Antenna, provides a way to query user activity events and summarizes the activity with various aggregations. The next component he dived deeper into was a content ingestion pipeline for OCR (optical character recognition). As an example, he explained how the image of a receipt is converted into contextual text. The pipeline takes the image through multiple models for various subtasks. The first classifies whether the image has some detectable text, the second does corner detection, the third does word box detection followed by deep LSTM neural net that does the core sequence based OCR. The final stage performs some lexicographical post processing.
He talked about the practical considerations of ingesting user content — they need to prevent malicious content from impacting the service. To enable this they have adopted a plugin based architecture and each task plugin runs in a sandbox jail environment.
Their offline data preparation ETLs run on Spark and they use Airflow as the orchestration layer. Their training infrastructure relies on a hybrid cloud approach. They have built a layer and command line tool called dxblearn that abstracts the training paths, allowing the researchers to train either locally or leverage AWS. dxblearn also allows them to fire off training jobs for hyperparameter tuning.
Published models are sent to a model store in S3 which are then picked up by their central model prediction service that does online inferencing for all use cases. Using a central inferencing service allows them to partition compute resources appropriately and having a standard API makes it easy to share and also run inferencing in the cloud.
They have also built a common “suggest backend” that is a generic predictive application that can be used by the various edge and production facing services that regularizes the data fetching, prediction and experiment configuration needed for a product prediction use case. This allows them to do live experimentation more easily.
The last part of Tsahi’s talk described a product use case leveraging their ML Platform. He used the example of a promotion campaign ranker, (eg “Try Dropbox business”) for up-selling. This is modeled as a multi-armed bandit problem, an example well in line with the meetup theme.
The biggest value of such meetups lies in the high bandwidth exchange of ideas from like-minded practitioners. In addition to some great questions after the talks, the 150+ attendees stayed well past 2 hours in the reception exchanging stories and lessons learnt solving similar problems at scale.
In the Personalization org at Netflix, we are always interested in exchanging ideas about this rapidly evolving ML space in general and the bandits and reinforcement learning space in particular. We are committed to sharing our learnings with the community and hope to discuss progress here, especially our work on Policy Evaluation and Bandit Metrics in future meetups. If you are interested in working on this exciting space, there are many open opportunities on both engineering and research endeavors.
In a microservice architecture such as Netflix’s, propagating datasets from a single source to multiple downstream destinations can be challenging. These datasets can represent anything from service configuration to the results of a batch job, are often needed in-memory to optimize access and must be updated as they change over time.
One example displaying the need for dataset propagation: at any given time Netflix runs a very large number of A/B tests. These tests span multiple services and teams, and the operators of the tests need to be able to tweak their configuration on the fly. There needs to be the ability to detect nodes that have failed to pick up the latest test configuration, and the ability to revert to older versions of configuration when things go wrong.
Another example of a dataset that needs to be disseminated is the result of a machine-learning model: the results of these models may be used by several teams, but the ML teams behind the model aren’t necessarily interested in maintaining high-availability services in the critical path. Rather than each team interested in consuming the model having to build in fallbacks to degrade gracefully, there is a lot of value in centralizing the work to allow multiple teams to leverage a single team’s effort.
Without infrastructure-level support, every team ends up building their own point solution to varying degrees of success. Datasets themselves are of varying size, from a few bytes to multiple gigabytes. It is important to build in observability and fault detection, and to provide tooling to allow operators to make quick changes without having to develop their own tools.
At Netflix we use an in-house dataset pub/sub system called Gutenberg. Gutenberg allows for propagating versioned datasets — consumers subscribe to data and are updated to the latest versions when they are published. Each version of the dataset is immutable and represents a complete view of the data — there is no dependency on previous versions of data. Gutenberg allows browsing older versions of data for use cases such as debugging, rapid mitigation of data related incidents, and re-training of machine-learning models. This post is a high level overview of the design and architecture of Gutenberg.
The top-level construct in Gutenberg is a “topic”. A publisher publishes to a topic and consumers consume from a topic. Publishing to a topic creates a new monotonically-increasing “version”. Topics have a retention policy that specifies a number of versions or a number of days of versions, depending on the use case. For example, you could configure a topic to retain 10 versions or 10 days of versions.
Each version contains metadata (keys and values) and a data pointer. You can think of a data pointer as special metadata that points to where the actual data you published is stored. Today, Gutenberg supports direct data pointers (where the payload is encoded in the data pointer value itself) and S3 data pointers (where the payload is stored in S3). Direct data pointers are generally used when the data is small (under 1MB) while S3 is used as a backing store when the data is large.
Gutenberg provides the ability to scope publishes to a particular set of consumers — for example by region, application, or cluster. This can be used to canary data changes with a single cluster, roll changes out incrementally, or constrain a dataset so that only a subset of applications can subscribe to it. Publishers decide the scope of a particular data version publish, and they can later add scopes to a previously published version. Note that this means that the concept of a latest version depends on the scope — two applications may see different versions of data as the latest depending on the publish scopes created by the publisher. The Gutenberg service matches the consuming application with the published scopes before deciding what to advertise as the latest version.
The most common use case of Gutenberg is to propagate varied sizes of data from a single publisher to multiple consumers. Often the data is held in memory by consumers and used as a “total cache”, where it is accessed at runtime by client code and atomically swapped out under the hood. Many of these use cases can be loosely grouped as “configuration” — for example Open Connect Appliance cache configuration, supported device type IDs, supported payment method metadata, and A/B test configuration. Gutenberg provides an abstraction between the publishing and consumption of this data — this allows publishers the freedom to iterate on their application without affecting downstream consumers. In some cases, publishing is done via a Gutenberg-managed UI, and teams do not need to manage their own publishing app at all.
Another use case for Gutenberg is as a versioned data store. This is common for machine-learning applications, where teams build and train models based on historical data, see how it performs over time, then tweak some parameters and run through the process again. More generally, batch-computation jobs commonly use Gutenberg to store and propagate the results of a computation as distinct versions of datasets. “Online” use cases subscribe to topics to serve real-time requests using the latest versions of topics’ data, while “offline” systems may instead use historical data from the same topics — for example to train machine-learned models.
An important point to note is that Gutenberg is not designed as an eventing system — it is meant purely for data versioning and propagation. In particular, rapid-fire publishes do not result in subscribed clients stepping through each version; when they ask for an update, they will be provided with the latest version, even if they are currently many versions behind. Traditional pub-sub or eventing systems are suited towards messages that are smaller in size and are consumed in sequence; consumers may build up a view of an entire dataset by consuming an entire (potentially compacted) feed of events. Gutenberg, however, is designed for publishing and consuming an entire immutable view of a dataset.
Design and architecture
Gutenberg consists of a service with gRPC and REST APIs as well as a Java client library that uses the gRPC API.
The Gutenberg client library handles tasks such as subscription management, S3 uploads/downloads, Atlas metrics, and knobs you can tweak using Archaius properties. It communicates with the Gutenberg service via gRPC, using Eureka for service discovery.
Publishers generally use high-level APIs to publish strings, files, or byte arrays. Depending on the data size, the data may be published as a direct data pointer or it may get uploaded to S3 and then published as an S3 data pointer. The client can upload a payload to S3 on the caller’s behalf or it can publish just the metadata for a payload that already exists in S3.
Direct data pointers are automatically replicated globally. Data that is published to S3 is uploaded to multiple regions by the publisher by default, although that can be configured by the caller.
The client library provides subscription management for consumers. This allows users to create subscriptions to particular topics, where the library retrieves data (eg from S3) before handing off to a user-provided listener. Subscriptions operate on a polling model — they ask the service for a new update every 30 seconds, providing the version with which they were last notified. Subscribed clients will never consume an older version of data than the one they are on unless they are pinned (see “Data resiliency” below). Retry logic is baked in and configurable — for instance, users can configure Gutenberg to try older versions of data if it fails to download or process the latest version of data on startup, often to deal with non-backwards-compatible data changes. Gutenberg also provides a pre-built subscription that holds on to the latest data and atomically swaps it out under the hood when a change comes in — this tackles a majority of subscription use cases, where callers only care about the current value at any given time. It allows callers to specify a default value — either for a topic that has never been published to (a good fit when the topic is used for configuration) or if there is an error consuming the topic (to avoid blocking service startup when there is a reasonable default).
Gutenberg also provides high-level client APIs that wrap the low-level gRPC APIs and provide additional functionality and observability. One example of this is to download data for a given topic and version — this is used extensively by components plugged into Netflix Hollow. Another example is a method to get the “latest” version of a topic at a particular time — a common use case when debugging and when training ML models.
Client resiliency and observability
Gutenberg was designed with a bias towards allowing consuming services to be able to start up successfully versus guaranteeing that they start with the freshest data. With this in mind, the client library was built with fallback logic for when it cannot communicate with the Gutenberg service. After HTTP request retries are exhausted, the client downloads a fallback cache of topic publish metadata from S3 and works based off of that. This cache contains all the information needed to decide whether an update needs to be applied, and from where data needs to be fetched (either from the publish metadata itself or from S3). This allows clients to fetch data (which is potentially stale, depending on how current that fallback cache is) without using the service.
Part of the benefit of providing a client library is the ability to expose metrics that can be used to alert on an infrastructure-wide issue or issues with specific applications. Today these metrics are used by the Gutenberg team to monitor our publish-propagation SLI and to alert in the event of widespread issues. Some clients also use these metrics to alert on app-specific errors, for example individual publish failures or a failure to consume a particular topic.
The Gutenberg service is a Governator/Tomcat application that exposes gRPC and REST endpoints. It uses a globally-replicated Cassandra cluster for persistence and to propagate publish metadata to every region. Instances handling consumer requests are scaled separately from those handling publish requests — there are approximately 1000 times more consumer requests than there are publish requests. In addition, this insulates publishing from consumption — a sudden spike in publishing will not affect consumption, and vice versa.
Each instance in the consumer request cluster maintains its own in-memory cache of “latest publishes”, refreshing it from Cassandra every few seconds. This is to handle the large volume of poll requests coming from subscribed clients without passing on the traffic to the Cassandra cluster. In addition, request-pooling low-ttl caches protect against large spikes in requests that could potentially burden Cassandra enough to affect entire region — we’ve had situations where transient errors coinciding with redeployments of large clusters have caused Gutenberg service degradation. Furthermore, we use an adaptive concurrency limiter bucketed by source application to throttle misbehaving applications without affecting others.
For cases where the data was published to S3 buckets in multiple regions, the server makes a decision on what bucket to send back to the client to download from based on where the client is. This also allows the service to provide the client with a bucket in the “closest” region, and to have clients fall back to another region if there is a region outage.
Before returning subscription data to consumers, the Gutenberg service first runs consistency checks on the data. If the checks fail and the polling client already has consumed some data the service returns nothing, which effectively means that there is no update available. If the polling client has not yet consumed any data (this usually means it has just started up), the service queries the history for the topic and returns the latest value that passes consistency checks. This is because we see sporadic replication delays at the Cassandra layer, where by the time a client polls for new data, the metadata associated with the most recently published version has only been partially replicated. This can result in incomplete data being returned to the client, which then manifests itself either as a data fetch failure or an obscure business-logic failure. Running these consistency checks on the server insulates consumers from the eventual-consistency caveats that come with the service’s choice of a data store.
Visibility on topic publishes and nodes that consume a topic’s data is important for auditing and to gather usage info. To collect this data, the service intercepts requests from publishers and consumers (both subscription poll requests and others) and indexes them in Elasticsearch by way of the Keystone data pipeline. This allows us to gain visibility into topic usage and decommission topics that are no longer in use. We expose deep-links into a Kibana dashboard from an internal UI to allow topic owners to get a handle on their consumers in a self-serve manner.
In addition to the clusters serving publisher and consumer requests, the Gutenberg service runs another cluster that runs periodic tasks. Specifically this runs two tasks:
Every few minutes, all the latest publishes and metadata are gathered up and sent to S3. This powers the fallback cache used by the client as detailed above.
A nightly janitor job purges topic versions which exceed their topic’s retention policy. This deletes the underlying data as well (e.g. S3 objects) and helps enforce a well-defined lifecycle for data.
In the world of application development bad deployments happen, and a common mitigation strategy there is to roll back the deployment. A data-driven architecture makes that tricky, since behavior is driven by data that changes over time.
Data propagated by Gutenberg influences — and in many cases drives — system behavior. This means that when things go wrong, we need a way to roll back to a last-known good version of data. To facilitate this, Gutenberg provides the ability to “pin” a topic to a particular version. Pins override the latest version of data and force clients to update to that version — this allows for quick mitigation rather than having an under-pressure operator attempt to figure out how to publish the last known good version. You can even apply a pin to a specific publish scope so that only consumers that match that scope are pinned. Pins also override data that is published while the pin is active, but when the pin is removed clients update to the latest version, which may be the latest version when the pin was applied or a version published while the pin was active.
When deploying new code, it’s often a good idea to canary new builds with a subset of traffic, roll it out incrementally, or otherwise de-risk a deployment by taking it slow. For cases where data drives behavior, a similar principle should be applied.
One feature Gutenberg provides is the ability to incrementally roll out data publishes via Spinnaker pipelines. For a particular topic, users configure what publish scopes they want their publish to go to and what the delay is between each one. Publishing to that topic then kicks off the pipeline, which publishes the same data version to each scope incrementally. Users are able to interact with the pipeline; for example they may choose to pause or cancel pipeline execution if their application starts misbehaving, or they may choose to fast-track a publish to get it out sooner. For example, for some topics we roll out a new dataset version one AWS region at a time.
Gutenberg has been at use at Netflix for the past three years. At present, Gutenberg stores low tens-of-thousands of topics in production, about a quarter of which have published at least once in the last six months. Topics are published at a variety of cadences — from tens of times a minute to once every few months — and on average we see around 1–2 publishes per second, with peaks and troughs about 12 hours apart.
In a given 24 hour period, the number of nodes that are subscribed to at least one topic is in the low six figures. The largest number of topics a single one of these nodes is subscribed to is north of 200, while the median is 7. In addition to subscribed applications, there are a large number of applications that request specific versions of specific topics, for example for ML and Hollow use cases. Currently the number of nodes that make a non-subscribe request for a topic is in the low hundreds of thousands, the largest number of topics requested is 60, and the median is 4.
Here’s a sample of work we have planned for Gutenberg:
Polyglot support: today Gutenberg only supports a Java client, but we’re seeing an increasing number of requests for Node.js and Python support. Some of these teams have cobbled together their own solutions built on top of the Gutenberg REST API or other systems. Rather than have different teams reinvent the wheel, we plan to provide first-class client libraries for Node.js and Python.
Encryption and access control: for sensitive data, Gutenberg publishers should be able to encrypt data and distribute decryption credentials to consumers out-of-band. Adding this feature opens Gutenberg up to another set of use-cases.
Better incremental rollout: the current implementation is in its pretty early days and needs a lot of work to support customization to fit a variety of use cases. For example, users should be able to customize the rollout pipeline to automatically accept or reject a data version based on their own tests.
Alert templates: the metrics exposed by the Gutenberg client are used by the Gutenberg team and a few teams that are power users. Instead, we plan to provide leverage to users by building and parameterizing templates they can use to set up alerts for themselves.
Topic cleanup: currently topics sit around forever unless they are explicitly deleted, even if no one is publishing to them or consuming from them. We plan on building an automated topic cleanup system based on the consumption trends indexed in Elasticsearch.
Data catalog integration: an ongoing issue at Netflix is the problem of cataloging data characteristics and lineage. There is an effort underway to centralize metadata around data sources and sinks, and once Gutenberg integrates with this, we can leverage the catalog to automate tools that message the owners of a dataset.
The effects that legal streaming services have on people’s motivation to pirate can be quite confusing.
On the one hand, legal services have been known to lower the piracy rates in some regions, but too many exclusive platforms could boost piracy again.
In South Africa, Internet providers are mostly noticing the first. Netflix first launched there in 2016, and since then, Netflix traffic has dwarfed BitTorrent traffic, ISPs say. While it’s not entirely clear to what degree torrent traffic decreased, if it did, the companies have all noticed a massive Netflix effect.
In a report published by Mybroadband, several ISPs were questioned about the topic. Without exception, they say that video services, Netflix in particular, have made BitTorrent traffic relatively insignificant.
RSAWEB, for example, noticed that Netflix traffic surged and that peak time data usage doubles every six months. This is in large part the result of a streaming boom. At the moment, the volume of Netflix traffic is 20x that of BitTorrent traffic.
“The ratios have significantly changed compared to a few years ago,” a RSAWEB spokesperson said. “The current ratio would be for every 50Mbps of aggregated torrent traffic we observe 1Gbps of aggregated Netflix streaming traffic.”
Webafrica noticed a similar trend, but perhaps even more pronounced. The company noted that during peak hours half of all traffic is generated by Netflix. BitTorrent traffic follows somewhere in the distance t the extent it’s not even worth tracking anymore.
“The growth of Netflix in recent years has been truly phenomenal, to the point where we no longer track torrent traffic separately,” Webafrica’s Greg Wright said. “Google (including Youtube) and Netflix are dominating the content currently,” he added.
Paul Butschi, co-founder of Internet provider Cool Ideas confirmed this trend. Netflix makes up roughly 30% of the company’s peak traffic and he believes that the increased popularity of online streaming had a pronounced impact on torrent traffic.
These opinions were largely shared by rival ISPs Cybersmart and Supersonic, with the latter noting that video streaming services have “completely overtaken the need for torrent sharing,” and that things will only get better if more competing services enter the market.
The last comment is something that’s up for debate, especially if new services all come with exclusive content. Looking at the relative traffic market share in North America over the past decade, a word of caution may be warranted.
Less than a decade ago nearly 20% of all traffic during peak hours was P2P related, mostly BitTorrent. As Netflix and other video streaming services grew, this relative share quickly dropped, but more recently it started to show signs of growth again.
It could be that people have started to pirate again because they can’t afford to pay for several paid streaming subscriptions.
This notion is supported by a recent survey which showed that piracy rates could potentially double again if the video streaming market continues to fragment. This can affect BitTorrent traffic, bus also pirate streaming sites and services, which the ISPs were asked about. These are the go-to piracy solution for most people nowadays.
Overall it’s safe to say that legal streaming services do indeed limit the demand for piracy, as the South African ISPs observe. This is particularly true if they’re convenient and affordable while providing access to a great content library.
At Netflix we prioritize innovation and velocity in pursuit of the best experience for our 150+ million global customers. This means that our microservices constantly evolve and change, but what doesn’t change is our responsibility to provide a highly available service that delivers 100+ million hours of daily streaming to our subscribers.
In order to achieve this level of availability, we leverage an N+1 architecture where we treat Amazon Web Services (AWS) regions as fault domains, allowing us to withstand single region failures. In the event of an isolated failure we first pre-scale microservices in the healthy regions after which we can shift traffic away from the failing one. This pre-scaling is necessary due to our use of autoscaling, which generally means that services are right-sized to handle their current demand, not the surge they would experience once we shift traffic.
Though this evacuation capability exists today, this level of resiliency wasn’t always the standard for the Netflix API. In 2013 we first developed our multi-regional availability strategy in response to a catalyst that led us to re-architect the way our service operates. Over the last 6 years Netflix has continued to grow and evolve along with our customer base, invalidating core assumptions built into the machinery that powers our ability to pre-scale microservices. Two such assumptions were that:
Regional demand for all microservices (i.e. requests, messages, connections, etc.) can be abstracted by our key performance indicator, stream starts per second (SPS).
Microservices within healthy regions can be scaled uniformly during an evacuation.
These assumptions simplified pre-scaling, allowing us to treat microservices uniformly, ignoring the uniqueness and regionality of demand. This approach worked well in 2013 due to the existence of monolithic services and a fairly uniform customer base, but became less effective as Netflix evolved.
Regional Microservice Demand
Most of our microservices are in some way related to serving a stream, so SPS seemed like a reasonable stand-in to simplify regional microservice demand. This was especially true for large monolithic services. For example, player logging, authorization, licensing, and bookmarks were initially handled by a single monolithic service whose demand correlated highly with SPS. However, in order to improve developer velocity, operability, and reliability, the monolith was decomposed into smaller, purpose-built services with dissimilar function-specific demand.
Our edge gateway (zuul) also sharded by function to achieve similar wins. The graph below captures the demand for each shard, the combined demand, and SPS. Looking at the combined demand and SPS lines, SPS roughly approximates combined demand for a majority of the day. Looking at individual shards however, the amount of error introduced by using SPS as a demand proxy varies widely.
Uniform Evacuation Scaling
Since we used SPS as a demand proxy, it also seemed reasonable to assume that we can uniformly pre-scale all microservices in the healthy regions. In order to illustrate the shortcomings of this approach, let’s look at playback licensing (DRM) & authorization.
DRM is closely aligned with device type, such that Consumer Electronics (CE), Android, & iOS use different DRM platforms. In addition, the ratio of CE to mobile streaming differs regionally; for example, mobile is more popular in South America. So, if we evacuate South American traffic to North America, demand for CE and Android DRM won’t grow uniformly.
On the other hand, playback authorization is a function used by all devices prior to requesting a license. While it does have some device specific behavior, demand during an evacuation is more a function of the overall change in regional demand.
Closing The Gap
In order to address the issues with our previous approach, we needed to better characterize microservice-specific demand and how it changes when we evacuate. The former requires that we capture regional demand for microservices versus relying on SPS. The latter necessitates a better understanding of microservice demand by device type as well as how regional device demand changes during an evacuation.
Microservice-Specific Regional Demand
Because of service decomposition, we understood that using a proxy demand metric like SPS wasn’t tenable and we needed to transition to microservice-specific demand. Unfortunately, due to the diversity of services, a mix of Java (Governator/Springboot with Ribbon/gRPC, etc.) and Node (NodeQuark), there wasn’t a single demand metric we could rely on to cover all use cases. To address this, we built a system that allows us to associate each microservice with metrics that represent their demand.
The microservice metrics are configuration-driven, self-service, and allows for scoping such that services can have different configurations across various shards and regions. Our system then queries Atlas, our time series telemetry platform, to gather the appropriate historical data.
Microservice Demand By Device Type
Since demand is impacted by regional device preferences, we needed to deconstruct microservice demand to expose the device-specific components. The approach we took was to partition a microservice’s regional demand by aggregated device types (CE, Android, PS4, etc.). Unfortunately, the existing metrics didn’t uniformly expose demand by device type, so we leveraged distributed tracing to expose the required details. Using this sampled trace data we can explain how a microservice’s regional device type demand changes over time. The graph below highlights how relative device demand can vary throughout the day for a microservice.
Device Type Demand
We can use historical device type traffic to understand how to scale the device-specific components of a service’s demand. For example, the graph below shows how CE traffic in us-east-1 changes when we evacuate us-west-2. The nominal and evacuation traffic lines are normalized such that 1 represents the max(nominal traffic) and the demand scaling ratio represents the relative change in demand during an evacuation (i.e. evacuation traffic/nominal traffic).
Microservice-Specific Demand Scaling Ratio
We can now combine microservice demand by device and device-specific evacuation scaling ratios to better represent the change in a microservice’s regional demand during an evacuation — i.e. the microservice’s device type weighted demand scaling ratio. To calculate this ratio (for a specific time of day) we take a service’s device type percentages, multiply by device type evacuation scaling ratios, producing each device type’s contribution to the service’s scaling ratio. Summing these components then yields a device type weighted evacuation scaling ratio for the microservice. To provide a concrete example, the table below shows the evacuation scaling ratio calculation for a fictional service.
The graph below highlights the impact of using a microservice-specific evacuation scaling ratio versus the simplified SPS-based approach used previously. In the case of Service A, the old approach would have done well in approximating the ratio, but in the case of Service B and Service C, it would have resulted in over and under predicting demand, respectively.
Understanding the uniqueness of demand across our microservices improved the quality of our predictions, leading to safer and more efficient evacuations at the cost of additional computational complexity. This new approach, however, is itself an approximation with its own set of assumptions. For example, it assumes all categories of traffic for a device type has similar shape, for example Android logging and playback traffic. As Netflix grows our assumptions will again be challenged and we will have to adapt to continue to provide our customers with the availability and reliability that they have come to expect.
If this article has piqued your interest and you have a passion for solving cross-discipline distributed systems problems, our small but growing team is hiring!
The rate at which ‘pirate’ sites are being blocked in various countries raises the question of how many more there are left to block.
The answer, it seems, is plenty more yet.
Back in May, yet another application filed in Australia’s Federal Court presented a unique feature – the inclusion of US-based streaming giant Netflix as one of the applicants.
This was the first time the company had appeared requesting a blocking application in the region, claiming infringement of its works Santa Clarita Diet and Stranger Things.
Netflix didn’t appear on its own. The application was headed by local movie giant Roadshow Films and supported by other prominent movie companies such as Disney Enterprises, Universal City Studios, Warner Bros., Television Broadcasts Limited, TVBO, and Madman Anime Group.
Together they demanded the blocking of over 130 domains related to close to 90 torrent, streaming, and similar sites by more than 50 local ISPs.
The claims were filed under Section 115a of Australia’s Copyright Act, which can grant injunctions to force local ISPs to prevent their subscribers from accessing overseas-based ‘infringing locations. It’s taken three months, but the content companies have now been successful.
This morning, Justice Thawley in the Federal Court ordered the respondents including Telstra, Optus, TPG, Vocus, and Vodafone, to take “reasonable steps to disable access to the Target Online Locations” within 15 business days. Each ISP will be handed AUS$50 per domain by the applicants to cover compliance costs.
In common with previous orders, the ISPs were given the option to utilize DNS, IP address, and/or URL blocking techniques (or any other technical means agreed in writing between them and the applications) to prevent access to the sites.
Of course, sites often decide to take countermeasures when orders such as this are handed down in order to circumvent blocking, so the order allows the studios to provide additional information so that these can be swiftly dealt with by the ISPs moving forward.
An updated/amended domain and URL list (there can be changes following an original application) is yet to appear in court records. However, the list of sites and domains in the original application can be viewed in our earlier report.
The order handed down this morning can be found here (pdf)
India is no stranger to blocking pirate sites. Just last week, a court ordered local Internet service providers to block more than 1,200 sites to prevent the spread of a single movie.
Now, however, it appears that there additional legal moves underway to ensure that sites are not only blocked temporarily but also on a more permanent basis.
Over the past several weeks the High Court in Delhi has been handling many separate applications for permanent injunction filed by US-based Warner Bros. Entertainment Inc.
In all cases, the company states that several of its copyrighted works – movies Aquaman, A Star is Born, Wonder Woman, plus TV show Arrow – were made available via a broad range of torrent, streaming, linking, and proxy-type sites.
The complaints also cite works by studios including Columbia, Paramount, Universal, and Netflix as further examples of content being infringed on the platforms.
In just one of the complaints the list of infringing domains runs to 124 and includes some very well known names including local giant Tamilrockers, TorrentDownload.ch, TorrentDownloads.me and EZTV, iStole.it, Zoink.it, Torrents.me, Torrents.io, Zooqle, MovieRulz, LimeTorrents, Bolly4u, KatMovie, Monova, and 9xMovies.
In many cases, multiple domains are listed for the above sites, including alternates, proxies and other variants that are accessible via various unblocking platforms. All are accused of infringing the rights of Warner Bros. by providing access to its movie and TV shows content without authorization.
“[D]efendant Websites are primarily and substantially engaged in communicating to the public, hosting, streaming and/or making available to the public Plaintiff’s original content without authorization, and/or facilitating the same,” one order reads.
The order covering the above sites notes that Warner investigated and then served legal notices on the platforms ordering them to cease-and-desist. However, it’s reported that none acted to prevent their infringing activities.
To boost its case, Warner also informed the Court that some of the sites have already been blocked in other jurisdictions (including the UK, Portugal, Malaysia, Australia, Belgium, Denmark, Russia, and Italy) for similar behavior.
After consideration, the Court found that there is a prima facie case and Warner should be awarded an interim injunction to prevent the sites from continuing their infringing activities. Furthermore, the sites should have their domains blocked by ISPs in India, to prevent further damage and losses.
The Court also addressed the issue of additional domains or platforms appearing to circumvent any blocking, by granting Warner permission to file additional updates with the Court that will allow for such mechanisms to be disabled by ISPs via an expedited process.
The example order detailed above is very specific, in that it orders ISPs to block the domain names of the sites plus a list of IP addresses. However, the vast majority appear to be using Cloudflare, so it remains to be seen whether the ISPs will use discretion or blindly block, which could cause considerable disruption to other sites using the same IP locations.
In some of the orders, it appears that domain registrars are also required to suspend domain names belonging or connected to various sites, including TamilRockers, Hindilinks4u, Otorrents, Filmlinks4u, Mp4Moviez, Series9.io, uWatchFree, OnlineWatchMovies, MovieRulzFree, and SkyMovies.
Several additional applications from Warner are on record at the Delhi High Court but are yet to be published as interim orders.
The main order detailed above can be found here (pdf), the rest here 2,3,4,5,6,7,8,9,10
Conductor is a workflow orchestration engine developed and open-sourced by Netflix. If you’re new to Conductor, this earlier blogpost and the documentation should help you get started and acclimatized to Conductor.
In the last two years since inception, Conductor has seen wide adoption and is instrumental in running numerous core workflows at Netflix. Many of the Netflix Content and Studio Engineering services rely on Conductor for efficient processing of their business flows. The Netflix Media Database (NMDB) is one such example.
In this blog, we would like to present the latest updates to Conductor, address some of the frequently asked questions and thank the community for their contributions.
How we’re using Conductor at Netflix
Conductor is one of the most heavily used services within Content Engineering at Netflix. Of the multitude of modules that can be plugged into Conductor as shown in the image below, we use the Jersey server module, Cassandra for persisting execution data, Dynomite for persisting metadata, DynoQueues as the queuing recipe built on top of Dynomite, Elasticsearch as the secondary datastore and indexer, and Netflix Spectator + Atlas for Metrics. Our cluster size ranges from 12–18 instances of AWS EC2 m4.4xlarge instances, typically running at ~30% capacity.
We do not maintain an internal fork of Conductor within Netflix. Instead, we use a wrapper that pulls in the latest version of Conductor and adds Netflix infrastructure components and libraries before deployment. This allows us to proactively push changes to the open source version while ensuring that the changes are fully functional and well-tested.
As of writing this blog, Conductor orchestrates 600+ workflow definitions owned by 50+ teams across Netflix. While we’re not (yet) actively measuring the nth percentiles, our production workloads speak for Conductor’s performance. Below is a snapshot of our Kibana dashboard which shows the workflow execution metrics over a typical 7-day period.
Some of the use cases served by Conductor at Netflix can be categorized under:
Content Ingest and Delivery
Content Quality Control
Encodes and Deployments
One of the key features in v2.0 was the introduction of the gRPC framework as an alternative/auxiliary to REST. This was contributed by our counterparts at GitHub, thereby strengthening the value of community contributions to Conductor.
Cassandra Persistence Layer
To enable horizontal scaling of the datastore for large volume of concurrent workflow executions (millions of workflows/day), Cassandra was chosen to provide elastic scaling and meet throughput demands.
External Payload Storage
External payload storage was implemented to prevent the usage of Conductor as a data persistence system and to reduce the pressure on its backend datastore.
Dynamic Workflow Executions
For use cases where the need arises to execute a large/arbitrary number of varying workflow definitions or to run a one-time ad hoc workflow for testing or analytical purposes, registering definitions first with the metadata store in order to then execute them only once, adds a lot of additional overhead. The ability to dynamically create and execute workflows removes this friction. This was another great addition that stemmed from our collaboration with GitHub.
Workflow Status Listener
Conductor can be configured to publish notifications to external systems or queues upon completion/termination of workflows. The workflow status listener provides hooks to connect to any notification system of your choice. The community has contributed an implementation that publishes a message on a dyno queue based on the status of the workflow. An event handler can be configured on these queues to trigger workflows or tasks to perform specific actions upon the terminal state of the workflow.
Bulk Workflow Management
There has always been a need for bulk operations at the workflow level from an operability standpoint. When running at scale, it becomes essential to perform workflow level operations in bulk due to bad downstream dependencies in the worker processes causing task failures or bad task executions. Bulk APIs enable the operators to have macro-level control on the workflows executing within the system.
Decoupling Elasticsearch from Persistence
This inter-dependency was removed by moving the indexing layer into separate persistence modules, exposing a property (workflow.elasticsearch.instanceType) to choose the type of indexing engine. Further, the indexer and persistence layer have been decoupled by moving this orchestration from within the primary persistence layer to a service layer through the ExecutionDAOFacade.
Support for Elasticsearch versions 5 and 6 have been added as part of the major version upgrade to v2.x. This addition also provides the option to use the Elasticsearch RestClient instead of the Transport Client which was enforced in the previous version. This opens the route to using a managed Elasticsearch cluster (a la AWS) as part of the Conductor deployment.
Task Rate Limiting & Concurrent Execution Limits
Task rate limiting helps achieve bounded scheduling of tasks. The task definition parameter rateLimitFrequencyInSeconds sets the duration window, while rateLimitPerFrequency defines the number of tasks that can be scheduled in a duration window. On the other hand, concurrentExecLimit provides unbounded scheduling limits of tasks. I.e the total of current scheduled tasks at any given time will be under concurrentExecLimit. The above parameters can be used in tandem to achieve desired throttling and rate limiting.
Validation was one of the core features missing in Conductor 1.x. To improve usability and operability, we added validations, which in practice has greatly helped find bugs during creation of workflow and task definitions. Validations enforce the user to create and register their task definitions before registering the workflow definitions using these tasks. It also ensures that the workflow definition is well-formed with correct wiring of inputs and outputs in the various tasks within the workflow. Any anomalies found are reported to the user with a detailed error message describing the reason for failure.
Developer Labs, Logging and Metrics
We have been continually improving logging and metrics, and revamped the documentation to reflect the latest state of Conductor. To provide a smooth on boarding experience, we have created developer labs, which guides the user through creating task and workflow definitions, managing a workflow lifecycle, configuring advanced workflows with eventing etc., and a brief introduction to Conductor API, UI and other modules.
New Task Types
System tasks have proven to be very valuable in defining the Workflow structure and control flow. As such, Conductor 2.x has seen several new additions to System tasks, mostly contributed by the community:
Terminate task is useful when workflow logic should terminate with a given output. For example, if a decision task evaluates to false, and we do not want to execute remaining tasks in the workflow, instead of having a DECISION task with a list of tasks in one case and an empty list in the other, this can scope the decide and terminate workflow execution.
Exclusive Join task helps capture task output from a DECISION task’s flow. This is useful to wire task inputs from the outputs of one of the cases within a decision flow. This data will only be available during workflow execution time and the ExclusiveJoin task can be used to collect the output from one of the tasks in any of decision branches.
For in-depth implementation details of the new additions, please refer the documentation.
There are a lot of features and enhancements we would like to add to Conductor. The below wish list could be considered as a long-term road map. It is by no means exhaustive, and we are very much welcome to ideas and contributions from the community. Some of these listed in no particular order are:
Advanced Eventing with Event Aggregation and Distribution
At the moment, event generation and processing is a very simple implementation. An event task can create only one message, and a task can wait for only one event.
We envision an Event Aggregation and Distribution mechanism that would open up Conductor to a multitude of use-cases. A coarse idea is to allow a task to wait for multiple events, and to progress several tasks based on one event.
While the current UI provides a neat way to visualize and track workflow executions, we would like to enhance this with features like:
Creating metadata objects from UI
Support for starting workflows
Visualize execution metrics
Admin dashboard to show outliers
New Task types like Goto, Loop etc.
Conductor has been using a Directed Acyclic Graph (DAG) structure to define a workflow. The Goto and Loop on tasks are valid use cases, which would deviate from the DAG structure. We would like to add support for these tasks without violating the existing workflow execution rules. This would help unlock several other use cases like streaming flow of data to tasks and others that require repeated execution of a set of tasks within a workflow.
Support for reusable commonly used tasks like Email, DatabaseQuery etc.
Similarly, we’ve seen the value of shared reusable tasks that does a specific thing. At Netflix internal deployment of Conductor, we’ve added tasks specific to services that users can leverage over recreating the tasks from scratch. For example, we provide a TitusTask which enables our users to launch a new Titus container as part of their workflow execution.
We would like to extend this idea such that Conductor can offer a repository of commonly used tasks.
Push based task scheduling interface
Current Conductor architecture is based on polling from a worker to get tasks that it will execute. We need to enhance the grpc modules to leverage the bidirectional channel to push tasks to workers as and when they are scheduled, thus reducing network traffic, load on the server and redundant client calls.
Validating Task inputKeys and outputKeys
This is to provide type safety for tasks and define a parameterized interface for task definitions such that tasks are completely re-usable within Conductor once registered. This provides a contract allowing the user to browse through available task definitions to use as part of their workflow where the tasks could have been implemented by another team/user. This feature would also involve enhancing the UI to display this contract.
Implementing MetadataDAO in Cassandra
As mentioned here, Cassandra module provides a partial implementation for persisting only the workflow executions. Metadata persistence implementation is not available yet and is something we are looking to add soon.
Pluggable Notifications on Task completion
Similar to the Workflow status listener, we would like to provide extensible interfaces for notifications on task execution.
Python client in Pypi
We have seen wide adoption of Python client within the community. However, there is no official Python client in Pypi, and lacks some of the newer additions to the Java client. We would like to achieve feature parity and publish a client from Conductor Github repository, and automate the client release to Pypi.
Removing Elasticsearch from critical path
While Elasticsearch is greatly useful in Conductor, we would like to make this optional for users who do not have Elasticsearch set-up. This means removing Elasticsearch from the critical execution path of a workflow and using it as an opt-in layer.
Pluggable authentication and authorization
Conductor doesn’t support authentication and authorization for API or UI, and is something that we feel would add great value and is a frequent request in the community.
Validations and Testing
Dry runs, i.e the ability to evaluate workflow definitions without actually running it through worker processes and all relevant set-up would make it much easier to test and debug execution paths.
If you would like to be a part of the Conductor community and contribute to one of the Wishlist items or something that you think would provide a great value add, please read through this guide for instructions or feel free to start a conversation on our Gitter channel, which is Conductor’s user forum.
We also highly encourage to polish, genericize and share any customizations that you may have built on top of Conductor with the community.
We really appreciate and are extremely proud of the community involvement, who have made several important contributions to Conductor. We would like to take this further and make Conductor widely adopted with a strong community backing.
Netflix Conductor is maintained by the Media Workflow Infrastructure team. If you like the challenges of building distributed systems and are interested in building the Netflix Content and Studio ecosystem at scale, connect with Charles Zhao to get the conversation started.
Thanks to Alexandra Pau, Charles Zhao, Falguni Jhaveri, Konstantinos Christidis and Senthil Sayeebaba.
Pirated copies of movies appear online every day in a variety of formats, such as CAM, DVDRip, WEBRip, and Web-DL.
The latter, which usually come from streaming and download services such as Netflix, Amazon, or iTunes, have proven to be a reliable source for pirates over the years.
In general, that doesn’t apply to 4K releases. These are protected by the highest encryption standards. In the case of Netflix, this is Widevine’s highest level DRM. Cracking this is seen as the holy grail by pirates.
While there is no confirmation that the keys have been cracked, a flurry of new 4K Netflix leaks suggests that there’s at least some type of vulnerability that allows outsiders to decrypt the original steams.
Over the past 24-hours several 4K releases of prominent Netflix titles spread across various pirate sites. It started with the entire third season of the Netflix exclusive “Stranger Things,” which came out yesterday.
The leaked episodes originate from the DEFLATE release group and are all marked as ‘INTERNAL’ releases, such as “Stranger Things S03E01 INTERNAL 2160p WEB H265-DEFLATE.”
In the past, we have seen several 4K videos being ripped from Netflix. In fact, the first rips came out four years ago. However, the WEB tags on today’s releases indicate that these files were directly decrypted from the original files, which means that there’s no loss in quality.
“Untouched releases must be considered as anything that has been losslessly downloaded by official (offered) or unofficial (backdoor) methods,” official Scene rules dictate.
These untouched releases are rare. We’ve only previously seen these types of Netflix leaks for a brief period in 2017. At the time, the releases stopped following a Widevine update, a source informed TorrentFreak.
Exactly how the release group was able to pull off these new leaks is unknown. TorrentFreak reached out to Netflix for a comment on the matter, but at the time of publication, we have yet to hear back.
The DEFLATE release group is no stranger to novel 4K leaks. Earlier this year the same group also released several movies from iTunes, including the entire James Bond collection. That was the first breach of its kind on iTunes.
The first Stranger Things leak was pointed out by Tarnkappe but several other titles have appeared online as well. The release group UHDCANDY, for example, also posted the first episodes from the latest seasons of Marvel’s Jessica Jones and Black Mirror.
The fact that two groups have been able to decrypt the 4K releases indicates that this ‘breach’ is widespread. It wouldn’t be a surprise to see more titles appear during the coming days, until the hole is patched again.
Bringing Rich Experiences to Memory-Constrained TV Devices
By Jason Munning, Archana Kumar, Kris Range
Netflix has over 148M paid members streaming on more than half a billion devices spanning over 1,900 different types. In the TV space alone, there are hundreds of device types that run the Netflix app. We need to support the same rich Netflix experience on not only high-end devices like the PS4 but also memory and processor-constrained consumer electronic devices that run a similar chipset as was used in an iPhone 3Gs.
In this post, we will discuss the development of the Rich Collection row and the iterations we went through to be able to support this experience across the majority of the TV ecosystem.
Rich Collection Row
One of our most ambitious UI projects to date on the TV app is the animated Rich Collection Row. The goal of this experience from a UX design perspective was to bring together a tightly-related set of original titles that, though distinct entities on their own, also share a connected universe. We hypothesized this design would net a far greater visual impact than if the titles were distributed individually throughout the page. We wanted the experience to feel less like scrolling through a row and more like exploring a connected world of stories.
For the collections below, the row is composed of characters representing each title in a collected universe overlaid onto a shared, full-bleed background image which depicts the shared theme for the collection. When the user first scrolls down to the row, the characters are grouped into a lineup of four. The name of the collection animates in along with the logos for each title while a sound clip plays which evokes the mood of the shared world. The characters slide off screen to indicate the first title is selected. As the user scrolls horizontally, characters slide across the screen and the shared backdrop scrolls with a parallax effect. For some of the collections, the character images themselves animate and a full-screen tint is applied using a color that is representative of the show’s creative (see “Character Images” below).
Once the user pauses on a title for more than two seconds, the trailer for that title cross-fades with the background image and begins playing.
As part of developing this type of UI experience on any platform, we knew we would need to think about creating smooth, performant animations with a balance between quality and download size for the images and video previews, all without degrading the performance of the app. Some of the metrics we use to measure performance on the Netflix TV app include animation frames per second (FPS), key input responsiveness (the amount of time before a member’s key press renders a change in the UI), video playback speed, and app start-up time.
UI developers on the Netflix TV app also need to consider some challenges that developers on other platforms often are able to take for granted. One such area is our graphics memory management. While web browsers and mobile phones have gigabytes of memory available for graphics, our devices are constrained to mere MBs. Our UI runs on top of a custom rendering engine which uses what we call a “surface cache” to optimize our use of graphics memory.
Surface cache is a reserved pool in main memory (or separate graphics memory on a minority of systems) that the Netflix app uses for storing textures (decoded images and cached resources). This benefits performance as these resources do not need to be re-decoded on every frame, saving CPU time and giving us a higher frame-rate for animations.
Each device running the Netflix TV application has a limited surface cache pool available so the rendering engine tries to maximize the usage of the cache as much as possible. This is a positive for the end experience because it means more textures are ready for re-use as a customer navigates around the app.
The amount of space a texture requires in surface cache is calculated as:
width * height * 4 bytes/pixel (for rgba)
Most devices currently run a 1280 x 720 Netflix UI. A full-screen image at this resolution will use 1280 * 720 * 4 = 3.5MB of surface cache. The majority of legacy devices run at 28MB of surface cache. At this size, you could fit the equivalent of 8 full-screen images in the cache. Reserving this amount of memory allows us to use transition effects between screens, layering/parallax effects, and to pre-render images for titles that are just outside the viewport to allow scrolling in any direction without images popping in. Devices in the Netflix TVUI ecosystem have a range of surface cache capacity, anywhere from 20MB to 96MB and we are able to enable/disable rich features based on that capacity.
When the limit of this memory pool is approached or exceeded, the Netflix TV app tries to free up space with resources it believes it can purge (i.e. images no longer in the viewport). If the cache is over budget with surfaces that cannot be purged, devices can behave in unpredictable ways ranging from application crashes, displaying garbage on the screen, or drastically slowing down animations.
Surface Cache and the Rich Collection Row
From developing previous rich UI features, we knew that surface cache usage was something to consider with the image-heavy design for the Rich Collection row. We made sure to test memory usage early on during manual testing and did not see any overages so we checked that box and proceeded with development. When we were approaching code-complete and preparing to roll out this experience to all users we ran our new code against our memory-usage automation suite as a sanity check.
The chart below shows an end-to-end automated test that navigates the Netflix app, triggering playbacks, searches, etc to simulate a user session. In this case, the test was measuring surface cache after every step. The red line shows a test run with the Rich Collection row and the yellow line shows a run without. The dotted red line is placed at 28MB which is the amount of memory reserved for surface cache on the test device.
Uh oh! We found some massive peaks (marked in red) in surface cache that exceeded our maximum recommended surface cache usage of 28MB and indicated we had a problem. Exceeding the surface cache limit can have a variety of impacts (depending on the device implementation) to the user from missing images to out of memory crashes. Time to put the brakes on the rollout and debug!
Assessing the Problem
The first step in assessing the problem was to drill down into our automation results to make sure they were valid. We re-ran the automation tests and found the results were reproducible. We could see the peaks were happening on the home screen where the Rich Collection row was being displayed. It was odd that we hadn’t seen the surface cache over budget (SCOB) errors while doing manual testing.
To close the gap we took a look at the configuration settings we were using in our automation and adjusted them to match the settings we use in production for real devices. We then re-ran the automation and still saw the peaks but in the process we discovered that the issue seemed to only present itself on devices running a version of our SDK from 2015. The manual testing hadn’t caught it because we had only been manually testing surface cache on more recent versions of the SDK. Once we did manual testing on our older SDK version we were able to reproduce the issue in our development environment.
During brainstorming with our platform team, we came across an internal bug report from 2017 that described a similar issue to what we were seeing — surfaces that were marked as purgeable in the surface cache were not being fully purged in this older version of our SDK. From the ticket we could see that the inefficiency was fixed in the next release of our SDK but, because not all devices get Netflix SDK updates, the fix could not be back-ported to the 2015 version that had this issue. Considering that a significant share of our actively-used TV devices are running this 2015 version and won’t be updated to a newer SDK, we knew we needed to find a fix that would work for this specific version — a similar situation to the pre-2000 world before browsers auto-updated and developers had to code to specific browser versions.
Finding a Solution
The first step was to take a look at what textures were in the surface cache (especially those marked as un-purgeable) at the time of the overage and see where we might be able to make gains by reducing the size of images. For this we have a debug port that allows us to inspect which images are in the cache. This shows us information about the images in the surface cache including url. The links can then be hovered over to show a small thumbnail of the image.
From snapshots such as this one we could see the Rich Collection row alone filled about 15.3MB of surface cache which is >50% of the 28MB total graphics memory available on devices running our 2015 SDK.
The largest un-purgeable images we found were:
Character images (6 * 1MB)
Background images for the parallax background (2 * 2.9MB)
Unknown — a full screen blank white rectangle (3.5MB)
Some of our rich collections featured the use of animated character assets to give an even richer experience. We created these assets using a Netflix-proprietary animation format called a Scriptable Network Graphic (SNG) which was first supported in 2017 and is similar to an animated PNG. The SNG files have a relatively large download size at ~1.5MB each. In order to ensure these assets are available at the time the rich collection row enters the viewport, we preload the SNGs during app startup and save them to disk. If the user relaunches the app in the future and receives the same collection row, the SNG files can be read from the disk cache, avoiding the need to download them again. Devices running an older version of the SDK fallback to a static character image.
At the time of the overage we found thatsix character images were present in the cache — four on the screen and two preloaded off of the screen. Our first savings came from only preloading one image for a total of five characters in the cache. Right off the bat this saved us almost 7% in surface cache with no observable impact to the experience.
Next we created cropped versions of the static character images that did away with extra transparent pixels (that still count toward surface cache usage!). This required modifications to the image pipeline in order to trim the whitespace but still maintain the relative size of the characters — so the relative heights of the characters in the lineup would still be preserved. The cropped character assets used only half of the surface cache memory of the full-size images and again had no visible impact to the experience.
In order to achieve the illusion of a continuously scrolling parallax background, we were using two full screen background images essentially placed side by side which together accounted for ~38% of the experience’s surface cache usage. We worked with design to create a new full-screen background image that could be used for a fallback experience (without parallax) on devices that couldn’t support loading both of the background images for the parallax effect. Using only one background image saved us 19% in surface cache for the fallback experience.
After trial and error removing React components from our local build and inspecting the surface cache we found that the unknown widget that showed as a full screen blank white rectangle in our debug tool was added by the full-screen tint effect we were using. In order to apply the tint, the graphics layer essentially creates a full screen texture that is colored dynamically and overlaid over the visible viewport. Removing the tint overlay saved us 23% in surface cache.
Removing the tint overlay and using a single background image gave us a fallback experience that used 42% less surface cache than the full experience.
When all was said and done, the surface cache usage of the fallback experience (including fewer preloaded characters, cropped character images, a single background, and no tint overlay) clocked in at about 5MB which gave us a total savings of almost 67% over our initial implementation.
We were able to target this fallback experience to devices running the 2015 and older SDK, while still serving the full rich experience (23% lower surface cache usage than the original implementation) to devices running the new SDKs.
At this point our automation was passing so we began slowly rolling out this experience to all members. As part of any rollout, we have a dashboard of near real-time metrics that we monitor. To our chagrin we saw that another class of devices — those running the 2017 SDK — also were reporting higher SCOB errors than the control.
Thanks to our work on the fallback experience we were able to change the configuration for this class of devices on the fly to serve the fallback experience (without parallax background and tint). We found if we used the fallback experience we could still get away with using the animated characters. So yet another flavor of the experience was born.
Improvements and Takeaways
At Netflix we strive to move fast in innovation and learn from all projects whether they are successes or failures. From this project, we learned that there were gaps in our understanding of how our underlying graphics memory worked and in the tooling we used to monitor that memory. We kicked off an effort to understand this graphics memory space at a low level and compiled a set of best practices for developers beginning work on a project. We also documented a set of tips and tools for debugging and optimizing surface cache should a problem arise.
As part of that effort, we expanded our suite of build-over-build automated tests to increase coverage across our different SDK versions on real and reference devices to detect spikes/regressions in our surface cache usage.
We began logging SCOB errors with more detail in production so we can target the specific areas of the app that we need to optimize. We also are now surfacing surface cache errors as notifications in the dev environment so developers can catch them sooner.
And we improved our surface cache inspector tool to be more user friendly and to integrate with our Chrome DevTools debugger:
As UI engineers on the TVUI platform at Netflix, we have the challenge of delivering ambitious UI experiences to a highly fragmented ecosystem of devices with a wide range of performance characteristics. It’s important for us to reach as many devices as possible in order to give our members the best possible experience.
The solutions we developed while scaling the Rich Collection row have helped inform how we approach ambitious UI projects going forward. With our optimizations and fallback experiences we were able to almost double the number of devices that were able to get the Rich Collection row.
We are now more thoughtful about designing fallback experiences that degrade gracefully as part of the initial design phase instead of just as a reaction to problems we encounter in the development phase. This puts us in a position of being able to scale an experience very quickly with a set of knobs and levers that can be used to tune an experience for a specific class of devices.
Most importantly, we received feedback that our members enjoyed our Rich Collection row experience — both the full and fallback experiences — when we rolled them out globally at the end of 2018.
If this interests you and want to help build the future UIs for discovering and watching shows and movies, join our team!
Hack Days are a big deal at Netflix. They’re a chance to bring together employees from all our different disciplines to explore new ideas and experiment with emerging technologies.
For the most recent hack day, we channeled our creative energy towards our studio efforts. The goal remained the same: team up with new colleagues and have fun while learning, creating, and experimenting. We know even the silliest idea can spur something more.
The most important value of hack days is that they support a culture of innovation. We believe in this work, even if it never ships, and love to share the creativity and thought put into these ideas.
Below, you can find videos made by the hackers of some of our favorite hacks from this event.
Project Rumble Pak
You’re watching your favorite episode of Voltron when, after a suspenseful pause, there’s a huge explosion — and your phone starts to vibrate in your hands.
The Project Rumble Pak hack day project explores how haptics can enhance the content you’re watching. With every explosion, sword clank, and laser blast, you get force feedback to amp up the excitement.
For this project, we synchronized Netflix content with haptic effects using Immersion Corporation technology.
Introducing The Voice of Netflix. We trained a neural net to spot words in Netflix content and reassemble them into new sentences on demand. For our stage demonstration, we hooked this up to a speech recognition engine to respond to our verbal questions in the voice of Netflix’s favorite characters. Try it out yourself at blogofsomeguy.com/v!
TerraVision re-envisions the creative process and revolutionizes the way our filmmakers can search and discover filming locations. Filmmakers can drop a photo of a look they like into an interface and find the closest visual matches from our centralized library of locations photos. We are using a computer vision model trained to recognize places to build reverse image search functionality. The model converts each image into a small dimensional vector, and the matches are obtained by computing the nearest neighbors of the query.
Have you ever found yourself needing to give the Evil Eye™ to colleagues who are hogging your conference room after their meeting has ended?
Our hack is a simple web application that allows employees to select a Netflix meeting room anywhere in the world, and press a button to kick people out of their meeting room if they have overstayed their meeting. First, the app looks up calendar events associated with the room and finds the latest meeting in the room that should have already ended. It then automatically calls in to that meeting and plays walk-off music similar to the Oscar’s to not-so-subtly encourage your colleagues to Get Out! We built this hack using Java (Springboot framework), the Google OAuth and Calendar APIs (for finding rooms) and Twilio API (for calling into the meeting), and deployed it on AWS.
By Pythonistas at Netflix, coordinated by Amjith Ramanujam and edited by Ellen Livengood
As many of us prepare to go to PyCon, we wanted to share a sampling of how Python is used at Netflix. We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members. We use and contribute to many open-source Python packages, some of which are mentioned below. If any of this interests you, check out the jobs site or find us at PyCon. We have donated a few Netflix Originals posters to the PyLadies Auction and look forward to seeing you all there.
Open Connect is Netflix’s content delivery network (CDN). An easy, though imprecise, way of thinking about Netflix infrastructure is that everything that happens before you press Play on your remote control (e.g., are you logged in? what plan do you have? what have you watched so we can recommend new titles to you? what do you want to watch?) takes place in Amazon Web Services (AWS), whereas everything that happens afterwards (i.e., video streaming) takes place in the Open Connect network. Content is placed on the network of servers in the Open Connect CDN as close to the end user as possible, improving the streaming experience for our customers and reducing costs for both Netflix and our Internet Service Provider (ISP) partners.
Various software systems are needed to design, build, and operate this CDN infrastructure, and a significant number of them are written in Python. The network devices that underlie a large portion of the CDN are mostly managed by Python applications. Such applications track the inventory of our network gear: what devices, of which models, with which hardware components, located in which sites. The configuration of these devices is controlled by several other systems including source of truth, application of configurations to devices, and back up. Device interaction for the collection of health and other operational data is yet another Python application. Python has long been a popular programming language in the networking space because it’s an intuitive language that allows engineers to quickly solve networking problems. Subsequently, many useful libraries get developed, making the language even more desirable to learn and use.
Demand Engineering is responsible for Regional Failovers, Traffic Distribution, Capacity Operations, and Fleet Efficiency of the Netflix cloud. We are proud to say that our team’s tools are built primarily in Python. The service that orchestrates failover uses numpy and scipy to perform numerical analysis, boto3 to make changes to our AWS infrastructure, rq to run asynchronous workloads and we wrap it all up in a thin layer of Flask APIs. The ability to drop into a bpython shell and improvise has saved the day more than once.
We are heavy users of Jupyter Notebooks and nteract to analyze operational data and prototype visualization tools that help us detect capacity regressions.
The CORE team uses Python in our alerting and statistical analytical work. We lean on many of the statistical and mathematical libraries (numpy, scipy, ruptures, pandas) to help automate the analysis of 1000s of related signals when our alerting systems indicate problems. We’ve developed a time series correlation system used both inside and outside the team as well as a distributed worker system to parallelize large amounts of analytical work to deliver results quickly.
Python is also a tool we typically use for automation tasks, data exploration and cleaning, and as a convenient source for visualization work.
Monitoring, alerting and auto-remediation
The Insight Engineering team is responsible for building and operating the tools for operational insight, alerting, diagnostics, and auto-remediation. With the increased popularity of Python, the team now supports Python clients for most of their services. One example is the Spectator Python client library, a library for instrumenting code to record dimensional time series metrics. We build Python libraries to interact with other Netflix platform level services. In addition to libraries, the Winston and Bolt products are also built using Python frameworks (Gunicorn + Flask + Flask-RESTPlus).
The information security team uses Python to accomplish a number of high leverage goals for Netflix: security automation, risk classification, auto-remediation, and vulnerability identification to name a few. We’ve had a number of successful Python open sources, including Security Monkey (our team’s most active open source project). We leverage Python to protect our SSH resources using Bless. Our Infrastructure Security team leverages Python to help with IAM permission tuning using Repokid. We use Python to help generate TLS certificates using Lemur.
Some of our more recent projects include Prism: a batch framework to help security engineers measure paved road adoption, risk factors, and identify vulnerabilities in source code. We currently provide Python and Ruby libraries for Prism. The Diffy forensics triage tool is written entirely in Python. We also use Python to detect sensitive data using Lanius.
We use Python extensively within our broader Personalization Machine Learning Infrastructure to train some of the Machine Learning models for key aspects of the Netflix experience: from our recommendation algorithms to artwork personalization to marketing algorithms. For example, some algorithms use TensorFlow, Keras, and PyTorch to learn Deep Neural Networks, XGBoost and LightGBM to learn Gradient Boosted Decision Trees or the broader scientific stack in Python (e.g. numpy, scipy, sklearn, matplotlib, pandas, cvxpy). Because we’re constantly trying out new approaches, we use Jupyter Notebooks to drive many of our experiments. We have also developed a number of higher-level libraries to help integrate these with the rest of our ecosystem (e.g. data access, fact logging and feature extraction, model evaluation, and publishing).
Machine Learning Infrastructure
Besides personalization, Netflix applies machine learning to hundreds of use cases across the company. Many of these applications are powered by Metaflow, a Python framework that makes it easy to execute ML projects from the prototype stage to production.
Metaflow pushes the limits of Python: We leverage well parallelized and optimized Python code to fetch data at 10Gbps, handle hundreds of millions of data points in memory, and orchestrate computation over tens of thousands of CPU cores.
But Python plays a huge role in how we provide those services. Python is a primary language when we need to develop, debug, explore, and prototype different interactions with the Jupyter ecosystem. We use Python to build custom extensions to the Jupyter server that allows us to manage tasks like logging, archiving, publishing, and cloning notebooks on behalf of our users. We provide many flavors of Python to our users via different Jupyter kernels, and manage the deployment of those kernel specifications using Python.
The Big Data Orchestration team is responsible for providing all of the services and tooling to schedule and execute ETL and Adhoc pipelines.
Many of the components of the orchestration service are written in Python. Starting with our scheduler, which uses Jupyter Notebooks with papermill to provide templatized job types (Spark, Presto, …). This allows our users to have a standardized and easy way to express work that needs to be executed. You can see some deeper details on the subject here. We have been using notebooks as real runbooks for situations where human intervention is required — for example: to restart everything that has failed in the last hour.
Internally, we also built an event-driven platform that is fully written in Python. We have created streams of events from a number of systems that get unified into a single tool. This allows us to define conditions to filter events, and actions to react or route them. As a result of this, we have been able to decouple microservices and get visibility into everything that happens on the data platform.
Our team also built the pygenie client which interfaces with Genie, a federated job execution service. Internally, we have additional extensions to this library that apply business conventions and integrate with the Netflix platform. These libraries are the primary way users interface programmatically with work in the Big Data platform.
Finally, it’s been our team’s commitment to contribute to papermill and scrapbook open source projects. Our work there has been both for our own and external use cases. These efforts have been gaining a lot of traction in the open source community and we’re glad to be able to contribute to these shared projects.
The scientific computing team for experimentation is creating a platform for scientists and engineers to analyze AB tests and other experiments. Scientists and engineers can contribute new innovations on three fronts, data, statistics, and visualizations.
The Metrics Repo is a Python framework based on PyPika that allows contributors to write reusable parameterized SQL queries. It serves as an entry point into any new analysis.
The Causal Models library is a Python & R framework for scientists to contribute new models for causal inference. It leverages PyArrow and RPy2 so that statistics can be calculated seamlessly in either language.
The Visualizations library is based on Plotly. Since Plotly is a widely adopted visualization spec, there are a variety of tools that allow contributors to produce an output that is consumable by our platforms.
The Partner Ecosystem group is expanding its use of Python for testing Netflix applications on devices. Python is forming the core of a new CI infrastructure, including controlling our orchestration servers, controlling Spinnaker, test case querying and filtering, and scheduling test runs on devices and containers. Additional post-run analysis is being done in Python using TensorFlow to determine which tests are most likely to show problems on which devices.
Video Encoding and Media Cloud Engineering
Our team takes care of encoding (and re-encoding) the Netflix catalog, as well as leveraging machine learning for insights into that catalog. We use Python for ~50 projects such as vmaf and mezzfs, we build computer vision solutions using a media map-reduce platform called Archer, and we use Python for many internal projects. We have also open sourced a few tools to ease development/distribution of Python projects, like setupmeta and pickley.
Netflix Animation and NVFX
Python is the industry standard for all of the major applications we use to create Animated and VFX content, so it goes without saying that we are using it very heavily. All of our integrations with Maya and Nuke are in Python, and the bulk of our Shotgun tools are also in Python. We’re just getting started on getting our tooling in the cloud, and anticipate deploying many of our own custom Python AMIs/containers.
Content Machine Learning, Science & Analytics
The Content Machine Learning team uses Python extensively for the development of machine learning models that are the core of forecasting audience size, viewership, and other demand metrics for all content.
Back in January, a coalition of companies and organizations with ties to the entertainment industries called on local telecoms regulator CRTC to implement a national website blocking regime.
Under the banner of Fairplay Canada, members including Bell, Cineplex, Directors Guild of Canada, Maple Leaf Sports and Entertainment, Movie Theatre Association of Canada, and Rogers Media, spoke of an industry under threat from marauding pirates. But just how serious is this threat?
The results of a new survey commissioned by Innovation Science and Economic Development Canada (ISED) in collaboration with the Department of Canadian Heritage (PCH) aims to shine light on the problem by revealing the online content consumption habits of citizens in the Great White North.
While there are interesting findings for those on both sides of the site-blocking debate, the situation seems somewhat removed from the Armageddon scenario predicted by the entertainment industries.
Carried out among 3,301 Canadians aged 12 years and over, the Kantar TNS study aims to cover copyright infringement in six key content areas – music, movies, TV shows, video games, computer software, and eBooks. Attitudes and behaviors are also touched upon while measuring the effectiveness of Canada’s copyright measures.
General Digital Content Consumption
In its introduction, the report notes that 28 million Canadians used the Internet in the three-month study period to November 27, 2017. Of those, 22 million (80%) consumed digital content. Around 20 million (73%) streamed or accessed content, 16 million (59%) downloaded content, while 8 million (28%) shared content.
Music, TV shows and movies all battled for first place in the consumption ranks, with 48%, 48%, and 46% respectively.
According to the study, the majority of Canadians do things completely by the book. An impressive 74% of media-consuming respondents said that they’d only accessed material from legal sources in the preceding three months.
The remaining 26% admitted to accessing at least one illegal file in the same period. Of those, just 5% said that all of their consumption was from illegal sources, with movies (36%), software (36%), TV shows (34%) and video games (33%) the most likely content to be consumed illegally.
Interestingly, the study found that few demographic factors – such as gender, region, rural and urban, income, employment status and language – play a role in illegal content consumption.
“We found that only age and income varied significantly between consumers who infringed by downloading or streaming/accessing content online illegally and consumers who did not consume infringing content online,” the report reads.
“More specifically, the profile of consumers who downloaded or streamed/accessed infringing content skewed slightly younger and towards individuals with household incomes of $100K+.”
Licensed services much more popular than pirate haunts
It will come as no surprise that Netflix was the most popular service with consumers, with 64% having used it in the past three months. Sites like YouTube and Facebook were a big hit too, visited by 36% and 28% of content consumers respectively.
Overall, 74% of online content consumers use licensed services for content while 42% use social networks. Under a third (31%) use a combination of peer-to-peer (BitTorrent), cyberlocker platforms, or linking sites. Stream-ripping services are used by 9% of content consumers.
“Consumers who reported downloading or streaming/accessing infringing content only are less likely to use licensed services and more likely to use peer-to-peer/cyberlocker/linking sites than other consumers of online content,” the report notes.
Attitudes towards legal consumption & infringing content
In common with similar surveys over the years, the Kantar research looked at the reasons why people consume content from various sources, both legal and otherwise.
Convenience (48%), speed (36%) and quality (34%) were the most-cited reasons for using legal sources. An interesting 33% of respondents said they use legal sites to avoid using illegal sources.
On the illicit front, 54% of those who obtained unauthorized content in the previous three months said they did so due to it being free, with 40% citing convenience and 34% mentioning speed.
Almost six out of ten (58%) said lower costs would encourage them to switch to official sources, with 47% saying they’d move if legal availability was improved.
Canada’s ‘Notice-and-Notice’ warning system
People in Canada who share content on peer-to-peer systems like BitTorrent without permission run the risk of receiving an infringement notice warning them to stop. These are sent by copyright holders via users’ ISPs and the hope is that the shock of receiving a warning will turn consumers back to the straight and narrow.
The study reveals that 10% of online content consumers over the age of 12 have received one of these notices but what kind of effect have they had?
“Respondents reported that receiving such a notice resulted in the following: increased awareness of copyright infringement (38%), taking steps to ensure password protected home networks (27%), a household discussion about copyright infringement (27%), and discontinuing illegal downloading or streaming (24%),” the report notes.
While these are all positives for the entertainment industries, Kantar reports that almost a quarter (24%) of people who receive a notice simply ignore them.
Once upon a time, people obtaining music via P2P networks was cited as the music industry’s greatest threat but, with the advent of sites like YouTube, so-called stream-ripping is the latest bogeyman.
According to the study, 11% of Internet users say they’ve used a stream-ripping service. They are most likely to be male (62%) and predominantly 18 to 34 (52%) years of age.
“Among Canadians who have used a service to stream-rip music or entertainment, nearly half (48%) have used stream-ripping sites, one-third have used downloader apps (38%), one-in-seven (14%) have used a stream-ripping plug-in, and one-in-ten (10%) have used stream-ripping software,” the report adds.
Set-Top Boxes and VPNs
Few general piracy studies would be complete in 2018 without touching on set-top devices and Virtual Private Networks and this report doesn’t disappoint.
More than one in five (21%) respondents aged 12+ reported using a VPN, with the main purpose of securing communications and Internet browsing (57%).
A relatively modest 36% said they use a VPN to access free content while 32% said the aim was to access geo-blocked content unavailable in Canada. Just over a quarter (27%) said that accessing content from overseas at a reasonable price was the main motivator.
One in ten (10%) of respondents reported using a set-top box, with 78% stating they use them to access paid-for content. Interestingly, only a small number say they use the devices to infringe.
“A minority use set-top boxes to access other content that is not legal or they are unsure if it is legal (16%), or to access live sports that are not legal or they are unsure if it is legal (11%),” the report notes.
“Individuals who consumed a mix of legal and illegal content online are more likely to use VPN services (42%) or TV set-top boxes (21%) than consumers who only downloaded or streamed/accessed legal content.”
Kantar says that the findings of the report will be used to help policymakers evaluate how Canada’s Copyright Act is coping with a changing market and technological developments.
“This research will provide the necessary information required to further develop copyright policy in Canada, as well as to provide a foundation to assess the effectiveness of the measures to address copyright infringement, should future analysis be undertaken,” it concludes.
In April 2017, the first episode of the brand new season of Netflix’s Orange is the New Black was uploaded to The Pirate Bay, months ahead of its official release date.
The leak was the work of a hacking entity calling itself TheDarkOverlord (TDO). One of its members had contacted TorrentFreak months earlier claiming that the content was in its hands but until the public upload, nothing could be confirmed.
TDO told us it had obtained the episodes after hacking the systems of Hollywood-based Larson Studios, an ADR (additional dialogue recorded) studio, back in 2016. TDO had attempted to blackmail the company into paying a bitcoin ransom but when it wasn’t forthcoming, TDO pressed the nuclear button.
Netflix responded by issuing a wave of takedown notices but soon TDO moved onto a new target. In June 2017, TDO followed up on an earlier threat to leak content owned by ABC.
But while TDO was perhaps best known for its video-leaking exploits, the group’s core ‘business’ was hacking what many perceived to be softer targets. TDO ruthlessly slurped confidential data from weakly protected computer systems at medical facilities, private practices, and businesses large and small.
In each case, the group demanded ransoms in exchange for silence and leaked sensitive data to the public if none were paid. With dozens of known targets, TDO found itself at the center of an international investigation, led by the FBI. That now appears to have borne some fruit, with the arrest of an individual in Serbia.
Serbian police say that members of its Ministry of Internal Affairs, Criminal Police Directorate (UCC), in coordination with the Special Prosecution for High-Tech Crime, have taken action against a suspected member of TheDarkOverlord group.
Police say they tracked down a Belgrade resident, who was arrested and taken into custody. Identified only by the initials “S.S”, police say the individual was born in 1980 but have released no further personal details. A search of his apartment and other locations led to the seizure of items of digital equipment.
“According to the order of the Special Prosecutor’s Office for High-Tech Crime, criminal charges will be brought against him because of the suspicion that he committed the criminal offense of unauthorized access to a protected computer, computer networks and electronic processing, and the criminal offense of extortion,” a police statement reads.
In earlier correspondence with TF, the TDO member always gave the impression of working as part of a team but we only had a single contact point which appeared to be the same person. However, Serbian authorities say the larger investigation is aimed at uncovering “a large number of people” who operate under the banner of “TheDarkOverlord”.
Since June 2016, the group is said to have targeted at least 50 victims while demanding bitcoin ransoms to avoid disclosure of their content. Serbian authorities say that on the basis of available data, TDO received payments of more than $275,000.
In 2018, dealing with copyright infringement claims is a daily issue for many content platforms. The law in many regions demands swift attention and in order to appease copyright holders, most platforms are happy to oblige.
While it’s not unusual for ‘pirate’ content and services to suddenly disappear in response to a DMCA or similar notice, the same is rarely true for entire legitimate services.
But that’s what appeared to happen on the Roku platform during the night, when YouTube, Netflix and other channels disappeared only to be replaced with an ominous anti-piracy warning.
As the embedded tweet shows, the message caused confusion among Roku users who were only using their devices to access legal content. Messages replacing Netflix and YouTube seemed to have caused the greatest number of complaints but many other services were affected.
But while users scratched their heads, with some misguidedly blaming Roku for not being diligent enough against piracy, Roku took to Twitter to reveal that rather than anti-piracy complaints against the channels in question, a technical hitch was to blame.
We are aware of a technical glitch that is affecting the Channel Store and channel playback on your Roku device. We hope to resolve the issue shortly. You can get a status update at https://t.co/mxCNxoG36N. We apologize for any inconvenience.
However, a subsequent statement to CNET suggested that while blacking out Netflix and YouTube might have been accidental, Roku appears to have been taking anti-piracy action against another channel or channels at the time, with the measures inadvertently spilling over to innocent parties.
“We use that warning when we detect content that has violated copyright,” Roku said in a statement.
“Some channels in our Channel Store displayed that message and became inaccessible after Roku implemented a targeted anti-piracy measure on the platform.”
The precise nature of the action taken by Roku is unknown but it’s clear that copyright infringement is currently a hot topic for the platform.
Roku is currently fighting legal action in Mexico which ordered its products off the shelves following complaints that its platform is used by pirates. That led to an FBI warning being shown for what was believed to be the first time against the XTV and other channels last year.
This March, Roku took action against the popular USTVNow channel following what was described as a “third party” copyright infringement complaint. Just a couple of weeks later, Roku followed up by removing the controversial cCloud channel.
With Roku currently fighting to have sales reinstated in Mexico against a backdrop of claims that up to 40% of its users are pirates, it’s unlikely that Roku is suddenly going to go soft on piracy, so more channel outages can be expected in the future.
In the meantime, the scary FBI warnings of last evening are beginning to fade away (for legitimate channels at least) after the company issued advice on how to fix the problem.
“The recent outage which affected some channels has been resolved. Go to Settings > System > System update > Check now for a software update. Some channels may require you to log in again. Thank you for your patience,” the company wrote in an update.
Информация от Австрия за начина, по който обществената телевизия е под надзора на местния регулатор за изпълнение на обществената мисия.
Регулаторът не е одобрил две предложения на ORF – за собствен канал в YouTube и за нова услуга – предавания/филми от програмите на ORF срещу абонамент.
Видимо в Австрия регулаторът има роля ex ante в по-широк обхват – което е в защита на аудиторията. БНТ е в YouYube – или поне в интернет се вижда “Официален канал на БНТ” в YouTube. Но това не е единствената илюстрация за ограничената роля на българския регулатор – не е ясно например какво става с политематичната БНТ2, нито може да се прочете мониторингов доклад как БНТ изпълнява лицензиите си.
ORF кандидатства за разрешение да добави канал в YouTube към своите медийни дейности. Този канал трябваше да предложи главно предавания на ORF, които поради правни ограничения понастоящем не могат да бъдат предоставени за повече от 7 дни чрез catch-up услугата ORF TVthek.
KommAustria твърди, че изключителното сътрудничество между ORF и YouTube би дискриминирало други, сравними компании.
Регулаторът също така взема предвид съществуващите услуги, когато одобрява нови услуги на ORF. KommAustria предполага намаляване на интереса към ORF TVthek, ако се създаде канал на ORF в YouTube. Освен това регулаторът твърди, че е възможно да се удължи общият период за предоставяне на програми на ORF TVthek (повече от 7 дни) чрез преразглеждане на правната рамка.
Според KommAustria по принцип не е забранено на ORF да предлага абонаментна услуга. В конкретния случай обаче нито икономически, нито правно искането не е обосновано, напр. остава “напълно неясно” колко голям ще бъде делът от таксата за тази услуга – доколкото разбирам, не е изяснена пропорцията между абонамент и обществено финансиране.
If, like us, you’ve been bingeflixing your way through Netflix’s new show, Lost in Space, you may have noticed a Raspberry Pi being used as futuristic space tech.
Danger, Will Robinson, that probably won’t work
This isn’t the first time a Pi has been used as a film or television prop. From Mr. Robot and Disney Pixar’s Big Hero 6 to Mr. Robot, Sense8, and Mr. Robot, our humble little computer has become quite the celeb.
Raspberry Pi Spy has been working hard to locate and document the appearance of the Raspberry Pi in some of our favourite shows and movies. He’s created this video covering 2010-2017:
Researchers at Princeton University have released IoT Inspector, a tool that analyzes the security and privacy of IoT devices by examining the data they send across the Internet. They’ve already used the tool to study a bunch of different IoT devices. From their blog post:
Finding #3: Many IoT Devices Contact a Large and Diverse Set of Third Parties
In many cases, consumers expect that their devices contact manufacturers’ servers, but communication with other third-party destinations may not be a behavior that consumers expect.
We have found that many IoT devices communicate with third-party services, of which consumers are typically unaware. We have found many instances of third-party communications in our analyses of IoT device network traffic. Some examples include:
Samsung Smart TV. During the first minute after power-on, the TV talks to Google Play, Double Click, Netflix, FandangoNOW, Spotify, CBS, MSNBC, NFL, Deezer, and Facebookeven though we did not sign in or create accounts with any of them.
Amcrest WiFi Security Camera. The camera actively communicates with cellphonepush.quickddns.com using HTTPS. QuickDDNS is a Dynamic DNS service provider operated by Dahua. Dahua is also a security camera manufacturer, although Amcrest’s website makes no references to Dahua. Amcrest customer service informed us that Dahua was the original equipment manufacturer.
Halo Smoke Detector. The smart smoke detector communicates with broker.xively.com. Xively offers an MQTT service, which allows manufacturers to communicate with their devices.
Geeni Light Bulb. The Geeni smart bulb communicates with gw.tuyaus.com, which is operated by TuYa, a China-based company that also offers an MQTT service.
We also looked at a number of other devices, such as Samsung Smart Camera and TP-Link Smart Plug, and found communications with third parties ranging from NTP pools (time servers) to video storage services.
Their first two findings are that “Many IoT devices lack basic encryption and authentication” and that “User behavior can be inferred from encrypted IoT device traffic.” No surprises there.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.