Tag Archives: data-management

Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/uda-unified-data-architecture-6a6aee261d8d

By Alex Hutter, Alexandre Bertails, Claire Wang, Haoyuan He, Kishore Banala, Peter Royal, Shervin Afshar

As Netflix’s offerings grow — across films, series, games, live events, and ads — so does the complexity of the systems that support it. Core business concepts like ‘actor’ or ‘movie’ are modeled in many places: in our Enterprise GraphQL Gateway powering internal apps, in our asset management platform storing media assets, in our media computing platform that powers encoding pipelines, to name a few. Each system models these concepts differently and in isolation, with little coordination or shared understanding. While they often operate on the same concepts, these systems remain largely unaware of that fact, and of each other.

Spider-Man Pointing meme with each Spider-Man labelled as: “it’s a movie”, “it’s a tv show”, “it’s a game”.

As a result, several challenges emerge:

  • Duplicated and Inconsistent Models — Teams re-model the same business entities in different systems, leading to conflicting definitions that are hard to reconcile.
  • Inconsistent Terminology — Even within a single system, teams may use different terms for the same concept, or the same term for different concepts, making collaboration harder.
  • Data Quality Issues — Discrepancies and broken references are hard to detect across our many microservices. While identifiers and foreign keys exist, they are inconsistently modeled and poorly documented, requiring manual work from domain experts to find and fix any data issues.
  • Limited Connectivity — Within systems, relationships between data are constrained by what each system supports. Across systems, they are effectively non-existent.

To address these challenges, we need new foundations that allow us to define a model once, at the conceptual level, and reuse those definitions everywhere. But it isn’t enough to just document concepts; we need to connect them to real systems and data. And more than just connect, we have to project those definitions outward, generating schemas and enforcing consistency across systems. The conceptual model must become part of the control plane.

These were the core ideas that led us to build UDA.

Introducing UDA

UDA (Unified Data Architecture) is the foundation for connected data in Content Engineering. It enables teams to model domains once and represent them consistently across systems — powering automation, discoverability, and semantic interoperability.

Using UDA, users and systems can:

Register and connect domain models — formal conceptualizations of federated business domains expressed as data.

  • Why? So everyone uses the same official definitions for business concepts, which avoids confusion and stops different teams from rebuilding similar models in conflicting ways.

Catalog and map domain models to data containers, such as GraphQL type resolvers served by a Domain Graph Service, Data Mesh sources, or Iceberg tables, through their representation as a graph.

  • Why? To make it easy to find where the actual data for these business concepts lives (e.g., in which specific database, table, or service) and understand how it’s structured there.

Transpile domain models into schema definition languages like GraphQL, Avro, SQL, RDF, and Java, while preserving semantics.

  • Why? To automatically create consistent technical data structures (schemas) for various systems directly from the domain models, saving developers manual effort and reducing errors caused by out-of-sync definitions.

Move data faithfully between data containers, such as from federated GraphQL entities to Data Mesh (a general purpose data movement and processing platform for moving data between Netflix systems at scale), Change Data Capture (CDC) sources to joinable Iceberg Data Products.

  • Why? To save developer time by automatically handling how data is moved and correctly transformed between different systems. This means less manual work to configure data movement, ensuring data shows up consistently and accurately wherever it’s needed.

Discover and explore domain concepts via search and graph traversal.

  • Why? So anyone can more easily find the specific business information they’re looking for, understand how different concepts and data are related, and be confident they are accessing the correct information.

Programmatically introspect the knowledge graph using Java, GraphQL, or SPARQL.

  • Why? So developers can build smarter applications that leverage this connected business information, automate more complex data-dependent workflows, and help uncover new insights from the relationships in the data.

This post introduces the foundations of UDA as a knowledge graph, connecting domain models to data containers through mappings, and grounded in an in-house metamodel, or model of models, called Upper. Upper defines the language for domain modeling in UDA and enables projections that automatically generate schemas and pipelines across systems.

Image of the UDA knowledge graph. A central node representing a domain model is connected to other nodes representing Data Mesh, GraphQL, and Iceberg data containers.
The same domain model can be connected to semantically equivalent data containers in the UDA knowledge graph.

This post also highlights two systems that leverage UDA in production:

Primary Data Management (PDM) is our platform for managing authoritative reference data and taxonomies. PDM turns domain models into flat or hierarchical taxonomies that drive a generated UI for business users. These taxonomy models are projected into Avro and GraphQL schemas, automatically provisioning data products in the Warehouse and GraphQL APIs in the Enterprise Gateway.

Sphere is our self-service operational reporting tool for business users. Sphere uses UDA to catalog and relate business concepts across systems, enabling discovery through familiar terms like ‘actor’ or ‘movie.’ Once concepts are selected, Sphere walks the knowledge graph and generates SQL queries to retrieve data from the warehouse, no manual joins or technical mediation required.

UDA is a Knowledge Graph

UDA needs to solve the data integration problem. We needed a data catalog unified with a schema registry, but with a hard requirement for semantic integration. Connecting business concepts to schemas and data containers in a graph-like structure, grounded in strong semantic foundations, naturally led us to consider a knowledge graph approach.

We chose RDF and SHACL as the foundation for UDA’s knowledge graph. But operationalizing them at enterprise scale surfaced several challenges:

  • RDF lacked a usable information model. While RDF offers a flexible graph structure, it provides little guidance on how to organize data into named graphs, manage ontology ownership, or define governance boundaries. Standard follow-your-nose mechanisms like owl:imports apply only to ontologies and don’t extend to named graphs; we needed a generalized mechanism to express and resolve dependencies between them.
  • SHACL is not a modeling language for enterprise data. Designed to validate native RDF, SHACL assumes globally unique URIs and a single data graph. But enterprise data is structured around local schemas and typed keys, as in GraphQL, Avro, or SQL. SHACL could not express these patterns, making it difficult to model and validate real-world data across heterogeneous systems.
  • Teams lacked shared authoring practices. Without strong guidelines, teams modeled their ontologies inconsistently breaking semantic interoperability. Even subtle differences in style, structure, or naming led to divergent interpretations and made transpilation harder to define consistently across schemas.
  • Ontology tooling lacked support for collaborative modeling. Unlike GraphQL Federation, ontology frameworks had no built-in support for modular contributions, team ownership, or safe federation. Most engineers found the tools and concepts unfamiliar, and available authoring environments lacked the structure needed for coordinated contributions.

To address these challenges, UDA adopts a named-graph-first information model. Each named graph conforms to a governing model, itself a named graph in the knowledge graph. This systematic approach ensures resolution, modularity, and enables governance across the entire graph. While a full description of UDA’s information infrastructure is beyond the scope of this post, the next sections explain how UDA bootstraps the knowledge graph with its metamodel and uses it to model data container representations and mappings.

Upper is Domain Modeling

Upper is a language for formally describing domains — business or system — and their concepts. These concepts are organized into domain models: controlled vocabularies that define classes of keyed entities, their attributes, and their relationships to other entities, which may be keyed or nested, within the same domain or across domains. Keyed concepts within a domain model can be organized in taxonomies of types, which can be as complex as the business or the data system needs them to be. Keyed concepts can also be extended from other domain models — that is, new attributes and relationships can be contributed monotonically. Finally, Upper ships with a rich set of datatypes for attribute values, which can also be customized per domain.

Visualization of the UDA graph representation of a One Piece character. The Character node in the graph is connected to a Devil Fruit node. The Devil Fruit node is connected to a Devil Fruit Type node.
The graph representation of the onepiece: domain model from our UI. Depicted here you can see how Characters are related to Devil Fruit, and that each Devil Fruit has a type.

Upper domain models are data. They are expressed as conceptual RDF and organized into named graphs, making them introspectable, queryable, and versionable within the UDA knowledge graph. This graph unifies not just the domain models themselves, but also the schemas they transpile to — GraphQL, Avro, Iceberg, Java — and the mappings that connect domain concepts to concrete data containers, such as GraphQL type resolvers served by a Domain Graph Service, Data Mesh sources, or Iceberg tables, through their representations. Upper raises the level of abstraction above traditional ontology languages: it defines a strict subset of semantic technologies from the W3C tailored and generalized for domain modeling. It builds on ontology frameworks like RDFS, OWL, and SHACL so domain authors can model effectively without even needing to learn what an ontology is.

Screenshot of UDA UI showing domain model for One Piece serialized as Turtle.
UDA domain model for One Piece. Link to full definition.

Upper is the metamodel for Connected Data in UDA — the model for all models. It is designed as a bootstrapping upper ontology, which means that Upper is self-referencing, because it models itself as a domain model; self-describing, because it defines the very concept of a domain model; and self-validating, because it conforms to its own model. This approach enables UDA to bootstrap its own infrastructure: Upper itself is projected into a generated Jena-based Java API and GraphQL schema used in GraphQL service federated into Netflix’s Enterprise GraphQL gateway. These same generated APIs are then used by the projections and the UI. Because all domain models are conservative extensions of Upper, other system domain models — including those for GraphQL, Avro, Data Mesh, and Mappings — integrate seamlessly into the same runtime, enabling consistent data semantics and interoperability across schemas.

Screenshot of an IDE. It shows Java code using the generated API from the Upper metamodel to traverse and print terms from a domain domain in the top while the bottom contains the output of an execution.
Traversing a domain model programmatically using the Java API generated from the Upper metamodel.

Data Container Representations

Data containers are repositories of information. They contain instance data that conform to their own schema languages or type systems: federated entities from GraphQL services, Avro records from Data Mesh sources, rows from Iceberg tables, or objects from Java APIs. Each container operates within the context of a system that imposes its own structural and operational constraints.

Screenshot of a UI showing details for a Data Mesh Source containing One Piece Characters.
A Data Mesh source is a data container.

Data container representations are data. They are faithful interpretations of the members of data systems as graph data. UDA captures the definition of these systems as their own domain models, the system domains. These models encode both the information architecture of the systems and the schemas of the data containers within. They provide a blueprint for translating the systems into graph representations.

Screenshot of an IDE showing two files open side by side. On the left is a system domain model for Data Mesh. On the right is a representation of a Data Mesh source containing One Piece Character data.
Side by side/super imposed image of data container schema and representation. Link to full data container representation.

UDA catalogs the data container representations into the knowledge graph. It records the coordinates and metadata of the underlying data assets, but unlike a traditional catalog, it only tracks assets that are semantically connected to domain models. This enables users and systems to connect concepts from domain models to the concrete locations where corresponding instance data can be accessed. Those connections are called Mappings.

Mappings

Mappings are data that connect domain models to data containers. Every element in a domain model is addressable, from the domain model itself down to specific attributes and relationships. Likewise, data container representations make all components addressable, from an Iceberg table to an individual column, or from a GraphQL type to a specific field. A Mapping connects nodes in a subgraph of the domain model to nodes in a subgraph of a container representation. Visually, the Mapping is the set of arcs that link those two graphs together.

Screenshot of UDA UI showing a mapping between a concept in UDA and a Data Mesh Source.
A mapping between a domain model and a Data Mesh Source from the UDA UI. Link to full mapping.

Mappings enable discovery. Starting from a domain concept, users and systems can walk the knowledge graph to find where that concept is materialized — in which data system, in which container, and even how a specific attribute or relationship is physically accessed. The inverse is also supported: given a data container, one can trace back to the domain concepts it participates in.

Mappings shape UDA’s approach to semantic data integration. Most existing schema languages are not expressive enough in capturing richer semantics of a domain to address requirements for data integration (for example, “accessibility of data, providing semantic context to support its interpretation, and establishing meaningful links between data”). A trivial example of this could be seen in the lack of built-in facilities in Avro to represent foreign keys, making it very hard to express how entities relate across Data Mesh sources. Mappings, together with the corresponding system domain models, allow for such relationships, and many other constraints, to be defined in the domain models and used programmatically in actual data systems.

Mappings enable intent-based automation. Data is not always available in the systems where consumers need it. Because Mappings encode both meaning and location, UDA can reason about how data should move, preserving semantics, without requiring the consumer to specify how it should be done. Beyond the cataloging use case, connecting to existing containers, UDA automatically derives canonical Mappings from registered domain models as part of the projection process.

Projections

A projection produces a concrete data container. These containers, such as a GraphQL schema or a Data Mesh source, implement the characteristics derived from a registered domain model. Each projection is a concrete realization of Upper’s denotational semantics, ensuring semantic interoperability across all containers projected from the same domain model.

Projections produce consistent public contracts across systems. The data containers generated by projections encode data contracts in the form of schemas, derived by transpiling a domain model into the target container’s schema language. UDA currently supports transpilation to GraphQL and Avro schemas.

The GraphQL transpilation produces a schema that adheres to the official GraphQL spec with the ability to generate all GraphQL types defined in the spec. Given that the UDA domain model can be federated, it also supports generating federated graphQL schemas. Below is an example of a transpiled GraphQL schema.

Screenshot of an IDE showing two files open side by side. On the left is the definition of a Character in UDA. On the right is transpiled GraphQL schema.
Domain model on the left, with transpiled GraphQL schema on the right. Link to full transpiled GraphQL schema.

The Avro transpilation produces a schema that is a Data Mesh flavor of Avro, which includes some customization on top of the official Avro spec. This schema is used to automatically create a Data Mesh source container. Below is an example of a transpiled Avro schema.

Screenshot of an IDE showing two files open side by side. On the left is the definition of a Devil Fruit in UDA. On the right is transpiled Avro schema.
Domain model on the left, with transpiled Avro schema on the right. Link to full transpiled Avro schema.

Projections can automatically populate data containers. Some projections, such as those to GraphQL schemas or Data Mesh sources produce empty containers that require developers to populate the data. This might be creating GraphQL APIs or pushing events onto Data Mesh sources. Conversely, other containers, like Iceberg Tables, are automatically created and populated by UDA. For Iceberg Tables, UDA leverages the Data Mesh platform to automatically create data streams to move data into tables. This process utilizes much of the same infrastructure detailed in this blog post here.

Projections have mappings. UDA automatically generates and manages mappings between the newly created data containers and the projected domain model.

Early Adopters

Controlled Vocabularies (PDM)

The full range of Netflix’s business activities relies on a sprawling data model that captures the details of our many business processes. Teams need to be able to coordinate operational activities to ensure that content production is complete, advertising campaigns are in place, and promotional assets are ready to deploy. We implicitly depend upon a singular definition of shared concepts, such as content production is complete. Multiple definitions create coordination challenges. Software (and humans) don’t know that the definitions mean the same thing.

We started the Primary Data Management (PDM) initiative to create unified and consistent definitions for the core concepts in our data model. These definitions form controlled vocabularies, standardized and governed lists for what values are permitted within certain fields in our data model.

Primary Data Management (PDM) is a single place where business users can manage controlled vocabularies. Our data model governance has been scattered across different tools and teams creating coordination challenges. This is an information management problem relating to the definition, maintenance and consistent use of reference data and taxonomies. This problem is not unique to Netflix, so we looked outward for existing solutions to this problem.

Screenshot of PDM UI
Managing the taxonomy of One Piece characters in PDM.

PDM uses the Simple Knowledge Organization System (SKOS) model. It is a W3C data standard designed for modeling knowledge. Its terminology is abstract, with Concepts that can be organized into ConceptSchemes and properties to describe various types of relationships. Every system is hardcoded against something, that’s how software knows how to manipulate data. We want a system that can work with a data model as its input, so we still need something concrete to build the software against. This is what SKOS provides, a generic basis for modeling knowledge that our system can understand.

PDM uses Domain Models to integrate SKOS into the rest of Content Engineering’s ecosystem. A core premise of the system is that it takes a domain model as input, and everything that can be derived is derived from that model. PDM builds a user interface based upon the model definition and leverages UDA to project this model into type-safe interfaces for other systems to use. The system will provision a Domain Graph Service (DGS) within our federated GraphQL API environment using a GraphQL schema that UDA projects from the domain model. UDA is also used to provision data movement pipelines which are able to feed our GraphSearch infrastructure as well as move data into the warehouse. The data movement systems use Avro schemas, and UDA creates a projection from the domain model to Avro.

Consumers of controlled vocabularies never know they’re using SKOS. Domain models use terms that fit in with the domain. SKOS’s generic notion of broader and narrower to define a hierarchy are hidden from consumers as super-properties within the model. This allows consumers to work with language that is familiar to them while enabling PDM to work with any model. The best of both worlds.

Operational Reporting (Sphere)

Operational reporting serves the detailed day-to-day activities and processes of a business domain. It is a reporting paradigm specialized in covering high-resolution, low-latency data sets.

Operational reporting systems should generate reports without relying on technical intermediaries. Operational reporting systems need to address the persistent challenge of empowering business users to explore and obtain the data they need, when they need it. Without such self-service systems, requests for new reports or data extracts often result in back-and-forth exchanges, where the initial query may not exactly meet business users’ expectations, requiring further clarification and refinement.

Data discovery and query generation are two relevant aspects of data integration. Supplying end-users with an accurate, contextual, and user-friendly data discovery experience provides a basis for query generation mechanism which produces syntactically correct and semantically reliable queries.

Operational reports are predominantly run on data hydrated from GraphQL services into the Data Warehouse. You can read about our journey from conventional data movement to streaming data pipelines based on CDC and GraphQL hydration in this blog post. Among the challenging byproducts of this approach was that a single, distinct data concept is now present in two places (GraphQL and data warehouse), with some disparity in semantic context to guide and support the interpretations and connectivity of that data. To address this, we formulate a mechanism to use the syntax and semantics captured in the federated schema from Netflix’s Enterprise GraphQL and populate representational domain models in UDA to preserve those details and add more.

Domain models enable the data discovery experience. Metadata aggregated from various data-producing systems is captured in UDA domain models using a unified vocabulary. This metadata is surfaced for the users’ search and discovery needs; instead of specifying exact tables and join keys, users simply can search for familiar business concepts such as ‘actors’ or ‘movies’. We use UDA models to disambiguate and resolve the intended concepts and their related data entities.

UDA knowledge graph is the data landscape for query generation. Once concepts are discovered and their mappings to corresponding data containers are identified and located in the knowledge graph, we use them to establish join strategies. Through graph traversal, we identify boundaries and islands within the data landscape. This ensures only feasible, joinable combinations are selected while weeding out semantically incorrect and non-executable query candidates.

Screenshot of Sphere’s UI
Generating a report in Sphere.

Sphere is a UDA-powered self-service operational reporting system. The solution based on knowledge graphs described above is called Sphere. Seeing self-service operational reporting through this lens, we can improve business users’ agency in access to operational data. They are empowered to explore, assemble, and refine reports at the conceptual level, while technical complexities are managed by the system.

Stay Tuned

UDA marks a fundamental shift in how we approach data modeling within Content Engineering. By providing a unified knowledge graph composed of what we know about our various data systems and the business concepts within them, we’ve made information more consistent, connected, and discoverable across our organization. We’re excited about future applications of these ideas such as:

  • Supporting additional projections like Protobuf/gRPC
  • Materializing the knowledge graph of instance data for querying, profiling, and management
  • Finally solving some of the initial challenges posed by Graph Search (that actually inspired some of this work)

If you’re interested in this space, we’d love to connect — whether you’re exploring new roles down the road or just want to swap ideas.

Expect to see future blog posts exploring PDM and Sphere in more detail soon!

Credits

Thanks to Andreas Legenbauer, Bernardo Gomez Palacio Valdes, Charles Zhao, Christopher Chong, Deepa Krishnan, George Pesmazoglou, Jessica Silva, Katherine Anderson, Malik Day, Rita Bogdanova, Ruoyun Zheng, Shawn Stedman, Suchita Goyal, Utkarsh Shrivastava, Yoomi Koh, Yulia Shmeleva


Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data pipeline asset management with Dataflow

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/data-pipeline-asset-management-with-dataflow-86525b3e21ca

by Sam Setegne, Jai Balani, Olek Gorajek

Glossary

  • asset — any business logic code in a raw (e.g. SQL) or compiled (e.g. JAR) form to be executed as part of the user defined data pipeline.
  • data pipeline — a set of tasks (or jobs) to be executed in a predefined order (a.k.a. DAG) for the purpose of transforming data using some business logic.
  • Dataflow — Netflix homegrown CLI tool for data pipeline management.
  • job — a.k.a task, an atomic unit of data transformation logic, a non-separable execution block in the workflow chain.
  • namespace — unique label, usually representing a business subject area, assigned to a workflow asset to identify it across all other assets managed by Dataflow (e.g. security).
  • workflow — see “data pipeline”

Intro

The problem of managing scheduled workflows and their assets is as old as the use of cron daemon in early Unix operating systems. The design of a cron job is simple, you take some system command, you pick the schedule to run it on and you are done. Example:

0 0 * * MON /home/alice/backup.sh

In the above example the system would wake up every Monday morning and execute the backup.sh script. Simple right? But what if the script does not exist in the given path, or what if it existed initially but then Alice let Bob access her home directory and he accidentally deleted it? Or what if Alice wanted to add new backup functionality and she accidentally broke existing code while updating it?

The answers to these questions is something we would like to address in this article and propose a clean solution to this problem.

Let’s define some requirements that we are interested in delivering to the Netflix data engineers or anyone who would like to schedule a workflow with some external assets in it. By external assets we simply mean some executable carrying the actual business logic of the job. It could be a JAR compiled from Scala, a Python script or module, or a simple SQL file. The important thing is that this business logic can be built in a separate repository and maintained independently from the workflow definition. Keeping all that in mind we would like to achieve the following properties for the whole workflow deployment:

  1. Versioning: we want both the workflow definition and its assets to be versioned and we want the versions to be tied together in a clear way.
  2. Transparency: we want to know which version of an asset is running along with every workflow instance, so if there are any issues we can easily identify which version caused the problem and to which one we could revert, if necessary.
  3. ACID deployment: for every scheduler workflow definition change, we would like to have all the workflow assets bundled in an atomic, durable, isolated and consistent manner. This way, if necessary, all we need to know is which version of the workflow to roll back to, and the rest would be taken care of for us.

While all the above goals are our North Star, we also don’t want to negatively affect fast deployment, high availability and arbitrary life span of any deployed asset.

Previous solutions

The basic approach to pulling down arbitrary workflow resources during workflow execution has been known to mankind since the invention of cron, and with the advent of “infinite” cloud storage systems like S3, this approach has served us for many years. Its apparent flexibility and convenience can often fool us into thinking that by simply replacing the asset in the S3 location we can, without any hassle, introduce changes to our business logic. This method often proves very troublesome especially if there is more than one engineer working on the same pipeline and they are not all aware of the other folks’ “deployment process”.

The slightly improved approach is shown on the diagram below.

Figure 1. Manually constructed continuous delivery system.

In Figure 1, you can see an illustration of a typical deployment pipeline manually constructed by a user for an individual project. The continuous deployment tool submits a workflow definition with pointers to assets in fixed S3 locations. These assets are then separately deployed to these fixed locations. At runtime, the assets are retrieved from the defined locations in S3 and executed in the runtime container. Despite requiring users to construct the deployment pipeline manually, often by writing their own scripts from scratch, this design works and has been successfully used by many teams for years. That being said, it does have some drawbacks that are revealed as you try to add any amount of complexity to your deployment logic. Let’s discuss a few of them.

Does not consider branch/PR deployments

In any production pipeline, you want the flexibility of having a “safe” alternative deployment logic. For example, you may want to build your Scala code and deploy it to an alternative location in S3 while pushing a sandbox version of your workflow that points to this alternative location. Something this simple gets very complicated very quickly and requires the user to consider a number of things. Where should this alternative location be in S3? Is a single location enough? How do you set up your deployment logic to know when to deploy the workflow to a test or dev environment? Answers to these questions often end up being more custom logic inside of the user’s deployment scripts.

Cannot rollback to previous workflow versions

When you deploy a workflow, you really want it to encapsulate an atomic and idempotent unit of work. Part of the reason for that is the desire for the ability to rollback to a previous workflow version and knowing that it will always behave as it did in previous runs. There can be many reasons to rollback but the typical one is when you’ve recognized a regression in a recent deployment that was not caught during testing. In the current design, reverting to a previous workflow definition in your scheduling system is not enough! You have to rebuild your assets from source and move them to your fixed S3 location that your workflow points to. To enable atomic rollbacks, you can add more custom logic to your deployment scripts to always deploy your assets to a new location and generate new pointers for your workflows to use, but that comes with higher complexity that often just doesn’t feel worth it. More commonly, teams will opt to do more testing to try and catch regressions before deploying to production and will accept the extra burden of rebuilding all of their workflow dependencies in the event of a regression.

Runtime dependency on user-managed cloud storage locations

At runtime, the container must reach out to a user-defined storage location to retrieve the assets required. This causes the user-managed storage system to be a critical runtime dependency. If we zoom out to look at an entire workflow management system, the runtime dependencies can become unwieldy if it relies on various storage systems that are arbitrarily defined by the workflow developers!

Dataflow deployment with asset management

In the attempt to deliver a simple and robust solution to the managed workflow deployments we created a command line utility called Dataflow. It is a Python based CLI + library that can be installed anywhere inside the Netflix environment. This utility can build and configure workflow definitions and their assets during testing and deployment. See below diagram:

Figure 2. Dataflow asset management system.

In Figure 2, we show a variation of the typical manually constructed deployment pipeline. Every asset deployment is released to some newly calculated UUID. The workflow definition can then identify a specific asset by its UUID. Deploying the workflow to the scheduling system produces a “Deployment Bundle”. The bundle includes all of the assets that have been referenced by the workflow definition and the entire bundle is deployed to the scheduling system. At every scheduled runtime, the scheduling system can create an instance of your workflow without having to gather runtime dependencies from external systems.

The asset management system that we’ve created for Dataflow provides a strong abstraction over this deployment design. Deploying the asset, generating the UUID, and building the deployment bundle is all handled automatically by the Dataflow build logic. The user does not need to be aware of anything that’s happening on S3, nor that S3 is being used at all! Instead, the user is given a flexible UUID referencing system that’s layered on top of our scheduling system’s workflow DSL. Later in the article we’ll cover this referencing system in some detail. But first, let’s look at an example of deploying an asset and a workflow.

Deployment of an asset

Let’s walk through an example of a workflow asset build and deployment. Let’s assume we have a repository called stranger-data with the following structure:

.
├── dataflow.yaml
├── pyspark-workflow
│ ├── main.sch.yaml
│ └── hello_world
│ ├── ...
│ └── setup.py
└── scala-workflow
├── build.gradle
├── main.sch.yaml
└── src
├── main
│ └── ...
└── test
└── ...

Let’s now use Dataflow command to see what project components are visible:

stranger-data$ dataflow project list
Python Assets:
-> ./pyspark-workflow/hello_world/setup.py
Summary: 1 found.
Gradle Assets:
-> ./scala-workflow/build.gradle
Summary: 1 found.
Scheduler Workflows:
-> ./scala-workflow/main.sch.yaml
-> ./pyspark-workflow/main.sch.yaml
Summary: 2found.

Before deploying the assets, and especially if we made any changes to them, we can run unit tests to make sure that we didn’t break anything. In a typical Dataflow configuration this manual testing is optional because Dataflow continuous integration tests will do that for us on any pull-request.

stranger-data$ dataflow project test
Testing Python Assets:
-> ./pyspark-workflow/hello_world/setup.py... PASSED
Summary: 1 successful, 0 failed.
Testing Gradle Assets:
-> ./scala-workflow/build.gradle... PASSED
Summary: 1 successful, 0 failed.
Building Scheduler Workflows:
-> ./scala-workflow/main.sch.yaml... CREATED ./.workflows/scala-workflow.main.sch.rendered.yaml
-> ./pyspark-workflow/main.sch.yaml... CREATED ./.workflows/pyspark-workflow.main.sch.rendered.yaml
Summary: 2 successful, 0 failed.
Testing Scheduler Workflows:
-> ./scala-workflow/main.sch.yaml... PASSED
-> ./pyspark-workflow/main.sch.yaml... PASSED
Summary: 2 successful, 0 failed.

Notice that the test command we use above not only executes unit test suites defined in our Scala and Python sub-projects, but it also renders and statically validates all the workflow definitions in our repo, but more on that later…

Assuming all tests passed, let’s now use the Dataflow command to build and deploy a new version of the Scala and Python assets into the Dataflow asset registry.

stranger-data$ dataflow project deploy
Building Python Assets:
-> ./pyspark-workflow/hello_world/setup.py... CREATED ./pyspark-workflow/hello_world/dist/hello_world-0.0.1-py3.7.egg
Summary: 1 successful, 0 failed.
Deploying Python Assets:
-> ./pyspark-workflow/hello_world/setup.py... DEPLOYED AS dataflow.egg.hello_world.user.stranger-data.master.39206ee8.3
Summary: 1 successful, 0 failed.
Building Gradle Assets:
-> ./scala-workflow/build.gradle... CREATED ./scala-workflow/build/libs/scala-workflow-all.jar
Summary: 1 successful, 0 failed.
Deploying Gradle Assets:
-> ./scala-workflow/build.gradle... DEPLOYED AS dataflow.jar.scala-workflow.user.stranger-data.master.39206ee8.11
Summary: 1 successful, 0 failed.
...

Notice that the above command:

  • created a new version of the workflow assets
  • assigned the asset a “UUID” (consisting of the “dataflow” string, asset type, asset namespace, git repo owner, git repo name, git branch name, commit hash and consecutive build number)
  • and deployed them to a Dataflow managed S3 location.

We can check the existing assets of any given type deployed to any given namespace using the following Dataflow command:

stranger-data$ dataflow project list eggs --namespace hello_world --deployed
Project namespaces with deployed EGGS:
hello_world
-> dataflow.egg.hello_world.user.stranger-data.master.39206ee8.3
-> dataflow.egg.hello_world.user.stranger-data.master.39206ee8.2
-> dataflow.egg.hello_world.user.stranger-data.master.39206ee8.1

The above list could come in handy, for example if we needed to find and access an older version of an asset deployed from a given branch and commit hash.

Deployment of a workflow

Now let’s have a look at the build and deployment of the workflow definition which references the above assets as part of its pipeline DAG.

Let’s list the workflow definitions in our repo again:

stranger-data$ dataflow project list workflows
Scheduler Workflows:
-> ./scala-workflow/main.sch.yaml
-> ./pyspark-workflow/main.sch.yaml
Summary: 2 found.

And let’s look at part of the content of one of these workflows:

stranger-data$ cat ./scala-workflow/main.sch.yaml
...
dag:
- ddl -> write
- write -> audit
- audit -> publish
jobs:
- ddl: ...
- write:
spark:
script: ${dataflow.jar.scala-workflow}
class: com.netflix.spark.ExampleApp
conf: ...
params: ...
- audit: ...
- publish: ...
...

You can see from the above snippet that the write job wants to access some version of the JAR from the scala-workflow namespace. A typical workflow definition, written in YAML, does not need any compilation before it is shipped to the Scheduler API, but Dataflow designates a special step called “rendering” to substitute all of the Dataflow variables and build the final version.

The above expression ${dataflow.jar.scala-workflow} means that the workflow will be rendered and deployed with the latest version of the scala-workflow JAR available at the time of the workflow deployment. It is possible that the JAR is built as part of the same repository in which case the new build of the JAR and a new version of the workflow may be coming from the same deployment. But the JAR may be built as part of a completely different project and in that case the testing and deployment of the new workflow version can be completely decoupled.

We showed above how one would request the latest asset version available during deployment, but with Dataflow asset management we can distinguish two more asset access patterns. An obvious next one is to specify it by all its attributes: asset type, asset namespace, git repo owner, git repo name, git branch name, commit hash and consecutive build number. There is one more extra method for a middle ground solution to pick a specific build for a given namespace and git branch, which can help during testing and development. All of this is part of the user-interface for determining how the deployment bundle will be created. See below diagram for a visual illustration.

Figure 3. A closer at the Deployment Bundle

In short, using the above variables gives the user full flexibility and allows them to pick any version of any asset in any workflow.

An example of the workflow deployment with the rendering step is shown below:

stranger-data$ dataflow project deploy
...
Building Scheduler Workflows:
-> ./scala-workflow/main.sch.yaml... CREATED ./.workflows/scala-workflow.main.sch.rendered.yaml
-> ./pyspark-workflow/main.sch.yaml... CREATED ./.workflows/pyspark-workflow.main.sch.rendered.yaml
Summary: 2 successful, 0 failed.
Deploying Scheduler Workflows:
-> ./scala-workflow/main.sch.yaml… DEPLOYED AS https://hawkins.com/scheduler/sandbox:user.stranger-data.scala-workflow
-> ./pyspark-workflow/main.sch.yaml… DEPLOYED AS https://hawkins.com/scheduler/sandbox:user.stranger-data.pyspark-workflow
Summary: 2 successful, 0 failed.

And here you can see what the workflow definition looks like before it is sent to the Scheduler API and registered as the latest version. Notice the value of the script variable of the write job. In the original code says ${dataflow.jar.scala-workflow} and in the rendered version it is translated to a specific file pointer:

stranger-data$ cat ./scala-workflow/main.sch.yaml
...
dag:
- ddl -> write
- write -> audit
- audit -> publish
jobs:
- ddl: ...
- write:
spark:
script: s3://dataflow/jars/scala-workflow/user/stranger-data/master/39206ee8/1.jar
class: com.netflix.spark.ExampleApp
conf: ...
params: ...
- audit: ...
- publish: ...
...

User perspective

The Infrastructure DSE team at Netflix is responsible for providing insights into data that can help the Netflix platform and service scale in a secure and effective way. Our team members partner with business units like Platform, OpenConnect, InfoSec and engage in enterprise level initiatives on a regular basis.

One side effect of such wide engagement is that over the years our repository evolved into a mono-repo with each module requiring a customized build, testing and deployment strategy packaged into a single Jenkins job. This setup required constant upkeep and also meant every time we had a build failure multiple people needed to spend a lot of time in communication to ensure they did not step on each other.

Last quarter we decided to split the mono-repo into separate modules and adopt Dataflow as our asset orchestration tool. Post deployment, the team relies on Dataflow for automated execution of unit tests, management and deployment of workflow related assets.

By the end of the migration process our Jenkins configuration went from:

Figure 4. Real example of a deployment script.

to:

cd /dataflow_workspace
dataflow project deploy

The simplicity of deployment enabled the team to focus on the problems they set out to solve while the branch based customization gave us the flexibility to be our most effective at solving them.

Conclusions

This new method available for Netflix data engineers makes workflow management easier, more transparent and more reliable. And while it remains fairly easy and safe to build your business logic code (in Scala, Python, etc) in the same repository as the workflow definition that invokes it, the new Dataflow versioned asset registry makes it easier yet to build that code completely independently and then reference it safely inside data pipelines in any other Netflix repository, thus enabling easy code sharing and reuse.

One more aspect of data workflow development that gets enabled by this functionality is what we call branch-driven deployment. This approach enables multiple versions of your business logic and workflows to be running at the same time in the scheduler ecosystem, and makes it easy, not only for individual users to run isolated versions of the code during development, but also to define isolated staging environments through which the code can pass before it reaches the production stage. Obviously, in order for the workflows to be safely used in that configuration they must comply with a few simple rules with regards to the parametrization of their inputs and outputs, but let’s leave this subject for another blog post.

Credits

Special thanks to Peter Volpe, Harrington Joseph and Daniel Watson for the initial design review.


Data pipeline asset management with Dataflow was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.