<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>data-engineering &#8211; Noise</title>
	<atom:link href="https://noise.getoto.net/tag/data-engineering/feed/" rel="self" type="application/rss+xml" />
	<link>https://noise.getoto.net</link>
	<description>The collective thoughts of the interwebz</description>
	<lastBuildDate>Fri, 17 Oct 2025 18:42:37 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.8.2</generator>
	<item>
		<title>How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data…</title>
		<link>https://noise.getoto.net/2025/10/17/how-and-why-netflix-built-a-real-time-distributed-graph-part-1-ingesting-and-processing-data/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Fri, 17 Oct 2025 18:42:37 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[software-architecture]]></category>
		<guid isPermaLink="false">https://medium.com/p/80113e124acc</guid>

					<description><![CDATA[How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet ScaleAuthors: Adrian Taruc and James DaltonThis is the first entry of a multi-part blog series describing how we built a Real-Time Distr...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale</title>
		<link>https://noise.getoto.net/2025/09/23/scaling-muse-how-netflix-powers-data-driven-creative-insights-at-trillion-row-scale/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Mon, 22 Sep 2025 21:24:20 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[Druid]]></category>
		<guid isPermaLink="false">https://medium.com/p/aa9ad326fd77</guid>

					<description><![CDATA[By Andrew Pierce, Chris Thrailkill, Victor ChiapaikeoAt Netflix, we prioritize getting timely data and insights into the hands of the people who can act on them. One of our key internal applications for this purpose is Muse. Muse’s ultimate goal is to ...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix</title>
		<link>https://noise.getoto.net/2025/06/12/model-once-represent-everywhere-uda-unified-data-architecture-at-netflix/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Thu, 12 Jun 2025 14:56:32 +0000</pubDate>
				<category><![CDATA[Data Catalog]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[data-management]]></category>
		<category><![CDATA[Knowledge management]]></category>
		<category><![CDATA[rdf]]></category>
		<guid isPermaLink="false">https://medium.com/p/6a6aee261d8d</guid>

					<description><![CDATA[By Alex Hutter, Alexandre Bertails, Claire Wang, Haoyuan He, Kishore Banala, Peter Royal, Shervin AfsharAs Netflix’s offerings grow — across films, series, games, live events, and ads — so does the complexity of the systems that support it. Core busine...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Building a Spark observability product with StarRocks: Real-time and historical performance analysis</title>
		<link>https://noise.getoto.net/2025/03/06/building-a-spark-observability-product-with-starrocks-real-time-and-historical-performance-analysis/</link>
		
		<dc:creator><![CDATA[Grab Tech]]></dc:creator>
		<pubDate>Thu, 06 Mar 2025 00:00:10 +0000</pubDate>
				<category><![CDATA[Data Analytics]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[Engineering]]></category>
		<category><![CDATA[generative AI]]></category>
		<category><![CDATA[LLM]]></category>
		<category><![CDATA[Real-time Analytics]]></category>
		<category><![CDATA[Spark Observability]]></category>
		<category><![CDATA[StarRocks]]></category>
		<category><![CDATA[System Architecture]]></category>
		<guid isPermaLink="false">https://engineering.grab.com/building-a-spark-observability</guid>

					<description><![CDATA[Introduction

At Grab, we’ve been working to perfect our Spark observability tools. Our initial solution, Iris, was developed to provide a custom, in-depth observability tool for Spark jobs. As described in our previous blog post, Iris collects and ana...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Introducing Impressions at Netflix</title>
		<link>https://noise.getoto.net/2025/02/15/introducing-impressions-at-netflix/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Sat, 15 Feb 2025 01:13:20 +0000</pubDate>
				<category><![CDATA[data]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[distributed-systems]]></category>
		<guid isPermaLink="false">https://medium.com/p/e2b67c88c9fb</guid>

					<description><![CDATA[Part 1: Creating the Source of Truth for ImpressionsBy: Tulika BhattImagine scrolling through Netflix, where each movie poster or promotional banner competes for your attention. Every image you hover over isn’t just a visual placeholder; it’s a critica...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>A Recap of the Data Engineering Open Forum at Netflix</title>
		<link>https://noise.getoto.net/2024/06/20/a-recap-of-the-data-engineering-open-forum-at-netflix/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Thu, 20 Jun 2024 15:01:27 +0000</pubDate>
				<category><![CDATA[data]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[software engineering]]></category>
		<category><![CDATA[Technology]]></category>
		<guid isPermaLink="false">https://medium.com/p/6b4d4410b88f</guid>

					<description><![CDATA[A summary of sessions at the first Data Engineering Open Forum at Netflix on April 18th, 2024The Data Engineering Open Forum at Netflix on April 18th, 2024.At Netflix, we aspire to entertain the world, and our data engineering teams play a crucial role...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Our First Netflix Data Engineering Summit</title>
		<link>https://noise.getoto.net/2023/12/14/our-first-netflix-data-engineering-summit/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Thu, 14 Dec 2023 16:54:11 +0000</pubDate>
				<category><![CDATA[data]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data-engineer]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[data-visualization]]></category>
		<guid isPermaLink="false">https://medium.com/p/f326b0589102</guid>

					<description><![CDATA[Holden Karau Elizabeth Stone Pedro Duarte Chris Stephens Pallavi Phadnis Lee Woodridge Mark Cho Guil Pires Sujay Jain Tristan Reid Senthilnathan Athinarayanan Bharath Mummadisetty Abhinaya Shetty Judit Lantos Amanuel Kahsay Dao Mi Mick Dreeling Chris C...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Ready-to-go sample data pipelines with Dataflow</title>
		<link>https://noise.getoto.net/2022/12/04/ready-to-go-sample-data-pipelines-with-dataflow/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Sun, 04 Dec 2022 00:10:21 +0000</pubDate>
				<category><![CDATA[Data Pipeline]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[dataflow]]></category>
		<category><![CDATA[etl]]></category>
		<category><![CDATA[standardization]]></category>
		<guid isPermaLink="false">https://medium.com/p/17440a9e141d</guid>

					<description><![CDATA[<p>by <a href="https://www.linkedin.com/in/ywnfm5/">Jasmine Omeke</a>, <a href="https://www.linkedin.com/in/onwoke/">Obi-Ike Nwoke</a>, <a href="https://www.linkedin.com/in/agorajek/">Olek Gorajek</a></p><h3>Intro</h3><p>This post is for all data practitioners, who are interested in learning about bootstrapping, standardization and automation of batch data pipelines at Netflix.</p><p>You may remember Dataflow from the post we wrote last year titled <a href="https://netflixtechblog.com/data-pipeline-asset-management-with-dataflow-86525b3e21ca">Data pipeline asset management with Dataflow</a>. That article was a deep dive into one of the more technical aspects of Dataflow and didn’t properly introduce this tool in the first place. This time we’ll try to give justice to the intro and then we will focus on one of the very first features Dataflow came with. That feature is called <strong>sample workflows</strong>, but before we start in let’s have a quick look at Dataflow in general.</p><figure><img alt="Dataflow" src="https://cdn-images-1.medium.com/max/1024/1*4IalrwbpzJovyfmA8lMtyA.png"></figure><h4>Dataflow</h4><p>Dataflow is a command line utility built to improve experience and to streamline the data pipeline development at Netflix. Check out this high level Dataflow help command output below:</p><pre>$ dataflow --help<br>Usage: dataflow [OPTIONS] COMMAND [ARGS]...<br><br>Options:<br>  --docker-image TEXT  Url of the docker image to run in.<br>  --run-in-docker      Run dataflow in a docker container.<br>  -v, --verbose        Enables verbose mode.<br>  --version            Show the version and exit.<br>  --help               Show this message and exit.<br><br>Commands:<br>  migration  Manage schema migration.<br>  mock       Generate or validate mock datasets.<br>  project    Manage a Dataflow project.<br>  sample     Generate fully functional sample workflows.</pre><p>As you can see Dataflow CLI is divided into four main subject areas (or commands). The most commonly used one is <strong>dataflow project</strong>, which helps folks in managing their data pipeline repositories through creation, testing, deployment and few other activities.</p><p>The <strong>dataflow migration</strong> command is a special feature, developed single handedly by <a href="https://www.linkedin.com/in/stephenhuenneke/">Stephen Huenneke</a>, to fully automate the communication and tracking of a data warehouse table changes. Thanks to the Netflix internal lineage system (built by <a href="https://www.linkedin.com/in/girish-lingappa-309aa24/">Girish Lingappa</a>) Dataflow migration can then help you identify downstream usage of the table in question. And finally it can help you craft a message to all the owners of these dependencies. After your migration has started Dataflow will also keep track of its progress and help you communicate with the downstream users.</p><p><strong>Dataflow mock</strong> command is another standalone feature. It lets you create YAML formatted mock data files based on selected tables, columns and a few rows of data from the Netflix data warehouse. Its main purpose is to enable easy unit testing of your data pipelines, but it can technically be used in any other situations as a readable data format for small data sets.</p><p>All the above commands are very likely to be described in separate future blog posts, but right now let’s focus on the <strong>dataflow sample </strong>command.</p><h3>Sample workflows</h3><p>Dataflow <strong>sample workflows </strong>is a set of templates anyone can use to bootstrap their data pipeline project. And by “sample” we mean “an example”, like food samples in your local grocery store. One of the main reasons this feature exists is just like with food samples, to give you “a taste” of the production quality ETL code that you could encounter inside the Netflix data ecosystem.</p><p>All the code you get with the Dataflow sample workflows is fully functional, adjusted to your environment and isolated from other sample workflows that others generated. This pipeline is safe to run the moment it shows up in your directory. It will, not only, build a nice example aggregate table and fill it up with real data, but it will also present you with a complete set of recommended components:</p><ul><li>clean DDL code,</li><li>proper table metadata settings,</li><li>transformation job (in a language of choice) wrapped in an optional WAP (Write, Audit, Publish) pattern,</li><li>sample set of data audits for the generated data,</li><li>and a fully functional unit test for your transformation logic.</li></ul><p>And last, but not least, these sample workflows are being tested continuously as part of the Dataflow code change protocol, so you can be sure that what you get is working. This is one way to build trust with our internal user base.</p><p>Next, let’s have a look at the actual business logic of these sample workflows.</p><h4>Business Logic</h4><p>There are several variants of the sample workflow you can get from Dataflow, but all of them share the same business logic. This was a conscious decision in order to clearly illustrate the difference between various languages in which your ETL could be written in. Obviously not all tools are made with the same use case in mind, so we are planning to add more code samples for other (than classical batch ETL) data processing purposes, e.g. Machine Learning model building and scoring.</p><p>The example business logic we use in our template computes the top hundred movies/shows in every country where Netflix operates on a daily basis. This is not an actual production pipeline running at Netflix, because it is a highly simplified code but it serves well the purpose of illustrating a batch ETL job with various transformation stages. Let’s review the transformation steps below.</p><p><strong>Step 1:</strong> on a daily basis, incrementally, sum up all viewing time of all movies and shows in every country</p><pre>WITH STEP_1 AS (<br>   SELECT<br>       title_id<br>       , country_code<br>       , SUM(view_hours) AS view_hours<br>   FROM some_db.source_table<br>   WHERE playback_date = CURRENT_DATE<br>   GROUP BY<br>       title_id<br>       , country_code<br>)</pre><p><strong>Step 2</strong>: rank all titles from most watched to least in every county</p><pre>WITH STEP_2 AS (<br>   SELECT<br>       title_id<br>       , country_code<br>       , view_hours<br>       , RANK() OVER (<br>          PARTITION BY country_code <br>          ORDER BY view_hours DESC<br>       ) AS title_rank<br>   FROM STEP_1<br>)</pre><p><strong>Step 3:</strong> filter all titles to the top 100</p><pre>WITH STEP_3 AS (<br>   SELECT<br>       title_id<br>       , country_code<br>       , view_hours<br>       , title_rank<br>   FROM STEP_2<br>   WHERE title_rank &#60;= 100<br>)</pre><p>Now, using the above simple 3-step transformation we will produce data that can be written to the following Iceberg table:</p><pre>CREATE TABLE IF NOT EXISTS ${TARGET_DB}.dataflow_sample_results (<br>  title_id INT COMMENT "Title ID of the movie or show."<br>  , country_code STRING COMMENT "Country code of the playback session."<br>  , title_rank INT COMMENT "Rank of a given title in a given country."<br>  , view_hours DOUBLE COMMENT "Total viewing hours of a given title in a given country."<br>)<br>COMMENT<br>  "Example dataset brought to you by Dataflow. For more information on this<br>   and other examples please visit the Dataflow documentation page."<br>PARTITIONED BY (<br>  date DATE COMMENT "Playback date."<br>)<br>STORED AS ICEBERG;</pre><p>As you can infer from the above table structure we are going to load about <a href="https://help.netflix.com/en/node/14164">19,000</a> rows into this table on a daily basis. And they will look something like this:</p><pre> sql&#62; SELECT * FROM foo.dataflow_sample_results <br>      WHERE date = 20220101 and country_code = 'US' <br>      ORDER BY title_rank LIMIT 5;<br><br> title_id &#124; country_code &#124; title_rank &#124; view_hours &#124; date<br>----------+--------------+------------+------------+----------<br> 11111111 &#124; US           &#124;          1 &#124;   123      &#124; 20220101<br> 44444444 &#124; US           &#124;          2 &#124;   111      &#124; 20220101<br> 33333333 &#124; US           &#124;          3 &#124;   98       &#124; 20220101<br> 55555555 &#124; US           &#124;          4 &#124;   55       &#124; 20220101<br> 22222222 &#124; US           &#124;          5 &#124;   11       &#124; 20220101<br>(5 rows)</pre><p>With the business logic out of the way, we can now start talking about the components, or the boiler-plate, of our sample workflows.</p><h4>Components</h4><p>Let’s have a look at the most common workflow components that we use at Netflix. These components may not fit into every ETL use case, but are used often enough to be included in every template (or sample workflow). The workflow author, after all, has the final word on whether they want to use all of these patterns or keep only some. Either way they are here to start with, ready to go, if needed.</p><p><strong>Workflow Definitions</strong></p><p>Below you can see a typical file structure of a sample workflow package written in SparkSQL.</p><pre>.<br>├── <strong>backfill.sch.yaml</strong><br>├── <strong>daily.sch.yaml</strong><br>├── <strong>main.sch.yaml</strong><br>├── ddl<br>│   └── dataflow_sparksql_sample.sql<br>└── src<br>    ├── mocks<br>    │   ├── dataflow_pyspark_sample.yaml<br>    │   └── some_db.source_table.yaml<br>    ├── sparksql_write.sql<br>    └── test_sparksql_write.py</pre><p>Above bolded files define a series of steps (a.k.a. jobs) their cadence, dependencies, and the sequence in which they should be executed.</p><p>This is one way we can tie components together into a cohesive workflow. In every sample workflow package there are three workflow definition files that work together to provide flexible functionality. The sample workflow code assumes a daily execution pattern, but it is very easy to adjust them to run at different cadence. For the workflow orchestration we use Netflix homegrown <a href="https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c">Maestro</a> scheduler.</p><p>The <strong><em>main</em></strong> workflow definition file holds the logic of a single run, in this case one day-worth of data. This logic consists of the following parts: <a href="https://docs.google.com/document/d/1iaJPpEGRqiS3Cdxjxhzteup_h5mX5Sk_zEhFDtIxjOE/edit#heading=h.fvqhw2dhwp00">DDL</a> code, table <a href="https://docs.google.com/document/d/1iaJPpEGRqiS3Cdxjxhzteup_h5mX5Sk_zEhFDtIxjOE/edit#heading=h.wugc38fpk98s">metadata</a> information, data <a href="https://docs.google.com/document/d/1iaJPpEGRqiS3Cdxjxhzteup_h5mX5Sk_zEhFDtIxjOE/edit#heading=h.xvk461x30z3o">transformation</a> and a few <a href="https://docs.google.com/document/d/1iaJPpEGRqiS3Cdxjxhzteup_h5mX5Sk_zEhFDtIxjOE/edit#heading=h.1tum3rt1qfhc">audit</a> steps. It’s designed to run for a single date, and meant to be called from the <em>daily</em> or <em>backfill</em> workflows. This <em>main</em> workflow can also be called manually during development with arbitrary run-time parameters to get a feel for the workflow in action.</p><p>The <strong><em>daily</em></strong> workflow executes the <em>main</em> one on a daily basis for the predefined number of previous days. This is sometimes necessary for the purpose of catching up on some late arriving data. This is where we define a trigger schedule, notifications schemes, and update the <a href="https://docs.google.com/document/d/1iaJPpEGRqiS3Cdxjxhzteup_h5mX5Sk_zEhFDtIxjOE/edit#heading=h.lmy247srr96y">“high water mark” timestamps</a> on our target table.</p><p>The <strong><em>backfill</em></strong> workflow executes the <em>main</em> for a specified range of days. This is useful for restating data, most often because of a transformation logic change, but sometimes as a response to upstream data updates.</p><p><strong>DDL</strong></p><p>Often, the first step in a data pipeline is to define the target table structure and column metadata via a DDL statement. We understand that some folks choose to have their output schema be an implicit result of the transform code itself, but the explicit statement of the output schema is not only useful for adding table (and column) level comments, but also serves as one way to validate the transform logic.</p><pre>.<br>├── backfill.sch.yaml<br>├── daily.sch.yaml<br>├── main.sch.yaml<br>├── ddl<br>│   └── <strong>dataflow_sparksql_sample.sql</strong><br>└── src<br>    ├── mocks<br>    │   ├── dataflow_pyspark_sample.yaml<br>    │   └── some_db.source_table.yaml<br>    ├── sparksql_write.sql<br>    └── test_sparksql_write.py</pre><p>Generally, we prefer to execute DDL commands as part of the workflow itself, instead of running outside of the schedule, because it simplifies the development process. See below example of hooking the table creation SQL file into the <em>main</em> workflow definition.</p><pre>      - job:<br>          id: ddl<br>          type: Spark<br>          spark:<br>              script: $S3{./ddl/dataflow_sparksql_sample.sql}<br>              parameters:<br>                  TARGET_DB: ${TARGET_DB}</pre><p><strong>Metadata</strong></p><p>The metadata step provides context on the output table itself as well as the data contained within. Attributes are set via <a href="https://netflixtechblog.com/metacat-making-big-data-discoverable-and-meaningful-at-netflix-56fb36a53520">Metacat</a>, which is a Netflix internal metadata management platform. Below is an example of plugging that metadata step in the <em>main</em> workflow definition</p><pre>     - job:<br>          id: metadata<br>          type: Metadata<br>          metacat:<br>              tables:<br>                - ${CATALOG}/${TARGET_DB}/${TARGET_TABLE}<br>              owner: ${username}<br>              tags:<br>                - dataflow<br>                - sample<br>              lifetime: 123<br>              column_types:<br>                date: pk<br>                country_code: pk<br>                rank: pk</pre><p><strong>Transformation</strong></p><p>The transformation step (or steps) can be executed in the developer’s language of choice. The example below is using SparkSQL.</p><pre>.<br>├── backfill.sch.yaml<br>├── daily.sch.yaml<br>├── main.sch.yaml<br>├── ddl<br>│   └── dataflow_sparksql_sample.sql<br>└── src<br>    ├── mocks<br>    │   ├── dataflow_pyspark_sample.yaml<br>    │   └── some_db.source_table.yaml<br>    ├── <strong>sparksql_write.sql</strong><br>    └── test_sparksql_write.py</pre><p>Optionally, this step can use the Write-Audit-Publish <a href="https://www.dremio.com/subsurface/write-audit-publish-pattern-via-apache-iceberg/">pattern</a> to ensure that data is correct before it is made available to the rest of the company. See example below:</p><pre>      - template:<br>          id: wap<br>          type: wap<br>          tables:<br>              - ${CATALOG}/${DATABASE}/${TABLE}<br>          write_jobs:<br>            - job:<br>                id: write<br>                type: Spark<br>                spark:<br>                    script: $S3{./src/sparksql_write.sql}</pre><p><strong>Audits</strong></p><p>Audit steps can be defined to verify data quality. If a “blocking” audit fails, the job will halt and the write step is not committed, so invalid data will not be exposed to users. This step is optional and configurable, see a partial example of an audit from the <em>main</em> workflow below.</p><pre>         data_auditor:<br>            audits:<br>              - function: columns_should_not_have_nulls<br>                blocking: true<br>                params:<br>                    table: ${TARGET_TABLE}<br>                    columns:<br>                      - title_id<br>                      …</pre><p><strong>High-Water-Mark Timestamp</strong></p><p>A successful write will typically be followed by a metadata call to set the valid time (or high-water mark) of a dataset. This allows other processes, consuming our table, to be notified and start their processing. See an example high water mark job from the <em>main</em> workflow definition.</p><pre>      - job:<br>         id: hwm<br>         type: HWM<br>         metacat:<br>           table: ${CATALOG}/${TARGET_DB}/${TARGET_TABLE}<br>           hwm_datetime: ${EXECUTION_DATE}<br>           hwm_timezone: ${EXECUTION_TIMEZONE}</pre><p><strong>Unit Tests</strong></p><p>Unit test artifacts are also generated as part of the sample workflow structure. They consist of data mocks, the actual test code, and a simple execution harness depending on the workflow language. See the bolded file below.</p><pre>.<br>├── backfill.sch.yaml<br>├── daily.sch.yaml<br>├── main.sch.yaml<br>├── ddl<br>│   └── dataflow_sparksql_sample.sql<br>└── src<br>    ├── mocks<br>    │   ├── <strong>dataflow_pyspark_sample.yaml</strong><br>    │   └── <strong>some_db.source_table.yaml</strong><br>    ├── sparksql_write.sql<br>    └── <strong>test_sparksql_write.py</strong></pre><p>These unit tests are intended to test one “unit” of data transform in isolation. They can be run during development to quickly capture code typos and syntax issues, or during automated testing/deployment phase, to make sure that code changes have not broken any tests.</p><p>We want unit tests to run quickly so that we can have continuous feedback and fast iterations during the development cycle. Running code against a production database can be slow, especially with the overhead required for distributed data processing systems like Apache Spark. Mocks allow you to run tests locally against a small sample of “real” data to validate your transformation code functionality.</p><h4>Languages</h4><p>Over time, the extraction of data from Netflix’s source systems has grown to encompass a wider range of end-users, such as engineers, data scientists, analysts, marketers, and other stakeholders. Focusing on convenience, Dataflow allows for these differing personas to go about their work seamlessly. A large number of our data users employ SparkSQL, pyspark, and Scala. A small but growing contingency of data scientists and analytics engineers use R, backed by the Sparklyr interface or other data processing tools, like <a href="https://docs.metaflow.org/introduction/what-is-metaflow">Metaflow</a>.</p><p>With an understanding that the data landscape and the technologies employed by end-users are not homogenous, Dataflow creates a malleable path toward. It solidifies different recipes or repeatable templates for data extraction. Within this section, we’ll preview a few methods, starting with sparkSQL and python’s manner of creating data pipelines with dataflow. Then we’ll segue into the Scala and R use cases.</p><p>To begin, after installing Dataflow, a user can run the following command to understand how to get started.</p><pre>$ dataflow sample workflow --help                                                         <br>Dataflow (0.6.16)<br><br>Usage: dataflow sample workflow [OPTIONS] RECIPE [TARGET_PATH]<br><br>Create a sample workflow based on selected RECIPE and land it in the <br>specified TARGET_PATH.<br><br>Currently supported workflow RECIPEs are: spark-sql, pyspark, <br>scala and sparklyr.<br><br>  If TARGET_PATH:<br>  - if not specified, current directory is assumed<br>  - points to a directory, it will be used as the target location<br><br>Options:<br>  --source-path TEXT         Source path of the sample workflows.<br>  --workflow-shortname TEXT  Workflow short name.<br>  --workflow-id TEXT         Workflow ID.<br>  --skip-info                Skip the info about the workflow sample.<br>  --help                     Show this message and exit.</pre><p>Once again, let’s assume we have a directory called <em>stranger-data</em> in which the user creates workflow templates in all four languages that Dataflow offers. To better illustrate how to generate the sample workflows using Dataflow, let’s look at the full command one would use to create one of these workflows, e.g:</p><pre>$ cd stranger-data<br>$ dataflow sample workflow spark-sql ./sparksql-workflow</pre><p>By repeating the above command for each type of transformation language we can arrive at the following directory structure</p><pre>.<br>├── <strong>pyspark-workflow</strong><br>│   ├── main.sch.yaml<br>│   ├── daily.sch.yaml<br>│   ├── backfill.sch.yaml<br>│   ├── ddl<br>│   │   └── ...<br>│   ├── src<br>│   │   └── ...<br>│   └── tox.ini<br>├── <strong>scala-workflow</strong><br>│   ├── build.gradle<br>│   └── ...<br>├── <strong>sparklyR-workflow</strong><br>│   └── ...<br>└── <strong>sparksql-workflow</strong><br>    └── ...</pre><p>Earlier we talked about the business logic of these sample workflows and we showed the Spark SQL version of that example data transformation. Now let’s discuss different approaches to writing the data in other languages.</p><p><strong>PySpark</strong></p><p>This partial <strong>pySpark </strong>code below will have the same functionality as the SparkSQL example above, but it utilizes Spark dataframes Python interface.</p><pre>def main(args, spark):<br>   <br>    source_table_df = spark.table(f"{some_db}.{source_table})<br><br>    viewing_by_title_country = (<br>        source_table_df.select("title_id", "country_code",      <br>        "view_hours")<br>        .filter(col("date") == date)<br>        .filter("title_id IS NOT NULL AND view_hours &#62; 0")<br>        .groupBy("title_id", "country_code")<br>        .agg(F.sum("view_hours").alias("view_hours"))<br>    )<br><br>    window = Window.partitionBy(<br>        "country_code"<br>    ).orderBy(col("view_hours").desc())<br><br>    ranked_viewing_by_title_country = viewing_by_title_country.withColumn(<br>        "title_rank", rank().over(window)<br>    )<br><br>    ranked_viewing_by_title_country.filter(<br>        col("title_rank") &#60;= 100<br>    ).withColumn(<br>        "date", lit(int(date))<br>    ).select(<br>        "title_id",<br>        "country_code",<br>        "title_rank",<br>        "view_hours",<br>        "date",<br>    ).repartition(1).write.byName().insertInto(<br>        target_table, overwrite=True<br>    )</pre><p><strong>Scala</strong></p><p>Scala is another Dataflow supported recipe that offers the same business logic in a sample workflow out of the box.</p><pre>package com.netflix.spark<br><br>object ExampleApp {<br>  import spark.implicits._<br><br>  def readSourceTable(sourceDb: String, dataDate: String): DataFrame =<br>    spark<br>      .table(s"$someDb.source_table")<br>      .filter($"playback_start_date" === dataDate)<br><br>  def viewingByTitleCountry(sourceTableDF: DataFrame): DataFrame = {<br>    sourceTableDF<br>      .select($"title_id", $"country_code", $"view_hours")<br>      .filter($"title_id".isNotNull)<br>      .filter($"view_hours" &#62; 0)<br>      .groupBy($"title_id", $"country_code")<br>      .agg(F.sum($"view_hours").as("view_hours"))<br>  }<br><br>  def addTitleRank(viewingDF: DataFrame): DataFrame = {<br>    viewingDF.withColumn(<br>      "title_rank", F.rank().over(<br>        Window.partitionBy($"country_code").orderBy($"view_hours".desc)<br>      )<br>    )<br>  }<br><br>  def writeViewing(viewingDF: DataFrame, targetTable: String, dataDate: String): Unit = {<br>    viewingDF<br>      .select($"title_id", $"country_code", $"title_rank", $"view_hours")<br>      .filter($"title_rank" &#60;= 100)<br>      .repartition(1)<br>      .withColumn("date", F.lit(dataDate.toInt))<br>      .writeTo(targetTable)<br>      .overwritePartitions()<br>  }<br><br>def main():<br>    sourceTableDF = readSourceTable("some_db", "source_table", 20200101)<br>    viewingDf = viewingByTitleCountry(sourceTableDF)<br>    titleRankedDf = addTitleRank(viewingDF)<br>    writeViewing(titleRankedDf)</pre><p>R / sparklyR</p><p>As Netflix has a growing cohort of R users, R is the latest recipe available in Dataflow.</p><pre>suppressPackageStartupMessages({<br>  library(sparklyr)<br>  library(dplyr)<br>})<br><br>...<br><br>main &#60;- function(args, spark) {<br>  title_df &#60;- tbl(spark, g("{some_db}.{source_table}"))<br><br>  title_activity_by_country &#60;- title_df &#124;&#62;<br>    filter(title_date == date) &#124;&#62;<br>    filter(!is.null(title_id) &#38; event_count &#62; 0) &#124;&#62;<br>    select(title_id, country_code, event_type) &#124;&#62;<br>    group_by(title_id, country_code) &#124;&#62;<br>    summarize(event_count = sum(event_type, na.rm = TRUE))<br><br>  ranked_title_activity_by_country &#60;- title_activity_by_country  &#124;&#62;<br>    group_by(country_code) &#124;&#62;<br>    mutate(title_rank = rank(desc(event_count)))<br><br>  top_25_title_by_country &#60;- ranked_title_activity_by_country &#124;&#62;<br>    ungroup() &#124;&#62;<br>    filter(title_rank &#60;= 25) &#124;&#62;<br>    mutate(date = as.integer(date)) &#124;&#62;<br>    select(<br>      title_id,<br>      country_code,<br>      title_rank,<br>      event_count,<br>      date<br>    )<br><br>    top_25_title_by_country &#124;&#62;<br>      sdf_repartition(partitions = 1) &#124;&#62;<br>      spark_insert_table(target_table, mode = "overwrite")<br>}<br>  main(args = args, spark = spark)<br>}</pre><h3>Conclusions</h3><p>As you can see we try to make Netflix data engineering life easier by offering paved paths and suggestions on how to structure their code, while trying to keep the variety of options wide enough so they can pick and choose what works best for them in any particular case.</p><p>Having a well-defined set of defaults for data pipeline creation across Netflix makes onboarding easier, provides standardization and centralization best practices. Let’s review them below.</p><h4>Onboarding</h4><p>Ramping up on a new team or a business vertical always takes some effort, especially in a “highly aligned, loosely coupled” <a href="https://jobs.netflix.com/culture">culture</a>. Having a well-documented starting point removes some of the struggle that comes with starting from scratch and considerably speeds up the first iteration of the development cycle.</p><h4>Standardization</h4><p>Standardization makes life easier for new team members as well as those already familiar with the domain and tech stack.</p><p>Some transfer of work between people or teams is inevitable. Having standardized layout and patterns removes friction from this exchange. Also, code reviews and suggestions are easier to manage when working from a similar baseline.</p><p>Standardization also makes project layout more intuitive and minimizes risk of human error as the codebase evolves.</p><h4>Centralized Best Practices</h4><p>Data infrastructure evolves continually. Having easy access to a centralized set of good defaults is critical to ensure that best practices evolve along with the technology, and that users are aware of what’s the latest on the tech-stack menu.</p><p>Even better, Dataflow offers <strong>executable</strong> best practices, which present these concepts in the context of an actual use case. Instead of reading documentation, you can initialize a “real” project, change it as needed, and iterate from there.</p><h3>Credits</h3><p>Special thanks to <a href="https://www.linkedin.com/in/danielbwatson/">Daniel Watson</a>, <a href="https://www.linkedin.com/in/jim-hester/">Jim Hester</a>, <a href="https://www.linkedin.com/in/stephenhuenneke/">Stephen Huenneke</a>, <a href="https://www.linkedin.com/in/girish-lingappa-309aa24/">Girish Lingappa</a> for their contributions to Dataflow sample workflows and to <a href="https://www.linkedin.com/in/andreahairston/">Andrea Hairston</a> for the Dataflow logo design.</p><h3>Next Episode</h3><p>Hopefully you won’t need to wait another year to read about other features of Dataflow. Here are a few topics that we could write about next. Please have a look at the subjects below and, if you feel strongly about any of them, let us know in the comments section:</p><ul><li><strong>Branch driven deployment</strong> — to explain how Dataflow lets anyone customize their CI/CD jobs based on the git branch for easy testing in isolated environments.</li><li><strong>Local SparkSQL unit testing</strong>— to clarify how Dataflow helps in making robust unit tests for Spark SQL transform code, with ease.</li><li><strong>Data migrations made easy</strong> — to show how Dataflow can be used to plan a table migration, support the communication with downstream users and help in monitoring it to completion.</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&#38;referrerSource=full_rss&#38;postId=17440a9e141d" width="1" height="1" alt=""><hr><p><a href="https://netflixtechblog.com/ready-to-go-sample-data-pipelines-with-dataflow-17440a9e141d">Ready-to-go sample data pipelines with Dataflow</a> was originally published in <a href="https://netflixtechblog.com/">Netflix TechBlog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Data pipeline asset management with Dataflow</title>
		<link>https://noise.getoto.net/2022/02/09/data-pipeline-asset-management-with-dataflow/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Wed, 09 Feb 2022 17:33:22 +0000</pubDate>
				<category><![CDATA[automation]]></category>
		<category><![CDATA[ci-cd-pipeline]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[data-management]]></category>
		<category><![CDATA[deployment]]></category>
		<guid isPermaLink="false">https://medium.com/p/86525b3e21ca</guid>

					<description><![CDATA[by Sam Setegne, Jai Balani, Olek GorajekGlossaryasset — any business logic code in a raw (e.g. SQL) or compiled (e.g. JAR) form to be executed as part of the user defined data pipeline.data pipeline — a set of tasks (or jobs) to be executed in a predef...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Data Engineers of Netflix — Interview with Pallavi Phadnis</title>
		<link>https://noise.getoto.net/2021/10/28/data-engineers-of-netflix%e2%80%8a-%e2%80%8ainterview-with-pallavi-phadnis/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Thu, 28 Oct 2021 17:30:19 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[culture]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[Technology]]></category>
		<guid isPermaLink="false">https://medium.com/p/a1fcc5f64906</guid>

					<description><![CDATA[Data Engineers of Netflix — Interview with Pallavi PhadnisThis post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix.Pallavi Phadnis is a Senior Software Engine...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Data Engineers of Netflix — Interview with Kevin Wylie</title>
		<link>https://noise.getoto.net/2021/07/15/data-engineers-of-netflix%e2%80%8a-%e2%80%8ainterview-with-kevin-wylie/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Thu, 15 Jul 2021 14:15:59 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[culture]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data-engineering]]></category>
		<guid isPermaLink="false">https://medium.com/p/7fb9113a01ea</guid>

					<description><![CDATA[Data Engineers of Netflix — Interview with Kevin WylieThis post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix.Kevin Wylie is a Data Engineer on the Content D...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Data Engineers of Netflix — Interview with Dhevi Rajendran</title>
		<link>https://noise.getoto.net/2021/06/02/data-engineers-of-netflix%e2%80%8a-%e2%80%8ainterview-with-dhevi-rajendran/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Wed, 02 Jun 2021 01:17:13 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[culture]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[Remote Working]]></category>
		<guid isPermaLink="false">https://medium.com/p/a9ab7c7b36e5</guid>

					<description><![CDATA[Data Engineers of Netflix — Interview with Dhevi RajendranDhevi RajendranThis post is part of our “Data Engineers of Netflix” interview series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix.Dhevi Rajendran is...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Data Engineers of Netflix — Interview with Samuel Setegne</title>
		<link>https://noise.getoto.net/2021/06/02/data-engineers-of-netflix%e2%80%8a-%e2%80%8ainterview-with-samuel-setegne/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Wed, 02 Jun 2021 01:16:43 +0000</pubDate>
				<category><![CDATA[Big Data]]></category>
		<category><![CDATA[culture]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[data-tools]]></category>
		<category><![CDATA[developer-productivity]]></category>
		<guid isPermaLink="false">https://medium.com/p/f3027f58c2e2</guid>

					<description><![CDATA[Data Engineers of Netflix — Interview with Samuel SetegneSamuel SetegneThis post is part of our “Data Engineers of Netflix” interview series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix.Samuel Setegne is a ...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Mythbusting the Analytics Journey</title>
		<link>https://noise.getoto.net/2020/12/18/mythbusting-the-analytics-journey/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Fri, 18 Dec 2020 20:30:57 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[data-visualization]]></category>
		<category><![CDATA[Netflix]]></category>
		<guid isPermaLink="false">https://medium.com/p/58d692ea707e</guid>

					<description><![CDATA[Part of our series on who works in Analytics at Netflix — and what the role entailsby Alex DiamondThis Q&#38;A aims to mythbust some common misconceptions about succeeding in analytics at a big tech company.This isn’t your typical recruiting story. I w...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>A Day in the Life of a Content Analytics Engineer</title>
		<link>https://noise.getoto.net/2020/10/31/a-day-in-the-life-of-a-content-analytics-engineer/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Fri, 30 Oct 2020 23:08:34 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[data-visualization]]></category>
		<category><![CDATA[Netflix]]></category>
		<guid isPermaLink="false">https://medium.com/p/eb0250b993be</guid>

					<description><![CDATA[Part of our series on who works in Analytics at Netflix — and what the role entailsby Rocio RuelasBack when we were all working in offices, my favorite days were Monday, Wednesday, and Friday. Those were the days with the best hot breakfast, and I’ve a...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>How Our Paths Brought Us to Data and Netflix</title>
		<link>https://noise.getoto.net/2020/09/18/how-our-paths-brought-us-to-data-and-netflix/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Fri, 18 Sep 2020 18:17:49 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[Netflix]]></category>
		<guid isPermaLink="false">https://medium.com/p/4eced44a6872</guid>

					<description><![CDATA[Part of our series on who works in Analytics at Netflix — and what the role entailsby Julie Beckley &#38; Chris PhamThis Q&#38;A provides insights into the diverse set of skills, projects, and culture within Data Science and Engineering (DSE) at Netfli...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
		<item>
		<title>Analytics at Netflix: Who we are and what we do</title>
		<link>https://noise.getoto.net/2020/09/18/analytics-at-netflix-who-we-are-and-what-we-do/</link>
		
		<dc:creator><![CDATA[Netflix Technology Blog]]></dc:creator>
		<pubDate>Fri, 18 Sep 2020 17:54:32 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data-engineering]]></category>
		<category><![CDATA[data-visualization]]></category>
		<category><![CDATA[Netflix]]></category>
		<guid isPermaLink="false">https://medium.com/p/7d9c08fe6965</guid>

					<description><![CDATA[Analytics at Netflix: Who We Are and What We DoAn Introduction to Analytics and Visualization Engineering at Netflixby Molly Jackman &#38; Meghana ReddyExplained: Season 1 (Photo Credit: Netflix)Across nearly every industry, there is recognition that d...]]></description>
		
		
		<enclosure url="" length="0" type="" />

			</item>
	</channel>
</rss>

<!--
Performance optimized by W3 Total Cache. Learn more: https://www.boldgrid.com/w3-total-cache/

Object Caching 30/282 objects using Memcached
Page Caching using Disk: Enhanced 
Lazy Loading (feed)
Database Caching using Memcached

Served from: noise.getoto.net @ 2025-12-09 00:52:11 by W3 Total Cache
-->