The other day I had a conversation with a scientist friend who said something alone the lines of “yes, I work in that general field, but I’m not an expert in your question in particular”. IT is not science, of course, but I asked myself whether I am a real expert in the things that I do. And while it’s nearly impossible to hit exactly the fine line between impostor syndrome and boasting, this post is neither and has a point, so bear with me.
I’ve been doing a lot of things in the general IT field – from general purpose software engineering, IT architecture, information security, applications of cryptography, blockchain, e-government, algorithmic music composition, data analysis. And I’ve seen myself as having relatively expert knowledge. I even occasionally give TV and radio interviews, where I’m labelled as “Expert in X”. But…
Am I a real expert in software engineering and software architecture? I’ve been doing that for 15+ years, and I follow and somethings define or clarify best practices, I’m familiar with different methodologies and have been part of teams that implemented some of them correctly and efficiently. I have taken part in the decision making process of building large systems with their architectural implications. But I’ve never been formally assigned as an “architect” (not that I insist), my UML skills are rather basic and I’ve never had to integrate dozens of legacy systems. I’ve never used formal methods for assessing software, I’ve made mistakes in selecting technologies, I’ve never done proper TDD and I have only a basic understanding of networking. Maybe just the sheer amount of years of experience positions me as an expert, maybe the variety of projects I’ve worked on.
Am I a real expert in cryptography? Almost certainly not. Yes, I’m using cryptographic building blocks regularly, I know what an initialization vector is and I’ve code reviewed a merkle tree implementation. I’ve read dozens of papers on cryptography and understood many of them. But some papers are greek to me – I have no clue about the math behind cryptography. Sure, RSA is easy, but I have just a basic understanding of how elliptic curves work. On the other hand, I probably know more than 99% of the software industry, where the average person barely differentiates symmetric and asymmetric cryptography, IV is a roman numeral, and cryptography boils down to disabling TLS 1.0 on a web server.
Am I a real expert in information security? I’ve given talks on it, I’m in the infosec business, I know and follow best security practices, I know about sqlmap and I’ve even used Wireshark; I understand DEFCON talks and I’ve even decompiled several apps to find (and report) security vulnerabilities. But I’m no Mr. Robot-level hacker, nor I’m a CISO in a large organization who has to plan and implement security measures on hundreds of systems. I haven’t been part of red-teaming exercises and I haven’t built or operated a security operations center (SOC). But maybe in an industry where even having heard of OWASP puts you in the top 10% and actively thinking about the security aspects of each new piece of code puts you in the top 1%, I’m an expert.
Am I a real blockchain expert? I know Bitcoin’s and Ethereum’s implementations, I have implemented something similar to bitcoin’s data structures, I know what a Patricia merkle tree is and I’ve built and pushed raw Ethereum transactions. But I’m no Vitalik Buterin, I can’t build something like Ethereum, I’m only vaguely familiar with distributed consensus algorithms and their pitfalls, and I haven’t written a smart contract more complicated than a tutorial example. I haven’t run a production deployment of Hyperledger (only a test one), and I largely ignore most of the new networks. You may say that one doesn’t need to be Vitalik or Satoshi to be an expert, and with most people seeing blockchain as “that thing that stores data in an unmodifiable way”, one could be an expert by just writing a Hello world smart contract.
Am I a real e-government expert? Sure, I’ve been an e-government advisor to a deputy prime minister, I’ve co-authored legislation and strategic documents and understand how and why e-government works in several EU countries, most notably Estonia, but do I have a holistic view? I have almost no idea of how the e-government is structured in South Korea, Singapore or UAE, for example, I haven’t written a single paper, and I haven’t measured the impact of legislative, organizational and technical measures that we proposed and applied. There are questions that I don’t know the answer to – e.g. how to make the pan-European eID framework actually work.
So the question is – what does it mean to be an expert anyway? We will always be somewhere on an “expert spectrum”. And in many cases our industry doesn’t apply even basic good practices, so even basic expertise can be very valuable. There are always people that know more than you on a given sub-sub-field, and there are always people that are better than you at most of the things that you do. The reputation of “expert” is something important, yet something so vague. Individually it’s good to know where one stands (Dunning-Kruger and everything), and to be aware of the limits of one’s knowledge and understanding. Knowing the things that you don’t know is a good start.
But in a broader context, who’s an expert? Imagine that after we recover from the COVID-19 crisis, there’s a cyber crisis. Who will be the IT experts to advise governments on the measures to be taken? University professors? Senior silicon valley technical people? Who will be on TV to discuss the cyber crisis in the role of “expert” – a senior engineer at a big bank, a junior developer, or someone that took CS 101 in university and happens to know the host? Who will drive the agenda and public opinion?
The level of our expertise is primary for our careers, but it also has other aspects outside of our immediate field. When a crisis hits (and even if it doesn’t) it’s important that we have real experts, that we listen to them and that we trust them. But also to realize no expert knows everything about everything, and that many questions don’t have absolute answers, even for experts. That knowledge decays if not utilized and “published a paper 30 years ago” may be irrelevant today.
I promised that the article would have a point. And it’s two-fold. First, make sure you know what you don’t know, so that you can explore it if needed. Second, we need to value expertise with its imperfections. There is no absolute expert in anything, there are only relative experts. And forgive me for going through my skills (or lack thereof), but that was the best way I could think of for illustrating my point in depth.
Finally, I hope there isn’t a global IT-related crisis. But as some consider it inevitable, we may think about the perception of expertise in our field and who can we trust with certain aspects. There is no “full stack” expert, as the field is too broad.
In a microservice architecture such as Netflix’s, propagating datasets from a single source to multiple downstream destinations can be challenging. These datasets can represent anything from service configuration to the results of a batch job, are often needed in-memory to optimize access and must be updated as they change over time.
One example displaying the need for dataset propagation: at any given time Netflix runs a very large number of A/B tests. These tests span multiple services and teams, and the operators of the tests need to be able to tweak their configuration on the fly. There needs to be the ability to detect nodes that have failed to pick up the latest test configuration, and the ability to revert to older versions of configuration when things go wrong.
Another example of a dataset that needs to be disseminated is the result of a machine-learning model: the results of these models may be used by several teams, but the ML teams behind the model aren’t necessarily interested in maintaining high-availability services in the critical path. Rather than each team interested in consuming the model having to build in fallbacks to degrade gracefully, there is a lot of value in centralizing the work to allow multiple teams to leverage a single team’s effort.
Without infrastructure-level support, every team ends up building their own point solution to varying degrees of success. Datasets themselves are of varying size, from a few bytes to multiple gigabytes. It is important to build in observability and fault detection, and to provide tooling to allow operators to make quick changes without having to develop their own tools.
At Netflix we use an in-house dataset pub/sub system called Gutenberg. Gutenberg allows for propagating versioned datasets — consumers subscribe to data and are updated to the latest versions when they are published. Each version of the dataset is immutable and represents a complete view of the data — there is no dependency on previous versions of data. Gutenberg allows browsing older versions of data for use cases such as debugging, rapid mitigation of data related incidents, and re-training of machine-learning models. This post is a high level overview of the design and architecture of Gutenberg.
The top-level construct in Gutenberg is a “topic”. A publisher publishes to a topic and consumers consume from a topic. Publishing to a topic creates a new monotonically-increasing “version”. Topics have a retention policy that specifies a number of versions or a number of days of versions, depending on the use case. For example, you could configure a topic to retain 10 versions or 10 days of versions.
Each version contains metadata (keys and values) and a data pointer. You can think of a data pointer as special metadata that points to where the actual data you published is stored. Today, Gutenberg supports direct data pointers (where the payload is encoded in the data pointer value itself) and S3 data pointers (where the payload is stored in S3). Direct data pointers are generally used when the data is small (under 1MB) while S3 is used as a backing store when the data is large.
Gutenberg provides the ability to scope publishes to a particular set of consumers — for example by region, application, or cluster. This can be used to canary data changes with a single cluster, roll changes out incrementally, or constrain a dataset so that only a subset of applications can subscribe to it. Publishers decide the scope of a particular data version publish, and they can later add scopes to a previously published version. Note that this means that the concept of a latest version depends on the scope — two applications may see different versions of data as the latest depending on the publish scopes created by the publisher. The Gutenberg service matches the consuming application with the published scopes before deciding what to advertise as the latest version.
The most common use case of Gutenberg is to propagate varied sizes of data from a single publisher to multiple consumers. Often the data is held in memory by consumers and used as a “total cache”, where it is accessed at runtime by client code and atomically swapped out under the hood. Many of these use cases can be loosely grouped as “configuration” — for example Open Connect Appliance cache configuration, supported device type IDs, supported payment method metadata, and A/B test configuration. Gutenberg provides an abstraction between the publishing and consumption of this data — this allows publishers the freedom to iterate on their application without affecting downstream consumers. In some cases, publishing is done via a Gutenberg-managed UI, and teams do not need to manage their own publishing app at all.
Another use case for Gutenberg is as a versioned data store. This is common for machine-learning applications, where teams build and train models based on historical data, see how it performs over time, then tweak some parameters and run through the process again. More generally, batch-computation jobs commonly use Gutenberg to store and propagate the results of a computation as distinct versions of datasets. “Online” use cases subscribe to topics to serve real-time requests using the latest versions of topics’ data, while “offline” systems may instead use historical data from the same topics — for example to train machine-learned models.
An important point to note is that Gutenberg is not designed as an eventing system — it is meant purely for data versioning and propagation. In particular, rapid-fire publishes do not result in subscribed clients stepping through each version; when they ask for an update, they will be provided with the latest version, even if they are currently many versions behind. Traditional pub-sub or eventing systems are suited towards messages that are smaller in size and are consumed in sequence; consumers may build up a view of an entire dataset by consuming an entire (potentially compacted) feed of events. Gutenberg, however, is designed for publishing and consuming an entire immutable view of a dataset.
Design and architecture
Gutenberg consists of a service with gRPC and REST APIs as well as a Java client library that uses the gRPC API.
The Gutenberg client library handles tasks such as subscription management, S3 uploads/downloads, Atlas metrics, and knobs you can tweak using Archaius properties. It communicates with the Gutenberg service via gRPC, using Eureka for service discovery.
Publishers generally use high-level APIs to publish strings, files, or byte arrays. Depending on the data size, the data may be published as a direct data pointer or it may get uploaded to S3 and then published as an S3 data pointer. The client can upload a payload to S3 on the caller’s behalf or it can publish just the metadata for a payload that already exists in S3.
Direct data pointers are automatically replicated globally. Data that is published to S3 is uploaded to multiple regions by the publisher by default, although that can be configured by the caller.
The client library provides subscription management for consumers. This allows users to create subscriptions to particular topics, where the library retrieves data (eg from S3) before handing off to a user-provided listener. Subscriptions operate on a polling model — they ask the service for a new update every 30 seconds, providing the version with which they were last notified. Subscribed clients will never consume an older version of data than the one they are on unless they are pinned (see “Data resiliency” below). Retry logic is baked in and configurable — for instance, users can configure Gutenberg to try older versions of data if it fails to download or process the latest version of data on startup, often to deal with non-backwards-compatible data changes. Gutenberg also provides a pre-built subscription that holds on to the latest data and atomically swaps it out under the hood when a change comes in — this tackles a majority of subscription use cases, where callers only care about the current value at any given time. It allows callers to specify a default value — either for a topic that has never been published to (a good fit when the topic is used for configuration) or if there is an error consuming the topic (to avoid blocking service startup when there is a reasonable default).
Gutenberg also provides high-level client APIs that wrap the low-level gRPC APIs and provide additional functionality and observability. One example of this is to download data for a given topic and version — this is used extensively by components plugged into Netflix Hollow. Another example is a method to get the “latest” version of a topic at a particular time — a common use case when debugging and when training ML models.
Client resiliency and observability
Gutenberg was designed with a bias towards allowing consuming services to be able to start up successfully versus guaranteeing that they start with the freshest data. With this in mind, the client library was built with fallback logic for when it cannot communicate with the Gutenberg service. After HTTP request retries are exhausted, the client downloads a fallback cache of topic publish metadata from S3 and works based off of that. This cache contains all the information needed to decide whether an update needs to be applied, and from where data needs to be fetched (either from the publish metadata itself or from S3). This allows clients to fetch data (which is potentially stale, depending on how current that fallback cache is) without using the service.
Part of the benefit of providing a client library is the ability to expose metrics that can be used to alert on an infrastructure-wide issue or issues with specific applications. Today these metrics are used by the Gutenberg team to monitor our publish-propagation SLI and to alert in the event of widespread issues. Some clients also use these metrics to alert on app-specific errors, for example individual publish failures or a failure to consume a particular topic.
The Gutenberg service is a Governator/Tomcat application that exposes gRPC and REST endpoints. It uses a globally-replicated Cassandra cluster for persistence and to propagate publish metadata to every region. Instances handling consumer requests are scaled separately from those handling publish requests — there are approximately 1000 times more consumer requests than there are publish requests. In addition, this insulates publishing from consumption — a sudden spike in publishing will not affect consumption, and vice versa.
Each instance in the consumer request cluster maintains its own in-memory cache of “latest publishes”, refreshing it from Cassandra every few seconds. This is to handle the large volume of poll requests coming from subscribed clients without passing on the traffic to the Cassandra cluster. In addition, request-pooling low-ttl caches protect against large spikes in requests that could potentially burden Cassandra enough to affect entire region — we’ve had situations where transient errors coinciding with redeployments of large clusters have caused Gutenberg service degradation. Furthermore, we use an adaptive concurrency limiter bucketed by source application to throttle misbehaving applications without affecting others.
For cases where the data was published to S3 buckets in multiple regions, the server makes a decision on what bucket to send back to the client to download from based on where the client is. This also allows the service to provide the client with a bucket in the “closest” region, and to have clients fall back to another region if there is a region outage.
Before returning subscription data to consumers, the Gutenberg service first runs consistency checks on the data. If the checks fail and the polling client already has consumed some data the service returns nothing, which effectively means that there is no update available. If the polling client has not yet consumed any data (this usually means it has just started up), the service queries the history for the topic and returns the latest value that passes consistency checks. This is because we see sporadic replication delays at the Cassandra layer, where by the time a client polls for new data, the metadata associated with the most recently published version has only been partially replicated. This can result in incomplete data being returned to the client, which then manifests itself either as a data fetch failure or an obscure business-logic failure. Running these consistency checks on the server insulates consumers from the eventual-consistency caveats that come with the service’s choice of a data store.
Visibility on topic publishes and nodes that consume a topic’s data is important for auditing and to gather usage info. To collect this data, the service intercepts requests from publishers and consumers (both subscription poll requests and others) and indexes them in Elasticsearch by way of the Keystone data pipeline. This allows us to gain visibility into topic usage and decommission topics that are no longer in use. We expose deep-links into a Kibana dashboard from an internal UI to allow topic owners to get a handle on their consumers in a self-serve manner.
In addition to the clusters serving publisher and consumer requests, the Gutenberg service runs another cluster that runs periodic tasks. Specifically this runs two tasks:
Every few minutes, all the latest publishes and metadata are gathered up and sent to S3. This powers the fallback cache used by the client as detailed above.
A nightly janitor job purges topic versions which exceed their topic’s retention policy. This deletes the underlying data as well (e.g. S3 objects) and helps enforce a well-defined lifecycle for data.
In the world of application development bad deployments happen, and a common mitigation strategy there is to roll back the deployment. A data-driven architecture makes that tricky, since behavior is driven by data that changes over time.
Data propagated by Gutenberg influences — and in many cases drives — system behavior. This means that when things go wrong, we need a way to roll back to a last-known good version of data. To facilitate this, Gutenberg provides the ability to “pin” a topic to a particular version. Pins override the latest version of data and force clients to update to that version — this allows for quick mitigation rather than having an under-pressure operator attempt to figure out how to publish the last known good version. You can even apply a pin to a specific publish scope so that only consumers that match that scope are pinned. Pins also override data that is published while the pin is active, but when the pin is removed clients update to the latest version, which may be the latest version when the pin was applied or a version published while the pin was active.
When deploying new code, it’s often a good idea to canary new builds with a subset of traffic, roll it out incrementally, or otherwise de-risk a deployment by taking it slow. For cases where data drives behavior, a similar principle should be applied.
One feature Gutenberg provides is the ability to incrementally roll out data publishes via Spinnaker pipelines. For a particular topic, users configure what publish scopes they want their publish to go to and what the delay is between each one. Publishing to that topic then kicks off the pipeline, which publishes the same data version to each scope incrementally. Users are able to interact with the pipeline; for example they may choose to pause or cancel pipeline execution if their application starts misbehaving, or they may choose to fast-track a publish to get it out sooner. For example, for some topics we roll out a new dataset version one AWS region at a time.
Gutenberg has been at use at Netflix for the past three years. At present, Gutenberg stores low tens-of-thousands of topics in production, about a quarter of which have published at least once in the last six months. Topics are published at a variety of cadences — from tens of times a minute to once every few months — and on average we see around 1–2 publishes per second, with peaks and troughs about 12 hours apart.
In a given 24 hour period, the number of nodes that are subscribed to at least one topic is in the low six figures. The largest number of topics a single one of these nodes is subscribed to is north of 200, while the median is 7. In addition to subscribed applications, there are a large number of applications that request specific versions of specific topics, for example for ML and Hollow use cases. Currently the number of nodes that make a non-subscribe request for a topic is in the low hundreds of thousands, the largest number of topics requested is 60, and the median is 4.
Here’s a sample of work we have planned for Gutenberg:
Polyglot support: today Gutenberg only supports a Java client, but we’re seeing an increasing number of requests for Node.js and Python support. Some of these teams have cobbled together their own solutions built on top of the Gutenberg REST API or other systems. Rather than have different teams reinvent the wheel, we plan to provide first-class client libraries for Node.js and Python.
Encryption and access control: for sensitive data, Gutenberg publishers should be able to encrypt data and distribute decryption credentials to consumers out-of-band. Adding this feature opens Gutenberg up to another set of use-cases.
Better incremental rollout: the current implementation is in its pretty early days and needs a lot of work to support customization to fit a variety of use cases. For example, users should be able to customize the rollout pipeline to automatically accept or reject a data version based on their own tests.
Alert templates: the metrics exposed by the Gutenberg client are used by the Gutenberg team and a few teams that are power users. Instead, we plan to provide leverage to users by building and parameterizing templates they can use to set up alerts for themselves.
Topic cleanup: currently topics sit around forever unless they are explicitly deleted, even if no one is publishing to them or consuming from them. We plan on building an automated topic cleanup system based on the consumption trends indexed in Elasticsearch.
Data catalog integration: an ongoing issue at Netflix is the problem of cataloging data characteristics and lineage. There is an effort underway to centralize metadata around data sources and sinks, and once Gutenberg integrates with this, we can leverage the catalog to automate tools that message the owners of a dataset.
At Netflix we prioritize innovation and velocity in pursuit of the best experience for our 150+ million global customers. This means that our microservices constantly evolve and change, but what doesn’t change is our responsibility to provide a highly available service that delivers 100+ million hours of daily streaming to our subscribers.
In order to achieve this level of availability, we leverage an N+1 architecture where we treat Amazon Web Services (AWS) regions as fault domains, allowing us to withstand single region failures. In the event of an isolated failure we first pre-scale microservices in the healthy regions after which we can shift traffic away from the failing one. This pre-scaling is necessary due to our use of autoscaling, which generally means that services are right-sized to handle their current demand, not the surge they would experience once we shift traffic.
Though this evacuation capability exists today, this level of resiliency wasn’t always the standard for the Netflix API. In 2013 we first developed our multi-regional availability strategy in response to a catalyst that led us to re-architect the way our service operates. Over the last 6 years Netflix has continued to grow and evolve along with our customer base, invalidating core assumptions built into the machinery that powers our ability to pre-scale microservices. Two such assumptions were that:
Regional demand for all microservices (i.e. requests, messages, connections, etc.) can be abstracted by our key performance indicator, stream starts per second (SPS).
Microservices within healthy regions can be scaled uniformly during an evacuation.
These assumptions simplified pre-scaling, allowing us to treat microservices uniformly, ignoring the uniqueness and regionality of demand. This approach worked well in 2013 due to the existence of monolithic services and a fairly uniform customer base, but became less effective as Netflix evolved.
Regional Microservice Demand
Most of our microservices are in some way related to serving a stream, so SPS seemed like a reasonable stand-in to simplify regional microservice demand. This was especially true for large monolithic services. For example, player logging, authorization, licensing, and bookmarks were initially handled by a single monolithic service whose demand correlated highly with SPS. However, in order to improve developer velocity, operability, and reliability, the monolith was decomposed into smaller, purpose-built services with dissimilar function-specific demand.
Our edge gateway (zuul) also sharded by function to achieve similar wins. The graph below captures the demand for each shard, the combined demand, and SPS. Looking at the combined demand and SPS lines, SPS roughly approximates combined demand for a majority of the day. Looking at individual shards however, the amount of error introduced by using SPS as a demand proxy varies widely.
Uniform Evacuation Scaling
Since we used SPS as a demand proxy, it also seemed reasonable to assume that we can uniformly pre-scale all microservices in the healthy regions. In order to illustrate the shortcomings of this approach, let’s look at playback licensing (DRM) & authorization.
DRM is closely aligned with device type, such that Consumer Electronics (CE), Android, & iOS use different DRM platforms. In addition, the ratio of CE to mobile streaming differs regionally; for example, mobile is more popular in South America. So, if we evacuate South American traffic to North America, demand for CE and Android DRM won’t grow uniformly.
On the other hand, playback authorization is a function used by all devices prior to requesting a license. While it does have some device specific behavior, demand during an evacuation is more a function of the overall change in regional demand.
Closing The Gap
In order to address the issues with our previous approach, we needed to better characterize microservice-specific demand and how it changes when we evacuate. The former requires that we capture regional demand for microservices versus relying on SPS. The latter necessitates a better understanding of microservice demand by device type as well as how regional device demand changes during an evacuation.
Microservice-Specific Regional Demand
Because of service decomposition, we understood that using a proxy demand metric like SPS wasn’t tenable and we needed to transition to microservice-specific demand. Unfortunately, due to the diversity of services, a mix of Java (Governator/Springboot with Ribbon/gRPC, etc.) and Node (NodeQuark), there wasn’t a single demand metric we could rely on to cover all use cases. To address this, we built a system that allows us to associate each microservice with metrics that represent their demand.
The microservice metrics are configuration-driven, self-service, and allows for scoping such that services can have different configurations across various shards and regions. Our system then queries Atlas, our time series telemetry platform, to gather the appropriate historical data.
Microservice Demand By Device Type
Since demand is impacted by regional device preferences, we needed to deconstruct microservice demand to expose the device-specific components. The approach we took was to partition a microservice’s regional demand by aggregated device types (CE, Android, PS4, etc.). Unfortunately, the existing metrics didn’t uniformly expose demand by device type, so we leveraged distributed tracing to expose the required details. Using this sampled trace data we can explain how a microservice’s regional device type demand changes over time. The graph below highlights how relative device demand can vary throughout the day for a microservice.
Device Type Demand
We can use historical device type traffic to understand how to scale the device-specific components of a service’s demand. For example, the graph below shows how CE traffic in us-east-1 changes when we evacuate us-west-2. The nominal and evacuation traffic lines are normalized such that 1 represents the max(nominal traffic) and the demand scaling ratio represents the relative change in demand during an evacuation (i.e. evacuation traffic/nominal traffic).
Microservice-Specific Demand Scaling Ratio
We can now combine microservice demand by device and device-specific evacuation scaling ratios to better represent the change in a microservice’s regional demand during an evacuation — i.e. the microservice’s device type weighted demand scaling ratio. To calculate this ratio (for a specific time of day) we take a service’s device type percentages, multiply by device type evacuation scaling ratios, producing each device type’s contribution to the service’s scaling ratio. Summing these components then yields a device type weighted evacuation scaling ratio for the microservice. To provide a concrete example, the table below shows the evacuation scaling ratio calculation for a fictional service.
The graph below highlights the impact of using a microservice-specific evacuation scaling ratio versus the simplified SPS-based approach used previously. In the case of Service A, the old approach would have done well in approximating the ratio, but in the case of Service B and Service C, it would have resulted in over and under predicting demand, respectively.
Understanding the uniqueness of demand across our microservices improved the quality of our predictions, leading to safer and more efficient evacuations at the cost of additional computational complexity. This new approach, however, is itself an approximation with its own set of assumptions. For example, it assumes all categories of traffic for a device type has similar shape, for example Android logging and playback traffic. As Netflix grows our assumptions will again be challenged and we will have to adapt to continue to provide our customers with the availability and reliability that they have come to expect.
If this article has piqued your interest and you have a passion for solving cross-discipline distributed systems problems, our small but growing team is hiring!
This is the story about how the Content Setup Engineering team used Hollow, a Netflix OSS technology, to re-architect and simplify an essential component in our content pipeline — delivering a large amount of business value in the process.
Each movie and show on the Netflix service is carefully curated to ensure an optimal viewing experience. The team responsible for this curation is Title Operations. Title Operations will confirm, among other things:
We are in compliance with the contracts — date ranges and places where we can show a video are set up correctly for each title
Video with captions, subtitles, and secondary audio “dub” assets are sourced, translated, and made available to the right populations around the world
Title name and synopsis are available and translated
The appropriate maturity ratings are available for each country
When a title meets all of the minimum above requirements, then it is allowed to go live on the service. Gatekeeper is the system at Netflix responsible for evaluating the “liveness” of videos and assets on the site. A title doesn’t become visible to members until Gatekeeper approves it — and if it can’t validate the setup, then it will assist Title Operations by pointing out what’s missing from the baseline customer experience.
Gatekeeper accomplishes its prescribed task by aggregating data from multiple upstream systems, applying some business logic, then producing an output detailing the status of each video in each country.
Hollow, an OSS technology we released a few years ago, has been best described as a total high-density near cache:
Total: The entire dataset is cached on each node — there is no eviction policy, and there are no cache misses.
High-Density: encoding, bit-packing, and deduplication techniques are employed to optimize the memory footprint of the dataset.
Near: the cache exists in RAM on any instance which requires access to the dataset.
One exciting thing about the total nature of this technology — because we don’t have to worry about swapping records in-and-out of memory, we can make assumptions and do some precomputation of the in-memory representation of the dataset which would not otherwise be possible. The net result is, for many datasets, vastly more efficient use of RAM. Whereas with a traditional partial-cache solution you may wonder whether you can get away with caching only 5% of the dataset, or if you need to reserve enough space for 10% in order to get an acceptable hit/miss ratio — with the same amount of memory Hollow may be able to cache 100% of your dataset and achieve a 100% hit rate.
And obviously, if you get a 100% hit rate, you eliminate all I/O required to access your data — and can achieve orders of magnitude more efficient data access, which opens up many possibilities.
Until very recently, Gatekeeper was a completely event-driven system. When a change for a video occurred in any one of its upstream systems, that system would send an event to Gatekeeper. Gatekeeper would react to that event by reaching into each of its upstream services, gathering the necessary input data to evaluate the liveness of the video and its associated assets. It would then produce a single-record output detailing the status of that single video.
This model had several problems associated with it:
This process was completely I/O bound and put a lot of load on upstream systems.
Consequently, these events would queue up throughout the day and cause processing delays, which meant that titles may not actually go live on time.
Worse, events would occasionally get missed, meaning titles wouldn’t go live at all until someone from Title Operations realized there was a problem.
The mitigation for these issues was to “sweep” the catalog so Videos matching specific criteria (e.g., scheduled to launch next week) would get events automatically injected into the processing queue. Unfortunately, this mitigation added many more events into the queue, which exacerbated the problem.
Clearly, a change in direction was necessary.
We decided to employ a total high-density near cache (i.e., Hollow) to eliminate our I/O bottlenecks. For each of our upstream systems, we would create a Hollow dataset which encompasses all of the data necessary for Gatekeeper to perform its evaluation. Each upstream system would now be responsible for keeping its cache updated.
With this model, liveness evaluation is conceptually separated from the data retrieval from upstream systems. Instead of reacting to events, Gatekeeper would continuously process liveness for all assets in all videos across all countries in a repeating cycle. The cycle iterates over every video available at Netflix, calculating liveness details for each of them. At the end of each cycle, it produces a complete output (also a Hollow dataset) representing the liveness status details of all videos in all countries.
We expected that this continuous processing model was possible because a complete removal of our I/O bottlenecks would mean that we should be able to operate orders of magnitude more efficiently. We also expected that by moving to this model, we would realize many positive effects for the business.
A definitive solution for the excess load on upstream systems generated by Gatekeeper
A complete elimination of liveness processing delays and missed go-live dates.
A reduction in the time the Content Setup Engineering team spends on performance-related issues.
Improved debuggability and visibility into liveness processing.
Hollow can also be thought of like a time machine. As a dataset changes over time, it communicates those changes to consumers by breaking the timeline down into a series of discrete data states. Each data state represents a snapshot of the entire dataset at a specific moment in time.
Usually, consumers of a Hollow dataset are loading the latest data state and keeping their cache updated as new states are produced. However, they may instead point to a prior state — which will revert their view of the entire dataset to a point in the past.
The traditional method of producing data states is to maintain a single producer which runs a repeating cycle. During that cycle, the producer iterates over all records from the source of truth. As it iterates, it adds each record to the Hollow library. Hollow then calculates the differences between the data added during this cycle and the data added during the last cycle, then publishes the state to a location known to consumers.
The problem with this total-source-of-truth iteration model is that it can take a long time. In the case of some of our upstream systems, this could take hours. This data-propagation latency was unacceptable — we can’t wait hours for liveness processing if, for example, Title Operations adds a rating to a movie that needs to go live imminently.
What we needed was a faster time machine — one which could produce states with a more frequent cadence, so that changes could be more quickly realized by consumers.
To achieve this, we created an incremental Hollow infrastructure for Netflix, leveraging work which had been done in the Hollow library earlier, and pioneered in productionusage by the Streaming Platform Team at Target (and is now a public non-beta API).
With this infrastructure, each time a change is detected in a source application, the updated record is encoded and emitted to a Kafka topic. A new component that is not part of the source application, the Hollow Incremental Producer service, performs a repeating cycle at a predefined cadence. During each cycle, it reads all messages which have been added to the topic since the last cycle and mutates the Hollow state engine to reflect the new state of the updated records.
If a message from the Kafka topic contains the exact same data as already reflected in the Hollow dataset, no action is taken.
To mitigate issues arising from missed events, we implement a sweep mechanism that periodically iterates over an entire source dataset. As it iterates, it emits the content of each record to the Kafka topic. In this way, any updates which may have been missed will eventually be reflected in the Hollow dataset. Additionally, because this is not the primary mechanism by which updates are propagated to the Hollow dataset, this does not have to be run as quickly or frequently as a cycle must iterate the source in traditional Hollow usage.
The Hollow Incremental Producer is capable of reading a great many messages from the Kafka topic and mutating its Hollow state internally very quickly — so we can configure its cycle times to be very short (we are currently defaulting this to 30 seconds).
This is how we built a faster time machine. Now, if Title Operations adds a maturity rating to a movie, within 30 seconds, that data is available in the corresponding Hollow dataset.
The Tangible Result
With the data propagation latency issue solved, we were able to re-implement the Gatekeeper system to eliminate all I/O boundaries. With the prior implementation of Gatekeeper, re-evaluating all assets for all videos in all countries would have been unthinkable — it would tie up the entire content pipeline for more than a week (and we would then still be behind by a week since nothing else could be processed in the meantime). Now we re-evaluate everything in about 30 seconds — and we do that every minute.
There is no such thing as a missed or delayed liveness evaluation any longer, and the disablement of the prior Gatekeeper system reduced the load on our upstream systems — in some cases by up to 80%.
In addition to these performance benefits, we also get a resiliency benefit. In the prior Gatekeeper system, if one of the upstream services went down, we were unable to evaluate liveness at all because we were unable to retrieve any data from that system. In the new implementation, if one of the upstream systems goes down then it does stop publishing — but we still gate stale data for its corresponding dataset while all others make progress. So for example, if the translated synopsis system goes down, we can still bring a movie on-site in a region if it was held back for, and then receives, the correct subtitles.
The Intangible Result
Perhaps even more beneficial than the performance gains has been the improvement in our development velocity in this system. We can now develop, validate, and release changes in minutes which might have before taken days or weeks — and we can do so with significantly increased release quality.
The time-machine aspect of Hollow means that every deterministic process which uses Hollow exclusively as input data is 100% reproducible. For Gatekeeper, this means that an exact replay of what happened at time X can be accomplished by reverting all of our input states to time X, then re-evaluating everything again.
We use this fact to iterate quickly on changes to the Gatekeeper business logic. We maintain a PREPROD Gatekeeper instance which “follows” our PROD Gatekeeper instance. PREPROD is also continuously evaluating liveness for the entire catalog, but publishing its output to a different Hollow dataset. At the beginning of each cycle, the PREPROD environment will gather the latest produced state from PROD, and set each of its input datasets to the exact same versions which were used to produce the PROD output.
When we want to make a change to the Gatekeeper business logic, we do so and then publish it to our PREPROD cluster. The subsequent output state from PREPROD can be diffed with its corresponding output state from PROD to view the precise effect that the logic change will cause. In this way, at a glance, we can validate that our changes have precisely the intended effect, and zero unintended consequences.
This, coupled with some iteration on the deployment process, has resulted in the ability for our team to code, validate, and deploy impactful changes to Gatekeeper in literally minutes — at least an order of magnitude faster than in the prior system — and we can do so with a higher level of safety than was possible in the previous architecture.
This new implementation of the Gatekeeper system opens up opportunities to capture additional business value, which we plan to pursue over the coming quarters. Additionally, this is a pattern that can be replicated to other systems within the Content Engineering space and elsewhere at Netflix — already a couple of follow-up projects have been launched to formalize and capitalize on the benefits of this n-hollow-input, one-hollow-output architecture.
Content Setup Engineering is an exciting space right now, especially as we scale up our pipeline to produce more content with each passing quarter. We have many opportunities to solve real problems and provide massive value to the business — and to do so with a deep focus on computer science, using and often pioneering leading-edge technologies. If this kind of work sounds appealing to you, reach out to Ivan to get the ball rolling.
I was thinking the other days – why writing good code is so hard? Why the industry still hasn’t got to producing quality software, despite years of efforts, best practices, methodologies, tools. And the answer to those questions is anything but simple. It involves economic incentives, market realities, deadlines, formal education, industry standards, insufficient number of developers on the market, etc. etc.
As an organization, in order to produce quality software, you have to do a lot. Setup processes, get your recruitment right, be able to charge the overhead of quality to your customers, and actually care about that.
But even with all the measures taken, you can’t guarantee quality code. First, because that’s subjective, but second, because it always comes down to the individual developers. And not simply whether they are capable of writing quality software, but also whether they are actually doing it.
And as a developer, you may fit the process and still produce mediocre code. This is why my thoughts took me to the code from the eyes of the developer, but in the context of software as a whole. Tools can automatically catch code styles issues, cyclomatic complexity, large methods, too many method parameters, circular dependencies, etc. etc. But even if you cover those, you are still not guaranteed to have produced quality software.
So I came up with seven questions that we as developers should ask ourselves each time we commit code.
Is it correct? – does the code implement the specification. If there is no clear specification, did you do a sufficient effort to find out the expected behaviour. And is that behaviour tested somehow – by automated tests preferably, or at least by manual testing,.
Is it complete? – does it take care of all the edge cases, regardless of whether they are defined in the spec or not. Many edge cases are technical (broken connections, insufficient memory, changing interfaces, etc.).
Is it secure? – does it prevent misuse, does it follow best security practices, does it validate its input, does it prevent injections, etc. Is it tested to prove that it is secure against these known attacks. Security is much more than code, but the code itself can introduce a lot of vulnerabilities.
Is it readable and maintainable? – does it allow other people to easily read it, follow it and understand it? Does it have proper comments, describing how a certain piece of code fits into the big picture, does it break down code in small, readable units.
Is it extensible? – does it allow being extended with additional use cases, does it use the appropriate design patterns that allow extensibility, is it parameterizable and configurable, does it allow writing new functionality without breaking old one, does it cover a sufficient percentage of the existing functionality with tests so that changes are not “scary”.
Is it efficient? – does work well under high load, does it care about algorithmic complexity (without being prematurely optimized), does it use batch processing, does it read avoid loading big chunks of data in memory at once, does it make proper use of asynchronous processing.
Is it something to be proud of? – does it represent every good practice that your experience has taught you? Not every piece of code is glorious, as most perform mundane tasks, but is the code something to be proud of or something you’d hope nobody sees? Would you be okay to put it on GitHub?
I think we can internalize those questions. Will asking them constantly make a difference? I think so. Would we magically get quality software If every developer asked themselves these questions about their code? Certainly not. But we’d have better code, when combined with existing tools, processes and practices.
Quality software depends on many factors, but developers are one of the most important ones. Bad software is too often our fault, and by asking ourselves the right questions, we can contribute to good software as well.
Data that describe processes in a spatial context are everywhere in our day-to-day lives and they dominate big data problems. Map data, for instance, whether describing networks of roads or remote sensing data from satellites, get us where we need to go. Atmospheric data from simulations and sensors underlie our weather forecasts and climate models. Devices and sensors with GPS can provide a spatial context to nearly all mobile data.
In this post, we introduce the WIND toolkit, a huge (500 TB), open weather model dataset that’s available to the world on Amazon’s cloud services. We walk through how to access this data and some of the open-source software developed to make it easily accessible. Our solution considers a subset of geospatial data that exist on a grid (raster) and explores ways to provide access to large-scale raster data from weather models. The solution uses foundational AWS services and the Hierarchical Data Format (HDF), a well adopted format for scientific data.
The approach developed here can be extended to any data that fit in an HDF5 file, which can describe sparse and dense vectors and matrices of arbitrary dimensions. This format is already popular within the physical sciences for both experimental and simulation data. We discuss solutions to gridded data storage for a massive dataset of public weather model outputs called the Wind Integration National Dataset (WIND) toolkit. We also highlight strategies that are general to other large geospatial data management problems.
Wind Integration National Dataset
As variable renewable power penetration levels increase in power systems worldwide, the importance of renewable integration studies to ensure continued economic and reliable operation of the power grid is also increasing. The WIND toolkit is the largest freely available grid integration dataset to date.
The WIND toolkit was developed by 3TIER by Vaisala. They were under a subcontract to the National Renewable Energy Laboratory (NREL) to support studies on integration of wind energy into the existing US grid. NREL is a part of a network of national laboratories for the US Department of Energy and has a mission to advance the science and engineering of energy efficiency, sustainable transportation, and renewable power technologies.
The toolkit has been used by consultants, research groups, and universities worldwide to support grid integration studies. Less traditional uses also include resource assessments for wind plants (such as those powering Amazon data centers), and studying the effects of weather on California condor migrations in the Baja peninsula.
The diversity of applications highlights the value of accessible, open public data. Yet, there’s a catch: the dataset is huge. The WIND toolkit provides simulated atmospheric (weather) data at a two-km spatial resolution and five-minute temporal resolution at multiple heights for seven years. The entire dataset is half a petabyte (500 TB) in size and is stored in the NREL High Performance Computing data center in Golden, Colorado. Making this dataset publicly available easily and in a cost-effective manner is a major challenge.
As other laboratories and public institutions work to release their data to the world, they may face similar challenges to those that we experienced. Some prior, well-intentioned efforts to release huge datasets as-is have resulted in data resources that are technically available but fundamentally unusable. They may be stored in an unintuitive format or indexed and organized to support only a subset of potential uses. Downloading hundreds of terabytes of data is often impractical. Most users don’t have access to a big data cluster (or super computer) to slice and dice the data as they need after it’s downloaded.
We aim to provide a large amount of data (50 terabytes) to the public in a way that is efficient, scalable, and easy to use. In many cases, researchers can access these huge cloud-located datasets using the same software and algorithms they have developed for smaller datasets stored locally. Only the pieces of data they need for their individual analysis must be downloaded. To make this work in practice, we worked with the HDF Group and have built upon their forthcoming Highly Scalable Data Service.
In the rest of this post, we discuss how the HSDS software was developed to use Amazon EC2 and Amazon S3 resources to provide convenient and scalable access to these huge geospatial datasets. We describe how the HSDS service has been put to work for the WIND Toolkit dataset and demonstrate how to access it using the h5pyd Python library and the REST API. We conclude with information about our ongoing work to release more ‘open’ datasets to the public using AWS services, and ways to improve and extend the HSDS with newer Amazon services like Amazon ECS and AWS Lambda.
Developing a scalable service for big geospatial data
The HDF5 file format and API have been used for many years and is an effective means of storing large scientific datasets. For example, NASA’s Earth Observing System (EOS) satellites collect more than 16 TBs of data per day using HDF5.
With the rise of the cloud, there are new challenges and opportunities to rethink how HDF5 can be enhanced to work effectively as a component in a cloud-native architecture. For the HDF Group, working with NREL has been a great opportunity to put ideas into practice with a production-size dataset.
An HDF5 file consists of a directed graph of group and dataset objects. Datasets can be thought of as a multidimensional array with support for user-defined metadata tags and compression. Typical operations on datasets would be reading or writing data to a regular subregion (a hyperslab) or reading and writing individual elements (a point selection). Also, group and dataset objects may each contain an arbitrary number of the user-defined metadata elements known as attributes.
Many people have used the HDF library in applications developed or ported to run on EC2 instances, but there are a number of constraints that often prove problematic:
The HDF5 library can’t read directly from HDF5 files stored as S3 objects. The entire file (often many GB in size) would need to be copied to local storage before the first byte can be read. Also, the instance must be configured with the appropriately sized EBS volume)
The HDF library only has access to the computational resources of the instance itself (as opposed to a cluster of instances), so many operations are bottlenecked by the library.
Any modifications to the HDF5 file would somehow have to be synchronized with changes that other instances have made to same file before writing back to S3.
Using a pattern common to many offerings from AWS, the solution to these constraints is to develop a service framework around the HDF data model. Using this model, the HDF Group has created the Highly Scalable Data Service (HSDS) that provides all the functionality that traditionally was provided by the HDF5 library. By using the service, you don’t need to manage your own file volumes, but can just read and write whatever data that you need.
Because the service manages the actual data persistence to a durable medium (S3, in this case), you don’t need to worry about disk management. Simply stream the data you need from the service as you need it. Secondly, putting the functionality behind a service allows some tricks to increase performance (described in more detail later). And lastly, HSDS allows any number of clients to access the data at the same time, enabling HDF5 to be used as a coordination mechanism for multiple readers and writers.
In designing the HSDS architecture, we gave much thought to how to achieve scalability of the HSDS service. For accessing HDF5 data, there are two different types of scaling to consider:
Multiple clients making many requests to the service
Single requests that require a significant amount of data processing
To deal with the first scaling challenge, as with most services, we considered how the service responds as the request rate increases. AWS provides some great tools that help in this regard:
Auto Scaling groups
Elastic Load Balancing load balancers
The ability of S3 to handle large aggregate throughput rates
By using a cluster of EC2 instances behind a load balancer, you can handle different client loads in a cost-effective manner.
The second scaling challenge concerns single requests that would take significant processing time with just one compute node. One example of this from the WIND toolkit would be extracting all the values in the seven-year time span for a given geographic point and dataset.
In HDF5, large datasets are typically stored as “chunks”; that is, a regular partition of the array. In HSDS, each chunk is stored as a binary object in S3. The sequential approach to retrieving the time series values would be for the service to read each chunk needed from S3, extract the needed elements, and go on to the next chunk. In this case, that would involve processing 2557 chunks, and would be quite slow.
Fortunately, with HSDS, you can speed this up quite a bit by exploiting the compute and I/O capabilities of the cluster. Upon receiving the request, the receiving node can use other nodes in the cluster to read different portions of the selection. With multiple nodes reading from S3 in parallel, performance improves as the cluster size increases.
The diagram below illustrates how this works in simplified case of four chunks and four nodes.
This architecture has worked in well in practice. In testing with the WIND toolkit and time series extraction, we observed a request latency of ~60 seconds using four nodes vs. ~5 seconds with 40 nodes. Performance roughly scales with the size of the cluster.
A planned enhancement to this is to use AWS Lambda for the worker processing. This enables 1000-way parallel reads at a reasonable cost, as you only pay for the milliseconds of CPU time used with AWS Lambda.
Public access to atmospheric data using HSDS and AWS
An early challenge in releasing the WIND toolkit data was in deciding how to subset the data for different use cases. In general, few researchers need access to the entire 0.5 PB of data and a great deal of efficiency and cost reduction can be gained by making directed constituent datasets.
NREL grid integration researchers initially extracted a 2-TB subset by selecting 120,000 points where the wind resource seemed appropriate for development. They also chose only those data important for wind applications (100-m wind speed, converted to power), the most interesting locations for those performing grid studies. To support the remaining users who needed more data resolution, we down-sampled the data to a 60-minute temporal resolution, keeping all the other variables and spatial resolution intact. This reduced dataset is 50 TB of data describing 30+ atmospheric variables of data for 7 years at a 60-minute temporal resolution.
Programmatic access is possible using the h5pyd Python library, a distributed analog to the widely used h5py library. Users interact with the datasets (variables) and slice the data from its (time x longitude x latitude) cube form as they see fit.
Examples and use cases are described in a set of Jupyter notebooks and available on GitHub:
Now you have a Jupyter notebook server running on your EC2 server.
From your laptop, create an SSH tunnel:
$ ssh –L 8888:localhost:8888 (IP address of the EC2 server)
Now, you can browse to localhost:8888 using the correct token, and interact with the notebooks as if they were local. Within the directory, there are examples for accessing the HSDS API and plotting wind and weather data using matplotlib.
Controlling access and defraying costs
A final concern is rate limiting and access control. Although the HSDS service is scalable and relatively robust, we had a few practical concerns:
How can we protect from malicious or accidental use that may lead to high egress fees (for example, someone who attempts to repeatedly download the entire dataset from S3)?
How can we keep track of who is using the data both to document the value of the data resource and to justify the costs?
If costs become too high, can we charge for some or all API use to help cover the costs?
To approach these problems, we investigated using Amazon API Gateway and its simplified integration with the AWS Marketplace for SaaS monetization as well as third-party API proxies.
In the end, we chose to use API Umbrella due to its close involvement with http://data.gov. While AWS Marketplace is a compelling option for future datasets, the decision was made to keep this dataset entirely open, at least for now. As community use and associated costs grow, we’ll likely revisit Marketplace. Meanwhile, API Umbrella provides controls for rate limiting and API key registration out of the box and was simple to implement as a front-end proxy to HSDS. Those applications that may want to charge for API use can accomplish a similar strategy using Amazon API Gateway and AWS Marketplace.
Ongoing work and other resources
As NREL and other government research labs, municipalities, and organizations try to share data with the public, we expect many of you will face similar challenges to those we have tried to approach with the architecture described in this post. Providing large datasets is one challenge. Doing so in a way that is affordable and convenient for users is an entirely more difficult goal. Using AWS cloud-native services and the existing foundation of the HDF file format has allowed us to tackle that challenge in a meaningful way.
Dr. Caleb Phillips is a senior scientist with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Caleb comes from a background in computer science systems, applied statistics, computational modeling, and optimization. His work at NREL spans the breadth of renewable energy technologies and focuses on applying modern data science techniques to data problems at scale.
Dr. Caroline Draxl is a senior scientist at NREL. She supports the research and modeling activities of the US Department of Energy from mesoscale to wind plant scale. Caroline uses mesoscale models to research wind resources in various countries, and participates in on- and offshore boundary layer research and in the coupling of the mesoscale flow features (kilometer scale) to the microscale (tens of meters). She holds a M.S. degree in Meteorology and Geophysics from the University of Innsbruck, Austria, and a PhD in Meteorology from the Technical University of Denmark.
John Readey has been a Senior Architect at The HDF Group since he joined in June 2014. His interests include web services related to HDF, applications that support the use of HDF and data visualization.Before joining The HDF Group, John worked at Amazon.com from 2006–2014 where he developed service-based systems for eCommerce and AWS.
Jordan Perr-Sauer is an RPP intern with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Jordan hopes to use his professional background in software engineering and his academic training in applied mathematics to solve the challenging problems facing America and the world.
Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2, robust and reliable object storage at just $0.005/gb/mo. We offer the lowest price of any of the big players and are still profitable.
Backblaze has a culture of openness. The hardware designs for our storage pods are open source. Key parts of the software, including the Reed-Solomon erasure coding are open-source. Backblaze is the only company that publishes hard drive reliability statistics.
We’ve managed to nurture a team-oriented culture with amazingly low turnover. We value our people and their families. The team is distributed across the U.S., but we work in Pacific Time, so work is limited to work time, leaving evenings and weekends open for personal and family time. Check out our “About Us” page to learn more about the people and some of our perks.
We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.
Our engineering team is 10 software engineers, and 2 quality assurance engineers. Most engineers are experienced, and a couple are more junior. The team will be growing as the company grows to meet the demand for our products; we plan to add at least 6 more engineers in 2018. The software includes the storage systems that run in the data center, the web APIs that clients access, the web site, and client programs that run on phones, tablets, and computers.
As the Director of Engineering, you will be:
managing the software engineering team
ensuring consistent delivery of top-quality services to our customers
collaborating closely with the operations team
directing engineering execution to scale the business and build new services
transforming a self-directed, scrappy startup team into a mid-size engineering organization
A successful director will have the opportunity to grow into the role of VP of Engineering. Backblaze expects to continue our exponential growth of our storage services in the upcoming years, with matching growth in the engineering team..
This position is located in San Mateo, California.
We are a looking for a director who:
has a good understanding of software engineering best practices
has experience scaling a large, distributed system
gets energized by creating an environment where engineers thrive
understands the trade-offs between building a solid foundation and shipping new features
has a track record of building effective teams
Required for all Backblaze Employees:
Good attitude and willingness to do whatever it takes to get the job done
Strong desire to work for a small fast-paced company
Desire to learn and adapt to rapidly changing technologies and work environment
Rigorous adherence to best practices
Relentless attention to detail
Excellent interpersonal skills and good oral/written communication
Excellent troubleshooting and problem solving skills
Some Backblaze Perks:
Competitive healthcare plans
Competitive compensation and 401k
All employees receive Option grants
Unlimited vacation days
Fully stocked Micro kitchen
Catered breakfast and lunches
Awesome people who work on awesome projects
New Parent Childcare bonus
Normal work hours
Get to bring your pets into the office
San Mateo Office — located near Caltrain and Highways 101 & 280.
Product security is an interesting animal: it is a uniquely cross-disciplinary endeavor that spans policy, consulting, process automation, in-depth software engineering, and cutting-edge vulnerability research. And in contrast to many other specializations in our field of expertise – say, incident response or network security – we have virtually no time-tested and coherent frameworks for setting it up within a company of any size.
In my previous post, I shared some thoughts on nurturing technical organizations and cultivating the right kind of leadership within. Today, I figured it would be fitting to follow up with several notes on what I learned about structuring product security work – and about actually making the effort count.
The “comfort zone” trap
For security engineers, knowing your limits is a sought-after quality: there is nothing more dangerous than a security expert who goes off script and starts dispensing authoritatively-sounding but bogus advice on a topic they know very little about. But that same quality can be destructive when it prevents us from growing beyond our most familiar role: that of a critic who pokes holes in other people’s designs.
The role of a resident security critic lends itself all too easily to a sense of supremacy: the mistaken belief that our cognitive skills exceed the capabilities of the engineers and product managers who come to us for help – and that the cool bugs we file are the ultimate proof of our special gift. We start taking pride in the mere act of breaking somebody else’s software – and then write scathing but ineffectual critiques addressed to executives, demanding that they either put a stop to a project or sign off on a risk. And hey, in the latter case, they better brace for our triumphant “I told you so” at some later date.
Of course, escalations of this type have their place, but they need to be a very rare sight; when practiced routinely, they are a telltale sign of a dysfunctional team. We might be failing to think up viable alternatives that are in tune with business or engineering needs; we might be very unpersuasive, failing to communicate with other rational people in a language they understand; or it might be that our tolerance for risk is badly out of whack with the rest of the company. Whatever the cause, I’ve seen high-level escalations where the security team spoke of valiant efforts to resist inexplicably awful design decisions or data sharing setups; and where product leads in turn talked about pressing business needs randomly blocked by obstinate security folks. Sometimes, simply having them compare their notes would be enough to arrive at a technical solution – such as sharing a less sensitive subset of the data at hand.
To be effective, any product security program must be rooted in a partnership with the rest of the company, focused on helping them get stuff done while eliminating or reducing security risks. To combat the toxic us-versus-them mentality, I found it helpful to have some team members with software engineering backgrounds, even if it’s the ownership of a small open-source project or so. This can broaden our horizons, helping us see that we all make the same mistakes – and that not every solution that sounds good on paper is usable once we code it up.
Getting off the treadmill
All security programs involve a good chunk of operational work. For product security, this can be a combination of product launch reviews, design consulting requests, incoming bug reports, or compliance-driven assessments of some sort. And curiously, such reactive work also has the property of gradually expanding to consume all the available resources on a team: next year is bound to bring even more review requests, even more regulatory hurdles, and even more incoming bugs to triage and fix.
Being more tractable, such routine tasks are also more readily enshrined in SDLs, SLAs, and all kinds of other official documents that are often mistaken for a mission statement that justifies the existence of our teams. Soon, instead of explaining to a developer why they should fix a particular problem right away, we end up pointing them to page 17 in our severity classification guideline, which defines that “severity 2” vulnerabilities need to be resolved within a month. Meanwhile, another policy may be telling them that they need to run a fuzzer or a web application scanner for a particular number of CPU-hours – no matter whether it makes sense or whether the job is set up right.
To run a product security program that scales sublinearly, stays abreast of future threats, and doesn’t erect bureaucratic speed bumps just for the sake of it, we need to recognize this inherent tendency for operational work to take over – and we need to reign it in. No matter what the last year’s policy says, we usually don’t need to be doing security reviews with a particular cadence or to a particular depth; if we need to scale them back 10% to staff a two-quarter project that fixes an important API and squashes an entire class of bugs, it’s a short-term risk we should feel empowered to take.
As noted in my earlier post, I find contingency planning to be a valuable tool in this regard: why not ask ourselves how the team would cope if the workload went up another 30%, but bad financial results precluded any team growth? It’s actually fun to think about such hypotheticals ahead of the time – and hey, if the ideas sound good, why not try them out today?
Living for a cause
It can be difficult to understand if our security efforts are structured and prioritized right; when faced with such uncertainty, it is natural to stick to the safe fundamentals – investing most of our resources into the very same things that everybody else in our industry appears to be focusing on today.
I think it’s important to combat this mindset – and if so, we might as well tackle it head on. Rather than focusing on tactical objectives and policy documents, try to write down a concise mission statement explaining why you are a team in the first place, what specific business outcomes you are aiming for, how do you prioritize it, and how you want it all to change in a year or two. It should be a fluid narrative that reads right and that everybody on your team can take pride in; my favorite way of starting the conversation is telling folks that we could always have a new VP tomorrow – and that the VP’s first order of business could be asking, “why do you have so many people here and how do I know they are doing the right thing?”. It’s a playful but realistic framing device that motivates people to get it done.
In general, a comprehensive product security program should probably start with the assumption that no matter how many resources we have at our disposal, we will never be able to stay in the loop on everything that’s happening across the company – and even if we did, we’re not going to be able to catch every single bug. It follows that one of our top priorities for the team should be making sure that bugs don’t happen very often; a scalable way of getting there is equipping engineers with intuitive and usable tools that make it easy to perform common tasks without having to worry about security at all. Examples include standardized, managed containers for production jobs; safe-by-default APIs, such as strict contextual autoescaping for XSS or type safety for SQL; security-conscious style guidelines; or plug-and-play libraries that take care of common crypto or ACL enforcement tasks.
Of course, not all problems can be addressed on framework level, and not every engineer will always reach for the right tools. Because of this, the next principle that I found to be worth focusing on is containment and mitigation: making sure that bugs are difficult to exploit when they happen, or that the damage is kept in check. The solutions in this space can range from low-level enhancements (say, hardened allocators or seccomp-bpf sandboxes) to client-facing features such as browser origin isolation or Content Security Policy.
The usual consulting, review, and outreach tasks are an important facet of a product security program, but probably shouldn’t be the sole focus of your team. It’s also best to avoid undue emphasis on vulnerability showmanship: while valuable in some contexts, it creates a hypercompetitive environment that may be hostile to less experienced team members – not to mention, squashing individual bugs offers very limited value if the same issue is likely to be reintroduced into the codebase the next day. I like to think of security reviews as a teaching opportunity instead: it’s a way to raise awareness, form partnerships with engineers, and help them develop lasting habits that reduce the incidence of bugs. Metrics to understand the impact of your work are important, too; if your engagements are seen mostly as a yet another layer of red tape, product teams will stop reaching out to you for advice.
The other tenet of a healthy product security effort requires us to recognize at a scale and given enough time, every defense mechanism is bound to fail – and so, we need ways to prevent bugs from turning into incidents. The efforts in this space may range from developing product-specific signals for the incident response and monitoring teams; to offering meaningful vulnerability reward programs and nourishing a healthy and respectful relationship with the research community; to organizing regular offensive exercises in hopes of spotting bugs before anybody else does.
Oh, one final note: an important feature of a healthy security program is the existence of multiple feedback loops that help you spot problems without the need to micromanage the organization and without being deathly afraid of taking chances. For example, the data coming from bug bounty programs, if analyzed correctly, offers a wonderful way to alert you to systemic problems in your codebase – and later on, to measure the impact of any remediation and hardening work.
This column is from The MagPi issue 59. You can download a PDF of the full issue for free, or subscribe to receive the print edition through your letterbox or the digital edition on your tablet. All proceeds from the print and digital editions help the Raspberry Pi Foundation achieve our charitable goals.
“Hey, world!” Estefannie exclaims, a wide grin across her face as the camera begins to roll for another YouTube tutorial video. With a growing number of followers and wonderful support from her fans, Estefannie is building a solid reputation as an online maker, creating unique, fun content accessible to all.
It’s as if she was born into performing and making for an audience, but this fun, enjoyable journey to social media stardom came not from a desire to be in front of the camera, but rather as a unique approach to her own learning. While studying, Estefannie decided the best way to confirm her knowledge of a subject was to create an educational video explaining it. If she could teach a topic successfully, she knew she’d retained the information. And so her YouTube channel, Estefannie Explains It All, came into being.
Her first videos featured pages of notes with voice-over explanations of data structure and algorithm analysis. Then she moved in front of the camera, and expanded her skills in the process.
But YouTube isn’t her only outlet. With nearly 50000 followers, Estefannie’s Instagram game is strong, adding to an increasing number of female coders taking to the platform. Across her Instagram grid, you’ll find insights into her daily routine, from programming on location for work to behind-the-scenes troubleshooting as she begins to create another tutorial video. It’s hard work, with content creation for both Instagram and YouTube forever on her mind as she continues to work and progress successfully as a software engineer.
As a thank you to her Instagram fans for helping her reach 10000 followers, Estefannie created a free game for Android and iOS called Gravitris — imagine Tetris with balance issues!
Estefannie was born and raised in Mexico, with ambitions to become a graphic designer and animator. However, a documentary on coding at Pixar, and the beauty of Merida’s hair in Brave, opened her mind to the opportunities of software engineering in animation. She altered her career path, moved to the United States, and switched to a Computer Science course.
With a constant desire to make and to learn, Estefannie combines her software engineering profession with her hobby to create fun, exciting content for YouTube.
While studying, Estefannie started a Computer Science Girls Club at the University of Houston, Texas, and she found herself eager to put more time and effort into the movement to increase the percentage of women in the industry. The club was a success, and still is to this day. While Estefannie has handed over the reins, she’s still very involved in the cause.
Through her YouTube videos, Estefannie continues the theme of inclusion, with every project offering a warm sense of approachability for all, regardless of age, gender, or skill. From exploring Scratch and Makey Makey with her young niece and nephew to creating her own Disney ‘Made with Magic’ backpack for a trip to Disney World, Florida, Estefannie’s videos are essentially a documentary of her own learning process, produced so viewers can learn with her — and learn from her mistakes — to create their own tech wonders.
Estefannie’s automated gingerbread house project was a labour of love, with electronics, wires, and candy strewn across both her living room and kitchen for weeks before completion. While she already was a skilled programmer, the world of physical digital making was still fairly new for Estefannie. Having ditched her hot glue gun in favour of a soldering iron in a previous video, she continued to experiment and try out new, interesting techniques that are now second nature to many members of the maker community. With the gingerbread house, Estefannie was able to research and apply techniques such as light controls, servos, and app making, although the latter was already firmly within her skill set. The result? A fun video of ups and downs that resulted in a wonderful, festive treat. She even gave her holiday home its own solar panel!
1,910 Likes, 43 Comments – Estefannie Explains It All (@estefanniegg) on Instagram: “A DAY AT RASPBERRY PI TOWERS!! LINK IN BIO @raspberrypifoundation”
And that’s just the beginning of her adventures with Pi…but we won’t spoil her future plans by telling you what’s coming next. Sorry! However, since this article was written last year, Estefannie has released a few more Pi-based project videos, plus some awesome interviews and live-streams with other members of the maker community such as Simone Giertz. She even made us an awesome video for our Raspberry Pi YouTube channel! So be sure to check out her latest releases.
2,264 Likes, 56 Comments – Estefannie Explains It All (@estefanniegg) on Instagram: “Best day yet!! I got to hangout, play Jenga with a huge arm robot, and have afternoon tea with…”
While many wonderful maker videos show off a project without much explanation, or expect a certain level of skill from viewers hoping to recreate the project, Estefannie’s videos exist almost within their own category. We can’t wait to see where Estefannie Explains It All goes next!
Cerberus Technologies, in their own words: Cerberus is a company founded in 2017 by a team of visionary iGaming veterans. Our mission is simple – to offer the best tech solutions through a data-driven and a customer-first approach, delivering innovative solutions that go against traditional forms of working and process. This mission is based on the solid foundations of reliability, flexibility and security, and we intend to fundamentally change the way iGaming and other industries interact with technology.
Over the years, I have developed and created a number of data warehouses from scratch. Recently, I built a data warehouse for the iGaming industry single-handedly. To do it, I used the power and flexibility of Amazon Redshift and the wider AWS data management ecosystem. In this post, I explain how I was able to build a robust and scalable data warehouse without the large team of experts typically needed.
In two of my recent projects, I ran into challenges when scaling our data warehouse using on-premises infrastructure. Data was growing at many tens of gigabytes per day, and query performance was suffering. Scaling required major capital investment for hardware and software licenses, and also significant operational costs for maintenance and technical staff to keep it running and performing well. Unfortunately, I couldn’t get the resources needed to scale the infrastructure with data growth, and these projects were abandoned. Thanks to cloud data warehousing, the bottleneck of infrastructure resources, capital expense, and operational costs have been significantly reduced or have totally gone away. There is no more excuse for allowing obstacles of the past to delay delivering timely insights to decision makers, no matter how much data you have.
With Amazon Redshift and AWS, I delivered a cloud data warehouse to the business very quickly, and with a small team: me. I didn’t have to order hardware or software, and I no longer needed to install, configure, tune, or keep up with patches and version updates. Instead, I easily set up a robust data processing pipeline and we were quickly ingesting and analyzing data. Now, my data warehouse team can be extremely lean, and focus more time on bringing in new data and delivering insights. In this post, I show you the AWS services and the architecture that I used.
Handling data feeds
I have several different data sources that provide everything needed to run the business. The data includes activity from our iGaming platform, social media posts, clickstream data, marketing and campaign performance, and customer support engagements.
To handle the diversity of data feeds, I developed abstract integration applications using Docker that run on Amazon EC2 Container Service (Amazon ECS) and feed data to Amazon Kinesis Data Streams. These data streams can be used for real time analytics. In my system, each record in Kinesis is preprocessed by an AWS Lambda function to cleanse and aggregate information. My system then routes it to be stored where I need on Amazon S3 by Amazon Kinesis Data Firehose. Suppose that you used an on-premises architecture to accomplish the same task. A team of data engineers would be required to maintain and monitor a Kafka cluster, develop applications to stream data, and maintain a Hadoop cluster and the infrastructure underneath it for data storage. With my stream processing architecture, there are no servers to manage, no disk drives to replace, and no service monitoring to write.
Setting up a Kinesis stream can be done with a few clicks, and the same for Kinesis Firehose. Firehose can be configured to automatically consume data from a Kinesis Data Stream, and then write compressed data every N minutes to Amazon S3. When I want to process a Kinesis data stream, it’s very easy to set up a Lambda function to be executed on each message received. I can just set a trigger from the AWS Lambda Management Console, as shown following.
Regardless of the format I receive the data from our partners, I can send it to Kinesis as JSON data using my own formatters. After Firehose writes this to Amazon S3, I have everything in nearly the same structure I received but compressed, encrypted, and optimized for reading.
This data is automatically crawled by AWS Glue and placed into the AWS Glue Data Catalog. This means that I can immediately query the data directly on S3 using Amazon Athena or through Amazon Redshift Spectrum. Previously, I used Amazon EMR and an Amazon RDS–based metastore in Apache Hive for catalog management. Now I can avoid the complexity of maintaining Hive Metastore catalogs. Glue takes care of high availability and the operations side so that I know that end users can always be productive.
Working with Amazon Athena and Amazon Redshift for analysis
I found Amazon Athena extremely useful out of the box for ad hoc analysis. Our engineers (me) use Athena to understand new datasets that we receive and to understand what transformations will be needed for long-term query efficiency.
For our data analysts and data scientists, we’ve selected Amazon Redshift. Amazon Redshift has proven to be the right tool for us over and over again. It easily processes 20+ million transactions per day, regardless of the footprint of the tables and the type of analytics required by the business. Latency is low and query performance expectations have been more than met. We use Redshift Spectrum for long-term data retention, which enables me to extend the analytic power of Amazon Redshift beyond local data to anything stored in S3, and without requiring me to load any data. Redshift Spectrum gives me the freedom to store data where I want, in the format I want, and have it available for processing when I need it.
To load data directly into Amazon Redshift, I use AWS Data Pipeline to orchestrate data workflows. I create Amazon EMR clusters on an intra-day basis, which I can easily adjust to run more or less frequently as needed throughout the day. EMR clusters are used together with Amazon RDS, Apache Spark 2.0, and S3 storage. The data pipeline application loads ETL configurations from Spring RESTful services hosted on AWS Elastic Beanstalk. The application then loads data from S3 into memory, aggregates and cleans the data, and then writes the final version of the data to Amazon Redshift. This data is then ready to use for analysis. Spark on EMR also helps with recommendations and personalization use cases for various business users, and I find this easy to set up and deliver what users want. Finally, business users use Amazon QuickSight for self-service BI to slice, dice, and visualize the data depending on their requirements.
Each AWS service in this architecture plays its part in saving precious time that’s crucial for delivery and getting different departments in the business on board. I found the services easy to set up and use, and all have proven to be highly reliable for our use as our production environments. When the architecture was in place, scaling out was either completely handled by the service, or a matter of a simple API call, and crucially doesn’t require me to change one line of code. Increasing shards for Kinesis can be done in a minute by editing a stream. Increasing capacity for Lambda functions can be accomplished by editing the megabytes allocated for processing, and concurrency is handled automatically. EMR cluster capacity can easily be increased by changing the master and slave node types in Data Pipeline, or by using Auto Scaling. Lastly, RDS and Amazon Redshift can be easily upgraded without any major tasks to be performed by our team (again, me).
In the end, using AWS services including Kinesis, Lambda, Data Pipeline, and Amazon Redshift allows me to keep my team lean and highly productive. I eliminated the cost and delays of capital infrastructure, as well as the late night and weekend calls for support. I can now give maximum value to the business while keeping operational costs down. My team pushed out an agile and highly responsive data warehouse solution in record time and we can handle changing business requirements rapidly, and quickly adapt to new data and new user requests.
Stephen Borg is the Head of Big Data and BI at Cerberus Technologies. He has a background in platform software engineering, and first became involved in data warehousing using the typical RDBMS, SQL, ETL, and BI tools. He quickly became passionate about providing insight to help others optimize the business and add personalization to products. He is now the Head of Big Data and BI at Cerberus Technologies.
One of the technology areas I thoroughly enjoy is the Internet of Things (IoT). Even as a child I used to infuriate my parents by taking apart the toys they would purchase for me to see how they worked and if I could somehow put them back together. It seems somehow I was destined to end up the tough and ever-changing world of technology. Therefore, it’s no wonder that I am really enjoying learning and tinkering with IoT devices and technologies. It combines my love of development and software engineering with my curiosity around circuits, controllers, and other facets of the electrical engineering discipline; even though an electrical engineer I can not claim to be.
Despite all of the information that is collected by the deployment of IoT devices and solutions, I honestly never really thought about the need to analyze, search, and process this data until I came up against a scenario where it became of the utmost importance to be able to search and query through loads of sensory data for an anomaly occurrence. Of course, I understood the importance of analytics for businesses to make accurate decisions and predictions to drive the organization’s direction. But it didn’t occur to me initially, how important it was to make analytics an integral part of my IoT solutions. Well, I learned my lesson just in time because this re:Invent a service is launching to make it easier for anyone to process and analyze IoT messages and device data.
Hello, AWS IoT Analytics! AWS IoT Analytics is a fully managed service of AWS IoT that provides advanced data analysis of data collected from your IoT devices. With the AWS IoT Analytics service, you can process messages, gather and store large amounts of device data, as well as, query your data. Also, the new AWS IoTAnalytics service feature integrates with Amazon Quicksight for visualization of your data and brings the power of machine learning through integration with Jupyter Notebooks.
Benefits of AWS IoT Analytics
Helps with predictive analysis of data by providing access to pre-built analytical functions
Provides ability to visualize analytical output from service
Provides tools to clean up data
Can help identify patterns in the gathered data
Be In the Know: IoT Analytics Concepts
Channel: archives the raw, unprocessed messages and collects data from MQTT topics.
Pipeline: consumes messages from channels and allows message processing.
Activities: perform transformations on your messages including filtering attributes and invoking lambda functions advanced processing.
Data Store: Used as a queryable repository for processed messages. Provide ability to have multiple datastores for messages coming from different devices or locations or filtered by message attributes.
Data Set: Data retrieval view from a data store, can be generated by a recurring schedule.
Getting Started with AWS IoT Analytics
First, I’ll create a channel to receive incoming messages. This channel can be used to ingest data sent to the channel via MQTT or messages directed from the Rules Engine. To create a channel, I’ll select the Channels menu option and then click the Create a channel button.
I’ll name my channel, TaraIoTAnalyticsID and give the Channel a MQTT topic filter of Temperature. To complete the creation of my channel, I will click the Create Channel button.
Now that I have my Channel created, I need to create a Data Store to receive and store the messages received on the Channel from my IoT device. Remember you can set up multiple Data Stores for more complex solution needs, but I’ll just create one Data Store for my example. I’ll select Data Stores from menu panel and click Create a data store.
I’ll name my Data Store, TaraDataStoreID, and once I click the Create the data store button and I would have successfully set up a Data Store to house messages coming from my Channel.
Now that I have my Channel and my Data Store, I will need to connect the two using a Pipeline. I’ll create a simple pipeline that just connects my Channel and Data Store, but you can create a more robust pipeline to process and filter messages by adding Pipeline activities like a Lambda activity.
To create a pipeline, I’ll select the Pipelines menu option and then click the Create a pipeline button.
I will not add an Attribute for this pipeline. So I will click Next button.
As we discussed there are additional pipeline activities that I can add to my pipeline for the processing and transformation of messages but I will keep my first pipeline simple and hit the Next button.
The final step in creating my pipeline is for me to select my previously created Data Store and click Create Pipeline.
All that is left for me to take advantage of the AWS IoT Analytics service is to create an IoT rule that sends data to an AWS IoT Analytics channel. Wow, that was a super easy process to set up analytics for IoT devices.
If I wanted to create a Data Set as a result of queries run against my data for visualization with Amazon Quicksight or integrate with Jupyter Notebooks to perform more advanced analytical functions, I can choose the Analyze menu option to bring up the screens to create data sets and access the Juypter Notebook instances.
As you can see, it was a very simple process to set up the advanced data analysis for AWS IoT. With AWS IoT Analytics, you have the ability to collect, visualize, process, query and store large amounts of data generated from your AWS IoT connected device. Additionally, you can access the AWS IoT Analytics service in a myriad of different ways; the AWS Command Line Interface (AWS CLI), the AWS IoT API, language-specific AWS SDKs, and AWS IoT Device SDKs.
AWS IoT Analytics is available today for you to dig into the analysis of your IoT data. To learn more about AWS IoT and AWS IoT Analytics go to the AWS IoT Analytics product page and/or the AWS IoT documentation.
Good developers are good problem solvers. They turn each task into a series of problems they have to solve. They don’t necessarily know how to solve them in advance, but they have their toolbox of approaches, shortcuts and other tricks that lead to the solution. I have outlined one such set of steps for identifying problems, but you can’t easily formalize the problem-solving approach.
But is really turning a task into a set of problems a good idea? Programming can be seen as a creative exercise, rather than a problem solving one -- you think, you ponder, you deliberate, then you make something out of nothing and it’s beautiful, because it works. And sometimes programming is that, but that is almost always interrupted by a series of problems that stop you from getting the task completed. That process is best visualized with the following short video:
That’s because most things in software break. They either break because there are unknowns, or because of a lot of unsuspected edge cases, or because the abstraction that we use leaks, or because the tools that we use are poorly documented or have poor APIs/UIs, or simply because of bugs. Or in many cases -- all of the above.
So inevitably, we have to learn to solve problems. And solving them quickly and properly is in fact, one might argue, the most important skill when doing software. One should learn, though, not to just patch things up with duct tape, but to come up with the best possible solution with the constraints at hand. The library that you are using is missing a feature you really need? Ideally, you should propose the feature and wait for it to be implemented. Too often that’s not an option. Quick and dirty fix -- copy-paste a bunch of code. Proper, elegant solution -- use design patterns to adapt the library to your needs, or come up with a generic (but not time-wasting) way of patching libraries. Or there’s a memory leak? Just launch a bigger instance? No. Spend a week live-profiling the application? To slow. Figure out how to simulate the leaking scenario in a local setup and fix it in a day? Sounds ideal, but it’s not trivial.
Sometimes there are not too many problems and development goes smoothly. Then the good problem solver identifies problems proactively -- this implementation is slow, this is too memory-consuming, this is overcomplicated and should be refactored. And these can (and should) be small steps that don’t interfere with the development process, leaving you 2 days in deep refactoring for no apparent reason. The skill is to know the limit between gradual improvement and spotting problems before they occur, and wasting time in problems that don’t exist or you’ll never hit.
And finally, solving problems is not a solo exercise. In fact I think one of the most important aspects of problem solving is answering questions. If you want to be a good developer, you have to answer the questions of others. Your colleagues in most cases, but sometimes -- total strangers on Stackoverflow. I myself found that answering stackoverflow questions actually turned me into a better problem solver -- I could solve others problems in a limited time, with limited information. So in many case I was the go-to person on the team when a problem arises, even though I wasn’t the most senior or the most familiar with the project. But one could reasonably expect that I’ll be able to figure out a proper solution quickly. And then the loop goes on -- you answer more questions and get better at problem solving, and so on, and so forth. By the way, we shouldn’t assume we are good unless we are able to solver others’ problems in addition to ours.
Problem-solving is a transferable skill. We might not be developers forever, but our approach to problems, the tenacity in fixing them, and the determination to get things done properly, is useful in many contexts. You could, in fact, view each task, not just programming ones, as a problem-solving exercise. And having the confidence that you can fix it, even though you have never encountered it before, is often priceless.
What’s my ultimate point? We should see ourselves as problem solvers and constantly improve our problem solving toolbox. Which, among other things, includes helping others. Otherwise we are tied to our knowledge of a particular technology or stack, and that’s frankly boring.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.