Tag Archives: Technology

Querying a Decade of Drive Stats Data

Post Syndicated from Pat Patterson original https://www.backblaze.com/blog/querying-a-decade-of-drive-stats-data/

Last week, we published Backblaze Drive Stats for Q3 2022, sharing the metrics we’ve gathered on our fleet of over 230,000 hard drives. In this blog post, I’ll explain how we’re now using the Trino open source SQL query engine in ensuring the integrity of Drive Stats data, and how we plan to use Trino in future to generate the Drive Stats result set for publication.

Converting Zipped CSV Files into Parquet

In his blog post Storing and Querying Analytical Data in Backblaze B2, my colleague Greg Hamer explained how we started using Trino to analyze Drive Stats data earlier this year. We quickly discovered that formatting the data set as Apache Parquet minimized the amount of data that Trino needed to download from Backblaze B2 Cloud Storage to process queries, resulting in a dramatic improvement in query performance over the original CSV-formatted data.

As Greg mentioned in the earlier post, Drive Stats data is published quarterly to Backblaze B2 as a single .zip file containing a CSV file for each day of the quarter. Each CSV file contains a record for each drive that was operational on that day (see this list of the fields in each record).

When Greg and I started working with the Parquet-formatted Drive Stats data, we took a simple, but somewhat inefficient, approach to converting the data from zipped CSV to Parquet:

  • Download the existing zip files to local storage.
  • Unzip them.
  • Run a Python script to read the CSV files and write Parquet-formatted data back to local storage.
  • Upload the Parquet files to Backblaze B2.

We were keen to automate this process, so we reworked the script to use the Python ZipFile module to read the zipped CSV data directly from its Backblaze B2 Bucket and write Parquet back to another bucket. We’ve shared the script in this GitHub gist.

After running the script, the drivestats table now contains data up until the end of Q3 2022:

trino:ds> SELECT DISTINCT year, month, day 
FROM drivestats ORDER BY year DESC, month DESC, day DESC LIMIT 1;
year | month | day 
------+-------+-----
 2022 |     9 |  30 
(1 row)

In the last article, we were working with data running until the end of Q1 2022. On March 31, 2022, the Drive Stats dataset comprised 296 million records, and there were 211,732 drives in operation. Let’s see what the current situation is:

trino:ds> SELECT COUNT(*) FROM drivestats;
   _col0 
-----------
 346006813 
(1 row) 

trino:ds> SELECT COUNT(*) FROM drivestats 
    WHERE year = 2022 AND month = 9 AND day = 30;
   _col0 
--------
 230897 
(1 row)

So, since the end of March, we’ve added 50 million rows to the dataset, and Backblaze is now spinning nearly 231,000 drives—over 19,000 more than at the end of March 2022. Put another way, we’ve added more than 100 drives per day to the Backblaze Cloud Storage Platform in the past six months. Finally, how many exabytes of raw data storage does Backblaze now manage?

trino:ds> SELECT ROUND(SUM(CAST(capacity_bytes AS bigint))/1e+18, 2)
FROM drivestats WHERE year = 2022 AND month = 9 AND day = 30;
 _col0 
-------
  2.62 
(1 row)

Will we cross the three exabyte mark this year? Stay tuned to find out.

Ensuring the Integrity of Drive Stats Data

As Andy Klein, the Drive Stats supremo, collates each quarter’s data, he looks for instances of healthy drives being removed and then returned to service. This can happen for a variety of operational reasons, but it shows up in the data as the drive having failed, then later revived. This subset of data shows the phenomenon:

trino:ds> SELECT year, month, day, failure FROM drivestats WHERE 
serial_number = 'ZHZ4VLNV' AND year >= 2021 ORDER BY year, month, 
day;
 year | month | day | failure 
------+-------+-----+---------
...
 2021 |    12 |  26 |       0 
 2021 |    12 |  27 |       0 
 2021 |    12 |  28 |       0 
 2021 |    12 |  29 |       1 
 2022 |     1 |   3 |       0 
 2022 |     1 |   4 |       0 
 2022 |     1 |   5 |       0 
...

This drive appears to have failed on Dec 29, 2021, but was returned to service on Jan 3, 2022.

Since these spurious “failures” would skew the reliability statistics, Andy searches for and removes them from each quarter’s data. However, even Andy can’t see into the future, so, when a drive is taken offline at the end of one quarter and then returned to service in the next quarter, as in the above case, there is a bit of a manual process to find anomalies and clean up past data.

With the entire dataset in a single location, we can now write a SQL query to find drives that were removed, then returned to service, no matter when it occurred. Let’s build that query up in stages.

We start by finding the serial numbers and failure dates for each drive failure:

trino:ds> SELECT serial_number, DATE(FORMAT('%04d-%02d-%02d', year, 
month, day)) AS date 
FROM drivestats 
WHERE failure = 1;
  serial_number  |    date    
-----------------+------------
 ZHZ3KMX4        | 2021-04-01 
 ZA12RBBM        | 2021-04-01 
 S300Z52X        | 2017-03-01 
 Z3051FWK        | 2017-03-01 
 Z304JQAE        | 2017-03-02 
...
(17092 rows)

Now we find the most recent record for each drive:

trino:ds> SELECT serial_number, MAX(DATE(FORMAT('%04d-%02d-%02d', 
year, month, day))) AS date
    FROM drivestats 
    GROUP BY serial_number;
  serial_number   |    date    
------------------+------------
 ZHZ65F2W         | 2022-09-30 
 ZLW0GQ82         | 2022-09-30 
 ZLW0GQ86         | 2022-09-30 
 Z8A0A057F97G     | 2022-09-30 
 ZHZ62XAR         | 2022-09-30 
...
(329908 rows)

We then join the two result sets to find spurious failures; that is, failures where the drive was later returned to service. Note the join condition—we select records whose serial numbers match and where the most recent record is later than the failure:

trino:ds> SELECT f.serial_number, f.failure_date
FROM (
    SELECT serial_number, DATE(FORMAT('%04d-%02d-%02d', year, month, 
day)) AS failure_date
    FROM drivestats 
    WHERE failure = 1
) AS f
INNER JOIN (
    SELECT serial_number, MAX(DATE(FORMAT('%04d-%02d-%02d', year, 
month, day))) AS last_date
    FROM drivestats 
    GROUP BY serial_number
) AS l
ON f.serial_number = l.serial_number AND l.last_date > f.failure_date;
  serial_number  | failure_date 
-----------------+--------------
 2003261ED34D    | 2022-06-09 
 W300STQ5        | 2022-06-11 
 ZHZ61JMQ        | 2022-06-17 
 ZHZ4VL2P        | 2022-06-21 
 WD-WX31A2464044 | 2015-06-23 
(864 rows)

As you can see, the current schema makes date comparisons a little awkward, pointing the way to optimizing the schema by adding a DATE-typed column to the existing year, month, and day. This kind of denormalization is common in analytical data.

Calculating the Quarterly Failure Rates

In calculating failure rates per drive model for each quarter, Andy loads the quarter’s data into MySQL and defines a set of views. We additionally define the current_quarter view to restrict the failure rate calculation to data in July, August, and September 2022:

CREATE VIEW current_quarter AS 
    SELECT * FROM drivestats
    WHERE year = 2022 AND month in (7, 8, 9);

CREATE VIEW drive_days AS 
    SELECT model, COUNT(*) AS drive_days 
    FROM current_quarter
    GROUP BY model;

CREATE VIEW failures AS
    SELECT model, COUNT(*) AS failures
    FROM current_quarter
    WHERE failure = 1
    GROUP BY model
UNION
    SELECT DISTINCT(model), 0 AS failures
    FROM current_quarter
    WHERE model NOT IN
    (
        SELECT model
        FROM current_quarter
        WHERE failure = 1
        GROUP BY model
    );

CREATE VIEW failure_rates AS
    SELECT drive_days.model AS model,
           drive_days.drive_days AS drive_days,
           failures.failures AS failures, 
           100.0 * (1.0 * failures) / (drive_days / 365.0) AS 
annual_failure_rate
    FROM drive_days, failures
    WHERE drive_days.model = failures.model;

Running the above statements in Trino, then querying the failure_rates view, yields a superset of the data that we published in the Q3 2022 Drive Stats report. The difference is that this result set includes drives that Andy excludes from the Drive Stats report: SSD boot drives, drives that were used for testing purposes, and drive models which did not have at least 60 drives in service:

trino:ds> SELECT * FROM failure_rates ORDER BY model;
        model         | drive_days | failures | annual_failure_rate 
----------------------+------------+----------+---------------------
 CT250MX500SSD1       |      32171 |        2 |                2.27 
 DELLBOSS VD          |      33706 |        0 |                0.00 
 HGST HDS5C4040ALE630 |       2389 |        0 |                0.00 
 HGST HDS724040ALE640 |         92 |        0 |                0.00 
 HGST HMS5C4040ALE640 |     341509 |        3 |                0.32 
 ...
 WDC WD60EFRX         |        276 |        0 |                0.00 
 WDC WDS250G2B0A      |       3867 |        0 |                0.00 
 WDC WUH721414ALE6L4  |     765990 |        5 |                0.24 
 WDC WUH721816ALE6L0  |     242954 |        0 |                0.00 
 WDC WUH721816ALE6L4  |     308630 |        6 |                0.71 
(74 rows)

Query 20221102_010612_00022_qscbi, FINISHED, 1 node
Splits: 139 total, 139 done (100.00%)
8.63 [82.4M rows, 5.29MB] [9.54M rows/s, 628KB/s]

Optimizing the Drive Stats Production Process

Now that we have shown that we can derive the required statistics by querying the Parquet-formatted data with Trino, we can streamline the Drive Stats process. Starting with the Q4 2022 report, rather than wrangling each quarter’s data with a mixture of tools on his laptop, Andy will use Trino to both clean up the raw data and produce the Drive Stats result set for publication.

Accessing the Drive Stats Parquet Dataset

When Greg and I started experimenting with Trino, our starting point was Brian Olsen’s Trino Getting Started GitHub repository, in particular, the Hive connector over MinIO file storage tutorial. Since MinIO and Backblaze B2 both have S3-compatible APIs, it was easy to adapt the tutorial’s configuration to target the Drive Stats data in Backblaze B2, and Brian was kind enough to accept my contribution of a new tutorial showing how to use the Hive connector over Backblaze B2 Cloud Storage. This tutorial will get you started using Trino with data stored in Backblaze B2 Buckets, and includes a section on accessing the Drive Stats dataset.

You might be interested to know that Backblaze is sponsoring this year’s Trino Summit, taking place virtually and in person in San Francisco, on November 10. Registration is free; if you do attend, come say hi to Greg and me at the Backblaze booth and see Trino in action, querying data stored in Backblaze B2.

The post Querying a Decade of Drive Stats Data appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

A Survey of Causal Inference Applications at Netflix

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/a-survey-of-causal-inference-applications-at-netflix-b62d25175e6f

At Netflix, we want to entertain the world through creating engaging content and helping members discover the titles they will love. Key to that is understanding causal effects that connect changes we make in the product to indicators of member joy.

To measure causal effects we rely heavily on AB testing, but we also leverage quasi-experimentation in cases where AB testing is limited. Many scientists across Netflix have contributed to the way that Netflix analyzes these causal effects.

To celebrate that impact and learn from each other, Netflix scientists recently came together for an internal Causal Inference and Experimentation Summit. The weeklong conference brought speakers from across the content, product, and member experience teams to learn about methodological developments and applications in estimating causal effects. We covered a wide range of topics including difference-in-difference estimation, double machine learning, Bayesian AB testing, and causal inference in recommender systems among many others.

We are excited to share a sneak peek of the event with you in this blog post through selected examples of the talks, giving a behind the scenes look at our community and the breadth of causal inference at Netflix. We look forward to connecting with you through a future external event and additional blog posts!

Incremental Impact of Localization

Yinghong Lan, Vinod Bakthavachalam, Lavanya Sharan, Marie Douriez, Bahar Azarnoush, Mason Kroll

At Netflix, we are passionate about connecting our members with great stories that can come from anywhere, and be loved everywhere. In fact, we stream in more than 30 languages and 190 countries and strive to localize the content, through subtitles and dubs, that our members will enjoy the most. Understanding the heterogenous incremental value of localization to member viewing is key to these efforts!

In order to estimate the incremental value of localization, we turned to causal inference methods using historical data. Running large scale, randomized experiments has both technical and operational challenges, especially because we want to avoid withholding localization from members who might need it to access the content they love.

Conceptual overview of using double machine learning to control for confounders and compare similar titles to estimate incremental impact of localization

We analyzed the data across various languages and applied double machine learning methods to properly control for measured confounders. We not only studied the impact of localization on overall title viewing but also investigated how localization adds value at different parts of the member journey. As a robustness check, we explored various simulations to evaluate the consistency and variance of our incrementality estimates. These insights have played a key role in our decisions to scale localization and delight our members around the world.

A related application of causal inference methods to localization arose when some dubs were delayed due to pandemic-related shutdowns of production studios. To understand the impact of these dub delays on title viewing, we simulated viewing in the absence of delays using the method of synthetic control. We compared simulated viewing to observed viewing at title launch (when dubs were missing) and after title launch (when dubs were added back).

To control for confounders, we used a placebo test to repeat the analysis for titles that were not affected by dub delays. In this way, we were able to estimate the incremental impact of delayed dub availability on member viewing for impacted titles. Should there be another shutdown of dub productions, this analysis enables our teams to make informed decisions about delays with greater confidence.

Holdback Experiments for Product Innovation

Travis Brooks, Cassiano Coria, Greg Nettles, Molly Jackman, Claire Lackner

At Netflix, there are many examples of holdback AB tests, which show some users an experience without a specific feature. They have substantially improved the member experience by measuring long term effects of new features or re-examining old assumptions. However, when the topic of holdback tests is raised, it can seem too complicated in terms of experimental design and/or engineering costs.

We aimed to share best practices we have learned about holdback test design and execution in order to create more clarity around holdback tests at Netflix, so they can be used more broadly across product innovation teams by:

  1. Defining the types of holdbacks and their use cases with past examples
  2. Suggesting future opportunities where holdback testing may be valuable
  3. Enumerating the challenges that holdback tests pose
  4. Identifying future investments that can reduce the cost of deploying and maintaining holdback tests for product and engineering teams

Holdback tests have clear value in many product areas to confirm learnings, understand long term effects, retest old assumptions on newer members, and measure cumulative value. They can also serve as a way to test simplifying the product by removing unused features, creating a more seamless user experience. In many areas at Netflix they are already commonly used for these purposes.

Overview of how holdback tests work where we keep the current experience for a subset of members over the long term in order to gain valuable insights for improving the product

We believe by unifying best practices and providing simpler tools, we can accelerate our learnings and create the best product experience for our members to access the content they love.

Causal Ranker: A Causal Adaptation Framework for Recommendation Models

Jeong-Yoon Lee, Sudeep Das

Most machine learning algorithms used in personalization and search, including deep learning algorithms, are purely associative. They learn from the correlations between features and outcomes how to best predict a target.

In many scenarios, going beyond the purely associative nature to understanding the causal mechanism between taking a certain action and the resulting incremental outcome becomes key to decision making. Causal inference gives us a principled way of learning such relationships, and when coupled with machine learning, becomes a powerful tool that can be leveraged at scale.

Compared to machine learning, causal inference allows us to build a robust framework that controls for confounders in order to estimate the true incremental impact to members

At Netflix, many surfaces today are powered by recommendation models like the personalized rows you see on your homepage. We believe that many of these surfaces can benefit from additional algorithms that focus on making each recommendation as useful to our members as possible, beyond just identifying the title or feature someone is most likely to engage with. Adding this new model on top of existing systems can help improve recommendations to those that are right in the moment, helping find the exact title members are looking to stream now.

This led us to create a framework that applies a light, causal adaptive layer on top of the base recommendation system called the Causal Ranker Framework. The framework consists of several components: impression (treatment) to play (outcome) attribution, true negative label collection, causal estimation, offline evaluation, and model serving.

We are building this framework in a generic way with reusable components so that any interested team within Netflix can adopt this framework for their use case, improving our recommendations throughout the product.

Bellmania: Incremental Account Lifetime Valuation at Netflix and its Applications

Reza Badri, Allen Tran

Understanding the value of acquiring or retaining subscribers is crucial for any subscription business like Netflix. While customer lifetime value (LTV) is commonly used to value members, simple measures of LTV likely overstate the true value of acquisition or retention because there is always a chance that potential members may join in the future on their own without any intervention.

We establish a methodology and necessary assumptions to estimate the monetary value of acquiring or retaining subscribers based on a causal interpretation of incremental LTV. This requires us to estimate both on Netflix and off Netflix LTV.

To overcome the lack of data for off Netflix members, we use an approach based on Markov chains that recovers off Netflix LTV from minimal data on non-subscriber transitions between being a subscriber and canceling over time.

Through Markov chains we can estimate the incremental value of a member and non member that appropriately captures the value of potential joins in the future

Furthermore, we demonstrate how this methodology can be used to (1) forecast aggregate subscriber numbers that respect both addressable market constraints and account-level dynamics, (2) estimate the impact of price changes on revenue and subscription growth, and (3) provide optimal policies, such as price discounting, that maximize expected lifetime revenue of members.

Measuring causality is a large part of the data science culture at Netflix, and we are proud to have so many stunning colleagues leverage both experimentation and quasi-experimentation to drive member impact. The conference was a great way to celebrate each other’s work and highlight the ways in which causal methodology can create value for the business.

We look forward to sharing more about our work with the community in upcoming posts. To stay up to date on our work, follow the Netflix Tech Blog, and if you are interested in joining us, we are currently looking for new stunning colleagues to help us entertain the world!


A Survey of Causal Inference Applications at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data Engineers of Netflix — Interview with Pallavi Phadnis

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/data-engineers-of-netflix-interview-with-pallavi-phadnis-a1fcc5f64906

Data Engineers of Netflix — Interview with Pallavi Phadnis

This post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix.

Pallavi Phadnis is a Senior Software Engineer at Netflix.

Pallavi Phadnis is a Senior Software Engineer on the Product Data Science and Engineering team. In this post, Pallavi talks about her journey to Netflix and the challenges that keep the work interesting.

Pallavi received her master’s degree from Carnegie Mellon. Before joining Netflix, she worked in the advertising and e-commerce industries as a backend software engineer. In her free time, she enjoys watching Netflix and traveling.

Her favorite shows: Stranger Things, Gilmore Girls, and Breaking Bad.

Pallavi, what’s your journey to data engineering at Netflix?

Netflix’s unique work culture and petabyte-scale data problems are what drew me to Netflix.

During earlier years of my career, I primarily worked as a backend software engineer, designing and building the backend systems that enable big data analytics. I developed many batch and real-time data pipelines using open source technologies for AOL Advertising and eBay. I also built online serving systems and microservices powering Walmart’s e-commerce.

Those years of experience solving technical problems for various businesses have taught me that data has the power to maximize the potential of any product.

Before I joined Netflix, I was always fascinated by my experience as a Netflix member which left a great impression of Netflix engineering teams on me. When I read Netflix’s culture memo for the first time, I was impressed by how candid, direct and transparent the work culture sounded. These cultural points resonated with me most: freedom and responsibility, high bar for performance, and no hiring of brilliant jerks.

Over the years, I followed the big data open-source community and Netflix tech blogs closely, and learned a lot about Netflix’s innovative engineering solutions and active contributions to the open-source ecosystem. In 2017, I attended the Women in Big Data conference at Netflix and met with several amazing women in data engineering, including our VP. At this conference, I also got to know my current team: Consolidated Logging (CL).

CL provides an end-to-end solution for logging, processing, and analyzing user interactions on Netflix apps from all devices. It is critical for fast-paced product innovation at Netflix since CL provides foundational data for personalization, A/B experimentation, and performance analytics. Moreover, its petabyte scale also brings unique engineering challenges. The scope of work, business impact, and engineering challenges of CL were very exciting to me. Plus, the roles on the CL team require a blend of data engineering, software engineering, and distributed systems skills, which align really well with my interests and expertise.

What is your favorite project?

The project I am most proud of is building the Consolidated Logging V2 platform which processes and enriches data at a massive scale (5 million+ events per sec at peak) in real-time. I re-architected our legacy data pipelines and built a new platform on top of Apache Flink and Iceberg. The scale brought some interesting learnings and involved working closely with the Apache Flink and Kafka community to fix core issues. Thanks to the migration to V2, we were able to improve data availability and usability for our consumers significantly. The efficiency achieved in processing and storage layers brought us big savings on computing and storage costs. You can learn more about it from my talk at the Flink forward conference. Over the last couple of years, we have been continuously enhancing the V2 platform to support more logging use cases beyond Netflix streaming apps and provide further analytics capabilities. For instance, I am recently working on a project to build a common analytics solution for 100s of Netflix studio and internal apps.

How’s data engineering similar and different from software engineering?

The data engineering role at Netflix is similar to the software engineering role in many aspects. Both roles involve designing and developing large-scale solutions using various open-source technologies. In addition to the business logic, they need a good understanding of the framework internals and infrastructure in order to ensure production stability, for example, maintaining SLA to minimize the impact on the upstream and downstream applications. At Netflix, it is fairly common for data engineers and software engineers to collaborate on the same projects.

In addition, data engineers are responsible for designing data logging specifications and optimized data models to ensure that the desired business questions can be answered. Therefore, they have to understand both the product and the business use cases of the data in depth.

In other words, data engineers bridge the gap between data producers (such as client UI teams) and data consumers (such as data analysts and data scientists.)

Learning more

Interested in learning more about data roles at Netflix? You’re in the right place! Keep an eye out for our open roles in Data Science and Engineering by visiting our jobs site here. Our culture is key to our impact and growth: read about it here. To learn more about our Data Engineers, check out our chats with Dhevi Rajendran, Samuel Setegne, and Kevin Wylie.


Data Engineers of Netflix — Interview with Pallavi Phadnis was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Show Must Go On: Securing Netflix Studios At Scale

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/the-show-must-go-on-securing-netflix-studios-at-scale-19b801c86479

Written by Jose Fernandez, Arthur Gonigberg, Julia Knecht, and Patrick Thomas

Netflix Zuul Open Source Logo

In 2017, Netflix Studios was hitting an inflection point from a period of merely rapid growth to the sort of explosive growth that throws “how do we scale?” into every conversation. The vision was to create a “Studio in the Cloud”, with applications supporting every part of the business from pitch to play. The security team was working diligently to support this effort, faced with two apparently contradictory priorities:

  • 1) streamline any security processes so that we could get applications built and deployed to the public internet faster
  • 2) raise the overall security bar so that the accumulated risk of this giant and growing portfolio of newly internet-facing, high-sensitivity assets didn’t exceed its value

The journey to resolve that contradiction has been a collaboration that we’re proud of, and that we think exemplifies how Netflix approaches infrastructure product development and product security partnerships. You’ll hear from two teams here: first Application Security, and then Cloud Gateway.

Julia & Patrick (Netflix Application Security): In deciding how to address this, we focused on two observations. The first was that there were too many security things that each software team needed to think about — things like TLS certificates, authentication, security headers, request logging, rate limiting, among many others. There were security checklists for developers, but they were lengthy and mostly manual, neither of which contributed to the goal of accelerating development. Adding to the complexity, many of the checklist items themselves had a variety of different options to fulfill them (“new apps do this, but legacy apps do that”; “Java apps should use this approach, but Ruby apps should try one of these four things”… yes, there were flowcharts inside checklists. Ouch.). For development teams, just working through the flowcharts of requirements and options was a monumental task. Supporting developers through those checklists for edge cases, and then validating that each team’s choices resulted in an architecture with all the desired security properties, was similarly not scalable for our security engineers.

Our second observation centered on strong authentication as our highest-leverage control. Missing or incomplete authentication in an application was the most critical type of issue we regularly faced, while at the same time, an application that had a bulletproof authentication story was an application we considered to be lower risk. Concepts like Zero Trust, Beyond Corp, and Identity Aware Proxies all seemed to point the same way: there is powerful assurance in making 100% authentication a property of the architecture of the application rather than an implementation detail within an application.

With both of these observations in hand, we looked at the challenge through a lens that we have found incredibly valuable: how do we productize it? Netflix engineers talk a lot about the concept of a “Paved Road”. One especially attractive part of a Paved Road approach for security teams with a large portfolio is that it helps turn lots of questions into a boolean proposition: Instead of “Tell me how your app does this important security thing?”, it’s just “Are you using this paved road product that handles that?”. So, what would a product look like that could tackle most of the security checklist for a team, and that also could give us that architectural property of guaranteed authentication? With these lofty goals in mind, we turned to our central engineering teams to help get us there.

Partnering to Productize Security

Jose & Arthur (Netflix Cloud Gateway): The Cloud Gateway team develops and operates Netflix’s “Front Door”. Historically we have been responsible for connecting, routing, and steering internet traffic from Netflix subscribers to services in the cloud. Our gateways are powered by our flagship open-source technology Zuul. When Netflix Studios and our security partners approached us, the proposal was conceptually simple and a good fit for our modular, filter-based approach. To try it out, we deployed a custom Zuul build (which we named “API Wall” and eventually, more affectionately, “Wall-E”) with a new filter for Netflix’s Single-Sign-On provider, enabled it for all requests, and boom! — an application deployment strategy that guarantees authentication for services behind it.

Wall-E logical diagram showing a proxy with distinct filters

Killing the Checklist

Once we worked together to integrate our SSO with Wall-E, we had established a pretty exciting pattern of adding security requirements as filters. We thought back to our checklist through the lens of: which of these things are consistent enough across applications to add as a required filter? Our web application firewall (WAF), DDoS prevention, security header validation, and durable logging all fit the bill. One by one, we saw our checklists’ requirements bite the dust, and shift from ‘individual app developer-owned’ to ‘Wall-E owned’ (and consistently implemented!).

By this point, it was clear that we had achieved the vision in the AppSec team’s original request. We eventually were able to add so much security leverage into Wall-E that the bulk of the “going internet-facing” checklist for Studio applications boiled down to one item: Will you use Wall-E?

A small section of our go-external security questionnaire and checklist for studio apps before Wall-E and after Wall-E.

The Early Adopter Challenge

Wall-E’s early adopters were handpicked and nudged along by the Application Security team. Back then, the Cloud Gateway team had to work closely with application developers to provide a seamless migration without disrupting users. These joint efforts took several weeks for both parties. During our initial consultations, it was clear that developers preferred prioritizing product work over security or infrastructure improvements. Our meetings usually ended like this: “Security suggested we talk to you, and we like the idea of improving our security posture, but we have product goals to meet. Let’s talk again next quarter”. These conversations surfaced a couple of problems we knew we had to overcome to address this early adopter challenge:

  1. Setting up Wall-E for an application took too much time and effort, and the hands-on approach would not scale.
  2. Security improvements alone were not enough to drive organic adoption in Netflix’s “context not control” culture.

We were under pressure to improve our adoption numbers and decided to focus first on the setup friction by improving the developer experience and automating the onboarding process.

Scaling With Developer Experience

Developers in the Netflix streaming world compose the customer-facing Netflix experience out of hundreds of microservices, reachable by complex routing rules. On the Netflix Studio side, in Content Engineering, each team develops distinct products with simpler routing needs. To support that much different model, we did another thing that seemed simple at the time but has had an outsized impact over the years: we asked app teams to integrate with us by creating a version-controlled YAML file. Originally this was intended as a simplified and developer-friendly way to help collect domain names and some routing rules into a versionable package, but we quickly realized we had stumbled into a powerful model: we were harvesting developer intent.

An interactive Wall-E configuration wizard, and a concise declarative format for an application’s routing, resource, and authentication decisions

This small change was a kind of magic, and completely flipped our relationship with development teams: since we had a concise, standardized definition of the app they intended to expose, we could proactively automate a lot of the setup. Specify a domain name? Wall-E can ensure that it automagically exists, with DNS and TLS configured correctly. Iterating on this experience eventually led to other intent-based streamlining, like asking about intended user populations and related applications (to select OAuth configs and claims). We could now tell developers that setting up Wall-E would only take a few minutes and that our tooling would automate everything.

Going Faster, Faster

As all of these pieces came together, app teams outside Studio took notice. For a typical paved road application with no unusual security complications, a team could go from “git init” to a production-ready, fully authenticated, internet accessible application in a little less than 10 minutes. The automation of the infrastructure setup, combined with reducing risk enough to streamline security review saves developers days, if not weeks, on each application. Developers didn’t necessarily care that the original motivating factor was about security: what they saw in practice was that apps using Wall-E could get in front of users sooner, and iterate faster.

This created that virtuous cycle that core engineering product teams get incredibly excited about: more users make the amortized platform investment more valuable, but they also bring more ideas and clarity for feature ideas, which in turn attract more users. This set the tone for the next year of development, along two tracks: fixing adoption blockers, and turning more “developer intent” into product features to just handle things for them.

For adoption, both the security team and our team were asking the same question of developers: Is there anything that prevents you from using Wall-E? Each time we got an answer to that question, we tried to figure out how we could address it. Nearly all of the blockers related to systems in which (usually for historical reasons) some application team was solving both authentication and application routing in a custom way. Examples include legacy mTLS and various webhook schemes​. With Wall-E as a clear, durable, paved road choice, we finally had enough of a carrot to move these teams away from supporting unique, potentially risky features. The value proposition wasn’t just “let us help you migrate and you’ll only ever have to deal with incoming traffic that is already properly authenticated”, it was also “you can throw away the services and manual processes that handled your custom mechanisms and offload any responsibility for authentication, WAF integration and monitoring, and DDoS protection to the platform”. Overall, we cannot overstate the value of organizationally committing to a single paved road product to handle these kinds of concerns. It creates an amazing clarity and strategic pressure that helps align actual services that teams operate to the charters and expertise that define them. The difference between 2–4 “right-ish” ways and a single paved road one is powerful.

Also, with fewer exceptions and clearer criteria for apps that should adopt this paved road, our AppSec Engineering and User Focused Security Engineering (UFSE) teams could automate security guidance to give more appropriate automated nudges for adoption. Every leader’s security risk dashboard now includes a Wall-E adoption metric, and roughly ⅔ of recommended apps have chosen to adopt it. Wall-E now fronts over 350 applications, and is adding roughly 3 new production applications (mostly internet-facing) per week.

Automated guidance data, showing the percentage of applications recommended to use Wall-E which have taken it up. The jumpiness in the number of apps recommended for adoption is real: as adoption blockers were discovered then eventually solved, and as we standardized guidance across the company, our automated recommendations reflected these developments.

As adoption continued to increase, we looked at various signals of developer intent for good functionality to move from development-team-owned to platform-owned. One particularly pleasing example turned out to be UI hosting: it popped up over and over again as both an awkward exception to our “full authentication” goal, and also oftentimes the only thing that required Single Page App (SPA) UI teams to run actual cloud instances and have to be on-call for infrastructure. This eventually matured into an opinionated, declarative asset service that abstracts static file hosting for application teams: developers get fast static asset deployments, security gets strong guardrails around UI applications, and Netflix overall has fewer cloud instances to manage (and pay for!). Wall-E became a requirement for the best UI developer experience, and that drove even more adoption.

A productized approach also meant that we could efficiently enable lots of complex but “nice to have” features to enhance the developer experience, like Atlas metrics for free, and integration with our request tracing tool, Edgar.

From Product to Platform

You may have noticed a word sneak into the conversation up there… “platform”. Netflix has a Developer Productivity organization: teams dedicated to helping other developers be more effective. A big part of their work is this idea of harvesting developer intent and automating the necessary touchpoints across our systems. As these teams came to see Wall-E as the clear answer for many of their customers, they started integrating their tools to configure Wall-E from the even higher level developer intents they were harvesting. In effect, this moves authentication and traffic routing (and everything else that Wall-E handles) from being a specific product that developers need to think about and make a choice about, to just a fact that developers can trust and generally ignore. In 2019, essentially 100% of the Wall-E app configuration was done manually by developers. In 2021, that interaction has changed dramatically: now more than 50% of app configuration in WallE is done by automated tools (which are acting on higher-level abstractions on behalf of developers).

This scale and standardization again multiplies value: our internal risk quantification forecasts show compelling annualized savings in risk and incident response costs across the Wall-E portfolio. These applications have fewer, less severe, and less exploitable bugs compared to non-Wall-E apps, and we rarely need an urgent response from app owners (we call this not-getting-paged-at-midnight-as-a-service). Developer time saved on initial application setup and unneeded services additionally adds up on the order of team-months of productivity per year.

Looking back to the core need that started us down this road (“streamline any security processes […]” and “raise the overall security bar […]”), Wall-E’s evolution to being part of the platform cements and extends the initial success. Going forward, more and more apps and developers can benefit from these security assurances while needing to think less and less about them. It’s an outcome we’re quite proud of.

Let’s Do More Of That

To briefly recap, here’s a few of the things that we take away from this journey:

  • If you can do one thing to manage a large product security portfolio, do bulletproof authentication; preferably as a property of the architecture
  • Security teams and central engineering teams can and should have a collaborative, mutually supportive partnership
  • “Productizing” a capability (eg: clearly articulated; defined value proposition; branded; measured), even for internal tools, is useful to drive adoption and find further value
  • A specific product makes the “paved road” clearer; a boolean “uses/doesn’t use” is strongly preferable to various options with subtle caveats
  • Hitch the security wagon to developer productivity
  • Harvesting intent is powerful; it lets many teams add value

What’s Next

We see incredible power in this kind of security/infrastructure partnership work, and we’re excited to leverage these wins into our next goal: to truly become an infrastructure-as-service provider by building a full-fledged Gateway API, thereby handing off ownership of the developer experience to our partner teams in the Developer Productivity organization. This will allow us to focus on the challenges that will come on our way to the next milestone: 1000 applications behind Wall-E.

If this kind of thing is exciting to you, we are hiring for both of these teams: Senior Software Engineer and Engineering Manager on Application Networking; and Senior Security Partner and Appsec Senior Software Engineer.

With special thanks to Cloud Gateway and InfoSec team members past and present, especially Sunil Agrawal, Mikey Cohen, Will Rose, Dilip Kancharla, our partners on Studio & Developer Productivity, and the early Wall-E adopters that provided valuable feedback and ideas. And also to Queen for the song references we slipped in; tell us if you find ’em all.


The Show Must Go On: Securing Netflix Studios At Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Growth Engineering at Netflix- Creating a Scalable Offers Platform

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/growth-engineering-at-netflix-creating-a-scalable-offers-platform-69330136dd87

by Eric Eiswerth

Background

Netflix has been offering streaming video-on-demand (SVOD) for over 10 years. Throughout that time we’ve primarily relied on 3 plans (Basic, Standard, & Premium), combined with the 30-day free trial to drive global customer acquisition. The world has changed a lot in this time. Competition for people’s leisure time has increased, the device ecosystem has grown phenomenally, and consumers want to watch premium content whenever they want, wherever they are, and on whatever device they prefer. We need to be constantly adapting and innovating as a result of this change.

The Growth Engineering team is responsible for executing growth initiatives that help us anticipate and adapt to this change. In particular, it’s our job to design and build the systems and protocols that enable customers from all over the world to sign up for Netflix with the plan features and incentives that best suit their needs. For more background on Growth Engineering and the signup funnel, please have a look at our previous blog post that covers the basics. Alternatively, here’s a quick review of what the typical user journey for a signup looks like:

Signup Funnel Dynamics

There are 3 steps in a basic Netflix signup. We refer to these steps that comprise a user journey as a signup flow. Each step of the flow serves a distinct purpose.

  1. Introduction and account creation
    Highlight our value propositions and begin the account creation process.
  2. Plans & offers
    Highlight the various types of Netflix plans, along with any potential offers.
  3. Payment
    Highlight the various payment options we have so customers can choose what suits their needs best.

The primary focus for the remainder of this post will be step 2: plans & offers. In particular, we’ll define plans and offers, review the legacy architecture and some of its shortcomings, and dig into our new architecture and some of its advantages.

Plans & Offers

Definitions

Let’s define what a plan and an offer is at Netflix. A plan is essentially a set of features with a price.

An offer is an incentive that typically involves a monetary discount or superior product features for a limited amount of time. Broadly speaking, an offer consists of one or more incentives and a set of attributes.

When we merge these two concepts together and present them to the customer, we have the plan selection page (shown above). Here, you can see that we have 3 plans and a 30-day free trial offer, regardless of which plan you choose. Let’s take a deeper look at the architecture, protocols, and systems involved.

Legacy Architecture

As previously mentioned, Netflix has had a relatively static set of plans and offers since the inception of streaming. As a result of this simple product offering, the architecture was also quite straightforward. It consisted of a small set of XML files that were loaded at runtime and stored in local memory. This was a perfectly sufficient design for many years. However, there are some downsides as the company continues to grow and the product continues to evolve. To name a few:

  • Updating XML files is error-prone and manual in nature.
  • A full deployment of the service is required whenever the XML files are updated.
  • Updating the XML files requires engaging domain experts from the backend engineering team that owns these files. This pulls them away from other business-critical work and can be a distraction.
  • A flat domain object structure that resulted in client-side logic in order to extract relevant plan and offer information in order to render the UI. For example, consider the data structure for a 30 day free trial on the Basic plan.
{
"offerId": 123,
"planId": 111,
"price": "$8.99",
"hasSD": true,
"hasHD": false,
"hasFreeTrial": true,
etc…
}
  • As the company matures and our product offering adapts to our global audience, all of the above issues are exacerbated further.

Below is a visual representation of the various systems involved in retrieving plan and offer data. Moving forward, we’ll refer to the combination of plan and offer data simply as SKU (Stock Keeping Unit) data.

New Architecture

If you recall from our previous blog post, Growth Engineering owns the business logic and protocols that allow our UI partners to build lightweight and flexible applications for almost any platform. This implies that the presentation layer should be void of any business logic and should simply be responsible for rendering data that is passed to it. In order to accomplish this we have designed a microservice architecture that emphasizes the Separation of Concerns design principle. Consider the updated system interaction diagram below:

There are 2 noteworthy changes that are worth discussing further. First, notice the presence of a dedicated SKU Eligibility Service. This service contains specialized business logic that used to be part of the Orchestration Service. By migrating this logic to a new microservice we simplify the Orchestration Service, clarify ownership over the domain, and unlock new use cases since it is now possible for other services not shown in this diagram to also consume eligible SKU data.

Second, notice that the SKU Service has been extended to a platform, which now leverages a rules engine and SKU catalog DB. This platform unlocks tremendous business value since product-oriented teams are now free to use the platform to experiment with different product offerings for our global audience, with little to no code changes required. This means that engineers can spend less time doing tedious work and more time designing creative solutions to better prepare us for future needs. Let’s take a deeper look at the role of each service involved in retrieving SKU data, starting from the visitor’s device and working our way down the stack.

Step 1 — Device sends a request for the plan selection page
As discussed in our previous Growth Engineering blog post, we use a custom JSON protocol between our client UIs and our middle-tier Orchestration Service. An example of what this protocol might look like for a browser request to retrieve the plan selection page shown above might look as follows:

GET /plans
{
“flow”: “browser”,
“mode”: “planSelection”
}

As you can see, there are 2 critical pieces of information in this request:

  • Flow — The flow is a way to identify the platform. This allows the Orchestration Service to route the request to the appropriate platform-specific request handling logic.
  • Mode — This is essentially the name of the page being requested.

Given the flow and mode, the Orchestration Service can then process the request.

Step 2 — Request is routed to the Orchestration Service for processing
The Orchestration Service is responsible for validating upstream requests, orchestrating calls to downstream services, and composing JSON responses during a signup flow. For this particular request the Orchestration Service needs to retrieve the SKU data from the SKU Eligibility Service and build the JSON response that can be consumed by the UI layer.

The JSON response for this request might look something like below. Notice the difference in data structures from the legacy implementation. This new contextual representation facilitates greater reuse, as well as potentially supporting offers other than a 30 day free trial:

{
“flow”: “browser”,
“mode”: “planSelection”,
“fields”: {
“skus”: [
{
“id”: 123,
“incentives”: [“FREE_TRIAL”],
“plan”: {
“name”: “Basic”,
“quality”: “SD”,
“price” : “$8.99”,
...
}
...
},
{
“id”: 456,
“incentives”: [“FREE_TRIAL”],
“plan”: {
“name”: “Standard”,
“quality”: “HD”,
“price” : “$13.99”,
...
}
...
},
{
“id”: 789,
“incentives”: [“FREE_TRIAL”],
“plan”: {
“name”: “Premium”,
“quality”: “UHD”,
“price” : “$17.99”,
...
}
...
}
],
“selectedSku”: {
“type”: “Numeric”,
“value”: 789
}
"nextAction": {
"type": "Action"
"withFields": [
"selectedSku"
]
}
}
}

As you can see, the response contains a list of SKUs, the selected SKU, and an action. The action corresponds to the button on the page and the withFields specify which fields the server expects to have sent back when the button is clicked.

Step 3 & 4 — Determine eligibility and retrieve eligible SKUs from SKU Eligibility Service
Netflix is a global company and we often have different SKUs in different regions. This means we need to distinguish between availability of SKUs and eligibility for SKUs. You can think of eligibility as something that is applied at the user level, while availability is at the country level. The SKU Platform contains the global set of SKUs and as a result, is said to control the availability of SKUs. Eligibility for SKUs is determined by the SKU Eligibility Service. This distinction creates clear ownership boundaries and enables the Growth Engineering team to focus on surfacing the correct SKUs for our visitors.

This centralization of eligibility logic in the SKU Eligibility Service also enables innovation in different parts of the product that have traditionally been ignored. Different services can now interface directly with the SKU Eligibility Service in order to retrieve SKU data.

Step 5 — Retrieve eligible SKUs from SKU Platform
The SKU Platform consists of a rules engine, a database, and application logic. The database contains the plans, prices and offers. The rules engine provides a means to extract available plans and offers when certain conditions within a rule match. Let’s consider a simple example where we attempt to retrieve offers in the US.

Keeping the Separation of Concerns in mind, notice that the SKU Platform has only one core responsibility. It is responsible for managing all Netflix SKUs. It provides access to these SKUs via a simple API that takes customer context and attempts to match it against the set of SKU rules. SKU eligibility is computed upstream and is treated just as any other condition would be in the SKU ruleset. By not coupling the concepts of eligibility and availability into a single service, we enable increased developer productivity since each team is able to focus on their core competencies and any change in eligibility does not affect the SKU Platform. One of the core tenets of a platform is the ability to support self-service. This negates the need to engage the backend domain experts for every desired change. The SKU Platform supports this via lightweight configuration changes to rules that do not require a full deployment. The next step is to invest further into self-service and support rule changes via a SKU UI. Stay tuned for more details on this, as well as more details on the internals of the new SKU Platform in one of our upcoming blog posts.

Conclusion

This work was a large cross-functional effort. We rebuilt our offers and plans from the ground up. It resulted in systems changes, as well as interaction changes between teams. Where there was once ambiguity, we now have clearly defined ownership over SKU availability and eligibility. We are now capable of introducing new plans and offers in various markets around the globe in order to meet our customer’s needs, with minimal engineering effort.

Let’s review some of the advantages the new architecture has over the legacy implementation. To name a few:

  • Domain objects that have a more reusable and extensible “shape”. This shape facilitates code reuse at the UI layer as well as the service layers.
  • A SKU Platform that enables product innovation with minimal engineering involvement. This means engineers can focus on more challenging and creative solutions for other problems. It also means fewer engineering teams are required to support initiatives in this space.
  • Configuration instead of code for updating SKU data, which improves innovation velocity.
  • Lower latency as a result of fewer service calls, which means fewer errors for our visitors.

The world is constantly changing. Device capabilities continue to improve. How, when, and where people want to be entertained continues to evolve. With these types of continued investments in infrastructure, the Growth Engineering team is able to build a solid foundation for future innovations that will allow us to continue to deliver the best possible experience for our members.

Join Growth Engineering and help us build the next generation of services that will allow the next 200 million subscribers to experience the joy of Netflix.


Growth Engineering at Netflix- Creating a Scalable Offers Platform was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

За блогърстването и още нещо

Post Syndicated from Yovko Lambrev original https://yovko.net/za-blogarstvaneto/

За блогърстването и още нещо

Този текст е провокиран от публикация на Иван, който обзет от носталгия по доброто, старо блогърстване ме сръчка да споделя и аз някакви мисли по темата. А вероятно и да имам повод напиша нещо тук. Oт последният ми пост са минали три месеца. Вероятно до края на тази особена 2020 година няма да напиша друг, но пък бих използвал време от почивните дни около Коледа да реорганизирам сайта си и своето online присъствие. За пореден път. Иначе казано, щом виждам смисъл да го правя, значи не съм се отказал окончателно.

Макар и да съзнавам напълно, че няма как да е като преди.

Интересно съвпадение е, че някъде точно по това време миналата година убих най-накрая Facebook профила си. И не просто го замразих – изтрих го. С искрена и неприкрита ненавист към тази зловеща платформа.

Разбира се, това не значи, че съм се разписал повече тук.

Блогването имаше повече стойност, когато Google Reader беше жив и беше платформа за комуникация и диалог – кой какво чете, кой какво споделя от своите прочетени неща. Днес това все още е възможно – например в Inoreader. Но вече е много по-трудно да събереш всички на едно място с подобна цел (ако не си от hi-tech гигантите), а друга е темата дали това изобщо е добра идея. Един дистрибутиран RSS-четец с подобна функционалност ще е най-доброто възможно решение.

Уви, за щастие RSS е още жив, но натискът да бъде убит този чуден протокол за споделяне на съдържание е огромен. От една страна големият враг са всички, които искат да обвързват потребители, данни и съдържание със себе си, а от друга немарливостта на web-разработчиците и дигиталните маркетолози, които проповядват, че RSS е мъртъв и не си струва усилията. Днес все по-често сайтовете са с нарочно премахнат или спрян RSS.

Без такива неща, блоговете в момента са като мегафон, на който са извадени батериите – островчета, които продължават да носят смисъл и значение, но като в легендата за хан Кубрат – няма я силата на съчките събрани в сноп.

Всъщност вината ни е обща. Защото се подхлъзваме като малки деца на шаренко по всяка заигралка в мрежата, без да оценяваме плюсовете и минусите, още по-малко щетите, които би могла да нанесе. Интернет вече е платформа с по-голямо социално, отколкото технологично значение – оценка на въздействието би трябвало да бъде задължителна стъпка в тестването на всяка нова идея или проект в мрежата. И то не толкова от този, който я пуска, защото обикновено авторът е заслепен от мечти и жажда за слава (а често и неприкрита алчност), а от първите, че и последващите потребители.

Блогърите предадохме блогването, защото „глей к’во е яко да се пра’иш на интересен в 140 знака в твитър“ или „как ши стана мега-хипер-секси инфлуенцър с facebook и instagram в три куци стъпки“. Нищо, че социалките може и да те усилват (до едно време), но после само ти крадат трафика. И се превърщат в посредник със съмнителна добавена стойност, но пък взимащ задължително своето си.

Това всъщност беше само върхът на айсберга, защото от самото начало технооптимизмът беше в повече. В много посоки и отношения. И ако това в романтичните времена на съзиданието бе оправдано, и дори полезно – сега ни е нужен технореализъм и критично отношение за поправим счупеното.

Интернет днес не е това, което трябваше да бъде. Гравитацията на гигантите засмука всичко. И продължава. И ако не искаме да се озовем в черна дупка са нужни общи усилия за децентрализирането му. Това няма да е лесно, защото този път създателите няма да имат много съюзници от страна на бизнеса. Предстои трудна война с гигантите и влиянието им. Единственият могъщ съюзник сме си ние хората, ако успеем да си обясним помежду си защо е нужно да се съпротивляваме.

Засега не успяваме.

P.S. Т.е. псст, подхвърлям топката към Ясен и Мария.

Varta Portable PowerBank Slim

Post Syndicated from Yovko Lambrev original https://yovko.net/varta-portable-powerbank-slim/

Като човек, който използва много технологични джаджи, непрекъснато имам нужда и от преносима батерия, от която да мога да заредя телефон или таблет, в случай че съм в движение, навън, работя извън офис или просто съм забравил да заредя някое устройство. Най-критичното нещо е смартфонът ми, понеже навън го използвам и като Wi-Fi access point, който осигурява интернета за компютъра или таблета ми. Избягвам всякакви публични Wi-Fi мрежи. Този режим на работа доста бързо хаби батерията на телефона и възможността да презаредя в движение ми е важна. Дори да не ми се наложи да я ползвам, присъствието на преносима батерия наблизо успокоява духа и чакрите ми. 🙂

През последните 6-7 години използвах китайска „куча марка“ литиево-йонна (Li-ion) акумулаторна батерия с 12000mAh капацитет и до днес съм доволен от нея. Тя все още работи прилично, макар и позагубила част от капацитета си, което е напълно нормално. И макар че все още с нея мога безгрижно да заредя два пъти телефона си или поне веднъж донякъде iPad-а си, 6-7 години са много време за една литиево-йонна батерия, затова напоследък започнах да се оглеждам за нова. Към причините да искам да я сменя трябва да добавя теглото ѝ, което в никакъв случай не е пренебрежимо, както и факта, че изходите ѝ са само USB, което допреди няколко години беше окей, но напоследък имам все повече нужда от USB-C без допълнителни кабели или преходници. И най-голямата грижа със старата ми батерия беше, че максималният ѝ изходен ток е едва 2A, което прави едновременното зареждане на две устройства твърде бавно, а често и неефективно, ако едното от тях е таблетът ми, който сам изисква толкова.

И докато умувах над това каква по-нова батерия да си взема, Varta България ми предложиха да тествам тяхната Varta Portable PowerBank Slim и то точно модела от 12000mAh. Няма да крия, че веднага се съгласих.

Varta PowerBank Slim 12000mAh and MacBook Air Space Gray

Както личи от снимката, цветът ѝ е почти напълно идентичен с този на моя MacBook (Space Gray), с една идея по-тъмен нюанс. Това, което ме накара да проявя интерес обаче, бяха няколко други неща. На първо място приличният капацитет от 12000mAh (44Wh), металният корпус (старата ми батерия беше пластмасова и много внимавах да не я ударя или счупя, а това никак не е добре да се случва на литиево-йонни акумулатори) и на този фон съвсем разумното тегло от 390 грама.

Всъщност Varta Portable PowerBank е литиево-полимерна батерия и вероятно всеки е чувал, че това е по-нова технология от литиево-йонните акумулатори, но всъщност от значение са няколко разлики. Електролитът при литиево-полимерните батерии е сухо или гелообразно вещество, което намалява риска от изтичането му при някаква авария с батерията. Отделно, самите литиево-полимерни акумулатори са по-гъвкави и здрави, като в същото време могат да бъдат много тънки. Не е случайно, че Varta са решили да проектират своя тънка (slim) батерия именно с литиево-полимерна технология.

Класическите литиево-йонни батерии не понасят добре удари или прегряване, стареят с времето, крият по-голям риск от тях да изтече електролит, да се подуят или дори да се самозапалят. Това последното е много рядък случай и се дължи предимно на икономии при производството, а не толкова на технологията. Но затова е важно да си купувате батерии от реномиран производител, дори когато са малко по-скъпи.

Иначе въпросната Varta PowerBank Slim въпреки приличния си капацитет от 12000mAh е наистина тънка – малко повече от сантиметър. За сметка на това е по-широкичка. Ако се търси нещо, което да е по-близо до размерите на телефон или лесно да се побира в джоба на дънките, е по-разумно да се избере по-малкия модел от 6000mAh, стига този капацитет да е достатъчен (за iPad не е).

Varta PowerBank Slim 12000mAh size

С 12000mAh модел Varta твърдят, че може да се зареди пет пъти телефон или два пъти таблет. Вярно е, стига батерията на телефона да не е по-голяма от 2200-2300mAh или тази на таблета – по-голяма от 6000mAh. Батерията на моя iPhone е 2700mAh, а тази на iPad-а ми почти 9000mAh, съответно със заредена догоре Varta PowerBank Slim 12000mAh мога да презаредя изцяло четири пъти телефона си или пък веднъж него и веднъж iPad-а си. Пробвано е вече, така се получи.

Това, което ме прави двойно по-щастлив, е, че максималният изходен ток на моята Varta PowerBank Slim е 3A, достатъчно да зареждам едновременно телефона и таблета си. Портативната батерия има два изхода, които осигуряват 5V напрежение – единият е класическо USB с максимален ток до 2,4A, а другият USB-C с 3A ток. Този последният порт може да се ползва и за презареждане на самата Varta PowerBank Slim, което всъщност е по-бързият начин да се направи това с прилично зарядно и USB-C кабел. Иначе батерията има и MicroUSB порт, предназначен само за зареждане. В окомплектовката на батерията идва нужният за целта USB-to-MicroUSB кабел.

Varta PowerBank Slim 12000mAh ports

Батерията няма свое зарядно. Предполага се, че всеки има такова вече. Добре е да се ползва някое по-мощно, спестява време. Varta твърдят, че с 1500mA през MicroUSB порта батерията се презарежда за около осем часа и половина. Не съм засичал с хронометър, но мога да потвърдя, че батерията се зарежда за около 8 часа (една нощ) с такова зарядно и могат да се спестят 2-3 часа от тях, ползвайки хубаво USB-C зарядно (това от новите MacBook-ци също става).

Какво ми липсва ли? Май нищо. Изкушавам се да измрънкам за PD/IQ (за да мога да дозареждам и компютъра си), но нито капацитетът ще ми е достатъчен, нито е справедливо да мрънкам за такова нещо, при положение че крайната цена на дребно на Varta Portable PowerBank 12000mAh Slim е само 80 лева – за това чудесно, а и стилно устройство, което успешно ме спасява от липсата на електрически контакти около мен. 🙂

Some PSAs for NUC owners

Post Syndicated from esr original http://esr.ibiblio.org/?p=8725

I’ve written before, in Contemplating the Cute Brick, that I’m a big fan of Intel’s NUC line of small-form-factor computers. Over the last week I’ve been having some unpleasant learning experiences around them. I’m still a fan, but I’m shipping this post where the search engines can see it in support of future NUC owners in trouble.

Two years ago I bought an NUC for my wife Cathy to replace her last tower-case PC – the NUC8i3BEH1. This model was semi-obsolete even then, but I didn’t want one of the newer i5 or i7 NUCs because I didn’t think it would fit my wife’s needs as well.

What my wife does with her computer doesn’t tax it much. Web browsing, office work, a bit of gaming that does not extend to recent AAA titles demanding the latest whizzy graphics card. I thought her needs would be best served by a small, quiet, low-power-consumption machine that was cheap enough to be considered readily disposable at the end of its service life. The exact opposite of my Great Beast…

The NUC was an experiment that made Cathy and me happy. She especially likes the fact that it’s small and light enough to be mounted on the back of her monitor, so it effectively takes up no desk space or floor area in her rather crowded office. I like the NUC’s industrial design and engineering – lots of nice little details like the four case screws being captive to the baseplate so you cannot lose them during disassembly.

Also. Dammit, NUCs are pretty. I say dammit because I feel like this shouldn’t matter to me and am a bit embarrassed to discover that it does. I like the color and shape and feel of these devices. Someone did an amazing job of making them unobtrusively attractive.

However…

Last week, Cathy registered a complaint that her NUC was making a funny noise. I went and listened and, alas, it was clearly the sound of the fan bearing in the NUC, screaming. That sound means you have worn or dirty bearing surfaces and the fan could fail at any time, forcing the device to shut down before it roasts its own components.

PSA #1: If you web-search for “NUC fan replacement”, you may well land at the website of a company specializing in NUC sales and support, named “Simply NUC”; I did. Do not buy from these people; they are lazy jerks.

First reason I know this: the “Fans” subpage in their Accessories section carries a link to exactly one model of fan. No indication of the range of NUC variants it matches, and not even a general warning that there are NUC models that require a different-sized fan. I had to find this out the hard way by pulling out the innards of Cathy’s NUC and sitting the fan I bought from Simply NUC next to it.

Two fans side by side

Second reason I know this: Simply NUC tech support was unhelpful, telling me they only carry that one fan and suggesting that I RMA Cathy’s machine back to Intel for repair, because obviously there could be no conceivable problem with it being out of service for an indefinite amount of time.

When I asked if Simply NUC knew of a source for a fan that would fit my 8i3BEH1 – a reasonable question, I think, to ask a company that loudly claims to be a one-stop shop for all NUC needs – the reply email told me I’d have to do “personal research” on that.

It turns out that if the useless drone who was Simply NUC “service” had cared about doing his actual job, he could have the read the fan’s model number off the image I had sent him into a search box and found multiple sources within seconds, because that’s what I then did. Of course this would have required caring that a customer was unhappy, which apparently they don’t do at Simply NUC.

Third reason I know this: My request for a refund didn’t even get refused; it wasn’t even answered.

It actually took some work to get the NUC board and fan out if its case. I watched some YouTube videos purporting to illuminate the process; none of them quite matched the hardware I was looking at and none told me the One Weird Trick I actually needed to know. Therefore:

PSA #2: If you’ve taken out both hold-down screws and the board still seems mechanically locked in place, it may well be because the NUC case is designed like that. On some NUCs you need to flex the two case walls with connector ports outwards by about a millimeter on each side so the connectors will pop out of their exit holes. The case is made of thin, springy metal; thumb pressure will do it.

So now I’m waiting on a second replacement fan to arrive. But there is good news; while I had the thing disassembled I blew out all the dust I could see with a can of air, playing it liberally over the fan. And since I reassembled it, it hasn’t screamed once. So:

PSA#3: Your NUC fan noise problem might be solvable just by blowing out the moondust under and around the fan bearings.

We’ll see. If I’m feeling lazy when the new fan arrives, I’ll leave it in the parts drawer until and unless the one now in the NUC fails. If I’m feeling energetic, I’ll swap in the new one, then disassemble and thoroughly clean and oil the old one before putting it in the drawer.

SSDs: A gift and a curse

Post Syndicated from Laurie Denness original https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/

Artur Bergman, founder of a CDN exclusively powered by super fast SSDs, has made many compelling cases over the years to use them. He was definitely ahead of the curve here, but he’s right. Nowadays, they’re denser, 100x faster and as competitively priced as hard disks in most server configurations.

At Etsy, we’ve been trying to get on this bandwagon for the last 5 years too. It’s got a lot better value for money in the last year, so we’ve gone from “dipping our toes in the water” to “ORDER EVERYTHING WITH SSDs!” pretty rapidly.

This isn’t a post about how great SSDs are though: Seriously, they’re amazing. The new Dell R630 allows for 24x 960GB 1.8″ SSDs in a 1U chassis. That’s 19TB usable ludicrously fast, sub millisecond latency storage after RAID6, that will blow away anything you can get on spinning rust, use less power, and is actually reasonably priced per GB.

Picture of Dell R630, 24x 960GB SSDs in 1U chassis
Plus, they look amazing.

So if this post isn’t “GO BUY ALL THE SSDs NOW”, what is it? Well, it’s a cautionary tale that it’s not all unicorns and IOPs.

The problem(s) with SSDs

When SSDs first started to come out, people were concerned that these drives “only” handled a certain number of operations or data during their lifetime, and they’d be changing SSDs far more frequently than conventional spinning rust. Actually, that’s totally not the case and we haven’t experienced that at all. We have thousands of SSDs, and we’ve lost maybe one or two to old age, and it probably wasn’t wear related.

Spoiler alert: SSD firmware is buggy

When was the last time your hard disk failed because the firmware did something whacky? Well, Seagate had a pretty famous case back in 2009 where the drives may not ever power on again if you power them off. Whoops.

But the majority of times, the issue is the physical hardware… The infamous “spinning rust” that is in the drive.

So, SSDs solve this forever right? No moving parts.. Measured mean time to failure of hundreds of years before the memory wears out? Perfect!

Here’s the run down of the firmware issues we’ve had over 5 or so years:

Intel

Okay, bad start, we’ve actually had no issues with Intel. This seems to be common across other companies we’ve spoken to. We started putting single 160GB in our web servers about 4 years ago, because it gave us low power, fast, reliable storage and the space requirements for web servers and utility boxes was low anyway. No more waiting for the metal to seize up! We have SSDs that have long outlived the servers.

OCZ

Outside of the 160GB Intel drives, our search (Solr) stack was the first to benefit from denser, fast storage. Search indexes were getting big; too big for memory. In addition, getting them off disk and serving search results to users was limited by the random disk latency.

Rather than many expensive, relatively fast but low capacity spinning rust drives in a RAID array, we opted for OCZ Talos 960GB disks. These weren’t too bad; we had a spate of initial failures in what seemed like a bad batch, but we were able to learn from this and make the app more resilient to failures.

However, they had poor SMART info (none) so predicting failures was hard.

Unfortunately, the company later went bankrupt, and Toshiba rescued them from the dead. They were unavailable for long enough that we simply ditched them and moved on.

HP SSDs

We briefly tried running third party SSDs on our older (HP) Graphite boxes… This was a quick, fairly cheap win as it got us a tonne of performance for relatively little money (back then we needed much less Graphite storage). This worked fine until the drives started to fail.

Unfortunately, HP have proprietary RAID controllers, and they don’t support SMART. Or rather, they refuse to talk to non-HP drives using off the shelf technology, they have their own methods.

Slot an unsupported disk or SSD into the controller, and you have no idea how that drive is performing or failing. We quickly learnt this after running for a while on these boxes, and performance randomly tanked. The SSDs underlying the RAID array seemed to be dying and slowing down, and we had no way of knowing which one (or ones), or how to fix it. Presumably the drives were not being issued TRIM commands either.

When we had to purchase a new box for our primary database this left us with no choice: We have to pay HP for SSDs. 960GB SSDs direct from HP, properly supported, cost us around $7000 each. Yes, each. We had to buy 4 of them to get the storage we needed.

On the upside, they do have fancy detailed stats (like wear levelling) exposed via the controller and ILO, and none have failed yet almost 3 years on (in fact, they’re all showing 99% health). You get what you pay for, luckily.

Samsung

Samsung saved the day and picked up from OCZ with a ludicrously cheap 960GB offering, the 840 EVO. A consumer drive, so very limited warranty, but for the price (~$400-500) you got great IOPS and they were reliable. They had better SMART info, and seemed to play nicely with our hardware.

We have a lot of these drives:

[~/chef-repo (master)] $ knife search node block_device_sda_model:'Samsung' -a block_device.sda.model

117 items found

That’s 117 hosts with those drives, most of them have 6 each, and doesn’t include hosts that have them behind RAID controllers (for example, our Graphite boxes). In particular, they’ve been awesome for our ELK logging cluster

Then BB6Q happened…

I hinted that we used these for Graphite. They worked great! Who wouldn’t want thousands and thousands of IOPs for relatively little money? Buying SSDs from OEMs is still expensive, and they give you those darn fancy “enterprise” level drives. Pfft. Redundancy at the app level, right?

We had started buying Dell, who use a rebranded LSI RAID controller so they happily talked to the drives including providing full SMART info. We had 16 of those Samsung drives behind the Dell controller giving us 7.3TB of super fast storage.

Given the already proven pattern, we ordered the same spec box for a Ganglia hardware refresh. And they didn’t work. The RAID controller hung on startup trying to initialise the drives, so long that the Boot ROM was never loaded so it was impossible to boot from an array created using them.

What had changed?! A quick

"MegaCli -AdpAllInfo -a0 | diff"

on the two boxes, revealed: The firmware on the drive had changed. (shout out to those of you who know the MegaCli parameters by heart now…)

Weeks of debugging and back and forth with both Dell (who were very nice given these drives were unsupported) and Samsung revealed there were definitely firmware issues with this particular BB6Q release.

It was soon released publicly, that not only did this new firmware somehow break compatibility with Dell RAID controllers (by accident), but they also had a crippling performance bug… They got slower and slower over time, because they had messed up their block allocation algorithm.

In the end, behind LSI controllers, it was the controller sending particular ATA commands to the drives that would make them hang and not respond.. And so the RAID controller would have to wait for it to time out.

Samsung put out a firmware updater and “fixer” tool for this, but it needed to move your data around so only ran on Windows with NTFS.

With hundreds of these things that are in production and working, but have a crippling performance issue, we had to figure out how they would get flashed. An awesome contractor for Samsung agreed that if we drove over batches of drives (luckily, they are incredibly close to our datacenter) they would flash them and return them the next day.

This story has a relatively happy ending then; our drives are getting fixed, and we’re still buying their drives; now the 960GB 850 PRO model, as they remain a great value for money high performance drive.

Talking with other companies, we’re not alone with Samsung issues like this, even the 840 PRO has some issues that require hard power cycles to fix. But the price is hard to beat, especially now the 850 range is looking more solid.

LiteOn

LiteOn were famously known for making CD writers back when CD writers were new and exciting.

But they’re also a chosen OEM partner of Dell’s for their official “value” SSDs. Value is a relative term here, but they’re infinitely cheaper than HP’s offerings, enterprise level, fully supported and for all that, “only” twice the price of Samsung (~$940)

We decided to buy new SSD based database boxes, because SSDs were too hard to resist for these use cases; crazy performance and at 1TB capacity, not too much more expensive per GB than spinning rust. We had to buy many many 15,000rpm drives to even get near the performance, and they were expensive at 300GB capacity. We could spend a bit more money and save power, rack space, and get more disk space and IOPs.

For similar reasons to HP, we thought best to pay the premium for a fully supported solution, especially as Samsung had just caused all these issue with their firmware issues.

With that in mind, we ordered some R630’s hot off the production line with 960GB LiteOn’s, tested performance, and it was great: 30,000 random write IOPs across 4 SSDs in RAID6, (5.5 TB useable space).

We put them live, and they promptly blew up spectacularly. (Yes, we had a postmortem about this). The RAID controller claimed that two drives had died simultaneously, with another being reset by the adapter. Did we really get two disks to die at once?

This took months of working closely with Dell to figure out. Replacement of drives, backplane, and then the whole box, but the problem persisted. Just a few short hours of intense IO, especially on a box with only 4 SSDs would cause them to flip out. And in the mean time, we’d ordered 50+ of these boxes with varying amounts of SSDs installed, having tested so well initially.

Eventually it transpires that, like most good problems, it was a combination of many factors that caused these issues. The SSDs were having extended garbage collection periods, exacerbated by a smaller amount of SSDs with higher IO, in RAID6. This caused the controller to kick the drive out of the array… and unfortunately due to the write levelling across the drives, at least two of them were garbage collecting at the same time, destroying the array integrity.

The fix was no small deal; Dell and LiteOn together identified and fixed weaknesses in their RAID controller, the backplane and the SSD firmware. It was great to see the companies working together rather than just pointing fingers here, and the fixes for all sizes except 960GB was out within a month.

The story here continues for us though; the 960GB drive remains unsolved, as it caused more issues, and we had almost exclusively purchased those. For systems that weren’t fully loaded, Dell kindly provided us with 800GB replacements and extra drives to make up the space. For the rest, because the stress across the 22 drives means garbage collection isn’t as intense, so they remain operating until a firmware fix.

Summary

I’m hesitant to recommend any one particular brand, because I’m sure as with the hard disk phenomenon (Law where each person has their preferred brand that they’ve never had issues with but everyone else has), people’s experiences will have varied.

We should probably collect some real data on this as an industry and share it around; I’ve always been of the mindset that we’re weirdly secretive sometimes of what hardware/software we use but we should share, so if anyone wants to contribute let me know.

But: you can probably continue to buy Intel and Samsung, depending on your use case/budget, and as usual, own your own availability and add resiliency to your apps and hardware, because things always fail in ways you can’t imagine.

systemd: Using ExecStop to depool nodes for fun and profit

Post Syndicated from Laurie Denness original https://laur.ie/blog/2015/02/systemd-using-execstop-to-depool-nodes-for-fun-and-profit/

Preface: There is a tonne of drama about systemd on the internets; it won’t take you long to find it, if you’re curious. Despite that, I’m largely a fan and focusing on all the cool stuff I can finally do as an ops person without basically re-writing crappy bash scripts for a living (cough sys-v init) 

Process Supervision

Without going into the basics about systemd too much (I quite enjoy this post as an intro), you tell systemd to run your executable using the “ExecStart” part of the config, and it will go and run that command and make sure it keeps running. Wonderful! In this case, we wanted to keep HHVM running all the time, so we told systemd to do it, in 3 lines. Waaaay easier than sys-v init.

ExecStop

By default when you tell systemd to stop a process, and you haven’t told it how to stop the process, it’s just going to gracefully kill the process and any other processes it spawned.

However, there is also the ExecStop configuration option that will be executed before systemd kills your processes, adding a new “deactivating” step to the process. It takes any executable name (or many) as an argument, so you can abuse this to do literally anything as cleanup before your processes get killed.

Systemd will also continue to do it’s regular killing of processes if by the end of running your ExecStop script the processes are not all dead.

Load balancer health checks

We have a load balancer that uses a bunch of health checks to ensure that the node that it’s asking to do work can actually still do work before it sends it there.

One of these is hitting an HTTP endpoint we set up, let’s call it “status.php” which just contains the text “Status:OK”. This way, if the server dies, or PHP breaks, or Apache breaks, that node will be automatically depooled and we don’t serve garbage to the user. Yay!

Example: automatic depooling using ExecStop

Armed with my new ExecStop super power, I realised we were able to let the load balancer know this node was no longer available before killing the process.

I wrote a simple bash script that:

  • Moves the status.php file to status.php.disabled
  • Starts pinging the built in HHVM “load” endpoint (which tells you how many requests are in flight in HHVM) to see if the load has hit 0
  • if the curl to the “load” endpoint fails, we try again after 1 second
  • If we hit 30 seconds and the load isn’t 0 or we still can’t reach the endpoint, we just carry on anyway; something is wrong.
  • Once the load is “0”, we can continue
  • use `pidof` to kill the HHVM process
  • Move status.php.disabled back to status.php

And now, i can reference this in our HHVM systemd unit file:

[Unit]
Description=HHVM HipHop Virtual Machine (FCGI)

[Service]
Restart=always
ExecStart=/usr/bin/hhvm -c <snip>
ExecStop=/usr/local/bin/hhvm_stop.sh

Now when I call service hhvm stop, it takes 6-10 seconds for the stop to complete, because the traffic is gracefully removed.

Logging

Another thing I personally love about systemd, is the increase visibility the operator gets about what’s going on. In sys-v, if you’re lucky, someone put a “status” action in their bash script and it might tell you if the pid exists.

In systemd, you get a tonne of information about what’s going on; the processes that have been launched (including child processes), the PIDs, logs associated with that process, and in the case of something like Apache, the process can report information back:

Apache systemd status output showing requests per second

In this case, our ExecStop script output gets shown when you look at the status output of systemd:

[root@hhvm01 ~]# systemctl status hhvm -l
hhvm.service - HHVM HipHop Virtual Machine (FCGI)
   Loaded: loaded (/usr/lib/systemd/system/hhvm.service; enabled)
   Active: inactive (dead) since Tue 2015-02-17 22:00:52 UTC; 48s ago
  Process: 23889 ExecStop=/usr/local/bin/hhvm_stop.sh (code=exited, status=0/SUCCESS)
  Process: 37601 ExecStart=/usr/bin/hhvm <snip> (code=killed, signal=TERM)
  Main PID: 37601 (code=killed, signal=TERM)

Feb 17 22:00:45 hhvm01 hhvm_stop.sh[23889]: Moving status.php to status.php.disabled
Feb 17 22:00:47 hhvm01 hvm_stop.sh[23889]: Waiting another second (currently up to 8) because the load is still 16
Feb 17 22:00:48 hhvm01 hhvm_stop.sh[23889]: Waiting another second (currently up to 9) because the load is still 10
Feb 17 22:00:49 hhvm01 hhvm_stop.sh[23889]: Waiting another second (currently up to 10) because the load is still 10
Feb 17 22:00:50 hhvm01 hhvm_stop.sh[23889]: Load was 0 after 11 seconds, now we can kill HHVM.
Feb 17 22:00:50 hhvm01 hhvm_stop.sh[23889]: Killing HHVM
Feb 17 22:00:52 hhvm01 hhvm_stop.sh[23889]: Flipping status.php.disabled to status.php
Feb 17 22:00:52 hhvm01 systemd[1]: Stopped HHVM HipHop Virtual Machine (FCGI).

Now all the information about what happened during the ExecStop process is captured for debugging later! No more having no idea what happened during the shut down.

When the script is in the process of running, the systemd status output will show as “deactivating” so you know it’s still ongoing.

 

Summary

This is just one example of how you might use/abuse the ExecStop to do work before killing processes. Whilst this was technically possible before, IMO the ease of use and the added introspection means this is actually feasible for production systems.

I’ve gisted a copy of the script here, if you want to steal it and modify it for your own use.

Why I’ll be letting Nagios live on a bit longer, thank you very much

Post Syndicated from Laurie Denness original https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/

My my, hasn’t @supersheep stirred up a bit of controversy over Nagios over the last week?

In case you missed it, he brought up an excellent topic that’s close my heart: Nagios. In his words, we should stop using it, so we can let it die. (I read about this in DevOpsWeekly, which you should absolutely sign up to if you haven’t already, it’s fantastic)

Mr Sheep (Andy) brought up some excellent points, and when I read them I must admit getting fairly triggered and angry that someone would speak about one of my favourite things in such a horrible way! Then maybe I started thinking I had a problem. Was I blindly in love with this thing? Naive to the alternatives, a fan boy? Do I need help? Luckily I could reach out to my wonderful coworkers, and @benjammingh was quick to confirm that yes, I do need help, but then again don’t we all. That’s a separate issue.

Anyway, the folks at reddit had plenty to say about this too. Some of the answers are sane, some are… Not so. Other people were seemingly very angry too. I don’t blame them.. It’s a bold move to stand up and say a perfectly good piece of software “sucks” and “you shouldn’t use it”. Which was the intention, of course, to make us talk about it.

Now the dust has settled slightly, I’m going to tell you why I still love Nagios, and why it will be continued to be used at Etsy, addressing the points Andy brought up individually.

“Doesn’t scale at all”

Yeah, that Gearman thing freaks me out too. I don’t think I’d want to use it, even though we use Gearman extremely heavily at Etsy for the site (we even invited the creator in for our Code as Craft speaker series).

But what scale are people taking here? Is it really that hard?

We “only” have 10,000 checks in our primary datacenter, all active, usually on 2-3 minute check intervals with a bunch on 30 seconds. I’m honestly not sure if that’s impressive or embarrassing, but the machine is 80% idle, so it’s not like there isn’t overhead for more. And this isn’t a super-duper spec box by any means. In fact, it’s actually one of the oldest servers we have.

use_large_installation_tweaks

We had to enable use_large_installation_tweaks  to get the latency down, but that made absolutely no difference to our Nagios operation. Our check latency is currently 2.324 seconds.

I’m not sure how familiar people are with this flag… Our latency crept up to minutes without it, and it’s not massively well documented online that you can probably enable it with almost no effect to anything except… Making Nagios not suck quite so much.

It’s literally a “go faster” flag.

Disable CPU scaling

Our Nagios boxes are HP or Dell servers, that by default have a “dynamic” CPU scaling setting enabled. Great for power saving, but for some reason the intelligence built into this system is absolutely horrible with Nagios. Because Nagios has extremely high context switches, but relatively low CPU, it causes a lot of problems with the intelligent management. If you’re still having latency issues, set the server to “static high performance mode” or equivalent.

We’ve tested this in a bunch of other places, and the only other place it helped was syslog-ng. Normally it’s pretty smart, but there *are* a few cases that trip it up.

Horizontal Scaling

The reason we’ve ended up with 10,000 checks on that single box is because that datacenter is now full, and we’ve moved onto another one, so we’ve started scaling Nagios horizontally rather than vertically. It makes a lot more sense to have a Nagios instance in each network/datacenter location so you can get a “clean view” of what’s going on inside that datacenter rather than letting a network link show half the hosts as dead. If you lose cross-DC connectivity, how will you ever know what really happened in that DC when it comes back?

This does present some small annoyances, for example we needed to come up with a solution for aggregating status together into one place. We use Nagdash for that. It uses the nagios-api, which I’ll come onto more later. We also use nagios-api to allow us to downtime hosts quickly and easily via irccat in IRC, regardless of the datacenter.

We’ve done the same with Ganglia and FITB too, for the same reasons. Much easier to scale things by adding more boxes, once you get over the hurdles of managing multiple instances. As long as you’re using configuration management.

“Second most horrible configuration”

After sendmail. Fair enough… m4 anyone? Some people like it though, it’s called preference.

Anyway, those are some strong feelings. Ever used XML based configuration? ini files? Yaml? Hadoop? In *my opinion* they’re worse. Maybe you’re a fan.

Regardless, if you spend your day picking through Nagios config files, then you probably either love it anyway, you’re doing a huge rewrite of your old config, or you’re probably doing it wrong. You can easily automate this.

We came up with a pretty simple solution for the split NRPE/Nagios configs thing at Etsy: Stop worrying about the NRPE configs and put every check on every host. The entire directory is 3MB, and does it matter if you have a check on a system you never use? No. Now you only have one config to worry about.

Andy acknowledges Chef/Puppet automation later where he calls using them to manage your Nagios configuration a “band aid”. Is managing your Apache config a “band aid”? How about your resolv.conf? Depending on your philosophy, you could basically call configuration management in general a giant bandaid. Is that a bad thing? No! That’s what makes it awesome. Our jobs is tying together components to construct a functioning system, at many many levels. At the highest level, at Etsy we’re here to make a shopping website. There are a bunch more systems tied together to make that possible lower down.

This is actually the Unix philosophy. Many small parts, applications that do a small specific thing, which you tie together using “|”. A pipe. You pipe data in to one application, and you manipulate it how you want on the way out. Which brings me onto:

“No programmatic interfaces”

At this point I am threatened with “If I catch you parsing status.dat I will beat your ass”. Bring it on!

We’re using the wonderful nagios-api project extremely heavily at Etsy because it provides a fantastic REST API for anything you’ve ever wanted in Nagios. And it does so by parsing status.dat. So sue me. Call me crazy, but isn’t parsing one machine readable output into another machine readable output basically computers? Where exactly is the issue in that?

Not only that, but it works really really well. We’ve contributed bits back to extend the functionality, and now our entire day to day workflow depends on it.

Would it be cool if it was built in? Maybe. Does it matter that it’s not? No. Again, pipes people. We’re using Chef as “echo” into Nagios, and then piping Nagios output into nagios-api for the output.

“Horrendous interface”

Well, it’s more “old” than anything else. At least everything is in the same place as you left it because it’s been the same since 1912. I wouldn’t argue if it was modernised slightly.

“Stupid wire format for clients”

I don’t think I’ve ever looked. Why are you looking? When was the last time NRPE broke? Maybe you have a good reason. I don’t.

“Throws away perfdata”

Again with the pipes! As Nagios logs this, we throw it into Splunk and Logstash. I admit we don’t bother doing much with it from there, as I like my graphs powered by something that was designed to graph, but a couple of times I’ve parsed the perfdata output in one of those two to get data I need.

All singing all dancing!

In the end though, I think the theme we’re coming onto here is that Andy really wants a big monolithic thing to handle everything for him, whereas actually I’m a massive fan of using the right tool for the job. You can buy a clock radio that is also a iPod dock, mp3 player, torch, battery charger, cheese grater, but it does all those things terribly.

For example, I don’t often need the perfdata because we have Ganglia for system level metrics, Graphite for our app level metrics, and we alert on data from both of those using Nagios.

In the end, Nagios is an extremely stable, extremely customisable piece of software, which does the job of scheduling and running shell scripts and then taking that and running other shell scripts to tell someone about it incredibly well. No it doesn’t do everything. Is that a bad thing?

Murphy said this excellently:

“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”

(As a side note, yes all of our Nagios instances monitor each other, no they’ve never crashed)

I will be honest; I haven’t used Sensu, because I’m in a happy place right now, but just the architectural diagram of how it works scares the shit out of me. When you need 7 arrow colours to describe where data is going in a monitoring system, I’m starting to fear it slightly. But hey, if it works, good on you guys. It just looks a lot like this. Nothing wrong with that, if you can make it stable and reliable.

Your mileage may vary

The nice thing about this world is people have choices. You may read everything I just wrote and still think Nagios is rubbish. No problem!

Certainly for us, things are working out pretty great, so Nagios will be with us for some time (drama involving monitoring plugins aside…). When we’ve hit a limit, that’ll be the next thing out the window or re-worked. But for now, long live Nagios. And it’s far from being on life support.

And, the best thing is, that doesn’t even stop Andy making something awesome. Hell, if it’s really good, maybe we’ll use it and contribute to it. But declaring Nagios as dead isn’t going to help that effort, actually. It will just alienate people. But I’m sure there are many of you who are sick of it, so please, don’t let us stop you.

Follow me on Twitter: @lozzd

Easy image sharing on OS X using Scrup

Post Syndicated from Laurie Denness original https://laur.ie/blog/2012/03/easy-image-sharing-scrup/

Nowadays, people are really into this whole “Skitch” thing, and being able to send images/screenshots to each other quickly. I’d been doing the same thing with TinyGrab for a long time, but I like to host things myself. Yes, TinyGrab has the ability to upload to your own server… but it uses FTP. This was causing me no end of issues, so I sought out something else.

I found Scrup, and I’ve been using it for the last year or so very happily. It’s open source, and has been hanging around on Github for 2 years now. There are some pretty sweet forks of it, including one that has support for a sound on upload completion, and Growl notifications too.

What does Scrup do?

So, you need to share an image, or a screenshot really quickly? Using the standard OS X screenshot features (Command + Shift + 4, and so on), you can hit one button and upload the image to your webserver and put the link to it into your clipboard ready for pasting anywhere.

It does also have the ability to edit the screenshot pre-upload (such as adding arrows to point to important, or awesome things)

What you need

  • A server somewhere with some disk space
  • PHP 5
  • A webserver

How?

  1. Install the Scrup.app onto your Mac. I have a pre-compiled a version with the sound and Growl patch included. You don’t have to use mine, you can compile it yourself using Xcode if you wish. (The source is on github here: https://github.com/rsms/scrup)
  2. Create a folder on your webserver that you want to store your images in. I call mine “grb” (short for “grab”) because I like short URLs. (/var/www/grb/ -> http://laur.ie/grb/)
  3. In that folder, put a script that will receive the files, and then return the URL to where it stored the file. You can view the one I use here (modified from the one that ships with Scrup) which names files something like “1s-euobfpq1xcwos.png” and has no authentication (so make sure you go the security-by-obscurity route of naming the script something random or add auth yourself)
  4. Open Scrup, and point it at your upload script. For example, http://yourhost/grb/receiver.php?name={filename}
  5. Take screenshots! They should get uploaded and you should see a green tick in the menu bar. The URL of the uploaded image should also be in your clipboard, ready for pasting wherever.

The best thing about Scrup is that it has a simple, fast UI for just uploading things quickly, and because it uses a regular HTTP POST, it works on whatever weird internet connection you may be on.

 

timeout waiting for input from local during Draining Input

Post Syndicated from Laurie Denness original https://laur.ie/blog/2011/09/timeout-waiting-for-input-from-local-during-draining-input/

I experienced this today, very frustrating; sendmail all locked up and outputting this bizarre “timeout waiting for input from local during Draining Input” error into the logs.

tl;dr: figure out what sendmail is waiting for.

In my case, it was stuck on procmail. But why? Turns out the local mailbox (a user that runs a lot of crons) had hit 3GB, at which point it didn’t seem to be accepting any more email into that inbox. Moving that file out of the way and allowing a new one to be created caused the queue to be flushed instantly.

PagerdutyPHP: Scripts for the Pagerduty API

Post Syndicated from Laurie Denness original https://laur.ie/blog/2011/08/pagerdutyphp-scripts-for-the-pagerduty-api/

As much as most of us would love to not have to do it, most people reading this now will have to be on call at some point. It sucks, but Pagerduty makes it a little easier to manage when your team starts to grow.

Whilst we still have Nagios sending to all contacts directly (a personal preference) we still rely on Pagerduty for emergency pages from the rest of the company, and to arrange who is on call when (their calendar is pretty good for us, allows for exceptions etc).

We’re also a user of the IRC bot “irccat” which, briefly explained, allows input/output to scripts from an IRC chat.

I wanted to combine the two for a long time, and when Pagerduty released their API to access schedule data it wasn’t long before we had a command that allows anyone in the company to ask irccat who is on call and until when.

I’ve finally got around to releasing this today, a “library” of useful Pagerduty API functions (pagerduty.php) (note currently it has just two, to see who is on call for a given schedule. Pull requests for additional useful functions please!) and more importantly, pagerdutycron.php – A script to run on an interval that will then either broadcast in IRC a new person on call, and/or send an email.

As usual, I’ve stuck the code on Github: https://github.com/lozzd/PagerdutyPHP

Hadoop and Ganglia 3.1

Post Syndicated from Laurie Denness original https://laur.ie/blog/2011/03/hadoop-and-ganglia-3-1/

A quick note to anyone setting up a new Hadoop cluster and hoping to quickly use the built in Ganglia metrics collection (which you should! If it moves, graph it!): This works out of the box with Ganglia 3.0, but the protocol changed with Ganglia 3.1.

The official GangliaMetrics pages talks about this, and talks about patching (which is already available if you use the Cloudera releases) but doesn’t go into more detail than that. I recently set up a new cluster, and remembered there was something I had to change in the default config to make it work out of the box… After inquiring (and finding the comment I left in my old config file!) I remembered, you must change the default class to have “31” (e.g. Ganglia 3.1) on the end.

For example, the default config file: (Replacing @GANGLIA@ with your multicast address)

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=@GANGLIA@:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=@GANGLIA@:8649

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=10
jvm.servers=@GANGLIA@:8649

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext
rpc.period=10
rpc.servers=@GANGLIA@:8649

Is changed to this:

dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
dfs.period=10
dfs.servers=@GANGLIA@:8649

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
mapred.period=10
mapred.servers=@GANGLIA@:8649

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
jvm.period=10
jvm.servers=@GANGLIA@:8649

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
rpc.period=10
rpc.servers=@GANGLIA@:8649

Restart the cluster, and the graphs will appear under each host in the Ganglia interface.

There is a LOT of detail in these graphs, with metrics ranging from DFS (things like bytes written, and how many operations were transferred from other nodes) to the JVM (monitor those heap memory sizes!)

This is probably old news to most people I’m sure, but I have a rule that if I didn’t find it within 30 minutes, maybe this will help someone in the same boat as me 🙂

Handy binaries for Thecus NAS boxes

Post Syndicated from Laurie Denness original https://laur.ie/blog/2010/11/handy-binaries-for-thecus-nas-boxes/

I recently took delivery of the rather splendid Thecus N5500 which I love; it’s the perfect mix between “it just works” and “oh, let’s stick SSH on there and poke around”. With 5 hot swap disk shelves, and 2TB hard drives you’ve got a serious amount of storage.

For your money you get a very nice little piece of hardware in a pretty nice shell (it strikes me as a touch tacky in places but then again it’s hardly going on show) with software that gets the job done. NFS, AFP, Samba, iSCSI, iTunes DAAP support, and plenty of modules to tickle your fancy (Logitech Squeezecenter, for instance).

But who am I kidding, I’m a sysadmin. 10 minutes after powering the thing on I was dying to log in using SSH so I could watch /proc/mdstat to see the RAID build. Luckily, the modules from the Thecus N5200 work fine; which means you’re a couple of clicks away from a root terminal.

  1. Grab the SSH and SYSUSER N5200 modules, and unzip them (a mistake I made.. How embarrassing.)
  2. Upload them using the webinterface, and enable them.
  3. SSH to the NAS box using the user “sys” and the password “sys”
  4. Enjoy your shell, and remember to run `passwd sys` to change the password to something else.

Now, you’ve got yourself a pretty handy, albeit it BusyBox-ridden install of Linux. The whole point of this post, is so I can pimp a few statically compiled binaries that might come in useful to you; they did to me anyway.

(You may wish to install the UTILITIES module, which gives you a proper version of top and ps, amongst other things, available here)

You can simply untar and drop the binaries into /raid/data/modules/bin folder so that they’re in your path, and stored on your disks rather than the flash units which are rather limited in space. By the way, these modules should also work fine on the Thecus N5200 NAS boxes too.

The binaries are available here: http://denness.net/thecus/binaries/

The list includes (all the latest versions as of the date of this blog post):

  • ethtool, handy for network interface prodding
  • iftop, a very useful “GUI” app that shows incoming/outgoing network bandwidth (let’s face it, this is fun on a NAS. NOTE: you may need to execute this one using `TERM=vt100; iftop`)
  • iostat, for hard core disk stats porn. Run it with `iostat -mx 1` and watch the megabytes fly
  • rsync, particularly handy if you want to synchronise/backup data from one place to another, so particularly handy on a NAS.
  • vim, just in case you were planning on writing a lot of code on the Thecus 🙂
  • GNU screen, a nice place to store your terminals and detach and come back later. (NOTE: you may need to execute this one using `TERM=vt100; screen`)
  • The command line version of PHP, in case you were planning on writing any scripts in PHP to run on the Thecus.

Any suggestions/comments, let me know.