Tag Archives: Engineering

Improving how we deploy GitHub

Post Syndicated from Julian Nadeau original https://github.blog/2021-01-25-improving-how-we-deploy-github/

Over the last year GitHub has doubled the number of developers contributing to the main GitHub.com application. While this seems like a solely positive thing on the surface, the 2x increase in folks contributing to the core software exposed some problems in terms of tooling. Tooling that worked for us a year ago no longer functioned in the same capacity. While GitHub itself has been a fantastic vehicle to drive change for GitHub, the deployment tooling and coordination has not enjoyed the same levels of success. One of those areas was our deployments.

GitHub.com is deployed primarily through chatops using a branch deploy model (we deploy branches before merging into the main branch). This meant that developers can add changes to a queue, check the status of the queue, and organize groups of pull requests to be deployed and merged. This all functioned using chatops in Slack in a room called #dotcom-ops. While this is a very simple system it started to cause some confusion while monitoring a deploy as a single chat room managed the queue, the deploy, and many other tasks for many people at the same time. All of a sudden this channel that was once a central part of a crucial information system was overwhelmed by the amount of information being pushed through it. Developers could no longer track their change through the system which resulted in a reduced capacity for that developer and an increased risk profile for GitHub.

This is just one step of about a dozen spread across hundreds of messages, it is hard to keep track and validate the state of a deploy.

The Build

At the beginning of this summer, we set out to be able to completely revamp the way we monitored deploy changes for GitHub.com. With the problem in mind –namely the multi-step, information-dense deploy process via chatops –we set out with a few main goals:

Simplify the Deploy for Developers

One of the main issues that we experienced with the previous system was that deploys were tracked across a number of different messages within the Slack channel. This made it very difficult to piece together the different messages that made-up a single deploy. Sometimes there could be as many as a few hundred messages in between subsequent messages from the deploy system.

Second Canary Stage

The second main issue was that we had a canary stage but the stage would only deploy to up to 2% of GitHub.com traffic. This meant that there were a whole slew of problems that would never get caught in the canary stage before we rolled out to 100% of production, and would have to instead start an incident and roll back. Availability and uptime is of the utmost concern to GitHub, so this risk became crucial to fix and address as GitHub continued to grow. With this in mind, we set out to introduce a second Canary stage at a higher percent so that we could catch more issues in earlier stages which would reduce the impact of future incidents.

At the end of this project, we were able to have two canary stages. The first is at 2% and aims to capture the majority of the issues. This low percentage keeps the risk profile at a tolerable level as such a small amount of traffic would actually be impacted by an issue. A second canary stage was introduced at 20% and allows us to direct to a much larger amount of traffic while still in canary stages. This has a higher risk profile, but is mitigated by the initial 2% canary stage, and allows us to transition with less risk to the 100% production stage.

Automate

The last issue is that developers had to sit with the deploy and help poke it along at every step of the way by running independent chatops commands to queue up, deploy canary, and deploy production – making judgement calls every step of the way. This meant that developers were fumbling with chatops commands multiple times in every deployment, and often getting them wrong.

We asked ourselves “what if there was one command and everything else just happened?”. So we did that, we automated the entire process.

The solution

We already had internal deploy software that was capable of tracking a single part of a deploy. This system had a proven track record and solid foundations, and so we aimed to add the ability to automate the entire deploy sequence from a single chatops command, rather than running multiple commands per deploy, and link these records together within the existing deployment infrastructure. Moreover, this new solution would provide an easy to use interface for a quick overview of any given deployment rather than piecing together multiple chatops commands.

We came up with two basic concepts:

  1. A deploy is made up of many stages (canary, production, etc.)
  2. There are gates between different stages that perform some sort of check to validate we can progress to the next stage

These concepts allowed us to model the entire system in a state-machine-like fashion:

This diagram shows an automatic progression between a 2% canary, 20% canary, production deploy, and a ready to merge stage – separated by automated 5-minute timers. Finally, a pointer was automated to progress across the data model after the timer gates completed and stages were deployed.

What resulted was a state machine backed deploy system with a first party UI component. This was combined with the traditional chatops with which our developers were already familiar. Below, you can see an overview of deploys which have recently been deployed to GitHub.com. Rather than tracking down various messages in a noisy Slack channel, you can go to a consolidated UI. You can see the state machine progression mentioned above in the overview on the right.

Drilling down into a specific deployment, you can see everything that has happened during a specific deployment. You can see that each stage of this deployment has an automatic 5-minute timer between them, and the ability to pause the deployment to give a developer more time to test. We also made sure that, in the case something is wrong, we have a quick way to rollback or revert the changes with the dropdown in the top right corner.

Finally, the entire system could be monitored and started from Slack – just like before. We found that this is how developers typically want to start their deploys, followed by monitoring in the UI component:

The results

These changes have revolutionized the way we deploy GitHub.com. Where confusion and frustration had once set in, we now have joy and content at the use of an automated system. We have received overwhelmingly positive feedback internally, and even attracted some attention from our very own Actions team. Our learnings, in this case, helped to inform and influence the decisions in the recent GitHub Actions CD offering announced at GitHub Universe. Our focus on our own developers means that we can apply our learnings to continue creating the best possible system for the 56M+ developers around the world.

Our work this past summer could not have been possible without the dedication and work from many people and teams. We’d like a special shoutout to the GitHub internal deploy team: their existing system, advice, and ongoing help was crucial in making sure our deploys were successful.

Part of the Building GitHub blog series.

The best of Changelog • 2020 Edition

Post Syndicated from Michelle Mannering original https://github.blog/2021-01-21-changelog-2020-edition/

If you haven’t seen it, the GitHub Changelog helps you keep up-to-date with all the latest features and updates to GitHub. We shipped a tonne of changes last year, and it’s impossible to blog about every feature. In fact, we merged over 90,000 pull requests into the GitHub codebase in the past 12 months!

Here’s a quick recap of the top changes made to GitHub in 2020. We hope these changes are helping you build cooler things better and faster. Let us know what your favourite feature of the past year has been.

GitHub wherever you are

While we haven’t exactly been travelling a lot recently, one of the things we love is the flexibility to work wherever we want, however we want. Whether you want to work on your couch, in the terminal, or check your notifications on the go, we’ve shipped some updates for you.

GitHub CLI

Do you like to work in the command line? In September, we brought GitHub to your terminal. Having GitHub available in the command line reduces the need to switch between applications or various windows and helps simplify a bunch of automation scenarios.

The GitHub CLI allows you to run your entire GitHub workflow directly from the terminal. You can clone a repo, create, view and review PRs, open issues, assign tasks, and so much more. The CLI is available on Windows, iOS, and Linux. Best of all, the GitHub CLI is open source. Download the CLI today, check out the repo, and view the Docs for a full list of the CLI commands.

GitHub for Mobile

It doesn’t stop there. Now you can also have GitHub in your pocket with GitHub for Mobile!

This new native app makes it easy to create, view, and comment on issues, check your notifications, merge a pull request, explore, organise your tasks, and more. One of the most used features of GitHub for Mobile is push notification support. Mobile alerts means you’ll never miss a mention or review again and can help keep your team unblocked.

GitHub for Mobile is available on iOS and Android. Download it today if you’re not already carrying the world’s development platform in your pocket.

Oh and did you know, GitHub for Mobile isn’t just in English? It’s also available in Brazilian Portuguese, Japanese, Simplified Chinese, and Spanish.

 

GitHub Enterprise Server

With the release of GitHub Enterprise Server 2.21 in 2020, there was a host of amazing new features. There are new features for PRs, a new notification experience, and changes to issues. These are all designed to make it easier to connect, communicate, and collaborate within your organisation.

And now we’ve made Enterprise Server even better with GitHub Enterprise Server 3.0 RC. That means GitHub Actions, Packages, Code Scanning, Mobile Support, and Secret Scanning are now available in your Enterprise Server. This is the biggest release we’ve done of GitHub Enterprise Server in years, and you can install it now with full support.

Working better with automation

GitHub Actions was launched at the end of 2019 and is already the most popular CI/CD service on GitHub. Our team has continued adding features and improving ways for you to automate common tasks in your repository. GitHub Actions is so much more than simply CI/CD. Our community has really stepped up to help you automate all the things with over 6,500 open source Actions available in the GitHub Marketplace.

Some of the enhancements to GitHub Actions in 2020 include:

Workflow visualisation

We made it easy for you to see what’s happening with your Actions automation. With Workflow visualisation, you can now see a visual graph of your workflow.

This workflow visualisation allows you to easily view and understand your workflows no matter how complex they are. You can also track the progress of your workflow in real time and easily monitor what’s happening so you can access deployment targets.

On top of workflow visualisation, you can also create workflow templates. This makes it easier to promote best practices and consistency across your organisation. It also cuts down time when using the same or similar workflows. You can even define rules for these templates that work across your repo.

Self-hosted runners

Right at the end of 2019, we announced GitHub Actions supports self-hosted runner groups. It offered developers maximum flexibility and control over their workflows. Last year, we made updates to self-hosted runners, making self-hosted runners shareable across some or all of your GitHub organisations.

In addition, you can separate your runners into groups, and add custom labels to the runners in your groups. Read more about these Enterprise self-hosted runners and groups over on our GitHub Docs.

Environments & Environment Secrets

Last year we added environment protection rules and environment secrets across our CD capabilities in GitHub Actions. This new update ensures there is separation between the concerns of deployment and concerns surrounding development to meet compliance and security requirements.

Manual Approvals

With Environments, we also added the ability to pause a job that’s trying to deploy to the protected environment and request manual approval before that job continues. This unleashes a whole new raft of continuous deployment workflows, and we are very excited to see how you make use of these new features.

Other Actions Changes

Yes there’s all the big updates, and we’re committed to making small improvements too. Alongside other changes, we now have better support for whatever default branch name you choose. We updated all our starter workflows to use a new $default-branch macro.

We also added the ability to re-run all jobs after a successful run, as well as change the retention days for artifacts and logs. Speaking of logs, we updated how the logs are displayed. They are now much easier to read, have better searching, auto-scrolling, clickable URLs, support for more colours, and full screen mode. You can now disable or delete workflow runs in the Actions tab as well as manually trigger Actions runs with the workflow_dispatch trigger.

While having access to all 6,500+ actions in the marketplace helps integrate with different tools, some enterprises want to limit which actions you can invoke to a limited trusted sub-set. You can now fine-tune access to your external actions by limiting control to GitHub-verified authors, and even limit access to specific Actions.

There were so many amazing changes and updates to GitHub Actions that we couldn’t possibly include them all here. Check out the Changelog for all our GitHub Actions updates.

Working better with Security

Keeping your code safe and secure is one of the most important things for us at GitHub. That’s why we made a number of improvements to GitHub Advanced Security for 2020.

You can read all about these improvements in the special Security Highlights from 2020. There are new features such as code scanning, secret scanning, Dependabot updates, Dependency review, and NPM advisory information.

If you missed the talk at GitHub Universe on the state of security in the software industry, don’t forget to check it out. Justin Hutchings, the Staff Product Manager for Security, walks through the latest trends in security and all things DevSecOps. It’s definitely worth carving out some time over the weekend to watch this:

Working better with your communities

GitHub is about building code together. That’s why we’re always making improvements to the way you work with your team and your community.

Issues improvements

Issues are important for keeping track of your project, so we have been busy making issues work better and faster on GitHub.

You can now also link issues and PRs via the sidebar, and issues now have list autocompletion. When you’re looking for an issue to reference, you can use multiple words to search for that issue inline.

Sometimes when creating an issue, you might like to add a GIF or short video to demo a bug or new feature. Now you can do it natively by adding an *.mp4 or *.mov into your issue.

GitHub Discussions

Issues are a great place to talk about feature updates and bug fixes, but what about when you want to have an open- ended conversation or have your community help answering common questions?

GitHub Discussions is a place for you and your community to come together and collaborate, chat, or discuss something in a separate space, away from your issues. Discussions allows you to have threaded conversations. You can even convert Issues to Discussions, mark questions as answered, categorise your topics, and pin your Discussions. These features help you provide a welcoming space to new people as well as quick access to the most common discussion points.

If you are an admin or maintainer of a public repo you can enable Discussions via repo settings today. Check out our Docs for more info.

Speaking of Docs, did you know we recently published all our documentation as an open source project? Check it out and get involved today.

GitHub Sponsors

We launched GitHub Sponsors in 2019, and people have been loving this program. It’s a great way to contribute to open source projects. In 2020, we made GitHub Sponsors available in even more countries. Last year, GitHub Sponsors became available in Mexico, Czech Republic, Malta, and Cyprus.

We also added some other fancy features to GitHub Sponsors. This includes the ability to export a list of your sponsors. You can also set up webhooks for events in your sponsored account and easily keep track of everything that’s happening via your activity feed.

At GitHub Universe, we also announced Sponsors for Companies. This means organisations can now invest in open source projects via their billing arrangement with GitHub. Now is a great time to consider supporting your company’s most critical open source dependencies.

Working better with code

We’re always finding ways to help developers. As Nat said in his GitHub Universe keynote, the thing we care about the most is helping developers build amazing things. That’s why we’re always trying to make it quicker and easier to collaborate on code.

Convert pull requests to drafts

Draft pull requests are a great way to let your team know you are working on a feature. It helps start the conversation about how it should be built without worrying about someone thinking it’s ready to merge into main. We recently made it easy to convert an existing PR into a draft anytime.

Multi-line code suggestions

Not only can you do multi-line comments, you can now suggest a specific change to multiple lines of code when you’re reviewing a pull request. Simply click and drag and then edit text within the suggestion block.

Default branch naming

Alongside the entire Git community, we’ve been trying to make it easier for teams wanting to use more inclusive naming for their default branch. This also gives teams much more flexibility around branch naming. We’ve added first-tier support for renaming branches in the GitHub UI.

This helps take care of retargeting pull requests and updating branch protection rules. Furthermore, it provides instructions to people who have forked or cloned your repo to make it easier for them to update to your new branch names.

Re-directing to the new default branch

We provided re-directs so links to deleted branch names now point to the new default branch. In addition, we updated GitHub Pages to allow it to publish from any branch. We also added a preference so you can set the default branch name for your organization. If you need to stay with ‘master’ for compatibility with your existing tooling and automation, or if you prefer to use a different default branch, such as ‘development,’ you can now set this in a single place.

For new organizations to GitHub, we also updated the default to ‘main’ to reflect the new consensus among the Git community. Existing repos are also not affected by any of these changes. Hopefully we’ve helped make it easier for the people who do want to move away from the old ‘master’ terminology in Git.

Design updates for repos and GitHub UI

In mid 2020, we launched a fresh new look to the GitHub UI. The way repos are shown on the homepage and the overall look and feel of GitHub is super sleek. There’s responsive layout, improved UX in the mobile web experience, and more. We also made lots of small improvements. For example, the way your commits are shown in the pull request timeline has changed. PRs in the past were ordered by author date. Now they’ll show up according to their chronological order in the head branch.

If you’ve been following a lot of our socials, you’ll know we’ve also got a brand new look and feel to GitHub.com. Check out these changes, and we hope it gives you fresh vibes for the future.

Go to the Dark Side

Speaking of fresh vibes, you’ve asked for it, and now it’s here! No longer will you be blinded by the light. Now you can go to the dark side with dark mode for the web.

Changelog 2020

These are just some of the highlights for 2020. We’re all looking forward to bringing you more great updates in 2021.

Keep an eye on the Changelog to stay informed and ensure you don’t miss out on any cool updates. You can also follow our changes with @GHChangelog on Twitter and see what’s coming soon by checking out the GitHub Roadmap. Tweet us your favourite changes for 2020, and tell us what you’re most excited to see in 2021.

Trident – Real-time event processing at scale

Post Syndicated from Grab Tech original https://engineering.grab.com/trident-real-time-event-processing-at-scale

Ever wondered what goes behind the scenes when you receive advisory messages on a confirmed booking? Or perhaps how you are awarded with rewards or points after completing a GrabPay payment transaction? At Grab, thousands of such campaigns targeting millions of users are operated daily by a backbone service called Trident. In this post, we share how Trident supports Grab’s daily business, the engineering challenges behind it, and how we solved them.

60-minute GrabMart delivery guarantee campaign operated via Trident
60-minute GrabMart delivery guarantee campaign operated via Trident

What is Trident?

Trident is essentially Grab’s in-house real-time if this, then that (IFTTT) engine, which automates various types of business workflows. The nature of these workflows could either be to create awareness or to incentivize users to use other Grab services.

If you are an active Grab user, you might have noticed new rewards or messages that appear in your Grab account. Most likely, these originate from a Trident campaign. Here are a few examples of types of campaigns that Trident could support:

  • After a user makes a GrabExpress booking, Trident sends the user a message that says something like “Try out GrabMart too”.
  • After a user makes multiple ride bookings in a week, Trident sends the user a food reward as a GrabFood incentive.
  • After a user is dropped off at his office in the morning, Trident awards the user a ride reward to use on the way back home on the same evening.
  • If  a GrabMart order delivery takes over an hour of waiting time, Trident awards the user a free-delivery reward as compensation.
  • If the driver cancels the booking, then Trident awards points to the user as a compensation.
  • With the current COVID pandemic, when a user makes a ride booking, Trident sends a message to both the passenger and driver reminding about COVID protocols.

Trident processes events based on campaigns, which are basically a logic configuration on what event should trigger what actions under what conditions. To illustrate this better, let’s take a sample campaign as shown in the image below. This mock campaign setup is taken from the Trident Internal Management portal.

Trident process flow
Trident process flow

This sample setup basically translates to: for each user, count his/her number of completed GrabMart orders. Once he/she reaches 2 orders, send him/her a message saying “Make one more order to earn a reward”. And if the user reaches 3 orders, award him/her the reward and send a congratulatory message. 😁

Other than the basic event, condition, and action, Trident also allows more fine-grained configurations such as supporting the overall budget of a campaign, adding limitations to avoid over awarding, experimenting A/B testing, delaying of actions, and so on.

An IFTTT engine is nothing new or fancy, but building a high-throughput real-time IFTTT system poses a challenge due to the scale that Grab operates at. We need to handle billions of events and run thousands of campaigns on an average day. The amount of actions triggered by Trident is also massive.

In the month of October 2020, more than 2,000 events were processed every single second during peak hours. Across the entire month, we awarded nearly half a billion rewards, and sent over 2.5 billion communications to our end-users.

Now that we covered the importance of Trident to the business, let’s drill down on how we designed the Trident system to handle events at a massive scale and overcame the performance hurdles with optimization.

Architecture design

We designed the Trident architecture with the following goals in mind:

  • Independence: It must run independently of other services, and must not bring performance impacts to other services.
  • Robustness: All events must be processed exactly once (i.e. no event missed, no event gets double processed).
  • Scalability: It must be able to scale up processing power when the event volume surges and withstand when popular campaigns run.

The following diagram depicts how the overall system architecture looks like.

Trident architecture
Trident architecture

Trident consumes events from multiple Kafka streams published by various backend services across Grab (e.g. GrabFood orders, Transport rides, GrabPay payment processing, GrabAds events). Given the nature of Kafka streams, Trident is completely decoupled from all other upstream services.

Each processed event is given a unique event key and stored in Redis for 24 hours. For any event that triggers an action, its key is persisted in MySQL as well. Before storing records in both Redis and MySQL, we make sure any duplicate event is filtered out. Together with the at-least-once delivery guaranteed by Kafka, we achieve exactly-once event processing.

Scalability is a key challenge for Trident. To achieve high performance under massive event volume, we needed to scale on both the server level and data store level. The following mind map shows an outline of our strategies.

Outline of Trident’s scale strategy
Outline of Trident’s scale strategy

Scale servers

Our source of events are Kafka streams. There are mostly two factors that could affect the load on our system:

  1. Number of events produced in the streams (more rides, food orders, etc. results in more events for us to process).
  2. Number of campaigns running.
  3. Nature of campaigns running. The campaigns that trigger actions for more users cause higher load on our system.

There are naturally two types of approaches to scale up server capacity:

  • Distribute workload among server instances.
  • Reduce load (i.e. reduce the amount of work required to process each event).

Distribute load

Distributing workload seems trivial with the load balancing and auto-horizontal scaling based on CPU usage that cloud providers offer. However, an additional server sits idle until it can consume from a Kafka partition.

Each Kafka partition can only be consumed by one consumer within the same consumer group (our auto-scaling server group in this case). Therefore, any scaling in or out requires matching the Kafka partition configuration with the server auto-scaling configuration.

Here’s an example of a bad case of load distribution:

Kafka partitions config mismatches server auto-scaling config
Kafka partitions config mismatches server auto-scaling config

And here’s an example of a good load distribution where the configurations for the Kafka partitions and the server auto-scaling match:

Kafka partitions config matches server auto-scaling config
Kafka partitions config matches server auto-scaling config

Within each server instance, we also tried to increase processing throughput while keeping the resource utilization rate in check. Each Kafka partition consumer has multiple goroutines processing events, and the number of active goroutines is dynamically adjusted according to the event volume from the partition and time of the day (peak/off-peak).

Reduce load

You may ask how we reduced the amount of processing work for each event. First, we needed to see where we spent most of the processing time. After performing some profiling, we identified that the rule evaluation logic was the major time consumer.

What is rule evaluation?

Recall that Trident needs to operate thousands of campaigns daily. Each campaign has a set of rules defined. When Trident receives an event, it needs to check through the rules for all the campaigns to see whether there is any match. This checking process is called rule evaluation.

More specifically, a rule consists of one or more conditions combined by AND/OR Boolean operators. A condition consists of an operator with a left-hand side (LHS) and a right-hand side (RHS). The left-hand side is the name of a variable, and the right-hand side a value. A sample rule in JSON:

Country is Singapore and taxi type is either JustGrab or GrabCar.
  {
    "operator": "and",
    "conditions": [
    {
      "operator": "eq",
      "lhs": "var.country",
      "rhs": "sg"
      },
      {
        "operator": "or",
        "conditions": [
        {
          "operator": "eq",
          "lhs": "var.taxi",
          "rhs": <taxi-type-id-for-justgrab>
          },
          {
            "operator": "eq",
            "lhs": "var.taxi",
            "rhs": <taxi-type-id-for-grabcard>
          }
        ]
      }
    ]
  }

When evaluating the rule, our system loads the values of the LHS variable, evaluates against the RHS value, and returns as result (true/false) whether the rule evaluation passed or not.

To reduce the resources spent on rule evaluation, there are two types of strategies:

  • Avoid unnecessary rule evaluation
  • Evaluate “cheap” rules first

We implemented these two strategies with event prefiltering and weighted rule evaluation.

Event prefiltering

Just like the DB index helps speed up data look-up, having a pre-built map also helped us narrow down the range of campaigns to evaluate. We loaded active campaigns from the DB every few minutes and organized them into an in-memory hash map, with event type as key, and list of corresponding campaigns as the value. The reason we picked event type as the key is that it is very fast to determine (most of the time just a type assertion), and it can distribute events in a reasonably even way.

When processing events, we just looked up the map, and only ran rule evaluation on the campaigns in the matching hash bucket. This saved us at least 90% of the processing time.

Event prefiltering
Event prefiltering
Weighted rule evaluation

Evaluating different rules comes with different costs. This is because different variables (i.e. LHS) in the rule can have different sources of values:

  1. The value is already available in memory (already consumed from the event stream).
  2. The value is the result of a database query.
  3. The value is the result of a call to an external service.

These three sources are ranked by cost:

In-memory < database < external service

We aimed to maximally avoid evaluating expensive rules (i.e. those that require calling external service, or querying a DB) while ensuring the correctness of evaluation results.

First optimization – Lazy loading

Lazy loading is a common performance optimization technique, which literally means “don’t do it until it’s necessary”.

Take the following rule as an example:

A & B

If we load the variable values for both A and B before passing to evaluation, then we are unnecessarily loading B if A is false. Since most of the time the rule evaluation fails early (for example, the transaction amount is less than the given minimum amount), there is no point in loading all the data beforehand. So we do lazy loading ie. load data only when evaluating that part of the rule.

Second optimization – Add weight

Let’s take the same example as above, but in a different order.

B & A
Source of data for A is memory and B is external service

Now even if we are doing lazy loading, in this case, we are loading the external data always even though it potentially may fail at the next condition whose data is in memory.

Since most of our campaigns are targeted, a popular condition is to check if a user is in a certain segment, which is usually the first condition that a campaign creator sets. This data resides in another service. So it becomes quite expensive to evaluate this condition first even though the next condition’s data can be already in memory (e.g. if the taxi type is JustGrab).

So, we did the next phase of optimization here, by sorting the conditions based on weight of the source of data (low weight if data is in memory, higher if it’s in our database and highest if it’s in an external system). If AND was the only logical operator we supported, then it would have been quite simple. But the presence of OR made it complex. We came up with an algorithm that sorts the evaluation based on weight keeping in mind the AND/OR. Here’s what the flowchart looks like:

Event flowchart
Event flowchart

An example:

Conditions: A & ( B | C ) & ( D | E )

Actual result: true & ( false | false ) & ( true | true ) --> false

Weight: B < D < E < C < A

Expected check order: B, D, C

Firstly, we start validating B which is false. Apparently, we cannot skip the sibling conditions here since B and C are connected by |. Next, we check D. D is true and its only sibling E is connected by | so we can mark E “skip”. Then, we check E but since E has been marked “skip”, we just skip it. Still, we cannot get the final result yet, so we need to continue validating C which is false. Now we know (B | C) is false so the whole condition is false too. We can stop now.

Sub-streams

After investigation, we learned that we consumed a particular stream that produced terabytes of data per hour. It caused our CPU usage to shoot up by 30%. We found out that we process only a handful of event types from that stream. So we introduced a sub-stream in between, which contains the event types we want to support. This stream is populated from the main stream by another server, thereby reducing the load on Trident.

Protect downstream

While we scaled up our servers wildly, we needed to keep in mind that there were many downstream services that received more traffic. For example, we call the GrabRewards service for awarding rewards or the LocaleService for checking the user’s locale. It is crucial for us to have control over our outbound traffic to avoid causing any stability issues in Grab.

Therefore, we implemented rate limiting. There is a total rate limit configured for calling each downstream service, and the limit varies in different time ranges (e.g. tighter limit for calling critical service during peak hour).

Scale data store

We have two types of storage in Trident: cache storage (Redis) and persistent storage (MySQL and others).

Scaling cache storage is straightforward, since Redis Cluster already offers everything we need:

  • High performance: Known to be fast and efficient.
  • Scaling capability: New shards can be added at any time to spread out the load.
  • Fault tolerance: Data replication makes sure that data does not get lost when any single Redis instance fails, and auto election mechanism makes sure the cluster can always auto restore itself in case of any single instance failure.

All we needed to make sure is that our cache keys can be hashed evenly into different shards.

As for scaling persistent data storage, we tackled it in two ways just like we did for servers:

  • Distribute load
  • Reduce load (both overall and per query)

Distribute load

There are two levels of load distribution for persistent storage: infra level and DB level. On the infra level, we split data with different access patterns into different types of storage. Then on the DB level, we further distributed read/write load onto different DB instances.

Infra level

Just like any typical online service, Trident has two types of data in terms of access pattern:

  • Online data: Frequent access. Requires quick access. Medium size.
  • Offline data: Infrequent access. Tolerates slow access. Large size.

For online data, we need to use a high-performance database, while for offline data, we can  just use cheap storage. The following table shows Trident’s online/offline data and the corresponding storage.

Trident’s online/offline data and storage
Trident’s online/offline data and storage

Writing of offline data is done asynchronously to minimize performance impact as shown below.

Online/offline data split
Online/offline data split

For retrieving data for the users, we have high timeout for such APIs.

DB level

We further distributed load on the MySQL DB level, mainly by introducing replicas, and redirecting all read queries that can tolerate slightly outdated data to the replicas. This relieved more than 30% of the load from the master instance.

Going forward, we plan to segregate the single MySQL database into multiple databases, based on table usage, to further distribute load if necessary.

Reduce load

To reduce the load on the DB, we reduced the overall number of queries and removed unnecessary queries. We also optimized the schema and query, so that query completes faster.

Query reduction

We needed to track usage of a campaign. The tracking is just incrementing the value against a unique key in the MySQL database. For a popular campaign, it’s possible that multiple increment (a write query) queries are made to the database for the same key. If this happens, it can cause an IOPS burst. So we came up with the following algorithm to reduce the number of queries.

  • Have a fixed number of threads per instance that can make such a query to the DB.
  • The increment queries are queued into above threads.
  • If a thread is idle (not busy in querying the database) then proceed to write to the database then itself.
  • If the thread is busy, then increment in memory.
  • When the thread becomes free, increment by the above sum in the database.

To prevent accidental over awarding of benefits (rewards, points, etc), we require campaign creators to set the limits. However, there are some campaigns that don’t need a limit, so the campaign creators just specify a large number. Such popular campaigns can cause very high QPS to our database. We had a brilliant trick to address this issue- we just don’t track if the number is high. Do you think people really want to limit usage when they set the per user limit to 100,000? 😉

Query optimization

One of our requirements was to track the usage of a campaign – overall as well as per user (and more like daily overall, daily per user, etc). We used the following query for this purpose:

INSERT INTO … ON DUPLICATE KEY UPDATE value = value + inc

The table had a unique key index (combining multiple columns) along with a usual auto-increment integer primary key. We encountered performance issues arising from MySQL gap locks when high write QPS hit this table (i.e. when popular campaigns ran). After testing out a few approaches, we ended up making the following changes to solve the problem:

  1. Removed the auto-increment integer primary key.
  2. Converted the secondary unique key to the primary key.

Conclusion

Trident is Grab’s in-house real-time IFTTT engine, which processes events and operates business mechanisms on a massive scale. In this article, we discussed the strategies we implemented to achieve large-scale high-performance event processing. The overall ideas of distributing and reducing load may be straightforward, but there were lots of thoughts and learnings shared in detail. If you have any comments or questions about Trident, feel free to leave a comment below.

All the examples of campaigns given in the article are for demonstration purpose only, they are not real live campaigns.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GitHub Availability Report: December 2020

Post Syndicated from Keith Ballinger original https://github.blog/2021-01-06-github-availability-report-december-2020/

Introduction

In December, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will provide a summary and follow-up details on how we addressed an incident mentioned in November’s report.

Follow-up to November 27 16:04 UTC (lasting one hour and one minute)

Upon further investigation around one of the incidents mentioned in November’s Availability Report, we discovered an edge case that triggered a large number of GitHub App token requests. This caused abnormal levels of replication lag within one of our MySQL clusters, specifically affecting the GitHub Actions service. This particular scenario resulted in amplified queries and increased the database lag, which impacted the database nodes that process GitHub App token requests.

When a GitHub Action is invoked, the Action is passed a GitHub App token to perform tasks on GitHub. In this case, the database lag resulted in the failure of some of those token requests because the database replicas did not have up to date information.

To help avoid this class of failure, we are updating the queries to prevent large quantities of token requests from overloading the database servers in the future.

In summary

Whether we’re introducing a system to manage flaky tests or improving our CI workflow, we’ve continued to invest in our engineering systems and overall reliability. To learn more about what we’re working on, visit GitHub’s engineering blog.

Building On-Call Culture at GitHub

Post Syndicated from Mary Moore-Simmons original https://github.blog/2021-01-06-building-on-call-culture-at-github/

As GitHub grows in size and our product offerings grow in number and complexity, we need to constantly evolve our on-call strategy so we can continue to be the trusted home for all developers. Expanding upon our Building GitHub blog series, this post gives you a window into one of the major steps along our continuous journey for operational excellence at GitHub.

Monolithic On-Call

Most of the GitHub products you interact with are in a large Ruby on Rails monolith. Monolithic codebases are common for many high-growth startups, and it’s a difficult situation to detangle yourself from. One of the pain points we had was problems with the on-call system for our monolith.

The biggest pain points with our monolithic on-call structure were:

  • The main GitHub monolith spans a huge number of products and features. Most engineers (understandably) weren’t familiar enough with enough of the codebase to feel confident in their ability to respond to incidents when on-call. Often, the engineer who got paged would escalate to another team, which made them feel more like a switchboard operator than an engineer.
  • The on-call rotation was large and engineers were on-call for 24 hours at a time. As a result, engineers were only on-call ~4 times per year for one day, and most never gained the context they needed to feel confident while on-call.
  • Since the monitoring and alerting for this on-call rotation was spread across most engineering teams at GitHub and people only had to experience on-call for 24 hours at a time, the monitoring and documentation was not well maintained. As a result, we had noisy alerts and poor runbooks.
  • Because most engineers didn’t feel confident with the monolithic on-call shift, the same 5-10 people who knew the platform best were involved for every production incident, which caused an imbalance in on-call responsibilities.

New On-Call Culture

To address these pain points, we made large changes to our on-call structure. We split up the monolith on-call rotation so every engineering team was on-call for the code they maintain. We had a number of logistical difficulties to overcome, but most of the hurdles we faced were cultural and educational.

Logistical Hurdles

File Ownership

The GitHub monolith contains over 16,000 files, and file ownership was often unclear or undocumented. To solve this, we rolled out a new system that associates files to services, and then assigns services to teams, which made it much easier to change ownership as teams changed. Here’s a snippet of the file to service mappings code we use in the monolith:

## Apps
### API
app/api/github_apps.rb     :apps
app/api/grants.rb          :apps
### Components             :apps
app/component/apps*        :apps
test/app/component/apps*   :apps

## Authzd
**/*authzd*                :authzd
app/models/permissions.rb  :authzd

You’ll see that each file in the monolith maps to a service, such as “apps” or “authzd”. We then have another file in the monolith that lists every service and maps them to teams. Here’s a snippet of that code for the apps service:

apps:
  name: GitHub Apps
  description: Allows users to build 'GitHub Apps' on top of GitHub.
  team: ecosystem-apps

This information is automatically pulled into a service we developed in-house, called the Service Catalog, which then displays the information to GitHub employees. You can search for services in the Service Catalog, which allows GitHub engineers, support, and product to track down which engineering team owns which services. Here’s an example of what it looks like for the apps service:

We asked all teams to move to this new service ownership model, implemented a linter that doesn’t allow files in the monolith to be updated or added unless the ownership information was completed, and worked with engineering and leadership to assign ownership for major services that were still unowned.

Monitoring and Alerting

Monitoring and alerting was set up for the monolith as a whole, so we asked teams to create monitoring specific to their areas of responsibility. Once this was mostly completed, a group of more senior engineers researched all remaining monolith-wide alerts and then split up alerts, assigned alerts out to teams, and decommissioned alerts that were no longer necessary.

Large Scope of Teams Involved

There were over 50 engineering teams that needed to make changes to their processes in order to complete this effort. To coordinate this, we opened a GitHub issue for every team with clear checklists for the work needed. From there we did regular check-ins with teams and provided help and information so we didn’t leave anyone behind.

Cultural and Educational Hurdles

  • Adjusting with the pandemic: This was a huge change for our engineers, and seven months into the project a global pandemic hit. This caused significant compounding anxiety that negatively impacted peoples’ ability to think critically.  As a result, the project required a more high-touch, empathy-first approach than was originally expected.
  • Training touchpoints: Many of our engineers had never been on-call before and didn’t have experience with operational best practices. We designed and delivered three rounds of training, set up office hours with on-call experts, created significant tooling and documentation for engineering teams, and opened Slack channels where people could ask questions and get help.
  • Instilling work-life balance while on-call: Several engineers were anxious about the impact on-call would have on their lives. Once you have been on-call for months or years, going to the grocery store or going for a bike ride while on-call is a normal activity that you can plan for. However, for those who haven’t been on-call before, it can be daunting to design personal strategies that allow you to respond to a page within minutes and try to do everyday tasks like shower or go grocery shopping. We worked with teams to understand their anxieties, document tips and tricks from people who had experience being on-call, and work with people 1:1 who had further concerns. We also reinforced that team members are there to support each other by taking on-call for a couple hours if someone wants to go for a run or handle childcare, by being a back-up if someone misses a page, or designing a follow-the-sun rotation if their team is in different time zones. This is yet another situation where GitHub’s global remote team is a major asset – we can lean on our team members in other parts of the world to take on-call when we’re not working. Many people will not become comfortable with how to balance on-call with having a life until they spend months or even years being on-call regularly, so some of this comfort is a matter of time and not training.
  • Cultivating a blameless culture: Many engineers were anxious about performing well when they are on-call. They had concerns that they would miss a page or make a mistake and let down their team. We worked with the organization to reinforce the message that mistakes are okay and outages happen no matter how well you perform your on-call duties, but we still have a lot of work to do in this area. We are undergoing a long-term effort to continue fostering a blameless and supportive on-call culture at GitHub, which includes nurturing safe spaces to learn about on-call and celebrating people publicly who bravely worked on something they weren’t familiar with while on-call. It will be a long journey to create a sense of safety about making mistakes on the engineering teams at GitHub, but we can improve over time.
  • Meeting the criticality needs for each team: Different GitHub products are at different levels of criticality, which means some engineering teams have to respond within 5 minutes if they are paged, and some don’t have to respond until the next business day if their service breaks. Some engineers are concerned that this causes an unfair imbalance between engineering teams. However, different engineers have different interests. Some would prefer to work on more business-critical, technically complex systems with stricter uptime requirements, while others value work/life balance more than the technical complexity needed for products with strict operational requirements. Over time, engineers will select teams that have a level of operational rigor they identify most with, and this will correct itself.
  • Dedicating time to focus on long-term success: As we rolled out the change, several teams expressed concerns that they are not able to spend enough time making their on-call experience better. We provided clear documentation reinforcing that the person on-call should be focused on improving the on-call experience when they are not responding to pages. This includes updating runbooks, tuning noisy alerts, scripting/automating on-call tasks to eliminate or streamline on-call tasks for sleep-deprived engineers, and fixing underlying technical debt that makes the on-call experience worse. We communicated that teams might spend ~20% of their time on technical debt and ~20% of their time on improving the on-call experience if needed for the stability of the product, customer experience, and engineering experience. These guidelines require long-term focus from leadership. We need to continuously spread the message from Senior VPs all the way to line managers that a sustainable engineering on-call experience is critical to the success of GitHub.

Continuing the Journey

GitHub’s incident resolution time improved after the bulk of this initiative was completed, but our journey is never done. Organizations need to constantly improve their operational best practices or they’ll fall behind.

There are several long-term cultural changes discussed above that we must continue to promote for years at GitHub. We are conducting a retrospective in January to learn how we can improve rolling out large changes like these in the future, and how we can continue to improve the on-call experience for our engineers and GitHub’s stability for our customers. In addition, we are sending out regular surveys to engineers about their on-call experience and we continue to monitor GitHub’s uptime. We will continue to meet with engineering teams to discuss their on-call pain points and how to improve so that we can encourage a growth mindset in our engineering teams and help each other learn operational best practices. Everyone at GitHub is in this journey together, and we need to support each other in our drive for excellence so we can continue to be the trusted home for all developers.

Git clone: a data-driven study on cloning behaviors

Post Syndicated from Solmaz Abbaspoursani original https://github.blog/2020-12-22-git-clone-a-data-driven-study-on-cloning-behaviors/

@derrickstolee recently discussed several different git clone options, but how do those options actually affect your Git performance? Which option is fastest for your client experience? Which option is fastest for your build machines? How can these options impact server performance? If you are a GitHub Enterprise Server administrator it’s important that you understand how the server responds to these options under the load of multiple simultaneous requests.

Here at GitHub, we use a data-driven approach to answer these questions. We ran an experiment to compare these different clone options and measured the client and server behavior. It is not enough to just compare git clone times, because that is only the start of your interaction with a Git repository. In particular, we wanted to determine how these clone options change the behavior of future Git operations such as git fetch.

In this experiment, we aimed to answer the below questions:

  1. How fast are the various git clone commands?
  2. Once we have cloned a repository, what kind of impact do future git fetch commands have on the server and client?
  3. What impact do full, shallow and partial clones have on a Git server? This is mostly important for our GitHub Enterprise Server Admins.
  4. Will the repository shape and size make any difference in the overall performance?

It is worth special emphasis that these results come from simulations that we performed in our controlled environments and do not simulate complex workflows that might be used by many Git users. Depending on your workflows and repository characteristics these results may change. Perhaps this experiment provides a framework that you could follow to measure how your workflows are affected by these options. If you would like help analyzing your worksflows, feel free to engage with GitHub’s Professional Services team.

For a summary of our findings, feel free to jump to our conclusions and recommendations.

Experiment design

To maximize the repeatability of our experiment, we use open source repositories for our sample data. This way, you can compare your repository shape to the tested repositories to see which is most applicable to your scenario.

We chose to use the jquery/jqueryapple/swift and torvalds/linux repositories. These three repositories vary in size and number of commits, blobs, and trees.

These repositories were mirrored to a GitHub Enterprise Server running version 2.22 on a 8-core cloud machine. We use an internal load testing tool based on Gatling to generate git requests against the test instance. We ran each test with a specific number of users across 5 different load generators for 30 minutes. All of our load generators use git version 2.28.0 which by default is using protocol version 1. We would like to make a note that protocol version 2 only improves ref advertisement and therefore we don’t expect it to make a difference in our tests.

Once a test is complete, we use a combination of Gatling results, ghe-governor and server health metrics to analyze the test.

Test repository characteristics

The git-sizer tool measures the size of Git repositories along many dimensions. In particular, we care about the total size on disk along with the count of each object type. The table below contains this information for our three test repositories.

Repo blobs trees commits Size (MB)
jquery/jquery 14,779 22,579 7,873 40
apple/swift 390,116 649,322 132,316 750
torvalds/linux 2,174,798 4,619,864 968,500 4,000

The jquery/jquery repository is a fairly small repository with only 40MB of disk space. apple/swift is a medium-sized repository with around 130 thousand commits and 750MB on disk. The torvalds/linux repository is typically the gold standard for Git performance tests on open source repositories. It uses 4 gigabytes of disk space and has close to a million commits.

Test scenarios

We care about the following clone options:

  1. Full clones.
  2. Shallow clones (--depth=1).
  3. Treeless clones (--filter=tree:0).
  4. Blobless clones (--filter=blob:none).

In addition to these options at clone time, we can also choose to fetch in a shallow way using --depth=1. Since treeless and blobless clones have their own way to reduce the objects downloaded during git fetch, we only test shallow fetches on full and shallow clones.

We organized our test scenarios into the following ten categories, labeled T1 through T10. T1 to T4, simulate four different git clone types. T5 to T10 simulate various git fetch operations into these cloned repositories.

Test# git command Description
T1 git clone Full clone
T2 git clone --depth 1 Shallow clone
T3 git clone --filter=tree:0 Treeless partial clone
T4 git clone --filter=blob:none Blobless partial clone
T5 T1 + git fetch Full fetch in a fully cloned repository
T6 T1 + git fetch --depth 1 Shallow fetch in a fully cloned repository
T7 T2 + git fetch Full fetch in a shallow cloned repository
T8 T2 + git fetch --depth 1 Shallow fetch in a shallow cloned repository
T9 T3 + git fetch Full fetch in a treeless partially cloned repository
T10 T4 + git fetch Full fetch in a blobless partially cloned repository

In partial clones, the new blobs at the new ref tip are not downloaded until we navigate to that position and populate our working directory with those blob contents. To be a fair comparison with the full and shallow clone cases, we also have our simulation run git reset --hard origin/$branch in all T5 to T10 tests. In T5 to T8 this extra step will not have a huge impact, but in T9 and T10 it will ensure the blob downloads are included in the cost.

In all the scenarios above, a single user was also set to repeatedly change 3 random files in the repository and push them to the same branch that the other users were cloning and fetching. This simulates repository growth so the git fetch commands actually have new data to download.

Test Results

Let’s dig into the numbers to see what our experiment says.

git clone performance

The full numbers are provided in the tables below.

Unsurprisingly, shallow clone is the fastest clone for the client, followed by a treeless then blobless partial clones, and finally full clones. This performance is directly proportional to the amount of data required to satisfy the clone request. Recall that full clones need all reachable objects, blobless clones need all reachable commits and trees, treeless clones need all reachable commits. A shallow clone is the only clone type that does not grow at all along with the history of your repository.

The performance impact of these clone types grows in proportion to the repository size, especially the number of commits. For example, a shallow clone of torvalds/linux is four times faster than a full clone, while a treeless clone is only twice as fast and a blobless clone is only 1.5 times as fast. It is worth noting that the development culture of the Linux project promotes very small blobs that compress extremely well. We expect that the performance difference to be greater for most other projects with a higher blob count or size.

As for server performance, we see that the Git CPU time per clone is higher for the blobless partial clone (T4). Looking a bit closer with ghe-governor, we observe that the higher Git CPU is mainly due to the higher amount of pack-objects operations in the partial clone scenarios (T3 and T4). In the torvalds/linux repository, the Git CPU time spent on pack-objects is four times more in a treeless partial clone (T3) compared to a full clone (T1). In contrast, in the smaller jquery/jquery repository, a full clone consumes more CPU per clone compared to the partial clones (T3 and T4). Shallow clone of all the three different repositories, consumes the lowest amount of total and Git CPU per clone.

If the full clone is sending more data in a full clone, then why is it spending more CPU on a partial clone? When Git sends all reachable objects to the client, it mostly transfers the data it has on disk without decompressing or converting the data. However, partial and shallow clones need to extract that subset of data and repackage it to send to the client. We are investigating ways to reduce this CPU cost in partial clones.

The real fun starts after cloning the repository and users start developing and pushing code back up to the server. In the next section we analyze scenarios T5 through T10, which focus on the git fetch and git reset --hard origin/$branch commands.

jquery/jquery clone performance

Test# description clone avgRT (milliseconds) git CPU spent per clone
T1 full clone 2,000ms (Slowest) 450ms (Highest)
T2 shallow clone 300ms (6x faster than T1) 15ms (30x less than T1)
T3 treeless partial clone 900ms (2.5x faster than T1) 270ms (1.7x less than T1)
T4 blobless partial clone 900ms (2.5x faster than T1) 300ms (1.5x less than T1)

apple/swift clone performance

Test# description clone avgRT (seconds) git CPU spent per clone
T1 full clone 50s (Slowest) 8s (2x less than T4)
T2 shallow clone 8s (6x faster than T1) 3s (6x less than T4)
T3 treeless partial clone 16s (3x faster than T1) 13s (Similar to T4)
T4 blobless partial clone 22s (2x faster than T1) 15s (Highest)

torvalds/linux clone performance

Test# description clone avgRT (minutes) git CPU spent per clone
T1 full clone 5m (Slowest) 60s (2x less than T4)
T2 shallow clone 1.2m (4x faster than T1) 40s (3.5x less than T4)
T3 treeless partial clone 2.4m (2x faster than T1) 120s (Similar to T4)
T4 blobless partial clone 3m (1.5x faster than T1) 130s (Highest)

git fetch performance

The full fetch performance numbers are provided in the tables below, but let’s first summarize our findings.

The biggest finding is that shallow fetches are the worst possible options, in particular from full clones. The technical reason is that the existence of a “shallow boundary” disables an important performance optimization on the server. This causes the server to walk commits and trees to find what’s reachable from the client’s perspective. This is more expensive in the full clone case because there are more commits and trees on the client that the server is checking to not duplicate. Also, as more shallow commits are accumulated, the client needs to send more data to the server to describe those shallow boundaries.

The two partial clone options have drastically different behavior in the git fetch and git reset --hard origin/$branch sequence. Fetching from blobless partial clones increases the reset command by a small, measurable way, but not enough to make a huge difference from the user perspective. In contrast, fetching from a treeless partial clone causes significant more time because the server needs to send all trees and blobs reachable from a commit’s root tree in order to satisfy the git reset --hard origin/$branch command.

Due to these extra costs as a repository grows, we strongly recommend against shallow fetches and fetching from treeless partial clones. The only recommended scenario for a treeless partial clone is for quickly cloning on a build machine that needs access to the commit history, but will delete the repository at the end of the build.

Blobless partial clones do increase the Git CPU costs on the server somewhat, but the network data transfer is much less than a full clone or a full fetch from a shallow clone. The extra CPU cost is likely to become less important if your repository has larger blobs than our test repositories. In addition, you have access to the full commit history, which might be valuable to real users interacting with these repositories.

It is also worth noting that we noticed a surprising result during our testing. During T9 and T10 tests for the Linux repository, our load generators encountered memory issues as it seems that these scenarios with the heavy load that we were running, triggered more auto Garbage Collections (GC). GC in the Linux repository is expensive and involves a full repack of all Git data. Since we were testing on a Linux client, the GC processes were launched in the background to avoid blocking our foreground commands. However, as we kept fetching we ended up with several concurrent background processes; this is not a realistic scenario but a factor of our synthetic load testing. We ran git config gc.auto false to prevent this from affecting our test results.

It is worth noting that blobless partial clones might trigger automatic garbage collection more often than a full clone. This is a natural byproduct of splitting the data into a larger number of small requests. We have work in progress to make Git’s repository maintenance be more flexible, especially for large repositories where a full repack is too time-consuming. Look forward to more updates about that feature here on the GitHub blog.

jquery/jquery fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 200ms 5ms 4ms (Lowest)
T6 shallow fetch in a fully cloned repository 300ms 5ms 18ms (4x more than to T5)
T7 full fetch in a shallow cloned repository 200ms 5ms 4ms (Similar to T5)
T8 shallow fetch in a shallow cloned repository 250ms 5ms 14ms (3x more than to T5)
T9 full fetch in a treeless partially cloned repository 200ms 85ms 9ms (2x more than to T5)
T10 full fetch in a blobless partially cloned repository 200ms 40ms 6ms (1.5x more than to T5)

apple/swift fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 350ms 80ms 20ms (Lowest)
T6 shallow fetch in a fully cloned repository 1,500ms 80ms 300ms (13x more than to T5)
T7 full fetch in a shallow cloned repository 300ms 80ms 20ms (Similar to T5)
T8 shallow fetch in a shallow cloned repository 350ms 80ms 45ms (2x more than to T5)
T9 full fetch in a treeless partially cloned repository 300ms 300ms 70ms (3x more than to T5)
T10 full fetch in a blobless partially cloned repository 300ms 150ms 35ms (1.5x more than to T5)

torvalds/linux fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 350ms 250ms 40ms (Lowest)
T6 shallow fetch in a fully cloned repository 6,000ms 250ms 1,000ms (25x more than T5)
T7 full fetch in a shallow cloned repository 300ms 250ms 50ms (1.3x more than T5)
T8 shallow fetch in a shallow cloned repository 400ms 250ms 80ms (2x more than to T5)
T9 full fetch in a treeless partially cloned repository 350ms 1250ms 400ms (10x more than to T5)
T10 full fetch in a blobless partially cloned repository 300ms 500ms 140ms (3.5x more than to T5)

What does this mean for you?

Our experiment demonstrated some performance changes between these different clone and fetch options. Your mileage may vary! Our experimental load was synthetic, and your repository shape can differ greatly from these repositories.

Here are some common themes we identified that could help you choose the right scenario for your own usage:

If you are a developer focused on a single repository, the best approach is to do a full clone and then always perform a full fetch into that clone. You might deviate to a blobless partial clone if that repository is very large due to many large blobs, as that clone will help you get started more quickly. The trade-off is that some commands such as git checkout or git blame will require downloading new blob data when necessary.

In general, calculating a shallow fetch is computationally more expensive compared to a full fetch. Always use a full fetch instead of a shallow fetch both in fully and shallow cloned repositories.

In workflows such as CI builds when there is a need to do a single clone and delete the repository immediately, shallow clones are a good option. Shallow clones are the fastest way to get a copy of the working directory at the tip commit. If you need the commit history for your build, then a treeless partial clone might work better for you than a full clone. Bear in mind that in larger repositories such as the torvalds/linux repository, it will save time on the client but it’s a bit heavier on your git server when compared to a full clone.

Blobless partial clones are particularly effective if you are using Git’s sparse-checkout feature to reduce the size of your working directory. The combination greatly reduces the number of blobs you need to do your work. Sparse-checkout does not reduce the required data transfer for shallow clones.

Notably, we did not test repositories that are significantly larger than the torvalds/linux repository. Such repositories are not really available in the open, but are becoming increasingly common for private repositories. If you feel your repository is not represented by our test repositories, then we recommend trying to replicate our experiments yourself.

As can be observed, our test process is not simulating a real life situation where users have different workflows and work on different branches. Also the set of Git commands that have been analyzed in this study is a small set and is not a representative of a user’s daily Git usage. We are continuing to study these options to get a holistic view of how they change the user experience.

Acknowledgements

A special thanks to Derrick Stolee and our Professional Services team for their efforts and sponsorship of this study!

Get up to speed with partial clone and shallow clone

Post Syndicated from Derrick Stolee original https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/

As your Git repositories grow, it becomes harder and harder for new developers to clone and start working on them. Git is designed as a distributed version control system. This means that you can work on your machine without needing a connection to a central server that controls how you interact with the repository. This is only fully realizable if you have all reachable data in your local repository.

What if there was a better way? Could you get started working in the repository without downloading every version of every file in the entire Git history? Git’s partial clone and shallow clone features are options that can help here, but they come with their own tradeoffs. Each option breaks at least one expectation from the normal distributed nature of Git, and you might not be willing to make those tradeoffs.

If you are working with an extremely large monorepo, then these tradeoffs are more likely to be worthwhile or even necessary to interact with Git at that scale!

Before digging in on this topic, be sure you are familiar with how Git stores your data, including commits, trees, and blob objects. I presented some of these ideas and other helpful tips at GitHub Universe in my talk, Optimize your monorepo experience.

Quick Summary

There are three ways to reduce clone sizes for repositories hosted by GitHub.

  • git clone --filter=blob:none <url> creates a blobless clone. These clones download all reachable commits and trees while fetching blobs on-demand. These clones are best for developers and build environments that span multiple builds.
  • git clone --filter=tree:0 <url> creates a treeless clone. These clones download all reachable commits while fetching trees and blobs on-demand. These clones are best for build environments where the repository will be deleted after a single build, but you still need access to commit history.
  • git clone --depth=1 <ulr> creates a shallow clone. These clones truncate the commit history to reduce the clone size. This creates some unexpected behavior issues, limiting which Git commands are possible. These clones also put undue stress on later fetches, so they are strongly discouraged for developer use. They are helpful for some build environments where the repository will be deleted after a single build.

Full clones

As we discuss the different clone types, we will use a common representation of Git objects:

  • Boxes are blobs. These represent file contents.
  • Triangles are trees. These represent directories.
  • Circles are commits. These are snapshots in time.

We use arrows to represent a relationship between objects. Basically, if an OID B appears inside a commit or tree A, then the object A has an arrow to the object B. If we can follow a list of arrows from an object A to another object C, then we say C is reachable from A. The process of following these arrows is sometimes referred to as walking objects.

We can now describe the data downloaded by a git clone command! The client asks the server for the latest commits, then the server provides those objects and every other reachable object. This includes every tree and blob in the entire commit history!

In this diagram, time moves from left to right. The arrows between a commit and its parents therefore go from right to left. Each commit has a single root tree. The root tree at the HEAD commit is fully expanded underneath, while the rest of the trees have arrows pointing towards these objects.

This diagram is purposefully simple, but if your repository is very large you will have many commits, trees, and blobs in your history. Likely, the historical data forms a majority of your data. Do you actually need all of it?

These days, many developers always have a network connection available as they work, so asking the server for a little more data when necessary might be an acceptable trade-off.

This is the critical design change presented by partial clone.

Partial clone

Git’s partial clone feature is enabled by specifying the --filter option in your git clone command. The full list of filter options exist in the git rev-list documentation, since you can use git rev-list --filter=<filter> --all to see which objects in your repository match the filter. There are several filters available, but the server can choose to deny your filter and revert to a full clone.

On github.com and GitHub Enterprise Server 2.22+, there are two options available:

  1. Blobless clones: git clone --filter=blob:none <url>
  2. Treeless clones: git clone --filter=tree:0 <url>

Let’s investigate each of these options.

Blobless clones

When using the --filter=blob:none option, the initial git clone will download all reachable commits and trees, and only download the blobs for commits when you do a git checkout. This includes the first checkout inside the git clone operation. The resulting object model is shown here:

The important thing to notice is that we have a copy of every blob at HEAD but the blobs in the history are not present. If your repository has a deep history full of large blobs, then this option can significantly reduce your git clone times. The commit and tree data is still present, so any subsequent git checkout only needs to download the missing blobs. The Git client knows how to batch these requests to ask the server only for the missing blobs.

Further, when running git fetch in a blobless clone, the server only sends the new commits and trees. The new blobs are downloaded only after a git checkout. Note that git pull runs git fetch and then git merge, so it will download the necessary blobs during the git merge command.

When using a blobless clone, you will trigger a blob download whenever you need the contents of a file, but you will not need one if you only need the OID of a file. This means that git log can detect which commits changed a given path without needing to download extra data.

This means that blobless clones can perform commands like git merge-basegit log, or even git log -- <path> with the same performance as a full clone.

Commands like git diff or git blame <path> require the contents of the paths to compute diffs, so these will trigger blob downloads the first time they are run. However, the good news is that after that you will have those blobs in your repository and do not need to download them a second time. Most developers only need to run git blame on a small number of files, so this tradeoff of a slightly slower git blame command is worth the faster clone and fetch times.

Blobless clones are the most widely-used partial clone option. I’ve been using them myself for months without issue.

Treeless clones

In some repositories, the tree data might be a significant portion of the history. Using --filter=tree:0, a treeless clone downloads all reachable commits, then downloads trees and blobs on demand. The resulting object model is shown here:

Note that we have all of the data at HEAD, but otherwise only have commit data. This means that the initial clone can be much faster in a treeless clone than in a blobless or full clone. Further, we can run git fetch to download only the latest commits. However, working in a treeless clone is more difficult because downloading a missing tree when needed is more expensive.

For example, a git checkout command changes the HEAD commit, usually to a commit where we do not have the root tree. The Git client then asks the server for that root tree by OID, but also for all reachable trees from that root tree. Currently, this request does not tell the server that the client already has some root trees, so the server might send many trees the client already has locally. After the trees are downloaded, the client can detect which blobs are missing and request those in a batch.

It is possible to work in a treeless clone without triggering too many requests for extra data, but it is much more restrictive than a blobless clone.

For example, history operations such as git merge-base or git log (without extra options) only use commit data. These will not trigger extra downloads.

However, if you run a file history request such as git log -- <path>, then a treeless clone will start downloading root trees for almost every commit in the history!

We strongly recommend that developers do not use treeless clones for their daily work. Treeless clones are really only helpful for automated builds when you want to quickly clone, compile a project, then throw away the repository. In environments like GitHub Actions using public runners, you want to minimize your clone time so you can spend your machine time actually building your software! Treeless clones might be an excellent option for those environments.

⚠ Warning: While writing this article, we were putting treeless clones to the test beyond the typical limits. We noticed that repositories that contain submodules behave very poorly with treeless clones. Specifically, if you run git fetch in a treeless clone, then the logic in Git that looks for changed submodules will trigger a tree request for every new commit! This behavior can be avoided by running git config fetch.recurseSubmodules false in your treeless clones. We are working on a more robust fix in the Git client.

Shallow clones

Partial clones are relatively new to Git, but there is an older feature that does something very similar to a treeless clone: shallow clones. Shallow clones use the --depth=<N> parameter in git clone to truncate the commit history. Typically, --depth=1 signifies that we only care about the most recent commits. Shallow clones are best combined with the --single-branch --branch=<branch> options as well, to ensure we only download the data for the commit we plan to use immediately.

The object model for a shallow clone is shown in this diagram:

Here, the commit at HEAD exists, but its connection to its parents and the rest of the history is severed. The commits whose parents are removed are called shallow commits and together form the shallow boundary. The commit objects themselves have not changed, but there is some metadata in the client repository directing the Git client to ignore those parent connections. All trees and blobs are downloaded for any commit that exists on the client.

Since the commit history is truncated, commands such as git merge-base or git log show different results than they would in a full clone! In general, you cannot count on them to work as expected. Recall that these commands work as expectedly in partial clones. Even in blobless clones, commands like git blame -- <path> will work correctly, if only a little slower than in full clones. Shallow clones don’t even make that a possibility!

The other major difference is how git fetch behaves in a shallow clone. When fetching new commits, the server must provide every tree and blob that is “new” to these commits, relative to the shallow commits. This computation can be more expensive than a typical fetch, partly because a well-maintained server can make use of reachability bitmaps. Depending on how others are contributing to your remote repository, a git fetch operation in a shallow clone might end up downloading an almost-full commit history!

Here are some descriptions of things that can go wrong with shallow clones that negate the supposed values. For these reasons we do not recommend shallow clones except for builds that delete the repository immediately afterwards. Fetching from shallow clones can cause more harm than good!

Remember the “shallow boundary” mentioned earlier? The client sends that boundary to the server during a git fetch command, telling the server that it doesn’t have all of the reachable commits behind that boundary. The client then asks for the latest commits and everything reachable from those until hitting a shallow commit in the boundary. If another user starts a topic branch below that boundary and then the shallow client fetches that topic (or worse, the topic is merged into the default branch), then the server needs to walk the full history and serve the client what amounts to almost a full clone! Further, the server needs to calculate that data without the advantage of performance features like reachability bitmaps.

Comparing Clone Options

Let’s recall each of our clone options. Instead of looking them at a pure object level, let’s explore each category of object. The figures below group the data that is downloaded by each repository type. In addition to the data downloaded at clone, let’s consider the situation where some time passes and then the client runs git fetch and then git checkout to move to a new commit. For each of these options, how much data is downloaded.

Full clones download all reachable objects. Typically, blobs are responsible for most of this data.

In a partial clone, some data is not served immediately and is delayed until the client needs it. Blobless clones skip blobs except those needed at checkout time. Treeless clones skip all trees in the history in favor of downloading a full copy of the trees needed for each checkout.

Blobless clone Treeless clone
git clone --depth=1,
git fetch
git clone --depth=1,
git fetch --depth=1

What do the numbers say?

Fellow GitHub engineer @solmazabbaspour designed and ran an experiment to compare these different clone options on a variety of open source repositories. She will post a blog post tomorrow giving full details and data for the experiment, but I’ll share the executive summary here. Here are some common themes we identified that could help you choose the right scenario for your own usage:

There are many different types of clones beyond the default full clone. If you truly need to have a distributed workflow and want all of the data in your local repository, then you should continue using full clones. If you are a developer focused on a single repository and your repository is reasonably-sized, the best approach is to do a full clone.

You might switch to a blobless partial clone if your repository is very large due to many large blobs, as that clone will help you get started more quickly. The trade-off is that some commands such as git checkout or git blame will require downloading new blob data when necessary.

In general, calculating a shallow fetch is computationally more expensive compared to a full fetch. Always use a full fetch instead of a shallow fetch both in fully and shallow cloned repositories.

In workflows such as CI builds when there is a need to do a single clone and delete the repository immediately, shallow clones are a good option. Shallow clones are the fastest way to get a copy of the working directory at the tip commit with the additional cost that fetching from these repositories is much more expensive, so we do not recommend shallow clones for developers. If you need the commit history for your build, then a treeless partial clone might work better for you than a full clone.

In general, your mileage may vary. Now that you are armed with these different options and the object model behind them, you can go and play with these kinds of clones. You should also be aware of some pitfalls of these non-full clone options:

  • Shallow clones skip the commit history. This makes commands such as git log or git merge-base unavailable. Never fetch from a shallow clone!
  • Treeless clones contain commit history, but it is very expensive to download missing trees. Thus, git log (without a path) and git merge-base are available, but commands like git log -- <path> and git blame are extremely slow and not recommended in these clones.
  • Blobless clones contain all reachable commits and trees, so Git downloads blobs when it needs access to file contents. This means that commands like git log -- <path> are available but commands like git blame are a bit slower on their first run. However, this can be a great way to get started on a very large repository with a lot of old, large blobs.
  • Full clones work as expected. The only downside is the time required to download all of that data, plus the extra disk space for all those files.

Be sure to upgrade to the latest Git version so you have all the latest performance improvements!

Visualizing GitHub’s global community

Post Syndicated from Tal Safran original https://github.blog/2020-12-21-visualizing-githubs-global-community/

This is the second post in a series about how we built our new homepage.

  1. How our globe is built
  2. How we collect and use the data behind the globe
  3. How we made the page fast and performant
  4. How We Illustrate at GitHub
  5. How we designed the homepage and wrote the narrative

In the first post, my teammate Tobias shared how we made the 3D globe come to life, with lots of nitty gritty details about Three.js, performance optimization, and delightful touches.

But there’s another side to the story—the data! We hope you enjoy the read. ✨

Data goals

When we kicked off the project, we knew that we didn’t want to make just another animated globe. We wanted the data to be interesting and engaging. We wanted it to be real, and most importantly, we wanted it to be live.

Luckily, the data was there.

The challenge then became designing a data service that addressed the following challenges:

  1. How do we query our massive volume of data?
  2. How do we show you the most interesting bits?
  3. How do we geocode user locations in a way that respects privacy?
  4. How do we expose the computed data back to the monolith?
  5. How do we not break GitHub? 😊

Let’s begin, shall we?

Querying GitHub

So, how hard could it be to show you some recent pull requests? It turns out it’s actually very simple:

class GlobeController < ApplicationController
  def data
    pull_requests = PullRequest
      .where(open: true)
      .joins(:repositories)
      .where("repository.is_open_source = true")
      .last(10_000)

    render json: pull_requests
  end
end

Just kidding 😛

Because of the volume of data generated on GitHub every day, the size of our databases, as well as the importance of keeping GitHub fast and reliable, we knew we couldn’t query our production databases directly.

Luckily, we have a data warehouse and a fantastic team that maintains it. Data from production is fetched, sanitized, and packaged nicely into the data warehouse on a regular schedule. The data can then be queried using Presto, a flavor of SQL meant for querying large sets of data.

We also wanted the data to be as fresh as possible. So instead of querying snapshots of our MySQL tables that are only copied over once a day, we were able to query data coming from our Apache Kafka event stream that makes it into the data warehouse much more regularly.

As an example, we have an event that is reported every time a pull request is merged. The event is defined in a format called protobuf, which stands for “protocol buffer.”

Here’s what the protobuf for a merged pull request event might look like:

message PullRequestMerge {
  github.v1.entities.User actor = 1;
  github.v1.entities.Repository repository = 2;
  github.v1.entities.User repository_owner = 3;
  github.v1.entities.PullRequest pull_request = 4;
  github.v1.entities.Issue issue = 5;
}

Each row corresponds to an “entity,” each of which is defined in its own protobuf file. Here’s a snippet from the definition of a pull request entity:

message PullRequest {
  uint64 id = 1;
  string global_relay_id = 2;
  uint64 author_id = 3;

  enum PullRequestState {
    UNKNOWN = 0;
    OPEN = 1;
    CLOSED = 2;
    MERGED = 3;
  }
  PullRequestState pull_request_state = 4;

  google.protobuf.Timestamp created_at = 5;
  google.protobuf.Timestamp updated_at = 6;
}

Including an entity in an event will pass along all of the attributes defined for it. All of that data gets copied into our data warehouse for every pull request that is merged.

This means that a Presto query for pull requests merged in the past day could look like:

SELECT
  pull_request.created_at,
  pull_request.updated_at,
  pull_request.id,
  issue.number,
  repository.id
FROM kafka.github.pull_request_merge
WHERE
  day >= CAST((CURRENT_DATE - INTERVAL '1' DAY) AS VARCHAR)

There are a few other queries we make to pull in all the data we need. But as you can see, this is pretty much standard SQL that pulls in merged pull requests from the last day in the event stream.

Surfacing interesting data

We wanted to make sure that whatever data we showed was interesting, engaging, and appropriate to be spotlighted on the GitHub homepage. If the data was good, visitors would be enticed to explore the vast ecosystem of open source being built on GitHub at that given moment. Maybe they’d even make a contribution!

So how do we find good data?

Luckily our data team came to the rescue yet again. A few years ago, the Data Science Team put together a model to rank the “health” of repositories based on 30-plus features weighted by importance. A healthy repository doesn’t necessarily mean having a lot of stars. It also takes into account how much current activity is happening and how easy it is to contribute to the project, to name a few.

The end result is a numerical health score that we can query against in the data warehouse.

SELECT repository_id
FROM data_science.github.repository_health_scores
WHERE 
  score > 0.75

Combining this query with the above, we can now pull in merged pull requests from repositories with health scores above a certain threshold:

WITH
healthy_repositories AS (
  SELECT repository_id
  FROM data_science.github.repository_health_scores
  WHERE 
    score > 0.75
)

SELECT
  a.pull_request.created_at,
  a.pull_request.updated_at,
  a.pull_request.id,
  a.issue.number,
  a.repository.id
FROM kafka.github.pull_request_merge a
JOIN healthy_repositories b
ON a.repository.id = b.repository_id
WHERE
  day >= CAST((CURRENT_DATE - INTERVAL '1' DAY) AS VARCHAR)

We do some other things to ensure the data is good, like filtering out accounts with spammy behavior. But repository health scores are definitely a key ingredient.

Geocoding user-provided locations

Your GitHub profile has an optional free text field for providing your location. Some people fill it out with their actual location (mine says “San Francisco”), while others use fake or funny locations (42 users have “Middle Earth” listed as theirs). Many others choose to not list a location. In fact, two-thirds of users don’t enter anything and that’s perfectly fine with us.

For users that do enter something, we try to map the text to a real location. This is a little harder to do than using IP addresses as proxies for locations, but it was important to us to only include data that users felt comfortable making public in the first place.

In order to map the free text locations to latitude and longitude pairs, we use Mapbox’s forward geocoding API and their Ruby SDK. Here’s an example of a forward geocoding of “New York City”:

MAPBOX_OPTIONS = {
  limit: 1,
  types: %w(region place country),
  language: "en"
}

Mapbox::Geocoder.geocode_forward("New York City", MAPBOX_OPTIONS)

=> [{
  "type" => "FeatureCollection",
  "query" => ["new", "york", "city"],
  "features" => [{
    "id" => "place.15278078705964500",
    "type" => "Feature",
    "place_type" => ["place"],
    "relevance" => 1,
    "properties" => {
      "wikidata" => "Q60"
    },
    "text_en" => "New York City",
    "language_en" => "en",
    "place_name_en" => "New York City, New York, United States",
    "text" => "New York City",
    "language" => "en",
    "place_name" => "New York City, New York, United States",
    "bbox" => [-74.2590879797556, 40.477399, -73.7008392055224, 40.917576401307],
    "center" => [-73.9808, 40.7648],
    "geometry" => {
      "type" => "Point", "coordinates" => [-73.9808, 40.7648]
    },
    "context" => [{
      "id" => "region.17349986251855570",
      "wikidata" => "Q1384",
      "short_code" => "US-NY",
      "text_en" => "New York",
      "language_en" => "en",
      "text" => "New York",
      "language" => "en"
    }, {
      "id" => "country.19678805456372290",
      "wikidata" => "Q30",
      "short_code" => "us",
      "text_en" => "United States",
      "language_en" => "en",
      "text" => "United States",
      "language" => "en"
    }]
  }],
  "attribution" => "NOTICE: (c) 2020 Mapbox and its suppliers. All rights reserved. Use of this data is subject to the Mapbox Terms of Service (https://www.mapbox.com/about/maps/). This response and the information it contains may not be retained. POI(s) provided by Foursquare."
}, {}]

There is a lot of data there, but let’s focus on text, relevance, and center for now. Here are those fields for the “New York City”:

result = Mapbox::Geocoder.geocode_forward("New York City", MAPBOX_OPTIONS)
result[0]["features"][0].slice("text", "relevance", "center")

=> {"text"=>"New York City", "relevance"=>1, "center"=>[-73.9808, 40.7648]}

If you use “NYC” query string, you get the exact same result:

result = Mapbox::Geocoder.geocode_forward("NYC", MAPBOX_OPTIONS)
result[0]["features"][0].slice("text", "relevance", "center")

=> {"text"=>"New York City", "relevance"=>1, "center"=>[-73.9808, 40.7648]}

Notice that the text is still “New York City” in this second example? That is because Mapbox is normalizing the results. We use the normalized text on the globe so viewers get a consistent experience. This also takes care of capitalization and misspellings.

The center field is an array containing the longitude and latitude of the location.

And finally, the relevance score is an indicator of Mapox’s confidence in the results. A relevance score of 1 is the highest, but sometimes users enter locations that Mapbox is less sure about:

result = Mapbox::Geocoder.geocode_forward("Middle Earth", MAPBOX_OPTIONS)
result[0]["features"][0].slice(text", "relevance", "center")

=> {"text"=>"Earth City", "relevance"=>0.5, "center"=>[-90.4682, 38.7689]}

We discard anything with a score of less than 1, just to get confidence that the location we show feels correct.

Mapbox also provides a batch geocoding endpoint. This allows us to query multiple locations in one request:

MAPBOX_ENDPOINT = "mapbox.places-permanent"

query_string = "{San Francisco};{Berlin};{Dakar};{Tokyo};{Lima}"

Mapbox::Geocoder.geocode_forward(query_string, MAPBOX_OPTIONS, MAPBOX_ENDPOINT)

After we’ve geocoded and normalized all of the results, we create a JSON representation of the pull request and its locations so our globe JavaScript client knows how to parse it.

Here’s a pull request we recently featured that was opened in San Francisco and merged in Tokyo:

{
   "uml":"Tokyo",
   "gm":{
      "lat":35.68,
      "lon":139.77
   },
   "uol":"San Francisco",
   "gop":{
      "lat":37.7648,
      "lon":-122.463
   },
   "l":"JavaScript",
   "nwo":"mdn/browser-compat-data",
   "pr":7937,
   "ma":"2020-12-17 04:00:48.000",
   "oa":"2020-12-16 10:02:31.000"
}

We use short keys to shave off some bytes from the JSON we end up serving so the globe loads faster.

Airflow, HDFS, and Munger

We run our data warehouse queries and geocoding throughout the day to ensure that the data on the homepage is always fresh.

For scheduling this work, we use another system from Apache called Airflow. Airflow lets you run scheduled jobs that contain a sequence of tasks. Airflow calls these workflows Direct Acyclical Graphs (or DAGs for short), which is a borrowed term from graph theory in computer science. Basically this means that you schedule one task at a time, execute the task, and when the task is done, then the next task is scheduled and eventually executed. Tasks can pass along information to each other.

At a high level, our DAG executes the following tasks:

  1. Query the data warehouse.
  2. Geocode locations from the results.
  3. Write the results to a file.
  4. Expose the results to the GitHub Rails app.

We covered the first two steps earlier. For writing the file, we use HDFS, which is a distributed file system that’s part of the Apache Hadoop project. The file is then uploaded to Munger, an internal service we use to expose results from the data science pipeline back to the GitHub Rails app that powers github.com.

Here’s what this might look like in the Airflow UI:

Each column in that screenshot represents a full DAG run of all of the tasks. The last column with the light green circle at the top indicates that the DAG is in the middle of a run. It’s completed the build_home_page_globe_table task (represented by a dark green box) and now has the next task write_to_hdfs scheduled (dark blue box).

Our Airflow instance runs more than just this one DAG throughout the day, so we may stay in this state for some time before the scheduler is ready to pick up the write_to_hdfs task. Eventually the remaining tasks should run. If everything ends up running smoothly, we should see all green:

Wrapping up

Hope that gives you a glimpse into how we built this!

Again, thank you to all the teams that made the GitHub homepage and globe possible. This project would not have been possible without years of investment in our data infrastructure and data science capabilities, so a special shout out to Kim, Jeff, Preston, Ike, Scott, Jamison, Rowan, and Omoju.

More importantly, we could not have done it without you, the GitHub community, and your daily contributions and projects that truly bring the globe to life. Stay tuned—we have even more in store for this project coming soon.

In the meantime, I hope to see you on the homepage soon. 😉

How we built the GitHub globe

Post Syndicated from Tobias Ahlin original https://github.blog/2020-12-21-how-we-built-the-github-globe/

GitHub is where the world builds software. More than 56 million developers around the world build and work together on GitHub. With our new homepage, we wanted to show how open source development transcends the borders we’re living in and to tell our product story through the lens of a developer’s journey.

Now that it’s live, we would love to share how we built the homepage-directly from the voices of our designers and developers. In this five-part series, we’ll discuss:

  1. How our globe is built
  2. How we collect and use the data behind the globe
  3. How we made the page fast and performant
  4. How our illustrators work with designers and engineers
  5. How we designed the homepage and wrote the narrative

At Satellite in 2019, our CEO Nat showed off a visualization of open source activity on GitHub over a 30-day span. The sheer volume and global reach was astonishing, and we knew we wanted to build on that story.

 

 The main goals we set out to achieve in the design and development of the globe were:

  • An interconnected community. We explored many different options, but ultimately landed on pull requests. It turned out to be a beautiful visualization of pull requests being opened in one part of the world and closed in another.
  • A showcase of real work happening now. We started by simply showing the pull requests’ arcs and spires, but quickly realized that we needed “proof of life.” The arcs could quite as easily just be design animations instead of real work. We iterated on ways to provide more detail and found most resonance with clear hover states that showed the pull request, repo, timestamp, language, and locations. Nat had the idea of making each line clickable, which really upleveled the experience and made it much more immersive. Read more here.
  • Attention to detail and performance. It was extremely important to us that the globe not only looked inspiring and beautiful, but that it performed well on all devices. We went through many, many iterations of refinement, and there’s still more work to be done.

Rendering the globe with WebGL

At the most fundamental level, the globe runs in a WebGL context powered by three.js. We feed it data of recent pull requests that have been created and merged around the world through a JSON file. The scene is made up of five layers: a halo, a globe, the Earth’s regions, blue spikes for open pull requests, and pink arcs for merged pull requests. We don’t use any textures: we point four lights at a sphere, use about 12,000 five-sided circles to render the Earth’s regions, and draw a halo with a simple custom shader on the backside of a sphere.

To draw the Earth’s regions, we start by defining the desired density of circles (this will vary depending on the performance of your machine—more on that later), and loop through longitudes and latitudes in a nested for-loop. We start at the south pole and go upwards, calculate the circumference for each latitude, and distribute circles evenly along that line, wrapping around the sphere:

for (let lat = -90; lat <= 90; lat += 180/rows) {
  const radius = Math.cos(Math.abs(lat) * DEG2RAD) * GLOBE_RADIUS;
  const circumference = radius * Math.PI * 2;
  const dotsForLat = circumference * dotDensity;
  for (let x = 0; x < dotsForLat; x++) {
    const long = -180 + x*360/dotsForLat;
    if (!this.visibilityForCoordinate(long, lat)) continue;

    // Setup and save circle matrix data
  }
}

To determine if a circle should be visible or not (is it water or land?) we load a small PNG containing a map of the world, parse its image data through canvas’s context.getImageData(), and map each circle to a pixel on the map through the visibilityForCoordinate(long, lat) method. If that pixel’s alpha is at least 90 (out of 255), we draw the circle; if not, we skip to the next one.

After collecting all the data we need to visualize the Earth’s regions through these small circles, we create an instance of CircleBufferGeometry and use an InstancedMesh to render all the geometry.

Making sure that you can see your own location

As you enter the new GitHub homepage, we want to make sure that you can see your own location as the globe appears, which means that we need to figure where on Earth that you are. We wanted to achieve this effect without delaying the first render behind an IP look-up, so we set the globe’s starting angle to center over Greenwich, look at your device’s timezone offset, and convert that offset to a rotation around the globe’s own axis (in radians):

const date = new Date();
const timeZoneOffset = date.getTimezoneOffset() || 0;
const timeZoneMaxOffset = 60*12;
rotationOffset.y = ROTATION_OFFSET.y + Math.PI * (timeZoneOffset / timeZoneMaxOffset);

It’s not an exact measurement of your location, but it’s quick, and does the job.

Visualizing pull requests

The main act of the globe is, of course, visualizing all of the pull requests that are being opened and merged around the world. The data engineering that makes this possible is a different topic in and of itself, and we’ll be sharing how we make that happen in an upcoming post. Here we want to give you an overview of how we’re visualizing all your pull requests.

 

Let’s focus on pull requests being merged (the pink arcs), as they are a bit more interesting. Every merged pull request entry comes with two locations: where it was opened, and where it was merged. We map these locations to our globe, and draw a bezier curve between these two locations:

const curve = new CubicBezierCurve3(startLocation, ctrl1, ctrl2, endLocation);

We have three different orbits for these curves, and the longer the two points are apart, the further out we’ll pull out any specific arc into space. We then use instances of TubeBufferGeometry to generate geometry along these paths, so that we can use setDrawRange() to animate the lines as they appear and disappear.

As each line animates in and reaches its merge location, we generate and animate in one solid circle that stays put while the line is present, and one ring that scales up and immediately fades out. The ease out easings for these animations are created by multiplying a speed (here 0.06) with the difference between the target (1) and the current value (animated.dot.scale.x), and adding that to the existing scale value. In other words, for every frame we step 6% closer to the target, and as we’re coming closer to that target, the animation will naturally slow down.

// The solid circle
const scale = animated.dot.scale.x + (1 - animated.dot.scale.x) * 0.06;
animated.dot.scale.set(scale, scale, 1);

// The landing effect that fades out
const scaleUpFade = animated.dotFade.scale.x + (1 - animated.dotFade.scale.x) * 0.06;
animated.dotFade.scale.set(scaleUpFade, scaleUpFade, 1);
animated.dotFade.material.opacity = 1 - scaleUpFade;

Creative constraints from performance optimizations

The homepage and the globe needs to perform well on a variety of devices and platforms, which early on created some creative restrictions for us, and made us focus extensively on creating a well-optimized page. Although some modern computers and tablets could render the globe at 60 FPS with antialias turned on, that’s not the case for all devices, and we decided early on to leave antialias turned off and optimize for performance. This left us with a sharp and pixelated line running along the top left edge of the globe, as the globe’s highlighted edge met the darker color of the background:

This encouraged us to explore a halo effect that could hide that pixelated edge. We created one by using a custom shader to draw a gradient on the backside of a sphere that’s slightly larger than the globe, placed it behind the globe, and tilted it slightly on its side to emphasize the effect in the top left corner:

const halo = new Mesh(haloGeometry, haloMaterial);
halo.scale.multiplyScalar(1.15);
halo.rotateX(Math.PI*0.03);
halo.rotateY(Math.PI*0.03);
this.haloContainer.add(halo);

This smoothed out the sharp edge, while being a much more performant operation than turning on antialias. Unfortunately, leaving antialias off also produced a fairly prominent moiré effect as all the circles making up the world came closer and closer to each other as they neared the edges of the globe. We reduced this effect and simulated the look of a thicker atmosphere by using a fragment shader for the circles where each circle’s alpha is a function of its distance from the camera, fading out every individual circle as it moves further away:

if (gl_FragCoord.z > fadeThreshold) {
  gl_FragColor.a = 1.0 + (fadeThreshold - gl_FragCoord.z ) * alphaFallOff;
}

Improving perceived speed

We don’t know how quickly (or slowly) the globe is going to load on a particular device, but we wanted to make sure that the header composition on the homepage is always balanced, and that you got the impression that the globe loads quickly even if there’s a slight delay before we can render the first frame.

We created a bare version of the globe using only gradients in Figma and exported it as an SVG. Embedding this SVG in the HTML document adds little overhead, but makes sure that something is immediately visible as the page loads. As soon as we’re ready to render the first frame of the globe, we transition between the SVG and the canvas element by crossfading between and scaling up both elements using the Web Animations API. Using the Web Animations API enables us to not touch the DOM at all during the transition, ensuring that it’s as stutter-free as possible.

const keyframesIn = [
      { opacity: 0, transform: 'scale(0.8)' },
      { opacity: 1, transform: 'scale(1)' }
    ];
const keyframesOut = [
      { opacity: 1, transform: 'scale(0.8)' },
      { opacity: 0, transform: 'scale(1)' }
    ];
const options = { fill: 'both', duration: 600, easing: 'ease' };

this.renderer.domElement.animate(keyframesIn, options);
const placeHolderAnim = placeholder.animate(keyframesOut, options);
placeHolderAnim.addEventListener('finish', () => {
  placeholder.remove();
});

Graceful degradation with quality tiers

We aim at maintaining 60 FPS while rendering an as beautiful globe as we can, but finding that balance is tricky—there are thousands of devices out there, all performing differently depending on the browser they’re running and their mood. We constantly monitor the achieved FPS, and if we fail to maintain 55.5 FPS over the last 50 frames we start to degrade the quality of the scene.

 

There are four quality tiers, and for every degradation we reduce the amount of expensive calculations. This includes reducing the pixel density, how often we raycast (figure out what your cursor is hovering in the scene), and the amount of geometry that’s drawn on screen—which brings us back to the circles that make up the Earth’s regions. As we traverse down the quality tiers, we reduce the desired circle density and rebuild the Earth’s regions, here going from the original ~12 000 circles to ~8 000:

// Reduce pixel density to 1.5 (down from 2.0)
this.renderer.setPixelRatio(Math.min(AppProps.pixelRatio, 1.5));
// Reduce the amount of PRs visualized at any given time
this.indexIncrementSpeed = VISIBLE_INCREMENT_SPEED / 3 * 2;
// Raycast less often (wait for 4 additional frames)
this.raycastTrigger = RAYCAST_TRIGGER + 4;
// Draw less geometry for the Earth’s regions
this.worldDotDensity = WORLD_DOT_DENSITY * 0.65;
// Remove the world
this.resetWorldMap();
// Generate world anew from new settings
this.buildWorldGeometry();

A small part of a wide-ranging effort

These are some of the techniques that we use to render the globe, but the creation of the globe and the new homepage is part of a longer story, spanning multiple teams, disciplines, and departments, including design, brand, engineering, product, and communications. We’ll continue the deep-dive in this 5-part series, so come back soon or follow us on Twitter @GitHub for all the latest updates on this project and more.

Next up: how we collect and use the data behind the globe.

In the meantime, don’t miss out on the new GitHub globe wallpapers from the GitHub Illustration Team to enjoy the globe from your desktop or mobile device:


Love the new GitHub homepage or any of the work you see here? Join our team

Commits are snapshots, not diffs

Post Syndicated from Derrick Stolee original https://github.blog/2020-12-17-commits-are-snapshots-not-diffs/

Git has a reputation for being confusing. Users stumble over terminology and phrasing that misguides their expectations. This is most apparent in commands that “rewrite history” such as git cherry-pick or git rebase. In my experience, the root cause of this confusion is an interpretation of commits as diffs that can be shuffled around. However, commits are snapshots, not diffs!

I believe that Git becomes understandable if we peel back the curtain and look at how Git stores your repository data. After we investigate this model, we’ll explore how this new perspective helps us understand commands like git cherry-pick and git rebase.

If you want to go really deep, you should read the Git Internals chapter of the Pro Git book.

I’ll be using the git/git repository checked out at v2.29.2 as an example. Follow along with my command-line examples for extra practice.

Object IDs are hashes

The most important part to know about Git objects is that Git references each by its object ID (OID for short), providing a unique name for the object. We will use the git rev-parse <ref> command to discover these OIDs. Each object is essentially a plain-text file and we can examine its contents using the git cat-file -p <oid> command.

You might also be used to seeing OIDs given as a shorter hex string. This string is given as something long enough that only one object in the repository has an OID that matches that abbreviation. If we request the type of an object using an abbreviated OID that is too short, then we will see the list of OIDs that match

$ git cat-file -t e0c03
error: short SHA1 e0c03 is ambiguous
hint: The candidates are:
hint: e0c03f27484 commit 2016-10-26 - contrib/buildsystems: ignore irrelevant files in Generators/
hint: e0c03653e72 tree
hint: e0c03c3eecc blob
fatal: Not a valid object name e0c03

What are these types: blobtree, and commit? Let’s start at the bottom and work our way up.

Blobs are file contents

At the bottom of the object model, blobs contain file contents. To discover the OID for a file at your current revision, run git rev-parse HEAD:<path>. Then, use git cat-file -p <oid> to find its contents.

$ git rev-parse HEAD:README.md
eb8115e6b04814f0c37146bbe3dbc35f3e8992e0

$ git cat-file -p eb8115e6b04814f0c37146bbe3dbc35f3e8992e0 | head -n 8
[![Build status](https://github.com/git/git/workflows/CI/PR/badge.png)](https://github.com/git/git/actions?query=branch%3Amaster+event%3Apush)

Git - fast, scalable, distributed revision control system
=========================================================

Git is a fast, scalable, distributed revision control system with an
unusually rich command set that provides both high-level operations
and full access to internals.

If I edit the README.md file on my disk, then git status notices that the file has a recent modified time and hashes the contents. If the contents don’t match the current OID at HEAD:README.md, then git status reports the file as “modified on disk.” In this way, we can see if the file contents in the current working directory match the expected contents at HEAD.

Trees are directory listings

Note that blobs contain file contents, but not the file names! The names come from Git’s representation of directories: trees. A tree is an ordered list of path entries, paired with object types, file modes, and the OID for the object at that path. Subdirectories are also represented as trees, so trees can point to other trees!

We will use diagrams to visualize how these objects are related. We use boxes for blobs and triangles for trees.

$ git rev-parse HEAD^{tree}
75130889f941eceb57c6ceb95c6f28dfc83b609c

$ git cat-file -p 75130889f941eceb57c6ceb95c6f28dfc83b609c  | head -n 15
100644 blob c2f5fe385af1bbc161f6c010bdcf0048ab6671ed    .cirrus.yml
100644 blob c592dda681fecfaa6bf64fb3f539eafaf4123ed8    .clang-format
100644 blob f9d819623d832113014dd5d5366e8ee44ac9666a    .editorconfig
100644 blob b08a1416d86012134f823fe51443f498f4911909    .gitattributes
040000 tree fbe854556a4ae3d5897e7b92a3eb8636bb08f031    .github
100644 blob 6232d339247fae5fdaeffed77ae0bbe4176ab2de    .gitignore
100644 blob cbeebdab7a5e2c6afec338c3534930f569c90f63    .gitmodules
100644 blob bde7aba756ea74c3af562874ab5c81a829e43c83    .mailmap
100644 blob 05f3e3f8d79117c1d32bf5e433d0fd49de93125c    .travis.yml
100644 blob 5ba86d68459e61f87dae1332c7f2402860b4280c    .tsan-suppressions
100644 blob fc4645d5c08bd005238fc72cfa709495d8722e6a    CODE_OF_CONDUCT.md
100644 blob 536e55524db72bd2acf175208aef4f3dfc148d42    COPYING
040000 tree a58410edddbdd133cca6b3322bebe4fb37be93fa    Documentation
100755 blob ca6ccb49866c595c80718d167e40cfad1ee7f376    GIT-VERSION-GEN
100644 blob 9ba33e6a141a3906eb707dd11d1af4b0f8191a55    INSTALL

Trees provide names for each sub-item. Trees also include information such as Unix file permissions, object type (blob or tree), and OIDs for each entry. We cut the output to the top 15 entries, but we can use grep to discover that this tree has a README.md entry that points to our earlier blob OID.

$ git cat-file -p 75130889f941eceb57c6ceb95c6f28dfc83b609c | grep README.md
100644 blob eb8115e6b04814f0c37146bbe3dbc35f3e8992e0    README.md

Trees can point to blobs and other trees using these path entries. Keep in mind that those relationships are paired with path names, but we will not always show those names in our diagrams.

The tree itself doesn’t know where it exists within the repository, that is the role of the objects pointing to the tree. The tree referenced by <ref>^{tree} is a special tree: the root tree. This designation is based on a special link from your commits.

Commits are snapshots

commit is a snapshot in time. Each commit contains a pointer to its root tree, representing the state of the working directory at that time. The commit has a list of parent commits corresponding to the previous snapshots. A commit with no parents is a root commit and a commit with multiple parents is a merge commit. Commits also contain metadata describing the snapshot such as author and committer (including name, email address, and date) and a commit message. The commit message is an opportunity for the commit author to describe the purpose of that commit with respect to the parents.

For example, the commit at v2.29.2 in the Git repository describes that release, and is authored and committed by the Git maintainer.

$ git rev-parse HEAD
898f80736c75878acc02dc55672317fcc0e0a5a6

/c/_git/git ((v2.29.2))
$ git cat-file -p 898f80736c75878acc02dc55672317fcc0e0a5a6
tree 75130889f941eceb57c6ceb95c6f28dfc83b609c
parent a94bce62b99be35f2ee2b4c98f97c222e7dd9d82
author Junio C Hamano <[email protected]> 1604006649 -0700
committer Junio C Hamano <[email protected]> 1604006649 -0700

Git 2.29.2

Signed-off-by: Junio C Hamano <[email protected]>

Looking a little farther in the history with git log, we can see a more descriptive commit message talking about the change between that commit and its parent.

$ git cat-file -p 16b0bb99eac5ebd02a5dcabdff2cfc390e9d92ef
tree d0e42501b1cf65395e91e22e74f75fc5caa0286e
parent 56706dba33f5d4457395c651cf1cd033c6c03c7a
author Jeff King &lt;[email protected]&gt; 1603436979 -0400
committer Junio C Hamano &lt;[email protected]&gt; 1603466719 -0700

am: fix broken email with --committer-date-is-author-date

Commit e8cbe2118a (am: stop exporting GIT_COMMITTER_DATE, 2020-08-17)
rewrote the code for setting the committer date to use fmt_ident(),
rather than setting an environment variable and letting commit_tree()
handle it. But it introduced two bugs:

- we use the author email string instead of the committer email

- when parsing the committer ident, we used the wrong variable to
compute the length of the email, resulting in it always being a
zero-length string

This commit fixes both, which causes our test of this option via the
rebase "apply" backend to now succeed.

Signed-off-by: Jeff King &lt;[email protected]&gt; Signed-off-by: Junio C Hamano &lt;[email protected]&gt;

In our diagrams, we will use circles to represent commits. Notice the alliteration? Let’s review:

  • Boxes are blobs. These represent file contents.
  • Triangles are trees. These represent directories.
  • Circles are commits. These are snapshots in time.

Branches are pointers

In Git, we move around the history and make changes without referring to OIDs most of the time. This is because branches provide pointers to the commits we care about. A branch with name main is actually a reference in Git called refs/heads/main. These files literally contain hex strings referencing the OID of a commit. As you work, these references change their contents to point to other commits.

This means branches are significantly different from our previous Git objects. Commits, trees, and blobs are immutable, meaning you can’t change their contents. If you change the contents, then you get a different hash and thus a new OID referring to the new object! Branches are named by users to provide meaning, such as trunk or my-special-project. We use branches to track and share work.

The special reference HEAD points to the current branch. When we add a commit to HEAD, it automatically updates that branch to the new commit.

We can create a new branch and update our HEAD using git switch -c:

$ git switch -c my-branch
Switched to a new branch 'my-branch'
$ cat .git/refs/heads/my-branch
1ec19b7757a1acb11332f06e8e812b505490afc6
$ cat .git/HEAD
ref: refs/heads/my-branch

Notice how creating my-branch created a file (.git/refs/heads/my-branch) containing the current commit OID and the .git/HEAD file was updated to point at this branch. Now, if we update HEAD by creating new commits, the branch my-branch will update to point to that new commit!

The big picture

Let’s put all of these new terms into one giant picture. Branches point to commits, commits point to other commits and their root trees, trees point to blobs and other trees, and blobs don’t point to anything. Here is a diagram containing all of our objects all at once:

In this diagram, time moves from left to right. The arrows between a commit and its parents go from right to left. Each commit has a single root tree. HEAD points to the main branch here, and main points to the most-recent commit. The root tree at this commit is fully expanded underneath, while the rest of the trees have arrows pointing towards these objects. The reason for that is that the same objects are reachable from multiple root trees! Since these trees reference those objects by their OID (their content) these snapshots do not need multiple copies of the same data. In this way, Git’s object model forms a Merkle tree.

When we view the object model in this way, we can see why commits are snapshots: they link directly to a full view of the expected working directory for that commit!

Computing diffs

Even though commits are snapshots, we frequently look at a commit in a history view or on GitHub as a diff. In fact, the commit message frequently refers to this diff. The diff is dynamically generated from the snapshot data by comparing the root trees of the commit and its parent. Git can compare any two snapshots in time, not just adjacent commits.

To compare two commits, start by looking at their root trees, which are almost always different. Then, perform a depth-first-search on the subtrees by following pairs when paths for the current tree have different OIDs. In the example below, the root trees have different values for the docs, so we recurse into those two trees. Those trees have different values for M.md, so those two blobs are compared line-by-line and that diff is shown. Still within docsN.md is the same, so that is skipped and we pop back to the root tree. The root tree then sees that the things directories have equal OIDs as well as the README.md entries.

In the diagram above, we notice that the things tree is never visited, and so none of its reachable objects are visited. This way, the cost of computing a diff is relative to the number of paths with different content.

Now we have the understanding that commits are snapshots and we can dynamically compute a diff between any two commits. Then why isn’t this common knowledge? Why do new users stumble over this idea that a commit is a diff?

One of my favorite analogies is to think of commits as having a wave/partical duality where sometimes they are treated like snapshots and other times they are treated like diffs. The crux of the matter really goes into a different kind of data that’s not actually a Git object: patches.

Wait, what’s a patch?

patch is a text document that describes how to alter an existing codebase. Patches are how extremely-distributed groups can share code without using Git commits directly. You can see these being shuffled around on the Git mailing list.

A patch contains a description of the change and why it is valuable, followed by a diff. The idea is that someone could use that reasoning as a justification to apply that diff to their copy of the code.

Git can convert a commit into a patch using git format-patch. A patch can then be applied to a Git repository using git apply. This was the dominant way to share code in the early days of open source, but most projects have moved to sharing Git commits directly through pull requests.

The biggest issue with sharing patches is that the patch loses the parent information and the new commit has a parent equal to your existing HEAD. Moreover, you get a different commit even if you use the same parent as before due to the commit time, but also the committer changes! This is the fundamental reason why Git has both “author” and “committer” details in the commit object.

The biggest problem with using patches is that it is hard to apply a patch when your working directory does not match the sender’s previous commit. Losing the commit history makes it difficult to resolve conflicts.

This idea of “moving patches around” has transferred into several Git commands as “moving commits around.” Instead, what actually happens is that commit diffs are replayed, creating new commits.

If commits aren’t diffs, then what does git cherry-pick do?

The git cherry-pick <oid> command creates a new commit with an identical diff to <oid> whose parent is the current commit. Git is essentially following these steps:

  1. Compute the diff between the commit <oid> and its parent.
  2. Apply that diff to the current HEAD.
  3. Create a new commit whose root tree matches the new working directory and whose parent is the commit at HEAD.
  4. Move the ref at HEAD to that new commit.

After Git creates the new commit, the output of git log -1 -p HEAD should match the output of git log -1 -p <oid>.

It is important to recognize that we didn’t “move” the commit to be on top of our current HEAD, we created a new commit whose diff matches the old commit.

If commits aren’t diffs, then what does git rebase do?

The git rebase command presents itself as a way to move commits to have a new history. In its most basic form it is really just a series of git cherry-pick commands, replaying diffs on top of a different commit.

The most important thing is that git rebase <target> will discover the list of commits that are reachable from HEAD but not reachable from <target>. You can show these yourself using git log --oneline <target>..HEAD.

Then, the rebase command simply navigates to the <target> location and starts performing git cherry-pick commands on this commit range, starting from the oldest commits. At the end, we have a new set of commits with different OIDs but similar diffs to the original commit range.

For example, consider a sequence of three commits in the current HEAD since branching off of a target branch. When running git rebase target, the common base P is computed to determine the commit list AB, and C. These are then cherry-picked on top of target in order to construct new commits A'B', and C'.

The commits A'B', and C' are brand new commits that share a lot of information with AB, and C, but are distinct new objects. In fact, the old commits still exist in your repository until garbage collection runs.

We can even inspect how these two commit ranges are different using the git range-diff command! I’ll use some example commits in the Git repository to rebase onto the v2.29.2 tag, then modify the tip commit slightly.

$ git checkout -f 8e86cf65816
$ git rebase v2.29.2
$ echo extra line >>README.md
$ git commit -a --amend -m "replaced commit message"
$ git range-diff v2.29.2 8e86cf65816 HEAD
1:  17e7dbbcbc = 1:  2aa8919906 sideband: avoid reporting incomplete sideband messages
2:  8e86cf6581 ! 2:  e08fff1d8b sideband: report unhandled incomplete sideband messages as bugs
    @@ Metadata
     Author: Johannes Schindelin <[email protected]>
     
      ## Commit message ##
    -    sideband: report unhandled incomplete sideband messages as bugs
    +    replaced commit message
     
    -    It was pretty tricky to verify that incomplete sideband messages are
    -    handled correctly by the `recv_sideband()`/`demultiplex_sideband()`
    -    code: they have to be flushed out at the end of the loop in
    -    `recv_sideband()`, but the actual flushing is done by the
    -    `demultiplex_sideband()` function (which therefore has to know somehow
    -    that the loop will be done after it returns).
    -
    -    To catch future bugs where incomplete sideband messages might not be
    -    shown by mistake, let's catch that condition and report a bug.
    -
    -    Signed-off-by: Johannes Schindelin <[email protected]>
    -    Signed-off-by: Junio C Hamano <[email protected]>
    + ## README.md ##
    +@@ README.md: and the name as (depending on your mood):
    + [Documentation/giteveryday.txt]: Documentation/giteveryday.txt
    + [Documentation/gitcvs-migration.txt]: Documentation/gitcvs-migration.txt
    + [Documentation/SubmittingPatches]: Documentation/SubmittingPatches
    ++extra line
     
      ## pkt-line.c ##
     @@ pkt-line.c: int recv_sideband(const char *me, int in_stream, int out)

Notice that the resulting range-diff claims that commits 17e7dbbcbc and 2aa8919906 are “equal”, which means they would generate the same patch. The second pair of commits are different, showing that the commit message changed and there is an edit to the README.md that was not in the original commit.

If you are following along, you can also see how the commit history still exists for these two commit sets. The new commits have the v2.29.2 tag as the third commit in the history while the old commits have the (earlier) v2.28.0 tag as the third commit.

$ git log --oneline -3 HEAD
e08fff1d8b2 (HEAD) replaced commit message
2aa89199065 sideband: avoid reporting incomplete sideband messages
898f80736c7 (tag: v2.29.2) Git 2.29.2

$ git log --oneline -3 8e86cf65816
8e86cf65816 sideband: report unhandled incomplete sideband messages as bugs
17e7dbbcbce sideband: avoid reporting incomplete sideband messages
47ae905ffb9 (tag: v2.28.0) Git 2.28

Since commits aren’t diffs, how does Git track renames?

If you were looking carefully at the object model, you might have noticed that Git never tracks changes between commits in the stored object data. You might have wondered “how does Git know a rename happened?”

Git doesn’t track renames. There is no data structure inside Git that stores a record that a rename happened between a commit and its parent. Instead, Git tries to detect renames during the dynamic diff calculation. There are two stages to this rename detection: exact renames and edit-renames.

After first computing a diff, Git inspects the internal model of that diff to discover which paths were added or deleted. Naturally, a file that was moved from one location to another would appear as a deletion from the first location and an add in the second. Git attempts to match these adds and deletes to create a set of inferred renames.

The first stage of this matching algorithm looks at the OIDs of the paths that were added and deleted and see if any are exact matches. Such exact matches are paired together.

The second stage is the expensive part: how can we detect files that were renamed and edited? Git iterates through each added file and compares that file against each deleted file to compute a similarity score as a percentage of lines in common. By default, anything larger than 50% of lines in common counts as a potential edit-rename. The algorithm continues comparing these pairs until finding the maximum match.

Did you notice a problem? This algorithm runs A * D diffs, where A is the number of adds and D is the number of deletes. This is quadratic! To avoid extra-long rename computations, Git will skip this portion of detecting edit-renames if A + D is larger than an internal limit. You can modify this limit using the diff.renameLimit config option. You can also avoid the algorithm altogether by disabling the diff.renames config option.

I’ve used my awareness of the Git rename detection in my own projects. For example, I forked VFS for Git to create the Scalar project and wanted to re-use a lot of the code but also change the file structure significantly. I wanted to be able to follow the history of these files into the versions in the VFS for Git codebase, so I constructed my refactor in two steps:

  1. Rename all of the files without changing the blobs.
  2. Replace strings to modify the blobs without changing filenames.

These two steps ensured that I can quickly use git log --follow -- <path> to see the history of a file across this rename.

$ git log --oneline --follow -- Scalar/CommandLine/ScalarVerb.cs
4183579d console: remove progress spinners from all commands
5910f26c ScalarVerb: extract Git version check
...
9f402b5a Re-insert some important instances of GVFS
90e8c1bd [REPLACE] Replace old name in all files
fb3a2a36 [RENAME] Rename all files
cedeeaa3 Remove dead GVFSLock and GitStatusCache code
a67ca851 Remove more dead hooks code
...

I abbreviated the output, but these last two commits don’t actually have a path corresponding to Scalar/CommandLine/ScalarVerb.cs, but instead it is tracking the previous path GVSF/GVFS/CommandLine/GVFSVerb.cs because Git recognized the exact-content rename from the commit fb3a2a36 [RENAME] Rename all files.

Won’t be fooled again!

You now know that commits are snapshots, not diffs! This understanding will help you navigate your experience working with Git.

Now you are armed with deep knowledge of the Git object model. You can use this knowledge to expand your skills in using Git commands or deciding on workflows for your team. In a future blog post, we will use this knowledge to learn about different Git clone options and how to reduce the data you need to get things done!

Reducing flaky builds by 18x

Post Syndicated from Jordan Raine original https://github.blog/2020-12-16-reducing-flaky-builds-by-18x/

Part of the Building GitHub blog series.

It’s four o’clock in the afternoon as you push the last tweak to your branch. Your teammate already reviewed and approved your pull request and now all that’s left is to wait for CI. But, fifteen minutes later, your commit goes red. Surprised and a bit annoyed because the last five commits were green, you take a closer look only to find a failing test unrelated to your changes. When you run the test locally, it passes.

Half an hour later, after running the test again and again locally without a single failure and retrying a still-red CI, you’re no closer to deploying your code. As the clock ticks past five, you retry CI once more. But this time, something is different: it’s green!

You deploy, merge your pull request, and, an hour later than expected, close your laptop for the day.

This is the cost of a flaky test. They’re puzzling, frustrating, and waste your time. And try as we might to stop them, they’re about as common as a developer who brews pour-over coffee every morning (read: very).

Bonus points if you can guess what caused the test to fail.

How far we’ve come

Earlier this year in our monolith, 1 in 11 commits had at least one red build caused by a flaky test, or about 9 percent of commits. If you were trying to deploy something with a handful of commits, there was a good chance you’d need to retry the build or spend time diagnosing a failure, even if your code was fine. This slowed us down.

Six weeks ago, after introducing a system to manage flaky tests, the percentage of commits with flaky builds dropped to less than half a percent, or 1 in 200 commits.

Chart showing number of commits with flakey builds month over month

This is an 18x improvement and the lowest rate of flaky builds since we began tracking flaky tests in 2016.

So, how does it work?

Text in red: the only person who is bother by a flakey test is the person who wrote it

Say you just merged a test that would fail once every 1,000 builds. It didn’t fail during development or review but a few hours later, the test fails on your teammate’s branch.

When this failure occurs, the new system inspects it and finds it to be flaky. After ensuring the test passes when run against the same code, it keeps the build green. And it happens quickly: unless your teammate is watching closely, they won’t notice.

But your flaky test is still out there, failing on other branches. Every time it does, the system keeps track of where it happened and how it failed, each time learning more about the failure. If it continues to affect developers, the system identifies it as high impact: it’s time for a human to investigate. Using test failure history and git blame, the system finds the commit most likely to have introduced the problem—your commit—and assigns an issue to you.

Screenshot of GitHub’s internal CI tooling.

From there, you can see information about the failure, including what may have caused it, where it failed, and who else might be responsible. After some poking around, you find the problem and merge a fix.

The system noticed a problem, contained it, and delegated it the right person. In short, the only person bothered by a flaky test is the person who wrote it.

# This ain't a blocker
def test_fails_one_in_a_thousand
  assert rand(1000).zero?
end

How it works

When we set out to build this new system, our intent wasn’t to fix every flaky test or to stop developers from introducing new flaky tests. Such goals, if not impossible, seemed impractical. Similar to telling a developer to never write another bug, it would be much more costly and only slightly less flaky. Rather, we set out to manage the inevitability of flaky tests.

Focusing on what matters

When inspecting the history of failures in our monolith, we found that about a quarter of our tests had failed flaky across three or more branches in the past two years. But as we filtered by occurence, we learned that the flakiness was not evenly distributed: most flaky test failed fewer than ten times and only 0.4 percent of flaky tests failed 100 times or more.

Bar chart showing tests with flakey failures

This made one thing clear: not every flaky failure should be investigated.

Instead, by focusing on this top 0.4 percent, we could make the most of our time. So which tests were in this group?

Automating flake detection

We define a “flaky” test result as a test that exhibits both a passing and a failing result with the same code. –John Micco, Google

Since 2016, our CI has been able to detect whether a test failure is flaky using two complementary approaches:

  1. Same code, different results. Once a build finishes, CI checks for other builds run against the same code using the root git tree hash. If another build had different results—for example, a test failed on the first build but passed on the second—the test failure was marked as flaky. While this approach was accurate, it only worked if a build was retried.
  2. Retry tests that fail. When a test failed, it was retried again later within the same build. This could be used on every build at minimal cost. If the test passed when rerun, it was marked as flaky. However, certain types of flaky tests couldn’t be detected with this approach, such as a time-based flaky test. (If your test failed because it was a leap year, rerunning it two minutes later won’t help.)

Unfortunately, these approaches were only able to identify 25 percent of the flaky failures, counting on developers to find the rest. Before delegating flaky test detection to CI, we needed an approach that was as good or better than a person.

Making it better

We decided to iterate on the test retry approach by rerunning the test three times, each in a scenario targeting a common cause of flakiness.

  1. Retry in the same process. This retry attempts to replicate the same conditions in which the test failed: same Ruby VM, same database, same host. If the test passes under the same conditions, it is likely caused by randomness in the code or a race condition.
  2. Retry in the same process, shifted into the future. This retry attempts to replicate the same conditions with one exception: time. If the test passes when run in the future, as simulated by test helpers, it is likely caused by an incorrect assumption about time (e.g., “there are 28 days in February”).
  3. Retry on a different host. This attempts to run the same code in a completely separate environment: different Ruby VM, different database, different host. If the test passes under these conditions but fails in the other two retries, it is likely caused by test order-dependence or some other shared state.

Using this approach, we are able to automatically identify 90 percent of flaky failures.

Further, because we kept a history of how a test would fail, we were also able to estimate the cause of flakiness: chance, time-based, or order-dependent. When it came time to fix a test, this gave the developer a headstart.

Measuring impact

Once we could accurately detect flaky failures, we needed a way to quantify impact to automate prioritization.

To do this, we used information tracked with every test failure: build, branch, author, commit, and more. Using this information, a flaky test is given an impact score based on how many times it has failed as well as how many branches, developers, and deploys were affected by it. The higher the score, the more important the flaky test.

Once the score exceeds a certain threshold, an issue is automatically opened and assigned to the people who most recently modified either the test files or associated code prior to the test becoming flaky. To help jog the memory of those assigned, a link to the commit that may have introduced the problem is added to the issue.

Teams can also view flaky tests by impact, CODEOWNER, or suite, giving insight into the test suite and giving developers a TODO list for problem areas.

Closing

By reducing the number of flaky builds by 18x, the new system makes CI more trustworthy and red builds more meaningful. If your pull request has a failure, it’s a sign you need to change something, not a sign you should hit Rebuild. When it comes time to deploy, you can be sure that your build won’t go red late in the day because a test doesn’t take into account daylight saving time.

This keeps us moving, even when we make mistakes.

Encapsulating Ruby on Rails views

Post Syndicated from Joel Hawksley original https://github.blog/2020-12-15-encapsulating-ruby-on-rails-views/

With the recent release of version 6.1, Ruby on Rails now supports the rendering of objects that respond to render_in, a change we introduced to the framework. It may be small (the two pull requests were less than a dozen lines), but this change has enabled us to develop a framework for building encapsulated views called ViewComponent.

Why encapsulation matters

Unlike models and controllers, Rails views are not encapsulated. All Rails views in an application exist in a single execution context, meaning they can share state. This makes them hard to reason about and difficult to test, as they cannot be easily isolated.

The need for a new way of building views in our application emerged as the number of templates in the GitHub application grew into the thousands. We depended on a combination of presenters and partials with inline Ruby, tested by expensive integration tests that exercise the routing and controller layers in addition to the view layer.

Inspired by our experience building component-based UI with React, we set off to build a framework to bring these ideas to server-rendered Rails views.

Enter ViewComponent

We created the ViewComponent framework for building reusable, testable & encapsulated view components in Ruby on Rails. A ViewComponent is the combination of a Ruby file and a template file. For example:

test_component.rb

class TestComponent < ViewComponent::Base
  def initialize(title:)
    @title = title
  end
end

test_component.html.erb

<span title="<%= @title %>">
  <%= content %>
</span>

Which is rendered in a view:

<%= render(TestComponent.new(title: "my title")) do %>
  Hello, World!
<% end %>

Returning:

<span title="my title">Hello, World!</span>

Unlike traditional Rails views, ViewComponent templates are executed within the context of the ViewComponent object, encapsulating their state and allowing them to be unit tested in isolation.

For example, to test our TestComponent, we can write:

require "view_component/test_case"

class MyComponentTest < ViewComponent::TestCase
  def test_render_component
    render_inline(TestComponent.new(title: "my title")) { "Hello, World!" }

    assert_selector("span[title='my title']", text: "Hello, World!")
    # or, to just assert against the text:
    assert_text("Hello, World!")
  end
end

These kinds of unit tests enable us to test our view code directly, instead of via controller tests. They are also significantly faster: in the GitHub codebase, component unit tests take around 25 milliseconds each, compared to about six seconds for controller tests.

In practice

Over the past two years, we’ve made significant strides in using ViewComponent: we now have over 400 components used in over 1600 of our 4500+ templates. For example, every Counter and Blankslate on GitHub.com is rendered with a ViewComponent.

We’re seeing several significant benefits from this architecture. Because ViewComponents can be unit tested against their rendered DOM, we’ve been able to reduce duplication of test coverage for shared templates, which we previously covered with controller tests. And since ViewComponent tests are so fast, we’re finding ourselves writing more of them, leading to higher confidence in our view code.

We’ve also seen the positive impact of the consistency that component-driven view architecture can provide. When we implemented a ViewComponent for the status of pull requests, we discovered several locations where we had not updated our previous, copy-pasted implementation, to handle the then-recently-shipped draft pull request status. By standardizing on a source of truth for the UI pattern, we now have a single, consistent implementation.

To see some of the ViewComponents we’re using in the GitHub application, check out the Primer ViewComponents library.

Support for 3rd-party component frameworks such as ViewComponent is just one of many contributions GitHub engineers contributed to Rails 6.1. Other notable additions include support for horizontal shardingstrict loading, and template annotations.

Life of a Netflix Partner Engineer — The case of extra 40 ms

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/life-of-a-netflix-partner-engineer-the-case-of-extra-40-ms-b4c2dd278513

Life of a Netflix Partner Engineer — The case of the extra 40 ms

By: John Blair, Netflix Partner Engineering

The Netflix application runs on hundreds of smart TVs, streaming sticks and pay TV set top boxes. The role of a Partner Engineer at Netflix is to help device manufacturers launch the Netflix application on their devices. In this article we talk about one particularly difficult issue that blocked the launch of a device in Europe.

The mystery begins

Towards the end of 2017, I was on a conference call to discuss an issue with the Netflix application on a new set top box. The box was a new Android TV device with 4k playback, based on Android Open Source Project (AOSP) version 5.0, aka “Lollipop”. I had been at Netflix for a few years, and had shipped multiple devices, but this was my first Android TV device.

All four players involved in the device were on the call: there was the large European pay TV company (the operator) launching the device, the contractor integrating the set-top-box firmware (the integrator), the system-on-a-chip provider (the chip vendor), and myself (Netflix).

The integrator and Netflix had already completed the rigorous Netflix certification process, but during the TV operator’s internal trial an executive at the company reported a serious issue: Netflix playback on his device was “stuttering.”, i.e. video would play for a very short time, then pause, then start again, then pause. It didn’t happen all the time, but would reliably start to happen within a few days of powering on the box. They supplied a video and it looked terrible.

The device integrator had found a way to reproduce the problem: repeatedly start Netflix, start playback, then return to the device UI. They supplied a script to automate the process. Sometimes it took as long as five minutes, but the script would always reliably reproduce the bug.

Meanwhile, a field engineer for the chip vendor had diagnosed the root cause: Netflix’s Android TV application, called Ninja, was not delivering audio data quickly enough. The stuttering was caused by buffer starvation in the device audio pipeline. Playback stopped when the decoder waited for Ninja to deliver more of the audio stream, then resumed once more data arrived. The integrator, the chip vendor and the operator all thought the issue was identified and their message to me was clear: Netflix, you have a bug in your application, and you need to fix it. I could hear the stress in the voices from the operator. Their device was late and running over budget and they expected results from me.

The investigation

I was skeptical. The same Ninja application runs on millions of Android TV devices, including smart TVs and other set top boxes. If there was a bug in Ninja, why is it only happening on this device?

I started by reproducing the issue myself using the script provided by the integrator. I contacted my counterpart at the chip vendor, asked if he’d seen anything like this before (he hadn’t). Next I started reading the Ninja source code. I wanted to find the precise code that delivers the audio data. I recognized a lot, but I started to lose the plot in the playback code and I needed help.

I walked upstairs and found the engineer who wrote the audio and video pipeline in Ninja, and he gave me a guided tour of the code. I spent some quality time with the source code myself to understand its working parts, adding my own logging to confirm my understanding. The Netflix application is complex, but at its simplest it streams data from a Netflix server, buffers several seconds worth of video and audio data on the device, then delivers video and audio frames one-at-a-time to the device’s playback hardware.

A diagram showing content downloaded to a device into a streaming buffer, then copied into the device decode buffer.
Figure 1: Device Playback Pipeline (simplified)

Let’s take a moment to talk about the audio/video pipeline in the Netflix application. Everything up until the “decoder buffer” is the same on every set top box and smart TV, but moving the A/V data into the device’s decoder buffer is a device-specific routine running in its own thread. This routine’s job is to keep the decoder buffer full by calling a Netflix provided API which provides the next frame of audio or video data. In Ninja, this job is performed by an Android Thread. There is a simple state machine and some logic to handle different play states, but under normal playback the thread copies one frame of data into the Android playback API, then tells the thread scheduler to wait 15 ms and invoke the handler again. When you create an Android thread, you can request that the thread be run repeatedly, as if in a loop, but it is the Android Thread scheduler that calls the handler, not your own application.

To play a 60fps video, the highest frame rate available in the Netflix catalog, the device must render a new frame every 16.66 ms, so checking for a new sample every 15ms is just fast enough to stay ahead of any video stream Netflix can provide. Because the integrator had identified the audio stream as the problem, I zeroed in on the specific thread handler that was delivering audio samples to the Android audio service.

I wanted to answer this question: where is the extra time? I assumed some function invoked by the handler would be the culprit, so I sprinkled log messages throughout the handler, assuming the guilty code would be apparent. What was soon apparent was that there was nothing in the handler that was misbehaving, and the handler was running in a few milliseconds even when playback was stuttering.

Aha, Insight

In the end, I focused on three numbers: the rate of data transfer, the time when the handler was invoked and the time when the handler passed control back to Android. I wrote a script to parse the log output, and made the graph below which gave me the answer.

A graph showing time spent in the thread handler and audio data throughput.
Figure 2: Visualizing Audio Throughput and Thread Handler Timing

The orange line is the rate that data moved from the streaming buffer into the Android audio system, in bytes/millisecond. You can see three distinct behaviors in this chart:

  1. The two, tall spiky parts where the data rate reaches 500 bytes/ms. This phase is buffering, before playback starts. The handler is copying data as fast as it can.
  2. The region in the middle is normal playback. Audio data is moved at about 45 bytes/ms.
  3. The stuttering region is on the right, when audio data is moving at closer to 10 bytes/ms. This is not fast enough to maintain playback.

The unavoidable conclusion: the orange line confirms what the chip vendor’s engineer reported: Ninja is not delivering audio data quickly enough.

To understand why, let’s see what story the yellow and grey lines tell.

The yellow line shows the time spent in the handler routine itself, calculated from timestamps recorded at the top and the bottom of the handler. In both normal and stutter playback regions, the time spent in the handler was the same: about 2 ms. The spikes show instances when the runtime was slower due to time spent on other tasks on the device.

The real root cause

The grey line, the time between calls invoking the handler, tells a different story. In the normal playback case you can see the handler is invoked about every 15 ms. In the stutter case, on the right, the handler is invoked approximately every 55 ms. There are an extra 40 ms between invocations, and there’s no way that can keep up with playback. But why?

I reported my discovery to the integrator and the chip vendor (look, it’s the Android Thread scheduler!), but they continued to push back on the Netflix behavior. Why don’t you just copy more data each time the handler is called? This was a fair criticism, but changing this behavior involved deeper changes than I was prepared to make, and I continued my search for the root cause. I dove into the Android source code, and learned that Android Threads are a userspace construct, and the thread scheduler uses the epoll() system call for timing. I knew epoll() performance isn’t guaranteed, so I suspected something was affecting epoll() in a systematic way.

At this point I was saved by another engineer at the chip supplier, who discovered a bug that had already been fixed in the next version of Android, named Marshmallow. The Android thread scheduler changes the behavior of threads depending whether or not an application is running in the foreground or the background. Threads in the background are assigned an extra 40 ms (40000000 ns) of wait time.

A bug deep in the plumbing of Android itself meant this extra timer value was retained when the thread moved to the foreground. Usually the audio handler thread was created while the application was in the foreground, but sometimes the thread was created a little sooner, while Ninja was still in the background. When this happened, playback would stutter.

Lessons learned

This wasn’t the last bug we fixed on this platform, but it was the hardest to track down. It was outside of the Netflix application, in a part of the system that was outside of the playback pipeline, and all of the initial data pointed to a bug in the Netflix application itself.

This story really exemplifies an aspect of my job I love: I can’t predict all of the issues that our partners will throw at me, and I know that to fix them I have to understand multiple systems, work with great colleagues, and constantly push myself to learn more. What I do has a direct impact on real people and their enjoyment of a great product. I know when people enjoy Netflix in their living room, I’m an essential part of the team that made it happen.


Life of a Netflix Partner Engineer — The case of extra 40 ms was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The evolving role of operations in DevOps

Post Syndicated from Jared Murrell original https://github.blog/2020-12-03-the-evolving-role-of-operations-in-devops/

This is the third blog post in our series of DevOps fundamentals. For a quick intro on what DevOps is, check out part one; for a primer on automation in DevOps, visit part two.

As businesses reorganize for DevOps, the responsibilities of teams throughout the software lifecycle inevitably shift. Operations teams that traditionally measure themselves on uptime and stability—often working in silos separate from business and development teams—become collaborators with new stakeholders throughout the software lifecycle. Development and operations teams begin to work closely together to build and continually improve their delivery and management processes. In this blog post, we’ll share more on what these evolving roles and responsibilities look like for IT teams today, and how operations help drive consistency and success across the entire organization.

The Ops role in DevOps compared to traditional IT operations

To better understand how DevOps changes the responsibilities of operations teams, it will help to recap the traditional, pre-DevOps role of operations. Let’s take a look at a typical organization’s software lifecycle: before DevOps, developers package an application with documentation, and then ship it to a QA team. The QA teams install and test the application, and then hand off to production operations teams. The operations teams are then responsible for deploying and managing the software with little-to-no direct interaction with the development teams.

These dev-to-ops handoffs are typically one-way, often limited to a few scheduled times in an application’s release cycle. Once in production, the operations team is then responsible for managing the service’s stability and uptime, as well as the infrastructure that hosts the code. If there are bugs in the code, the virtual assembly line of dev-to-qa-to-prod is revisited with a patch, with each team waiting on the other for next steps. This model typically requires pre-existing infrastructure that needs to be maintained, and comes with significant overhead. While many businesses continue to remain competitive with this model, the faster, more collaborative way of bridging the gap between development and operations is finding wide adoption in the form of DevOps.

Accelerating through public cloud adoption

Over the past decade, the maturation of the public cloud has added complexity to the responsibilities of operations teams. The ability to rent stable, secure infrastructure by the minute and provide everything as a service to customers has enabled organizations to deploy rapidly and frequently, often several times per day. Smaller, faster delivery cycles give organizations the critical capability of improving their customer experience through rapid feedback and automated deployments. Cloud technologies have made development velocity a fundamental part of delivering a competitive customer experience.

What the cloud, DevOps, and developer velocity mean for operations teams

Cloud technologies have transformed how we deliver and operate software, impacting how we do DevOps  today. Developers now focus more on stability and uptime in addition to developer velocity, and operations teams now have a stake in developer velocity along with their traditional role of maintaining uptime. When it comes to the specific role of operations in DevOps, this often means:

  • Enabling self-service for developers. In order to support developer velocity—and minimize risks that stem from “shadow operations”, where developers seek their own solutions—operations teams work more closely with developers to provide on-demand access to secure, compliant tooling and environments.
  • Standardized tooling and processes across the business. The best way to enable a sustainable self-service model and empower teams to work more efficiently together is by standardizing on the tooling that is in use. Tools and processes that are shared across the business unit enable organizational unity and greater collaboration. In turn, this reduces the friction developers and operations teams experience when sharing responsibilities.
  • Bringing extensible automation to traditional operations tasks. As operations teams focus more on empowering other teams through self-service and collaboration, there is less time to handle other work. Traditional operations tasks like resolving incidents, updating systems, or scaling infrastructure still need to be addressed—only smarter. When development and operations unite under DevOps, operations teams turn to automation for more of the repeatable tasks and drive consistency across the organization. This also enables teams and business units to track and measure the results of their efforts.
  • Working and shipping like developers. As operations teams shift more towards greater automation, ‘X’ as code becomes the new normal. Like application source code, the code controlling operations systems needs to be stored, versioned, secured, and maintained. As a result, the development-operations relationship starts to feel more balanced on both sides: operations specialists become more like the developers and more familiar with their working models, and in some organizations, developers become more like operations, sharing in the responsibility of debugging problems with their own code in production.

Closing the development-operations gap

While it’s well understood that DevOps requires close collaboration between teams, we’re often asked “How are development and operations functions really coordinated in a DevOps model?” At GitHub, we’re fortunate to partner with thousands of businesses every year on improving their DevOps practices. Sometimes these organizations focus on the clearest target, asking developers and delivery teams to go to market faster while paying less attention to the post-deployment operations teams.

However, we find the best results come through improving the practices of all the teams involved in the software lifecycle, together. Operations teams aren’t simply infrastructure and process owners for the organizations, but are also a critical part of the feedback loop for development. Try it out for yourself—a small pilot project that includes developers, release engineering, operations, and even InfoSec can give more teams the momentum they need. It can give them confidence to continue their work, establish best practices, and even train others within your organization along the way.

For a closer look at IT operations in DevOps, tune in to next week’s GitHub Universe session: Continuous delivery with GitHub Actions

Improving the GHES release process: release candidates

Post Syndicated from Maya Ross original https://github.blog/2020-12-03-improving-the-ghes-release-process-release-candidates/

In our ongoing “Building GitHub” series, we talk about some of the projects we’re working on to improve how efficiently we build GitHub, as well as increase GitHub’s availability, stability, and resilience. We know how important the stability of our platform is for developers and enterprises, and it continues to be a priority area of investment across GitHub.

In that spirit, we want to share a change in how we make new feature releases available to our GitHub Enterprise Server customers. This change will take effect with our next release, and we hope this increases our collaboration with our GHES customers and improves our release process.

What are Release Candidates?

Release candidates, or RCs, are builds that allow our GitHub Enterprise Server customers to try the latest release early. These RCs are a way for us to work with our customers on bugs and issues that will be used to improve the quality of every release.

Working in the open like this is the best way for us to collaborate with our customers to improve GitHub Enterprise Server and ensure that we are delivering a product that meets and (hopefully) exceeds expectations.

The Release Candidate Process

What can I expect with this new process?

Customers can start testing an RC as soon as it’s available, and release notes will accompany each RC. We expect each feature release will have one or more RC versions (eg. 2.22.0.RC1, 2.22.0.RC2), with each new version adding bug fixes for issues found in prior versions. The number of RCs will be driven by customer feedback, and we’ll decide based on quality and customer feedback when to publish and make generally available a final production release.

RCs can be upgraded from any version and can upgrade to any version. They should be deployed on test or staging environments.

Customers that test RCs can raise issues with GitHub Support. Each RC is supported while live, but is not included in long-term support.

What does this mean for other releases?

  • Production releases will continue to be numbered as they are today (2.20, 2.21, etc.)
  • Patch releases will not be released as RCs

With this new RC process, testing and feedback from our customers will be critical. We’re confident this will help us improve GitHub Enterprise Server, together. We’ll have more to share about upcoming RCs at GitHub Universe next week. Make sure you tune in!

How Grab is Blazing Through the Super App Bazel Migration

Post Syndicated from Grab Tech original https://engineering.grab.com/how-grab-is-blazing-through-the-super-app-bazel-migration

Introduction

At Grab, we build a seamless user experience that addresses more and more of the daily lifestyle needs of people across South East Asia. We’re proud of our Grab rides, payments, and delivery services, and want to provide a unified experience across these offerings.

Here is couple of examples of what Grab does for millions of people across South East Asia every day:

Grab Service Offerings
Grab Service Offerings

The Grab Passenger application reached super app status more than a year ago and continues to provide hundreds of life-changing use cases in dozens of areas for millions of users.

With the big product scale, it brings with it even bigger technical challenges. Here are a couple of dimensions that can give you a sense of the scale we’re working with.

Engineering and product structure

Technical and product teams work in close collaboration to outserve our customers. These teams are combined into dedicated groups to form Tech Families and focus on similar use cases and areas.

Grab consists of many Tech Families who work on food, payments, transport, and other services, which are supported by hundreds of engineers. The diverse landscape makes the development process complicated and requires the industry’s best practices and approaches.

Codebase scale overview

The Passenger Applications (Android and iOS) contain more than 2.5 million lines of code each and it keeps growing. We have 1000+ modules in the Android App and 700+ targets in the iOS App. Hundreds of commits are merged by all the mobile engineers on a daily basis.

To maintain the health of the codebase and product stability, we run 40K+ unit tests on Android and 30K+ unit tests on iOS, as well as thousands of UI tests and hundreds of end-to-end tests on both platforms.

Build time challenges

The described complexity and scale do not come without challenges. A huge codebase propels the build process to the ultimate extreme- challenging the efficiency of build systems and hardware used to compile the super app, and creating out of the line challenges to be addressed.

Local build time

Local build time (the build on engineers’ laptop) is one of the most obvious challenges. More code goes in the application binary, hence the build system requires more time to compile it.

ADR local build time

The Android ecosystem provides a great out-of-the-box tool to build your project called Gradle. It’s flexible and user friendly, and  provides huge capabilities for a reasonable cost. But is this always true? It appears to not be the case due to multiple reasons. Let’s unpack these reasons below.

Gradle performs well for medium sized projects with say 1 million line of code. Once the code surpasses that 1 million mark (or so), Gradle starts failing in giving engineers a reasonable build time for the given flexibility. And that’s exactly what we have observed in our Android application.

At some point in time, the Android local build became ridiculously long. We even encountered cases  where engineers’ laptops simply failed to build the project due to hardware resources limits. Clean builds took by the hours, and incremental builds easily hit dozens of minutes.

iOS local build time

Xcode behaved a bit better compared to Gradle. The Xcode build cache was somehow bearable for incremental builds and didn’t exceed a couple of minutes. Clean builds still took dozens of minutes though. When Xcode failed to provide the valid cache, engineers had to rerun everything as a clean build, which killed the experience entirely.

CI pipeline time

Each time an engineer submits a Merge Request (MR), our CI kicks in running a wide variety of jobs to ensure the commit is valid and doesn’t introduce regression to the master branch. The feedback loop time is critical here as well, and the pipeline time tends to skyrocket alongside the code base growth. We found ourselves on the trend where the feedback loop came in by the hours, which again was just breaking the engineering experience, and prevented  us from delivering the world’s best features to our customers.

As mentioned, we have a large number of unit tests (30K-40K+) and UI tests (700+) that we run on a pre-merge pipeline. This brings us to hours of execution time before we could actually allow MRs to land to the master branch.

The number of daily commits, which is by the hundreds, adds another stone to the basket of challenges.

All this clearly indicated the area of improvement. We were missing opportunities in terms of engineering productivity.

The extra mile

The biggest question for us to answer was how to put all this scale into a reasonable experience with minimal engineering idle time and fast feedback loop.

Build time critical path optimization

The most reasonable thing to do was to pay attention to the utilization of the hardware resources and make the build process optimal.

This literally boiled down to the simplest approach:

  1. Decouple building blocks
  2. Make building blocks as small as possible

This approach is valid for any build system and applies  for both iOS and Android. The first thing we focused on was to understand what our build graph looked like, how dependencies were distributed, and which blocks were bottlenecks.

Given the scale of the apps, it’s practically not possible to manage a dependency tree manually, thus we created a tool to help us.

Critical path overview

We introduced the Critical Path concept:

The critical path is the longest (time) chain of sequential dependencies, which must be built one after the other.

Critical Path
Critical Path build

Even with an infinite number of parallel processors/cores, the total build time cannot be less than the critical path time.

We implemented the tool that parsed the dependency trees (for both Android and iOS), aggregated modules/target build time, and calculated the critical path.

The concept of the critical path introduced a number of action items, which we prioritized:

  • The critical path must be as short as possible.
  • Any huge module/target on the critical path must be split into smaller modules/targets.
  • Depend on interfaces/bridges rather than implementations to shorten the critical path.
  • The presence of other teams’ implementation modules/targets in the critical path of the given team is a red flag.
Stack representation of the Critical Path build time
Stack representation of the Critical Path build time

Project’s scale factor

To implement the conceptually easy action items, we ran a Grab-wide program. The program has impacted almost every mobile team at Grab and involved 200+ engineers to some degree. The whole implementation took 6 months to complete.

During this period of time, we assigned engineers who were responsible to review the changes, provide support to the engineers across Grab, and monitor the results.

Results

Even though the overall plan seemed to be good on paper, the results were minimal – it just flattened the build time curve of the upcoming trend introduced by the growth of the codebase. The estimated impact was almost the same for both platforms and gave us about a 7%-10% cut in the CI and local build time.

Open source plan

The critical path tool proved to be effective to illustrate the projects’ bottlenecks in a dependency tree configuration. It is currently widely used by mobile teams at Grab to analyze their dependencies and cut out or limit an unnecessary impact on the respective scope.

The tool is currently considered to be open-sourced as we’d like to hear feedback from other external teams and see what can be built on top of it. We’ll provide more details on this in future posts.

Remote build

Another pillar of the  build process is the hardware where the build runs. The solution is  really straightforward – put more muscles on your build to get it stronger and to run faster.

Clearly, our engineers’ laptops could not be considered fast enough. To have a fast enough build we were looking at something with 20+ cores, ~200Gb of RAM. None of the desktop or laptop computers can reach those numbers within reasonable pricing. We hit a bottleneck in hardware. Further parallelization of the build process didn’t give any significant improvement as all the build tasks were just queueing and waiting for the resources to be released. And that’s where cloud computing came into the picture where a huge variety of available options is ready to be used.

ADR mainframer

We took advantage of the Mainframer tool. When the build must run, the code diff is pushed to the remote executor, gets compiled, and then the generated artifacts are pushed back to the local machine. An engineer might still benefit from indexing, debugging, and other features available in the IDE.

To make the infrastructure mature enough, we’ve introduced Kubernetes-based autoscaling based on the load. Currently, we have a stable infrastructure that accommodates 100+ Android engineers scaling up and down (saving costs).

This strategy gave us a 40-50% improvement in the local build time. Android builds finished, in the extreme case, x2 faster.

iOS

Given the success of the Android remote build infrastructure, we have immediately turned our attention to the iOS builds. It was an obvious move for us – we wanted the same infrastructure for iOS builds. The idea looked good on paper and was proven with Android infrastructure, but the reality was a bit different for our iOS builds.

Our  very first roadblock was that Xcode is not that flexible and the process of delegating build to remote is way more complicated compared to Android. We tackled a series of blockers such as running indexing on a remote machine, sending and consuming build artifacts, and even running the remote build itself.

The reality was that the remote build was absolutely possible for iOS. There were  minor tradeoffs impacting engineering experience alongside obvious gains from utilizing cloud computing resources. But the problem is that legally iOS builds are only allowed to be built on an Apple machine.

Even if we get the most powerful hardware – a macPro –  the specs are still not ideal and are unfortunately not optimized for the build process. A 24 core, 194Gb RAM macPro could have given about x2 improvement on the build time, but when it had to  run 3 builds simultaneously for different users, the build efficiency immediately dropped to the baseline value.

Android remote machines with the above same specs are capable of running up to 8 simultaneous builds. This allowed us to accommodate up to 30-35 engineers per machine, whereas iOS’ infrastructure would require to keep this balance at 5-6 engineers per machine. This solution didn’t seem to be scalable at all, causing us to abandon the idea of the remote builds for iOS at that moment.

Test impact analysis

The other battlefront was the CI pipeline time. Our efforts in dependency tree optimizations complemented with comparably powerful hardware played a good part in achieving a reasonable build time on CI.

CI validations also include the execution of unit and UI tests and may easily take 50%-60% of the pipeline time. The problem was getting worse as the number of tests was constantly growing. We were to face incredibly huge tests’ execution time in the near future. We could mitigate the problem by a muscle approach – throwing more runners and shredding tests – but it won’t make finance executives happy.

So the time for smart solutions came again. It’s a known fact that the simpler solution is more likely to be correct. The simplest solution was to stop running ALL tests. The idea was to run only those tests that were impacted by the codebase change introduced in the given MR.

Behind this simple idea, we’ve found a huge impact. Once the Test Impact Analysis was applied to the pre-merge pipelines, we’ve managed to cut down the total number of executed tests by up to 90% without any impact on the codebase quality or applications’ stability. As a result, we cut the pipeline for both platforms by more than 30%.

Today, the Test Impact Analysis is coupled with our codebase. We are looking to  invest some effort to make it available for open sourcing. We are excited to be  on this path.

The end of the Native Build Systems

One might say that our journey was long and we won the battle for the build time.

Today, we hit a limit to the native build systems’ efficiency and hardware for both Android and iOS. And it’s clear to us that in our current setup, we would not be able to scale up while delivering high engineering experience.

Let’s move to Bazel

To introduce another big improvement to the build time, we needed to make some ground-level changes. And this time, we focused on the  build system itself.

Native build systems are designed to work well for small and medium-sized projects, however they have not been as successful in large scale projects such as the Grab Passenger applications.

With these assumptions, we considered options and found the Bazel build system to be a good contender. The deep comparison of build systems disclosed that Bazel was promising better results almost in all key areas:

  • Bazel enables remote builds out of box
  • Bazel provides sustainable cache capabilities (local and remote). This cache can be reused across all consumers – local builds, CI builds
  • Bazel was designed with the big codebase as a cornerstone requirement
  • The majority of the tooling may be reused across multiple platforms

Ways of adopting

On paper, Bazel was awesome and shining. All our playground investigations showed positive results:

  • Cache worked great
  • Incremental builds were incredibly fast

But the effort to shift to this new build system was huge. We made sure that we foresee all possible pitfalls and impediments. It took us about 5 months to estimate the impact and put together a sustainable proof of concept, which reflected the majority of our use cases.

Migration limitations

After those 5 months of investigation, we got the endless list of incompatible features and major blockers to be addressed. Those blockers touched even such obvious things as indexing and the jump to definition IDE feature, which we used to take for granted.

But the biggest challenge was the need to keep the pace of the product release. There was no compromise of stopping the product development even for a day. The way out appeared to be a hybrid build concept. We figured out how to marry native and Bazel build systems to live together in harmony. This move gave us a chance to start migrating target by target, project by project moving from the bottom to top of the dependency graph.

This approach was a valid enabler, however we were still faced with a challenge of our app’s  scale. The codebase of over 2.5 million of LOC cannot be migrated overnight. The initial estimation was based on the idea of manually migrating the whole codebase, which would have required us to invest dozens of person-months.

Team capacity limitations

This approach was immediately pushed back by multiple teams arguing with the priority and concerns about the impact on their own product roadmap.

We were left with not much  choice. On one hand, we had a pressingly long build time. And on the other hand, we were asking for a huge effort from teams. We clearly needed to get buy-ins from all of our stakeholders to push things forward.

Getting buy-in

To get all needed buy-ins, all stakeholders were grouped and addressed separately. We defined key factors for each group.

Key factors

C-level stakeholders:

  • Impact. The migration impact must be significant – at least a 40% decrease on the build time.
  • Costs. Migration costs must be paid back in a reasonable time and the positive impact is extended to  the future.
  • Engineering experience. The user experience must not be compromised. All tools and features engineers used must be available during migration and even after.

Engineers:

  • Engineering experience. Similar to the criteria established at the C-level factor.
  • Early adopters engagement. A common  core experience must be created across the mobile engineering community to support other engineers in the later stages.
  • Education. Awareness campaigns must be in place. Planned and conducted a series of tech talks and workshops to raise awareness among engineers and cut the learning curve. We wrote hundreds of pages of documentation and guidelines.

Product teams:

  • No roadmap impact. Migration must not affect the product roadmap.
  • Minimize the engineering effort. Migration must not increase the efforts from engineering.

Migration automation (separate talks)

The biggest concern for the majority of the stakeholders appeared to be the estimated migration effort, which impacted the cost, the product roadmap, and the engineering experience. It became evident that we needed to streamline the process and reduce the effort for migration.

Fortunately, the actual migration process was routine in nature, so we had opportunities for automation. We investigated ideas on automating the whole migration process.

The tools we’ve created

We found that it’s relatively easy to create a bunch of tools that read the native project structure and create an equivalent Bazel set up. This was a game changer.

Things moved pretty smoothly for both Android and iOS projects. We managed to roll out tooling to migrate the codebase in a single click/command (well with some exceptions as of now. Stay tuned for another blog post on this). With this tooling combined with the hybrid build concept, we addressed all the key buy-in factors:

  • Migration cost dropped by at least 50%.
  • Less engineers required for the actual migration. There was no need to engage the wide engineering community as a small group of people can manage the whole process.
  • There is no more impact on the product roadmap.

Where do we stand today

When we were in the middle of the actual migration, we decided to take a pragmatic path and migrate our applications in phases to ensure everything was under control and that there were no unforeseen issues.

The hybrid build time is racing alongside our migration progress. It has a linear dependency on the amount of the migrated code. The figures look positive and we are confident in achieving our impact goal of decreasing at least 40% of the build time.

Plans to open source

The automated migration tooling we’ve created is planned to be open sourced. We are doing a bit better on the Android side decoupling it from our applications’ implementation details and plan to open source it in the near future.

The iOS tooling is a bit behind, and we expect it to be available for open-sourcing by the end of Q1’2021.

Is it worth it all?

Bazel is not a silver bullet for the build time and your project. There are a lot of edge cases you’ll never know until it punches you straight in your face.

It’s far from industry standard and you might find yourself having difficulty hiring engineers with such knowledge. It has a steep learning curve as well. It’s absolutely an overhead for small to medium-sized projects, but it’s undeniably essential once you start playing in a high league of super apps.

If you were to ask whether we’d go this path again, the answer would come in a fast and correct way – yes, without any doubts.


Authored by Sergii Grechukha on behalf of the Passenger App team at Grab. Special thanks to Madushan Gamage, Mikhail Zinov, Nguyen Van Minh, Mihai Costiug, Arunkumar Sampathkumar, Maryna Shaposhnikova, Pavlo Stavytskyi, Michael Goletto, Nico Liu, and Omar Gawish for their contributions.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GitHub Availability Report: November 2020

Post Syndicated from Keith Ballinger original https://github.blog/2020-12-02-availability-report-november-2020/

Introduction

In November, we experienced two incidents resulting in significant impact and degraded state of availability for issues, pull requests, and GitHub Actions services.

November 2 12:00 UTC (lasting 32 minutes)

The SSL certificate for *.githubassets.com expired, impacting web requests for GitHub.com UI and services. There was an auto-generated issue indicating the certificate was within 30 days of expiration, but it was not addressed in time. Impact was reported, and the on-call engineer remediated it promptly.

We are using this occurrence to evaluate our current processes, as well as our tooling and automation, within this area to reduce the likelihood of such instances in the future.

November 27 16:04 UTC (lasting one hour and one minute)

Our service monitors detected abnormal levels of replication lag within one of our MySQL clusters affecting the GitHub Actions service.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

In summary

We place great importance in the reliability of our services along with the trust that our users place in us every day. We’ll continue to keep you updated on the progress we’re making to ensure this. To learn more about what we’re working on, visit the GitHub engineering blog.

A Byzantine failure in the real world

Post Syndicated from Tom Lianza original https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/

A Byzantine failure in the real world

An analysis of the Cloudflare API availability incident on 2020-11-02

When we review design documents at Cloudflare, we are always on the lookout for Single Points of Failure (SPOFs). Eliminating these is a necessary step in architecting a system you can be confident in. Ironically, when you’re designing a system with built-in redundancy, you spend most of your time thinking about how well it functions when that redundancy is lost.

On November 2, 2020, Cloudflare had an incident that impacted the availability of the API and dashboard for six hours and 33 minutes. During this incident, the success rate for queries to our API periodically dipped as low as 75%, and the dashboard experience was as much as 80 times slower than normal. While Cloudflare’s edge is massively distributed across the world (and kept working without a hitch), Cloudflare’s control plane (API & dashboard) is made up of a large number of microservices that are redundant across two regions. For most services, the databases backing those microservices are only writable in one region at a time.

Each of Cloudflare’s control plane data centers has multiple racks of servers. Each of those racks has two switches that operate as a pair—both are normally active, but either can handle the load if the other fails. Cloudflare survives rack-level failures by spreading the most critical services across racks. Every piece of hardware has two or more power supplies with different power feeds. Every server that stores critical data uses RAID 10 redundant disks or storage systems that replicate data across at least three machines in different racks, or both. Redundancy at each layer is something we review and require. So—how could things go wrong?

In this post we present a timeline of what happened, and how a difficult failure mode known as a Byzantine fault played a role in a cascading series of events.

2020-11-02 14:43 UTC: Partial Switch Failure

At 14:43, a network switch started misbehaving. Alerts began firing about the switch being unreachable to pings. The device was in a partially operating state: network control plane protocols such as LACP and BGP remained operational, while others, such as vPC, were not. The vPC link is used to synchronize ports across multiple switches, so that they appear as one large, aggregated switch to servers connected to them. At the same time, the data plane (or forwarding plane) was not processing and forwarding all the packets received from connected devices.

This failure scenario is completely invisible to the connected nodes, as each server only sees an issue for some of its traffic due to the load-balancing nature of LACP. Had the switch failed fully, all traffic would have failed over to the peer switch, as the connected links would’ve simply gone down, and the ports would’ve dropped out of the forwarding LACP bundles.

Six minutes later, the switch recovered without human intervention. But this odd failure mode led to further problems that lasted long after the switch had returned to normal operation.

2020-11-02 14:44 UTC: etcd Errors begin

The rack with the misbehaving switch included one server in our etcd cluster. We use etcd heavily in our core data centers whenever we need strongly consistent data storage that’s reliable across multiple nodes.

In the event that the cluster leader fails, etcd uses the RAFT protocol to maintain consistency and establish consensus to promote a new leader. In the RAFT protocol, cluster members are assumed to be either available or unavailable, and to provide accurate information or none at all. This works fine when a machine crashes, but is not always able to handle situations where different members of the cluster have conflicting information.

In this particular situation:

  • Network traffic between node 1 (in the affected rack) and node 3 (the leader) was being sent through the switch in the degraded state,
  • Network traffic between node 1 and node 2 were going through its working peer, and
  • Network traffic between node 2 and node 3 was unaffected.

This caused cluster members to have conflicting views of reality, known in distributed systems theory as a Byzantine fault. As a consequence of this conflicting information, node 1 repeatedly initiated leader elections, voting for itself, while node 2 repeatedly voted for node 3, which it could still connect to. This resulted in ties that did not promote a leader node 1 could reach. RAFT leader elections are disruptive, blocking all writes until they’re resolved, so this made the cluster read-only until the faulty switch recovered and node 1 could once again reach node 3.

A Byzantine failure in the real world

2020-11-02 14:45 UTC: Database system promotes a new primary database

Cloudflare’s control plane services use relational databases hosted across multiple clusters within a data center. Each cluster is configured for high availability. The cluster setup includes a primary database, a synchronous replica, and one or more asynchronous replicas. This setup allows redundancy within a data center. For cross-datacenter redundancy, a similar high availability secondary cluster is set up and replicated in a geographically dispersed data center for disaster recovery. The cluster management system leverages etcd for cluster member discovery and coordination.

When etcd became read-only, two clusters were unable to communicate that they had a healthy primary database. This triggered the automatic promotion of a synchronous database replica to become the new primary. This process happened automatically and without error or data loss.

There was a defect in our cluster management system that requires a rebuild of all database replicas when a new primary database is promoted. So, although the new primary database was available instantly, the replicas would take considerable time to become available, depending on the size of the database. For one of the clusters, service was restored quickly. Synchronous and asynchronous database replicas were rebuilt and started replicating successfully from primary, and the impact was minimal.

For the other cluster, however, performant operation of that database required a replica to be online. Because this database handles authentication for API calls and dashboard activities, it takes a lot of reads, and one replica was heavily utilized to spare the primary the load. When this failover happened and no replicas were available, the primary was overloaded, as it had to take all of the load. This is when the main impact started.

Reduce Load, Leverage Redundancy

At this point we saw that our primary authentication database was overwhelmed and began shedding load from it. We dialed back the rate at which we push SSL certificates to the edge, send emails, and other features, to give it space to handle the additional load. Unfortunately, because of its size, we knew it would take several hours for a replica to be fully rebuilt.

A silver lining here is that every database cluster in our primary data center also has online replicas in our secondary data center. Those replicas are not part of the local failover process, and were online and available throughout the incident. The process of steering read-queries to those replicas was not yet automated, so we manually diverted API traffic that could leverage those read replicas to the secondary data center. This substantially improved our API availability.

The Dashboard

The Cloudflare dashboard, like most web applications, has the notion of a user session. When user sessions are created (each time a user logs in) we perform some database operations and keep data in a Redis cluster for the duration of that user’s session. Unlike our API calls, our user sessions cannot currently be moved across the ocean without disruption. As we took actions to improve the availability of our API calls, we were unfortunately making the user experience on the dashboard worse.

This is an area of the system that is currently designed to be able to fail over across data centers in the event of a disaster, but has not yet been designed to work in both data centers at the same time. After a first period in which users on the dashboard became increasingly frustrated, we failed the authentication calls fully back to our primary data center, and kept working on our primary database to ensure we could provide the best service levels possible in that degraded state.

2020-11-02 21:20 UTC Database Replica Rebuilt

The instant the first database replica rebuilt, it put itself back into service, and performance resumed to normal levels. We re-ramped all of the services that had been turned down, so all asynchronous processing could catch up, and after a period of monitoring marked the end of the incident.

Redundant Points of Failure

The cascade of failures in this incident was interesting because each system, on its face, had redundancy. Moreover, no system fully failed—each entered a degraded state. That combination meant the chain of events that transpired was considerably harder to model and anticipate. It was frustrating yet reassuring that some of the possible failure modes were already being addressed.

A team was already working on fixing the limitation that requires a database replica rebuild upon promotion. Our user sessions system was inflexible in scenarios where we’d like to steer traffic around, and redesigning that was already in progress.

This incident also led us to revisit the configuration parameters we put in place for things that auto-remediate. In previous years, promoting a database replica to primary took far longer than we liked, so getting that process automated and able to trigger on a minute’s notice was a point of pride. At the same time, for at least one of our databases, the cure may be worse than the disease, and in fact we may not want to invoke the promotion process so quickly. Immediately after this incident we adjusted that configuration accordingly.

Byzantine Fault Tolerance (BFT) is a hot research topic. Solutions have been known since 1982, but have had to choose between a variety of engineering tradeoffs, including security, performance, and algorithmic simplicity. Most general-purpose cluster management systems choose to forgo BFT entirely and use protocols based on PAXOS, or simplifications of PAXOS such as RAFT, that perform better and are easier to understand than BFT consensus protocols. In many cases, a simple protocol that is known to be vulnerable to a rare failure mode is safer than a complex protocol that is difficult to implement correctly or debug.

The first uses of BFT consensus were in safety-critical systems such as aircraft and spacecraft controls. These systems typically have hard real time latency constraints that require tightly coupling consensus with application logic in ways that make these implementations unsuitable for general-purpose services like etcd. Contemporary research on BFT consensus is mostly focused on applications that cross trust boundaries, which need to protect against malicious cluster members as well as malfunctioning cluster members. These designs are more suitable for implementing general-purpose services such as etcd, and we look forward to collaborating with researchers and the open source community to make them suitable for production cluster management.

We are very sorry for the difficulty the outage caused, and are continuing to improve as our systems grow. We’ve since fixed the bug in our cluster management system, and are continuing to tune each of the systems involved in this incident to be more resilient to failures of their dependencies.  If you’re interested in helping solve these problems at scale, please visit cloudflare.com/careers.

Democratizing Fare Storage at scale using Event Sourcing

Post Syndicated from Grab Tech original https://engineering.grab.com/democratizing-fare-storage-at-scale-using-event-sourcing

From humble beginnings, Grab has expanded across different markets in the last couple of years. We’ve added a wide range of features to the Grab platform to continue to delight our customers and driver-partners. We had to incessantly find ways to improve our existing solutions to better support new features.

In this blog, we discuss how we built Fare Storage, Grab’s single source of truth fare data store, and how we overcame the challenges to make it more reliable and scalable to support our expanding features.

High-level Flow

To set some context for this blog, let’s define some key terms before proceeding. A Fare is a dollar amount calculated to move someone or something from point A to point B. And, a Fee is a dollar amount added to or subtracted from the original fare amount for any additional service.

Now that you’re acquainted with the key concepts, let’s look take a look at the following image. It illustrates that features such as Destination Change Fee, Waiting Fee, Cancellation Fee, Tolls, Promos, Surcharges, and many others store additional fee breakdown along with the original fare. This set of information is crucial for generating receipts and debugging processes. However, our legacy storage system wasn’t designed to host massive quantities of information effectively.

Sample Flow with Fee Breakdown

In our legacy architecture, we stored all the booking and fare-related information in a single relational database table. Adding new fare fields and breakdowns required changes in our critical booking system, making iterations prohibitively expensive and hindering innovation.

The need to store the fare information and metadata for every additional feature along with other booking information resulted in a bloated booking entity. With millions of bookings created every day at Grab, this posed a scaling and stability threat to our booking service storage. Moreover, the legacy storage only tracked the latest value of fare and lacked a holistic view of all the modifications to the fare. So, debugging the fare was also a massive chore for our Engineering and Tech Operations teams.

Drafting a solution

The shortcomings of our legacy system led us to explore options for decoupling the fare and its metadata storage from the booking details. We wanted to build a platform that can store and provide access to both fare and its audit trail.

High-level functional requirements for the new fare store were:

  • Provide a platform to store and retrieve fare and associated breakdowns, with no tight coupling between services.
  • Act as a single source-of-truth for fare and associated fees in the Grab ecosystem.
  • Enable clients to access the metadata of fare change events in real-time, enabling the Product team to innovate freely.
  • Provide smooth access to a fare’s audit trail, improving the response time to our customers’ queries.

Non-functional requirements for the fare store were:

  • High availability for the read and write APIs, with few milliseconds latency.
  • Handle concurrent updates to the fare gracefully.
  • Detect duplicate events for a booking for the same transaction.

Storing change sequence with Event Sourcing

Our legacy storage solution used a defined schema and only stored the latest state of the fare. We needed an audit trail-based storage system with fast querying capabilities that can store and retrieve changes in chronological order.

The Event Sourcing pattern stood out as a flexible architectural pattern as it allowed us to store and query the sequence of changes in the order it occurred. In Martin Fowler’s blog, he described Event Sourcing as:

“The fundamental idea of Event Sourcing is to ensure that every change to the state of an application is captured in an event object and that these event objects are themselves stored in the sequence they were applied for the same lifetime as the application state itself.”

With the Event Sourcing pattern, we store all the fare changes as events in the order they occurred for a booking. We iterate through these events to retrieve a complete snapshot of the modifications to the fare.

A sample Fare Event looks like this:

message Event {
  // type of the event, ADD, SUB, SET, resilient
  EventType type = 1;
  // value which was added, subtracted or modified
  double value = 2;
  // fare for the booking after applying discount
  double fare = 3;

  ...

  // description bytes generated by SDK
  bytes description = 11;
  //transactionID for the EventType
  string transactionID = 12;
}

The Event Sourcing pattern also enable us to use the Command Query Responsibility Segregation (CQRS) pattern to decouple the read responsibility for different use cases.

CQRS Pattern

Clients of the fare life cycle read the current fare and create events to change the fare value as per their logic. Clients can also access fare events, when required. This pattern enable clients to modify fares independently, and give them visibility to the sequence for different business needs.

The diagram below describes the overall fare life cycle from creation, modification to display using the event store.

Overall Fare Life Cycle

Architecture overview

Fare Cycle Architecture

Clients interact with the Fare LifeCycle service through an SDK. The SDK offers various features such as metadata serialization, deserialization, retries, and timeouts configurations, some of which are discussed later.

The Fare LifeCycle Store service uses DynamoDB as Event Store to persist and read the fare change events backed by a cache for eventually consistent reads. For further processing, such as archiving and generation of receipts, the successfully updated events are streamed out to a message queue system.

Ensuring the integrity of the fare sequence

Democratizing the responsibility of fare modification means that multiple services might try to update the fare in parallel without prior synchronization. Concurrent fare updates for the same booking might result in a race condition. Concurrency and consistency problems are always highlights of distributed storage systems.

Let’s understand why the ordering of fare updates are important. Business rules for different cities and countries regulate the pricing features based on local market conditions and prevailing laws. For example, in some scenarios, Tolls and Waiting Fees may not be eligible for discounts or promotions. The service applying discounts needs to consider this information while applying a discount. Therefore, updates to the fare are not independent of the previous fare events.

Fare Integrity

We needed a mechanism to detect race conditions and handle them appropriately to ensure the integrity of the fare. To handle race conditions based on our use case, we explored Pessimistic and Optimistic locking mechanisms.

All the expected fare change events happen based on certain conditions being true or false. For example, less than 1% of the bookings have a payment change request initiated by passengers during a ride. And, the probability of multiple similar changes happening on the same booking is rather low. Optimistic Locking offers both efficiency and correctness for our requirements where the chances of race conditions are low, and the records are independent of each other.

The logic to calculate the fare/surcharge is coupled with the business logic of the system that calculates the fare component or fees. So, handling data race conditions on the data store layer was not an acceptable option either. It made more sense to let the clients handle it and keep the storage system decoupled from the business logic to compute the fare.

Optimistic Locking

To achieve Optimistic Locking, we store a fare version and increment it on every successful update. The client must pass the version they read to modify the fare. Should there be a version mismatch between the update query and the current fare, the update is rejected. On version mismatches, the clients read the updated checksum(version) and retry with the recalculated fare.

Idempotency of event updates

The next challenge we came across was how to handle client retries – ensuring that we do not duplicate the same event for the booking. Clients might encounter sporadic errors as a result of network-related issues, although the update was successful. Under such circumstances, clients retry to update the same event, resulting in duplicate events. Duplicate events not only result in an extra space requirement, but it also impairs the clients’ understanding on whether we’ve taken an action multiple times on the fare.

As discussed in the previous section, retrying with the same version would fail due to the version mismatch. If the previous attempt successfully modified the fare, it would also update the version.

However, clients might not know if their update modified the version or if any other clients updated the data. Relying on clients to check for event duplication makes the client-side complex and leaves a chance of error if the clients do not handle it correctly.

Solution for Duplicate Events

To handle the duplicate events, we associate each event with a unique UUID (transactionID) generated from the client-side using a UUID library from the Fare LifeCycle service SDK. We check whether the transactionID is already part of successful transaction IDs before updating the fare. If we identify a non-unique transactionID, we return duplicate event errors to the client.

For unique transactionIDs, we append it to the list of transactionIDs and save it to the Event Store along with the event.

Schema-less metadata

Metadata are the breakdowns associated with the fare. We require the metadata for specific fee/fare calculation for the generation of receipts and debugging purposes. Thus, for the storage system and multiple clients, they need not know the metadata definition of all events.

One goal for our data store was to give our clients the flexibility to add new fields to existing metadata or to define new metadata without changing the API. We adopted an SDK-based approach for metadata, where clients interact with the Fare LifeCycle service via SDK. The SDK has the following responsibilities for metadata:

  1. Serialize the metadata into bytes before making an API call to the Fare LifeCycle service.
  2. Deserialize the bytes metadata returned from the Fare LifeCycle service into a Go struct for client access.
Fare LifeCycle SDK

Serializing and deserializing the metadata on the client-side decoupled it from the Fare LifeCycle Store API. This helped teams update the metadata without deploying the storage service each time.

For reading the breakdown, the clients pass the metadata bytes to the SDK along with the Event Type, and then it converts them back into the corresponding proto schema. With this approach, clients can update the metadata without changing the Data Store Service.

Conclusion

The Fare LifeCycle service enabled us to revolutionize the fare storage at scale for Grab’s ecosystem of services. Further benefits realized with the system are:

  • The feature agnostic platform helped us to reduce the time-to-market for our hyper-local features so that we can further outserve our customers and driver-partners.
  • Decoupling the fare information from the booking information also helped us to achieve a better separation of concerns between services.
  • Improve the overall reliability and scalability of the Grab platform by decoupling fare and booking information, allowing them to scale independently of each other.
  • Reduce unnecessary coupling between services to fetch fare related information and update fare.
  • The audit-trail of fare changes in the chronological order reduced the time to debug fare and improved our response to customers for fare-related queries.

We hope this post helped you to have a closer look at how we used the Event Source pattern for building a data store and how we handled a few caveats and challenges in the process.


Authored by Sourabh Suman on behalf of the Pricing team at Grab. Special thanks to Karthik Gandhi, Kurni Famili, ChandanKumar Agarwal, Adarsh Koyya, Niteesh Mehra, Sebastian Wong, Matthew Saw, Muhammad Muneer, and Vishal Sharma for their contributions.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Nbdev: A literate programming environment that democratizes software engineering best practices

Post Syndicated from Hamel Husain original https://github.blog/2020-11-20-nbdev-a-literate-programming-environment-that-democratizes-software-engineering-best-practices/

At GitHub, we are deeply invested in democratizing software development. Part of this is accomplished by serving as the home of open source and providing tools to educators and students. We are also building features that lower the barrier to entry for software development, such as Codespaces. However, there is much work left to be done in order to make software development more approachable and to make it easier to employ best practices, such as continuous integration, distribution, and documentation of software.

This is why we decided to assist fastai in their development of a new, literate programming environment for Python, called nbdev. A discussion of the motivations behind nbdev as well as a primer on the history of literate programming can be found in this blog post. For the uninitiated, literate programming, as described by Donald Knuth, it is:

…a move away from writing computer programs in the manner and order imposed by the computer, and instead enables programmers to develop programs in the order demanded by the logic and flow of their thoughts.

While a subset of ideas from literate programming have shown up in tools, such as Swift Playgrounds, Jupyter, and Mathematica, there has been a lack of tools that encompass the entire software development life cycle. nbdev builds on top of Jupyter notebooks to fill these gaps and provides the following features, many of which are integrated with GitHub:

  • Automated generation of docs from Jupyter notebooks hosted on GitHub Pages. These docs are searchable and automatically hyperlinked to appropriate documentation pages by introspecting keywords you surround in backticks. An example of this documentation is the official fastai docs.
  • Continuous integration (CI) comes setup for you with GitHub Actions, which will run unit tests automatically for you. Even if you are not familiar with GitHub Actions, this starts working right away without any manual intervention.
  • The nbdev environment, which consists of a web server for previewing a docs site, a Jupyter server for writing code, and a series of CLI tools are set up to work with GitHub Codespaces, which makes getting started even easier. A detailed discussion of how CodeSpaces integrates with nbdev is discussed in this blog post.

As a teaser, this is a preview of this literate programming environment in Codespaces, which includes a notebook, a docs site and an IDE:

 

In addition to this GitHub integration, nbdev also offers the following features:

  • A robust, two-way sync between notebooks and source code, which allow you to use your IDE for code navigation or quick edits if desired.
  • The ability to write tests directly in notebooks without having to learn special APIs. These tests get executed in parallel with a single CLI command and also with GitHub Actions.
  • Tools for merge/conflict resolution with notebooks in a human readable format.
  • Utilities to automate the publishing of pypi and conda packages.

nbdev promotes software engineering best practices by allowing developers to write unit tests and documentation in the same context as source code, without having to learn special APIs or worry about web development. Similarly, GitHub Actions run unit tests automatically by default without requiring any prior experience with these tools. We believe removing friction from writing documentation and tests promotes higher quality software and makes software more inclusive.

Aside from using nbdev to create Python software, you can extend nbdev to build new types of tools. For example, we recently used nbdev to build fastpages, an easy to use blogging platform that allows developers to create blog posts directly with Jupyter notebooks. fastpages uses GitHub Actions and GitHub Pages to automate the conversion of notebooks to blog posts and offers a variety of other features to Python developers that democratize the sharing of knowledge. We have also used nbdev and fastpages to create covid19-dashboards, which demonstrates how to create interactive dashboards that automatically update with Jupyter notebooks.

We are excited about the potential of nbdev to make software engineering more inclusive, friendly, and robust.  We are also hopeful that tools like nbdev can inspire the next generation of literate programming tools. To learn more about nbdev, please see the following resources:

Finally, If you are building any projects with nbdev or would like to have further discussions, please feel free to reach out on the nbdev forums or on GitHub.