Tag Archives: Engineering

Scaling monorepo maintenance

Post Syndicated from Taylor Blau original https://github.blog/2021-04-29-scaling-monorepo-maintenance/

At GitHub, we serve some of the largest Git repositories on the planet. We also serve some of the fastest-growing repositories. Each day, the largest repositories we host become even larger.

About a year ago, we noticed that the job we use to repack Git repositories began hitting our self-imposed timeouts on larger repositories. Even when expanding these timeouts, failing maintenance on these repositories has generally been the cause of degraded performance that is hard to mitigate.

Today, these problems do not exist. GitHub can repack even the largest repositories we host in a fraction of the time it used to take. In this post, we’ll talk about what problems we were encountering, the solutions we built, how we deployed them safely, and describe some possible future directions.

All of our work here is being contributed to the open source Git project, and will be available in an upcoming release.

The problem

Why is GitHub’s maintenance job so expensive in the first place? It’s because we chose to have maintenance repack the entire contents of each repository into a single packfile. Doing so is expensive, but having just one packfile carries some benefits, too. With only one packfile, looking up objects doesn’t require opening and searching through multiple packs to find it. It also means that all objects can be compressed as a delta relative to all other objects (Git’s packfile format supports cross-pack deltas, but currently Git will never store them on disk). But, the most important reason is that reachability bitmaps, a performance-critical data structure, are only compatible with a single pack.

A new feature in Git, multi-pack indexes solves the former problem by making all object lookups go through a single index, but didn’t yet solve the latter. So, we set out to fill in the gaps by bringing bitmap support to multi-pack indexes in order to remove the single-pack limitation on reachability bitmaps.

But in order to build multi-pack bitmaps, we had to solve a number of other interesting problems along the way. First, we had to decide how to arrange the objects in a multi-pack index to achieve good bitmap compression. We also had to figure out how to quickly invert that ordering to translate between bit positions back to the objects they refer to. Some of these steps also yielded notable performance improvements on single-pack repositories, too. Finally, we had to figure out a new repacking strategy that scaled with the size of recent pushes, rather than with the size of the entire repository.

But before we get into all of that, let’s start from the very beginning.

Objects, packs, and fetching

You might be aware that Git stores the contents of your repositories as a set of objects. Each object represents an individual piece of your repository: a single file, tree, commit, or tag. These objects may be stored individually as “loose” objects, or together in a packfile.

If you’ve ever heard that a Git repository is “nothing more than a directed acyclic graph,” then you know that these objects can refer to one another. A tree refers to a set of blobs (which correspond to files) and other trees (which correspond to sub-directories). A commit refers to a single tree (the repository root), and zero or more other commits which are its parents.

These links help Git figure out which objects it needs to transfer to fulfill a fetch or clone request. When you fetch a repository from GitHub, Git performs a negotiation to figure out which objects to send. The server advertises the objects at the tip of its references (basically the tips of branches and tags). The client does the same, along with the set of references that it wants from the server. Then, the server walks the links between the requested objects and the objects that the client already has in order to figure out what to send.

Above is an object graph. The client advertises its ref tips (indicated by the darker blue commits). The server’s advertised references are colored dark red. The blue shaded area represents the result of walking along the edges to obtain the reachability closure of the objects the client already has. The red shaded area represents the same closure from the server’s perspective, excluding objects that the client already has. Objects in this region are the ones which need to be sent to the client.

During a fetch, Git needs to send not just the commits in between what the client has and wants, but also all of the objects that are reachable from those commits. Because Git doesn’t store the list of every reachable object, this may be expensive, especially in the case of a clone. When cloning, the client doesn’t have any objects, so it asks the server for all of the objects reachable from any reference.

In an earlier blog post, we talked about how we use reachability bitmaps to accelerate this negotiation. In case you haven’t read that post, below is a quick primer.

Reachability bitmaps

How does Git handle this special case where the client has nothing and wants everything? Ultimately, the server needs to determine the reachability closure of all of the reference tips. In other words, it needs a list of all of the objects at the reference tips, and all of the ancestors of those objects in order to assemble a complete copy of the repository at the other end.

Unfortunately for us, the larger the repository is, the longer it takes Git to compute the list of objects to send. This isn’t feasible even for medium-sized repositories. Git could handle our case specially by just sending every object it has, but that might result in many unwanted objects being sent, too. (For example, GitHub stores the outcome of “test merges” in special refs which aren’t ordinarily sent during fetches, but whose objects are stored in the same object directory nonetheless.)

Instead, Git stores a set of reachability bitmaps corresponding to some of the commits in a packfile. The idea is rather simple: arrange the objects in a pack in some order (the particular order used is something we’ll discuss shortly in detail). Then, the ith bit in the bitmap corresponding to commit C is 1 if C can reach the ith object, and 0 otherwise.

Having a one-to-one correspondence between objects and bit positions has a couple of appealing properties: taking the union of reachable objects between commits is as simple as ORing their bitmaps together, and taking the difference is as simple as combining AND and NOT. So, when a bitmap exists, Git can dramatically speed up the object negotiation phase:

  • First, OR all of the bitmaps corresponding to the reference tips that the client wants. Call this new bitmap W.
  • Then, do the same with the bitmaps corresponding to the reference tips that the client advertised as already having. Call this bitmap H.
  • Finally, compute W AND NOT H to produce the set of objects the client needs (in other words, everything it wants but does not already have). Then, send those objects.

Because all of the reachability information is encoded directly into the bitmaps, Git saves time by avoiding the need to open up and parse objects, allowing it to produce the same result in a fraction of the time.

This idea has been used since at least the 1970s to speed up queries in relational databases. In Git, reachability bitmaps can provide dramatic speed-ups when walking objects that reside in the same pack: walking all of the objects in the Linux kernel repository took more than 33 seconds without bitmaps, but only 1.57 seconds to perform the same traversal with bitmaps.

The object order

How does Git turn a set of objects into a sequence of bit positions? One way you might imagine to do this is assign bit positions in lexicographic order. The first bit corresponds to 000023961a, the second to 0000d6543f, the third to 000182eacf, and so on in lexical order.

Why not do this? Recall that an object’s ID is determined by a SHA-1 of its contents, which means that reachability of any object in this order is only reachable to nearby objects by chance. And reachability of nearby objects matters: Git compresses the bitmaps using EWAH compression, which relies on having long runs of identical bits. If the object order makes reachability look essentially random from bit to bit, Git won’t be able to efficiently compress the bitmaps.

Pack order—that is, the physical arrangement of objects in a packfile—produces a sequence of bit positions that tends to place reachable objects next to each other. And this produces exactly the kind of long runs of identical bits that make EWAH compression perform well.

The problem

But, all of this creates a problem for us: if the order of bit positions is dictated by a pack, then bitmaps are coupled in implementation and in concept to the existence of a single packfile. So, any objects that accumulate outside of the bitmapped pack won’t benefit from the same speed-ups.

To address this, we periodically repack the repository’s entire contents into a single pack, and then generate a new reachability bitmap. This makes reachability queries in more recent parts of the repository’s history faster.

But generating that new pack takes time; in fact, it’s quadratic. Bigger repositories take longer to repack, but also grow at a faster rate, which means they run maintenance more often. This compounding effect sometimes makes it such that some repositories are constantly undergoing maintenance: by the time one maintenance job has finished, another is already sitting in the queue, waiting to be run.

Since the bottleneck for maintenance is the compression of an entire repository’s contents into a single packfile, what would it take to be able to repack the contents into multiple packfiles instead?

Multi-pack indexes

To order a set of objects spanning multiple packs, we looked to a recent Git feature: multi-pack indexes.

When Git wants to locate an object by name in a single pack, it uses that pack’s index (.idx) file, which provides a binary-searchable list of object locations. Multi-pack indexes work similarly to .idx files, but the location they indicate is a pair containing both the pack containing the object, as well as where in that pack the object can be found.

The figure above gives a flavor for what kind of data is organized in a multi-pack index. Here, there are three packs, each with a handful of objects. The multi-pack index stores the location of each unique object among the set of packs. When multiple copies of an object exist (like the green or red objects in packs xyz and abc), ties are broken in favor of the copy in the pack with the earliest modification time.

The order of objects in the multi-pack index differs from the order in each individual pack. Since a pack is free to store objects in any order it wants, the multi-pack index stores objects in lexicographic order so that an object can be found quickly by name using a binary search.

Ordering objects

Given a multi-pack index, how should we order the objects in the packs it contains? We discussed earlier that ordering objects lexicographically results in poor compression. We also noted that objects in packs are ordered topologically, which for our purposes we can consider to imply that individual objects tend to appear near other objects which they reach in this order.

So, any ordering of the objects in a multi-pack index should capture as much of those two properties as possible. With that in mind, we decided on the following order:

  • Objects are first grouped according to which packfile they appear in, and packs are ordered according to the multi-pack index.
  • Objects within the same pack should be ordered according to their locations within that pack.

This effectively concatenates the pack-order of multiple packs together, according to some other ordering defined on the packs themselves.

To see what this looks like, let’s overlay a portion of a bitmap that covers the objects in our earlier example:

The first three bits correspond to the red, yellow, and green objects, respectively. Each one of those objects comes from the xyz pack, which means that the xyz pack has the oldest modification among the three. Scanning the bitmap from left to right, these objects appear in pack order; that is, the byte offset of the red object is less than the byte offsets of the yellow and green objects that follow it.

The purple and blue objects come next, since they are in the pack that follows. But note that the copies of the red and green objects in the abc pack don’t correspond to any bits highlighted. Why? Because the multi-pack index selected the copy of those objects in the earlier pack.

Finally, the orange and pink objects appear, also in pack order. And, as we expect, the copy of the purple object that appears in pack 123 isn’t included in the bitmap, because the copy in pack abc was.

This ordering gives us great locality, but we still need to address how to map bit positions back to the objects they represent. For example, let’s look at the fifth bit position, which we know refers to the blue object: how could Git discover this same fact?

You could reasonably imagine that knowing how many total objects are in each pack would be good enough to figure out which objects each bit corresponds to. But that isn’t enough information; we don’t know how many unique objects selected by the multi-pack index appear in each pack, and we also don’t know which non-unique objects are missing. So it’s not good enough to count past the three bits corresponding to objects in pack xyz and then count two more bits up to the fifth bit, because that would point at the copy of the green object in pack abc.

Reverse indexes

To solve this problem, we introduced reverse indexes. In the same way that the pack index provides a mapping from object name to object location, the reverse index maps an object’s location back to its name.

The idea is simple: in addition to the pack’s contents (stored in a .pack file) and index (stored in an .idx file), we’ll write an array of numbers (which comprise the reverse index, and are stored in a new .rev file). These numbers provide a mapping between an object’s position in pack order and its position in lexicographic order.

To better understand this, let’s take a look at an example on a single pack.

The .idx file (shown in the lower left) lists objects in lexicographic order: the yellow object comes before the red one, and the red one comes before the green one. But their pack order is different: there, the red object comes before the yellow one instead of the other way around. The reverse index helps us unwind the two: it tells us that the red object is in the position 3 in lexicographic order, and the yellow object is in position 1.

The reverse index allows us to map quickly from offsets into the packfile to object positions they correspond to. That allows us to quickly determine the size of a packed object. For example, say that you want to figure out how large the red object is. Because Git doesn’t store that information directly, you have one of two options: either scan linearly through the packed data (inflating its contents until you locate the stream end), or locate the adjacent object (in this case, the yellow one) by name, and measure the difference of their offsets.

Without the reverse index, there is no way to figure out where the adjacent object starts. But with a reverse index, locating the red object is as simple as reading the adjacent entry in the reverse index to discover the offset. Here, that value is 1, which points at an entry in the .idx file, which in turn points at the location in the pack.

Before the on-disk reverse index, Git computed this table on the fly and stored the results in an array in memory. This had a couple of major downsides. To build a reverse index on the fly, Git has to allocate a pair of pack offset and index position. This requires memory and runtime, which both scale relative to the size of the pack. Even though this processes using radix sort, sorting the reverse index entries can be noticeably slow when done once per process.

Some initial testing of these reverse indexes showed that they could enable rather dramatic speed-ups when serving fetches on real-world repositories. To verify our early results, we gradually rolled out reverse indexes on a handful of repositories.

Below is the 50th percentile of CPU time for fetches to Homebrew/homebrew-core on our testing host before and after the change:

In this case, we reduced the amount of time it took to serve any fetch of that repository by around 80%. Here’s another plot from the same time which shows the resident set size of the program used to serve fetches by replicas of that same repository:

Encouraged by our first results in production, we proceeded to cautiously roll out on-disk reverse indexes to all other replicas. Once we completed the roll-out, we let a couple of days pass in order for replicas to have enough time to generate reverse indexes. Then, we sampled the overall CPU time it took to serve fetches across all repositories and observed the drop we were hoping for:

In the graph above, you can see three 24-hour cycles. The first day was before we rolled out .rev files, and the second two peaks are after. The per-day peaks dropped from around 10.8 seconds to 7 seconds, for a collective savings of around 35% on all repositories.

Multi-pack bitmaps

Now that we have a format for .rev files, we can reuse it as the missing piece we need to build multi-pack bitmaps. Instead of writing positions relative to a pack .idx file, a multi-pack reverse index writes positions which are relative to a multi-pack-index file.

This provides exactly the information we need to rediscover the mapping between bit positions and the objects they refer to in the multi-pack index. To see why, let’s take a look at another example:

To figure out that the fifth bit corresponds to the blue object in the above diagram, we read the fifth entry in the multi-pack reverse index and get back that the fifth bit maps to the eleventh object in the multi-pack index. And, sure enough, the 11th object points back to the blue object that we were looking for in pack abc.

Putting it all together, this gives us a bitmap which can refer to objects in multiple packs. Based on its filename, each bitmap knows whether or not it belongs to a pack (and if so, which one) or to a multi-pack index. And based on that information, it can translate its object lookups to be relative to a packfile, or to the multi-pack index via its reverse index.

Since we chose the ordering carefully, these multi-pack bitmaps compress exactly as well as their single-pack counterparts. And they decouple bitmaps from individual packs. So, a repository can still have at most one bitmap, but that bitmap can now correspond to multiple packs.

Geometric repacking

Now that we can include multiple packs in a single bitmap, what’s the best way to repack a repository during maintenance?

With single-pack bitmaps, the only option was to pack everything together into one enormous pack. But now that this restriction no longer exists, we have to figure out the best way to repack the objects. When deciding on a new repacking strategy, we wanted something that struck a balance between two properties:

  • On average, there is a relatively small number of packs in the repository.
  • On average, we create a pack that collects objects that were pushed since the last maintenance run.

We decided on an invariant to ensure that the packs in a repository form a geometric progression by object size. In particular, if you sort the packs by number of objects, with the first pack having the most and the final pack having the least, then each pack will contain at least twice as many objects as the next one.

We taught git repack a new --geometric= mode, which creates exactly this geometric progression. Soon (at the time of writing, these patches are still being submitted and reviewed), you’ll be able to try this yourself on your own repository by running:

$ packsizes() {
    find .git/objects/pack -type f -name '*.pack' |
    while read pack; do
      printf "%7d %s\n" \
        "$(git show-index < ${pack%.pack}.idx | wc -l)" "$pack"
    done | sort -rn
  }
$ packsizes # before
$ git repack --write-midx --write-bitmap-index -d --geometric=2
$ packsizes # after

How does this command work? We select a set of packs and combine their objects into one new pack that replaces the set. But picking this set optimally is NP-hard, so we have to approximate it.

To see how we perform that approximation, let's walk through an example. The first step is to figure out how many packs already form a geometric progression. To do this, imagine ordering packs by how many objects they contain. Then, consider each adjacent pair of packs from largest to smallest. At each step, ask: "is the larger pack at least twice the size of the smaller one?". If so, then every pack from that pair onwards already forms a geometric progression.

In this example, the second and third packs (each containing one object) violate our progression. At that point, we know that the second pack, and any packs below it must be repacked together in order to restore the progression.

But we can't just repack the first two packs together, since the combined pack would still be too big (it would contain two objects, and the third pack only contains one). So we grow the set of packs to combine until the invariant is restored:

Here, we had to combine the first four packs together in order to restore a geometric progression. Those packs together contain 7 objects, which is less than half of the next-largest pack (which contains 32 objects). And we can't combine any more packs, since doing so would violate the remainder of the progression.

At this point, we can write a new pack containing just the objects in the set of packs we combined in the previous step. After discarding the now-redundant packs, the remaining packs again form a geometric progression:

This means that the number of packs grow logarithmically over time, so a repository will never have too many packs at any one point in time. It also has the appealing property that older objects tend to get pushed into the larger packs, meaning they get repacked less frequently as time passes. An important corollary of this is that each repack tends to focus on the most recently pushed objects. In other words, this strategy tends to make repacking take time proportional to the number of new objects, not the number of overall objects.

In order to keep the repository performing well over many geometric repacks, we intersperse an all-into-one repack once for every eight geometric repacks.

Importantly, this means that even though the classically-slow repacks are still slow, we aren't forced to run them every time we want to repack a repository.

Deploying to production

Now that we have a way to write a bitmap that covers the objects in multiple packs, a way to quickly map between bit positions and the objects they refer to in a multi-pack index, and a repacking strategy which only requires modifying recent additions to the repository, we are ready to put everything together.

Before rolling out a change as large as this one, we performed extensive local testing to ensure that writing bitmaps worked correctly, and that our new repacking strategy wasn't silently corrupting repositories. But our local testing can only go so far: there are endless corner cases when serving fetches and clones, so the real test occurs only after putting real traffic in front of these new paths. Our goal was to design a deployment strategy that simultaneously exercised enough of those cases, while also ensuring that any potential corruption could never occur on a majority of repository replicas.

Our first test cases were internal repositories. We broke our original tests into two phases. In the first phase, we wrote "multi-pack bitmaps" which contained only a single pack. This allowed us to exercise the most basic case of multi-pack bitmaps (having only one pack) without running our new repack code. Once we had built confidence in that approach, we expanded our test to alternate between geometric and full repacks.

After a couple of weeks without any issues, we were confident enough in our change to start testing it on other repositories external to GitHub. We selected first an individual host, and then a whole rack of hosts on which every repack would alternate between geometric and full. At this stage, results were encouraging: the average time to repack had dropped significantly, as had the overall amount of time spent repacking.

By this stage, our roll-out proceeded for several weeks in only one of three data centers. Because we never place a majority of repository replicas in any single datacenter, this configuration made it impossible for our changes to corrupt a majority set of replicas in an unrecoverable fashion, while still putting our changes in the request path for a large amount of traffic.

Finally, after adopting this configuration for a week, we proceeded to enroll percentages of replicas hosted in other datacenters to also use multi-pack bitmaps until all repositories were using multi-pack reachability bitmaps.

After rolling out our new repacking strategy with multi-pack bitmaps, we saved on average 5.67 CPU days every hour compared to the old strategy.

Likewise, the average time spent repacking any single repository also dropped considerably. Below, the plot is broken out per-site, and you can see when we began testing in a single site, as well as when we expanded our deployment to all sites.

There, the average dropped from around 1 minute to repack a repository to just 15 seconds.

Future directions

There are two major open areas we're considering for the future that will make it possible for further performance improvements:

One open area is the bitmap computation itself. Git's bitmap generation code can reuse existing bitmaps by permuting their bits into a new order, but this operation can still take time proportional to the size of the repository. One way to solve this would be to write bitmaps incrementally, only walking the new objects introduced since the last time a bitmap was written. This problem is tricky because it requires not only that the bitmap file be able to be written incrementally, but also that the object ordering we select is stable: that is, that introducing new bitmaps won't render the existing ones meaningless.

Another open area is the pack structure. Creating packs that form a geometric sequence is a promising step forward that allows us to trade off between full and partial repacks. But some repositories are so large that repacking the whole repository isn't feasible, much less desirable. Designing a strategy which freezes the packs containing the oldest objects in a repository's history will allow us to grow to support even larger repositories in the future.

Thank you

This project would not have been possible without help from the upstream Git community, as well as many engineering teams within GitHub. Each of these changes required extensive review, both to the Git project, as well as to internal GitHub services. Special thanks to Jeff King, Derrick Stolee, Jonathan Tan, Junio Hamano, and others for making this possible.

GitHub Desktop supports hiding whitespace, expanding diffs, and creating repository aliases

Post Syndicated from Billy Griffin original https://github.blog/2021-04-28-github-desktop-hiding-whitespace-expanding-diffs-repo-aliases/

GitHub Desktop 2.8 now includes several new features to make it easier to work with diffs and easier for people who have multiple copies of the same repository.

Expand diffs to get more context around changes

We hear a lot about how people love the way GitHub Desktop displays diffs beautifully, but you’re only able to see a few lines around the changes that you or someone else made. Now you can click to expand the diffs above or below your changes to get a more complete picture of the rest of the file around the changes made. You can also use a context menu on the diff to expand the whole file.

Hide whitespace in diffs

Similar to being able to see more context around your changes, sometimes there are a lot of whitespace changes in a file that don’t allow you to get a clear picture of the substantive changes that happened. Now, in both changes and history, you can optionally hide whitespace changes to allow you to focus just on the more meaningful changes to your code. This feature was built almost entirely by Steven Yeh (@say25), a fantastic and close community contributor to GitHub Desktop. Steven is a long-time open source contributor to GitHub Desktop, and we’re immensely grateful for him continuing to help improve the product.

Create aliases for repositories locally

Many developers keep more than one copy of a repository in GitHub Desktop, and the way repositories are displayed makes it tricky to differentiate between them. In GitHub Desktop 2.8, you can create aliases for your local repositories to easily tell them apart in the list.

We appreciate your input

All of the recent features we’ve shipped have come as a direct result of the great feedback we get from talking with users and hearing from you in our open source repository. With more than one million users actively using GitHub Desktop every month and more than 200 community contributors, we so appreciate all the feedback and contributions that help make GitHub Desktop even better every day. Thank you!

And if you’ve never tried it, or have not tried it in awhile, download GitHub Desktop today.

How we ship code faster and safer with feature flags

Post Syndicated from Alberto Gimeno original https://github.blog/2021-04-27-ship-code-faster-safer-feature-flags/

At GitHub, we’re continually working to improve existing features and shipping new ones all the time. From our launch of GitHub Discussions to the release of manual approvals for GitHub Actions—in order to ship new features and improvements faster while lowering the risk in our deployments, we have a simple but powerful tool: feature flags.

Reducing deployment risk

We deploy changes around the clock, and we need to keep the service running with no disruptions. For these reasons, deploying needs to be an almost risk-free process. Any problem that comes up during a deployment requires a rollback of the code. During this rollback period, users could be impacted and other deployments delayed; the more time that elapses the bigger the impact.

We use feature flags to reduce the risk of deploying to production. Any potentially risky change is put behind a feature flag in the code and then, when the deployment is done, we enable the feature flag to everyone or to a percentage of actors. This way we minimize the impact of the new changes, and if something goes wrong we can disable the feature flag completely in a matter of seconds without interrupting other deployments. This is fundamental to us for the following reasons:

  • It isolates and reduces the deployment risk.
  • We have the ability to disable changes in seconds without rolling back a deployment, which could take minutes.

What kind of things can go wrong during a deployment, where a feature flag can help? These are some examples:

  • Changes in database queries can potentially make existing functionality slower.
  • Changes in the logic on existing features can produce unexpected behavior, edge cases not considered, and more.
  • Changes in how we store or process information, such as saving new columns in the database can accidentally start inserting invalid values or generate too many writes to the database.

Working on new features

Feature flags are not only useful to reduce deployment risk when fixing or improving existing features, we also use them when developing new features!

Using feature flags allows us to work on features incrementally. Only staff members working on the project have the corresponding feature flag enabled, so they can see the new feature that is under development while other users cannot. We don’t use long-lived feature branches. Instead, feature flags allow us to work on small batches, which brings us many benefits:

  • Small batches are easier to review by other engineers in pull requests.
  • The smaller the change, the lesser the chance to get something wrong during the production deployment.
  • Not using long-lived feature branches avoids lots of potential merge conflicts and clashes with other features under development.

In order to adopt this way of building new features, you need to do a little bit of additional planning upfront to divide the work. For example, a new feature typically needs new data models or changes in existing ones. If you create a pull request only for the data model changes, then you may get blocked until those changes are merged. There are a couple of things we usually do to prevent this:

  • Create a main pull request. A spike, where you put the data model changes, and continue working on other areas that require changes: UI, API, background jobs, etc. Then, start extracting those changes to smaller pull requests that your peers can review.
  • Alternatively, you can create an initial pull request, and then create new branches off of that first branch, even if it’s not merged to the main branch yet. If you need to make changes to the first branch, you’ll need to rebase the other branches.

This way of working could introduce some friction due to having to ask for reviews more frequently. Sometimes it can take days until a team reviews your pull request. To prevent this from being a blocker, we follow one or more of these strategies:

Keep working even if the changes are not merged into the main branch. Either on that “spike” pull request or branching off the branch that is waiting for reviews.

Have a first responder person in each team that is focused on reviewing pull requests quickly to unblock the progress of their own team or other teams.
Sometimes if a critical pull request is ready to be deployed sooner rather than later, but has outstanding non-critical feedback waiting to be addressed, we do that in follow up pull requests.

Testing features

We have to make sure that the new functionality works as expected while also maintaining the existing functionality. In order to do that, we have many mechanisms to enable or disable feature flags at all levels as follows:

  • In our development environments, we can toggle feature flags from the command line.
  • In automated tests, we can enable or disable feature flags in the code.
  • In our CI, we have two different builds: one that runs with all feature flags disabled by default, and another one that runs with all feature flags enabled by default. This drastically reduces the chances of not covering most code paths in automated tests properly.
  • In production, we can enable or disable feature flags in the query string of a request.

Different shipping strategies

We already mentioned that we can flag particular actors, or we can enable a feature flag for a percentage of actors. Those are the most common options, but others include:

  • Individual actors: We use this mechanism to flag employees working on a feature or to enable the flag to customers that are experimenting with a feature, or if they have a problem we are trying to fix.
  • Staff shipping: When a feature is almost ready to be released, we first enable it for all GitHub staff, publish an internal post for awareness, and create an internal issue for gathering feedback or possible bugs.
  • Early access maintainers or other beta groups: When we are about to release an important feature that will impact the workflows of open source software (OSS) maintainers or other types of users, we test the feature with a small group first. After some time, we may interview them and gather their feedback to confirm our hypothesis, and validate the implementation.
  • Percentage of actors: As mentioned earlier, we can specify a percentage of actors for which to enable the feature flag. In this case when an actor is flagged, it remains flagged, unless the feature flag is rolled back. When the percentage of actors is changed, a message is sent to our deployments channel on Slack so other engineers are aware of the change.
  • Dark shipping: This allows us to enable the feature flag for a percentage of calls. This is different from the previous mechanism because an actor can get the feature enabled in one call (one request, for example) but not in the next one. This mechanism is not meant for features that are visible to the users, but for internal changes, such as a performance improvement in a query.

All these strategies and the full administration of feature flags is done through a web UI. Not all staff members have access to this UI, but most engineers do. The interface allows us to manage the shipping status of a feature flag, creation of new ones, and deletion. Additionally, we can see a history of changes made to the feature flag: who made the change, when and what changes were made, and other metadata, such as which team owns it.

What to flag

When we say that we can flag an actor, we mean that we can flag not only users, but other entities. In particular, we can flag users, organizations, teams, enterprises, repositories, or GitHub Apps. When writing the code that checks if the flag is enabled or not, we need to specify which actor has the flag enabled or not. These are some examples:

  • If the flag is purely visual, we flag the user, and we check the current logged user. Examples of this could be: dark mode, layout or navigation changes, and new components in the UI that don’t depend on data changes.
  • If the flag changes the way we store data, it is usually better to flag the repository for consistency for all users using the repository. For example, if we start storing timestamps for specific actions, we don’t want some users to produce the new timestamp while others do not.
  • For API changes, we also flag GitHub Apps.
  • Sometimes, especially for some risky changes, we create custom, context-specific actor types, where the existing actor types just wouldn’t have worked.

Sometimes we do more sophisticated checks, such as checking multiple actors or checking if a repository is public or not. For example, when we staff ship a feature it may make a new feature available for all repositories of an employee. If the feature must not be visible to other users, we want to enable the feature only for private repositories to prevent public repositories of an employee leaking the new feature.

The cost of a feature flag

As we have seen, using feature flags extensively changes the way we work, and requires a bit of additional planning and coordination. But there’s also additional costs associated with feature flags that we need to keep in mind. First, we have a runtime cost. When we check if a feature flag is enabled to an actor, we need to first load the metadata information of the flag, which is stored in MySQL but cached in memcached. We are now considering having the feature metadata and the list of actors always in memory to reduce this overhead. There’s an exception for this: we consider some feature flags “large.” An example of this was GitHub Actions, a feature that we rolled out to tens of thousands of users every week. For these large feature flags, when they are still not fully enabled, we need to do a per-actor query to MySQL to check if the actor is in the list of enabled actors.

For runtime, there’s also a cost in the form of technical debt. Once a feature flag is fully enabled, it leaves a lot of dead code and old tests in the repository that need to be deleted. Usually we do this manually after a feature has been rolled out for a few days or weeks depending on the case. But we are now automating the process: we have created a script that can be either invoked locally or by triggering a workflow manually.

The script uses regular expressions to find usages of the feature flag using git grep. Then for the matched Ruby files, it modifies the code using rubocop-ast. It is capable of deleting code blocks, such as if/else statements, and modifying boolean expressions, in addition to reindenting the code afterwards. When the file is a test, it deletes the code that enables the feature flag. For tests that check the behavior of the application when the feature flag is disabled, it just leaves a comment and makes the test fail, since in many cases we don’t want to just delete the test, but simply update it. As a result, even when the script requires some intervention from an engineer, it drastically reduces the manual work required in the process.

When the script is run using a workflow dispatch, it also creates a new branch and a pull request. Since we do extensive use of CODEOWNERS, the pull request will have the proper teams and people as reviewers. In the near future, we want to automate this process even more by running the script automatically when we can consider a feature flag stale.

Conclusion

Feature flags allow us to ship code faster with higher confidence. Using them we reduce deployment risk and make code reviews and development on new features easier, because we can work on small batches. There are also associated costs, but the benefits highly exceed them.

How We Improved Agent Chat Efficiency with Machine Learning

Post Syndicated from Grab Tech original https://engineering.grab.com/how-we-improved-agent-chat-efficiency-with-ml

In previous articles (see Grab’s in-house chat platform, workforce routing), we shared how chat has grown to become one of the primary channels for support in the last few years.

With continuous chat growth and a new in-house tool, helping our agents be more efficient and productive was key to ensure a faster support time for our users and scale chat even further.

Starting from the analysis on the usage of another third-party tool as well as some shadowing sessions, we realised that building a templated-based feature wouldn’t help. We needed to offer personalisation capabilities, as our consumer support specialists care about their writing style and tone, and using templates often feels robotic.

We decided to build a machine learning model, called SmartChat, which offers contextual suggestions by leveraging several sources of internal data, helping our chat specialists type much faster, and hence serving more consumers.

In this article, we are going to explain the process from problem discovery to design iterations, and share how the model was implemented from both a data science and software engineering perspective.

How SmartChat Works

Diving Deeper into the Problem

Agent productivity became a key part in the process of scaling chat as a channel for support.

After splitting chat time into all its components, we noted that agent typing time represented a big portion of the chat support journey, making it the perfect problem to tackle next.

After some analysis on the usage of the third-party chat tool, we found out that even with functionalities such as canned messages, 85% of the messages were still free typed.

Hours of shadowing sessions also confirmed that the consumer support specialists liked to add their own flair. They would often use the template and adjust it to their style, which took more time than just writing it on the spot. With this in mind, it was obvious that templates wouldn’t be too helpful, unless they provided some degree of personalisation.

We needed something that reduces typing time and also:

  • Allows some degree of personalisation, so that answers don’t seem robotic and repeated.
  • Works with multiple languages and nuances, considering Grab operates in 8 markets, even some of the English markets have some slight differences in commonly used words.
  • It’s contextual to the problem and takes into account the user type, issue reported, and even the time of the day.
  • Ideally doesn’t require any maintenance effort, such as having to keep templates updated whenever there’s a change in policies.

Considering the constraints, this seemed to be the perfect candidate for a machine learning-based functionality, which predicts sentence completion by considering all the context about the user, issue and even the latest messages exchanged.

Usability is Key

To fulfil the hypothesis, there are a few design considerations:

  1. Minimising the learning curve for agents.
  2. Avoiding visual clutter if recommendations are not relevant.

To increase the probability of predicting an agent’s message, one of the design explorations is to allow agents to select the top 3 predictions (Design 1). To onboard agents, we designed a quick tip to activate SmartChat using keyboard shortcuts.

By displaying the top 3 recommendations, we learnt that it slowed agents down as they started to read all options even if the recommendations were not helpful. Besides, by triggering this component upon every recommendable text, it became a distraction as they were forced to pause.

In our next design iteration, we decided to leverage and reuse the interaction of SmartChat from a familiar platform that agents are using – Gmail’s Smart Compose. As agents are familiar with Gmail, the learning curve for this feature would be less steep. For first time users, agents will see a “Press tab” tooltip, which will activate the text recommendation. The tooltip will disappear after 5 times of use.

To relearn the shortcut, agents can hover over the recommended text.

How We Track Progress

Knowing that this feature would come in multiple iterations, we had to find ways to track how well we were doing progressively, so we decided to measure the different components of chat time.

We realised that the agent typing time is affected by:

  • Percentage of characters saved. This tells us that the model predicted correctly, and also saved time. This metric should increase as the model improves.
  • Model’s effectiveness. The agent writes the least number of characters possible before getting the right suggestion, which should decrease as the model learns.
  • Acceptance rate. This tells us how many messages were written with the help of the model. It is a good proxy for feature usage and model capabilities.
  • Latency. If the suggestion is not shown in about 100-200ms, the agent would not notice the text and keep typing.

Architecture

The architecture involves support specialists initiating the fetch suggestion request, which is sent for evaluation to the machine learning model through API gateway. This ensures that only authenticated requests are allowed to go through and also ensures that we have proper rate limiting applied.

We have an internal platform called Catwalk, which is a microservice that offers the capability to execute machine learning models as a HTTP service. We used the Presto query engine to calculate and analyse the results from the experiment.

Designing the Machine Learning Model

I am sure all of us can remember an experiment we did in school when we had to catch a falling ruler. For those who have not done this experiment, feel free to try it at home! The purpose of this experiment is to define a ballpark number for typical human reaction time (equations also included in the video link).

Typically, the human reaction time ranges from 100ms to 300ms, with a median of about 250ms (read more here). Hence, we decided to set the upper bound for SmartChat response time to be 200ms while deciding the approach. Otherwise, the experience would be affected as the agents would notice a delay in the suggestions. To achieve this, we had to manage the model’s complexity and ensure that it achieves the optimal time performance.

Taking into consideration network latencies, the machine learning model would need to churn out predictions in less than 100ms, in order for the entire product to achieve a maximum 200ms refresh rate.

As such, a few key components were considered:

  • Model Tokenisation
    • Model input/output tokenisation needs to be implemented along with the model’s core logic so that it is done in one network request.
    • Model tokenisation needs to be lightweight and cheap to compute.
  • Model Architecture
    • This is a typical sequence-to-sequence (seq2seq) task so the model needs to be complex enough to account for the auto-regressive nature of seq2seq tasks.
    • We could not use pure attention-based models, which are usually state of the art for seq2seq tasks, as they are bulky and computationally expensive.
  • Model Service
    • The model serving platform should be executed on a low-level, highly performant framework.

Our proposed solution considers the points listed above. We have chosen to develop in Tensorflow (TF), which is a well-supported framework for machine learning models and application building.

For Latin-based languages, we used a simple whitespace tokenizer, which is serialisable in the TF graph using the tensorflow-text package.

import tensorflow_text as text

tokenizer = text.WhitespaceTokenizer()

For the model architecture, we considered a few options but eventually settled for a simple recurrent neural network architecture (RNN), in an Encoder-Decoder structure:

  • Encoder
    • Whitespace tokenisation
    • Single layered Bi-Directional RNN
    • Gated-Recurrent Unit (GRU) Cell
  • Decoder

    • Single layered Uni-Directional RNN
    • Gated-Recurrent Unit (GRU) Cell
  • Optimisation
    • Teacher-forcing in training, Greedy decoding in production
    • Trained with a cross-entropy loss function
    • Using ADAM (Kingma and Ba) optimiser

Features

To provide context for the sentence completion tasks, we provided the following features as model inputs:

  • Past conversations between the chat agent and the user
  • Time of the day
  • User type (Driver-partners, Consumers, etc.)
  • Entrypoint into the chat (e.g. an article on cancelling a food order)

These features give the model the ability to generalise beyond a simple language model, with additional context on the nature of contact for support. Such experiences also provide a better user experience and a more customised user experience.

For example, the model is better aware of the nature of time in addressing “Good {Morning/Afternoon/Evening}” given the time of the day input, as well as being able to interpret meal times in the case of food orders. E.g. “We have contacted the driver, your {breakfast/lunch/dinner} will be arriving shortly”.

Typeahead Solution for the User Interface

With our goal to provide a seamless experience in showing suggestions to accepting them, we decided to implement a typeahead solution in the chat input area. This solution had to be implemented with the ReactJS library, as the internal web-app used by our support specialist for handling chats is built in React.

There were a few ways to achieve this:

  1. Modify the Document Object Model (DOM) using Javascript to show suggestions by positioning them over the input HTML tag based on the cursor position.
  2. Use a content editable div and have the suggestion span render conditionally.

After evaluating the complexity in both approaches, the second solution seemed to be the better choice, as it is more aligned with the React way of doing things: avoid DOM manipulations as much as possible.

However, when a suggestion is accepted we would still need to update the content editable div through DOM manipulation. It cannot be added to React’s state as it creates a laggy experience for the user to visualise what they type.

Here is a code snippet for the implementation:

import React, { Component } from 'react';
import liveChatInstance from './live-chat';

export class ChatInput extends Component {
 constructor(props) {
   super(props);
   this.state = {
     suggestion: '',
   };
 }

 getCurrentInput = () => {
   const { roomID } = this.props;
   const inputDiv = document.getElementById(`input_content_${roomID}`);
   const suggestionSpan = document.getElementById(
     `suggestion_content_${roomID}`,
   );

   // put the check for extra safety in case suggestion span is accidentally cleared
   if (suggestionSpan) {
     const range = document.createRange();
     range.setStart(inputDiv, 0);
     range.setEndBefore(suggestionSpan);
     return range.toString(); // content before suggestion span in input div
   }
   return inputDiv.textContent;
 };

 handleKeyDown = async e => {
   const { roomID } = this.props;
   // tab or right arrow for accepting suggestion
   if (this.state.suggestion && (e.keyCode === 9 || e.keyCode === 39)) {
     e.preventDefault();
     e.stopPropagation();
     this.insertContent(this.state.suggestion);
     this.setState({ suggestion: '' });
   }
   const parsedValue = this.getCurrentInput();
   // space
   if (e.keyCode === 32 && !this.state.suggestion && parsedValue) {
     // fetch suggestion
     const prediction = await liveChatInstance.getSmartComposePrediction(
       parsedValue.trim(), roomID);
     this.setState({ suggestion: prediction })
   }
 };

 insertContent = content => {
   // insert content behind cursor
   const { roomID } = this.props;
   const inputDiv = document.getElementById(`input_content_${roomID}`);
   if (inputDiv) {
     inputDiv.focus();
     const sel = window.getSelection();
     const range = sel.getRangeAt(0);
     if (sel.getRangeAt && sel.rangeCount) {
       range.insertNode(document.createTextNode(content));
       range.collapse();
     }
   }
 };

 render() {
   const { roomID } = this.props;
   return (
     <div className="message_wrapper">
       <div
         id={`input_content_${roomID}`}
         role={'textbox'}
         contentEditable
         spellCheck
         onKeyDown={this.handleKeyDown}
       >
         {!!this.state.suggestion.length && (
           <span
             contentEditable={false}
             id={`suggestion_content_${roomID}`}
           >
             {this.state.suggestion}
           </span>
         )}
       </div>
     </div>
   );
 }
}

The solution uses the spacebar as the trigger for fetching the suggestion from the ML model and stores them in a React state. The ML model prediction is then rendered in a dynamically rendered span.

We used the window.getSelection() and range APIs to:

  • Find the current input value
  • Insert the suggestion
  • Clear the input to type a new message

The implementation has also considered the following:

  • Caching. API calls are made on every space character to fetch the prediction. To reduce the number of API calls, we also cached the prediction until it differs from the user input.
  • Recover placeholder. There are data fields that are specific to the agent and consumer, such as agent name and user phone number, and these data fields are replaced by placeholders for model training. The implementation recovers the placeholders in the prediction before showing it on the UI.
  • Control rollout. Since rollout is by percentage per country, the implementation has to ensure that only certain users can access predictions from their country chat model.
  • Aggregate and send metrics. Metrics are gathered and sent for each chat message.

Results

The initial experiment results suggested that we managed to save 20% of characters, which improved the efficiency of our agents by 12% as they were able to resolve the queries faster. These numbers exceeded our expectations and as a result, we decided to move forward by rolling SmartChat out regionally.

What’s Next?

In the upcoming iteration, we are going to focus on non-Latin language support, caching, and continuous training.

Non-Latin Language Support and Caching

The current model only works with Latin languages, where sentences consist of space-separated words. We are looking to provide support for non-Latin languages such as Thai and Vietnamese. The result would also be cached in the frontend to reduce the number of API calls, providing the prediction faster for the agents.

Continuous Training

The current machine learning model is built with training data derived from historical chat data. In order to teach the model and improve the metrics mentioned in our goals, we will enhance the model by letting it learn from data gathered in day-to-day chat conversations. Along with this, we are going to train the model to give better responses by providing more context about the conversations.

Seeing how effective this solution has been for our chat agents, we would also like to expose this to the end consumers to help them express their concerns faster and improve their overall chat experience.


Special thanks to Kok Keong Matthew Yeow, who helped to build the architecture and implementation in a scalable way.
—-

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How Grab Leveraged Performance Marketing Automation to Improve Conversion Rates by 30%

Post Syndicated from Grab Tech original https://engineering.grab.com/learn-how-grab-leveraged-performance-marketing-automation

Grab, Southeast Asia’s leading superapp, is a hyperlocal three-sided marketplace that operates across hundreds of cities in Southeast Asia. Grab started out as a taxi hailing company back in 2012 and in less than a decade, the business has evolved tremendously and now offers a diverse range of services for consumers’ everyday needs.

To fuel our business growth in newer service offerings such as GrabFood, GrabMart and GrabExpress, user acquisition efforts play a pivotal role in ensuring we create a sustainable Grab ecosystem that balances the marketplace dynamics between our consumers, driver-partners and merchant-partners.

Part of our user growth strategy is centred around our efforts in running direct-response app campaigns to increase trials on our superapp offerings. Executing these campaigns brings about a set of unique challenges against the diverse cultural backdrop present in Southeast Asia, challenging the team to stay hyperlocal in our strategies while driving user volumes at scale. To address these unique challenges, Grab’s performance marketing team is constantly seeking ways to leverage automation and innovate on our operations, improving our marketing efficiency and effectiveness.

Managing Grab’s Ever-expanding Business, Geographical Coverage and New User Acquisition

Grab’s ever-expanding services, extensive geographical coverage and hyperlocal strategies result in an extremely dynamic, yet complex ad account structure. This also means that whenever there is a new business vertical launch or hyperlocal campaign, the team would spend valuable hours rolling out a large volume of new ads across our accounts in the region.

Sample Google Ads account structure
A sample of our Google Ads account structure.

The granular structure of our Google Ads account provided us with flexibility to execute hyperlocal strategies, but this also resulted in thousands of ad groups that had to be individually maintained.

In 2019, Grab’s growth was simply outpacing our team’s resources and we finally hit a bottleneck. This challenged the team to take a step back and make the decision to pursue a fully automated solution built on the following principles for long term sustainability:

  • Building ad-tech solutions in-house instead of acquiring off-the-shelf solutions

    Grab’s unique business model calls for several tailor-made features, none of which the existing ad tech solutions were able to provide.

  • Shifting our mindset to focus on the infinite game

    In order to sustain the exponential volume in the ads we run, we had to seek the path of automation.

For our very first automation project, we decided to look into automating creative refresh and upload for our Google Ads account. With thousands of ad groups running multiple creatives each, this had become a growing problem for the team. Overtime, manually monitoring these creatives and refreshing them on a regular basis had become impossible.

The Automation Fundamentals

Grab’s superapp nature means that any automation solution fundamentally needs to be robust:

  • Performance-driven – to maintain and improve conversion efficiency over time
  • Flexibility –  to fit needs across business verticals and hyperlocal execution
  • Inclusivity – to account for future service launches and marketing tech (e.g. product feeds and more)
  • Scalability – to account for future geography/campaign coverage

With these in mind, we incorporated them in our requirements for the custom creative automation tool we planned to build.

  • Performance-driven – while many advertising platforms, such as Google’s App Campaigns, have built-in algorithms to prevent low-performing creatives from being served, the fundamental bottleneck lies in the speed in which these low-performing creatives can be replaced with new assets to improve performance. Thus, solving this bottleneck would become the primary goal of our tool.

  • Flexibility – to accommodate our broad range of services, geographies and marketing objectives, a mapping logic would be required to make sure the right creatives are added into the right campaigns.

    To solve this, we relied on a standardised creative naming convention, using key attributes in the file name to map an asset to a specific campaign and ad group based on:

    • Market
    • City
    • Service type
    • Language
    • Creative theme
    • Asset type
    • Campaign optimisation goal
  • Inclusivity – to address coverage of future service offerings and interoperability with existing ad-tech vendors, we designed and built our tool conforming to many industry API and platform standards.

  • Scalability – to ensure full coverage of future geographies/campaigns, the in-house solution’s frontend and backend had to be robust enough to handle volume. Working hand in glove with Google, the solution was built by leveraging multiple APIs including Google Ads and Youtube to host and replace low-performing assets across our ad groups. The solution was then deployed on AWS’ serverless compute engine.

Enter CARA

CARA is an automation tool that scans for any low-performing creatives and replaces them with new assets from our creative library:

CARA Workflow
A sneak peek of how CARA works

In a controlled experimental launch, we saw nearly 2,000 underperforming assets automatically replaced across more than 8,000 active ad groups, translating to an 18-30% increase in clickthrough and conversion rates.

Subset of results from CARA experimental launch
A subset of results from CARA’s experimental launch

Through automation, Grab’s performance marketing team has been able to significantly improve clickthrough and conversion rates while saving valuable man-hours. We have also established a scalable foundation for future growth. The best part? We are just getting started.


Authored on behalf of the performance marketing team @ Grab. Special thanks to the CRM data analytics team, particularly Milhad Miah and Vaibhav Vij for making this a reality.


Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

GitHub Availability Report: February 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-03-03-github-availability-report-february-2021/

Introduction

In February, we experienced no incidents resulting in service downtime to our core services.

This month’s GitHub Availability Report will provide initial details around an incident from March 1 that caused significant impact and a degraded state of availability for the GitHub Actions service.

March 1 09:59 UTC (lasting one hour and 42 minutes)

Our service monitors detected a high error rate on creating check suites for workflow runs. This incident resulted in the failure or delay of some queued jobs for a period of time. Additionally, some customers using the search/filter functionality to find workflows may have experienced incomplete search results.

In February, we experienced a few smaller and unrelated incidents affecting different parts of the GitHub Actions service. While these were localized to a subset of our customers and did not have broad impact, we take reliability very seriously and are conducting a thorough investigation into the contributing factors.

Due to the recency of the March 1 incident, we are still investigating the contributing factors and will provide a more detailed update in the March Availability Report, which will be published the first Wednesday of April. We will also share more about our efforts to minimize the impact of future incidents and increase the performance of the Actions service.

In summary

GitHub Actions has grown tremendously, and we know that the availability and performance of the service is critical to its continued success. We are committed to providing excellent service, reliability, and transparency to our users, and will continue to keep you updated on the progress we’re making to ensure this.

For more information, you can check out our status page for real-time updates on the availability of our systems and the GitHub Engineering blog for deep-dives around how we’re improving our engineering systems and infrastructure.

One small step closer to containerising service binaries

Post Syndicated from Grab Tech original https://engineering.grab.com/reducing-your-go-binary-size

Grab’s engineering teams currently own and manage more than 250+ microservices. Depending on the business problems that each team tackles, our development ecosystem ranges from Golang, Java, and everything in between.

Although there are centralised systems that help automate most of the build and deployment tasks, there are still some teams working on different technologies that manage their own build, test and deployment systems at different maturity levels. Managing a varied build and deploy ecosystems brings their own challenges.

Build challenges

  • Broken external dependencies.
  • Non-reproducible builds due to changes in AMI, configuration keys and other build parameters.
  • Missing security permissions between different repositories.

Deployment challenges

  • Varied deployment environments necessitating a bigger learning curve.
  • Managing the underlying infrastructure as code.
  • Higher downtime when bringing the systems up after a scale down event.

Grab’s appetite for customer obsession and quality drives the engineering teams to innovate and deliver value rapidly. The time that the team spends in fixing build issues or deployment-related tasks has a direct impact on the time they spend on delivering business value.

Introduction to containerisation

Using the Container architecture helps the team deploy and run multiple applications, isolated from each other, on the same virtual machine or server and with much less overhead.

At Grab, both the platform and the core engineering teams wanted to move to the containerisation architecture to achieve the following goals:

  • Support to build and push container images during the CI process.
  • Create a standard virtual machine image capable of running container workloads. The AMI is maintained by a central team and comes with Grab infrastructure components such as (DataDog, Filebeat, Vault, etc.).
  • A deployment experience which allows existing services to migrate to container workload safely by initially running both types of workloads concurrently.

The core engineering teams wanted to adopt container workloads to achieve the following benefits:

  • Provide a containerised version of the service that can be run locally and on different cloud providers without any dependency on Grab’s internal (runtime) tooling.
  • Allow reuse of common Grab tools in different projects by running the zero dependency version of the tools on demand whenever needed.
  • Allow a more flexible staging/dev/shadow deployment of new features.

Adoption of containerisation

Engineering teams at Grab use the containerisation model to build and deploy services at scale. Our containerisation effort helps the development teams move faster by:

  • Providing a consistent environment across development, testing and production
  • Deploying software efficiently
  • Reducing infrastructure cost
  • Abstracting OS dependency
  • Increasing scalability between cloud vendors

When we started using containers we realised that building smaller containers had some benefits over bigger containers. For example, smaller containers:

  • Include only the needed libraries and therefore are more secure.
  • Build and deploy faster as they can be pulled to the running container cluster quickly.
  • Utilise disk space and memory efficiently.

During the course of containerising our applications, we noted that some service binaries appeared to be bigger (~110 MB) than they should be. For a statically-linked Golang binary, that’s pretty big! So how do we figure out what’s bloating the size of our binary?

Go binary size visualisation tool

In the course of poking around for tools that would help us analyse the symbols in a Golang binary, we found go-binsize-viz based on this article. We particularly liked this tool, because it utilises the existing Golang toolchain (specifically, Go tool nm) to analyse imports, and provides a straightforward mechanism for traversing through the symbols present via treemap. We will briefly outline the steps that we did to analyse a Golang binary here.

  1. First, build your service using the following command (important for consistency between builds):

    $ go build -a -o service_name ./path/to/main.go
    
  2. Next, copy the binary over to the cloned directory of go-binsize-viz repository.
  3. Run the following script that covers the steps in the go-binsize-viz README.

    #!/usr/bin/env bash
    #
    # This script needs more input parsing, but it serves the needs for now.
    #
    mkdir dist
    # step 1
    go tool nm -size $1 | c++filt > dist/$1.symtab
    # step 2
    python3 tab2pydic.py dist/$1.symtab > dist/$1-map.py
    # step 3
    # must be data.js
    python3 simplify.py dist/$1-map.py > dist/$1-data.js
    rm data.js
    ln -s dist/$1-data.js data.js
    

    Running this script creates a dist folder where each intermediate step is deposited, and a data.js symlink in the top-level directory which points to the consumable .js file by treemap.html.

    # top-level directory
    $ ll
    -rw-r--r--   1 stan.halka  staff   1.1K Aug 20 09:57 README.md
    -rw-r--r--   1 stan.halka  staff   6.7K Aug 20 09:57 app3.js
    -rw-r--r--   1 stan.halka  staff   1.6K Aug 20 09:57 cockroach_sizes.html
    lrwxr-xr-x   1 stan.halka  staff        65B Aug 25 16:49 data.js -> dist/v2.0.709356.segments-paxgroups-macos-master-go1.13-data.js
    drwxr-xr-x   8 stan.halka  staff   256B Aug 25 16:49 dist
    ...
    # dist folder
    $ ll dist
    total 71728
    drwxr-xr-x   8 stan.halka  staff   256B Aug 25 16:49 .
    drwxr-xr-x  21 stan.halka  staff   672B Aug 25 16:49 ..
    -rw-r--r--   1 stan.halka  staff   4.2M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13-data.js
    -rw-r--r--   1 stan.halka  staff   3.4M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13-map.py
    -rw-r--r--   1 stan.halka  staff    11M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13.symtab
    

    As you can probably tell from the file names, these steps were explored on the segments-paxgroups service, which is a microservice used for segment information at Grab. You can ignore the versioning metadata, branch name, and Golang information embedded in the name.

  4. Finally, run a local python3 server to visualise the binary components.

    $ python3 -m http.server
    Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
    

    So now that we have a methodology to consistently generate a service binary, and a way to explore the symbols present, let’s dive in!

  5. Open your browser and visit http://localhost:8000/treemap_v3.html:

    Of the 103MB binary produced, 81MB are recognisable, with 66MB recognised as Golang (UNKNOWN is present, and also during parsing there were a fair number of warnings. Note that we haven’t spent enough time with the tool to understand why we aren’t able to recognise and index all the symbols present).

    Treemap

    The next step is to figure out where the symbols are coming from. There’s a bunch of Grab-internal stuff that for the sake of this blog isn’t necessary to go into, and it was reasonably easy to come to the right answer based on the intuitiveness of the go-binsize-viz tool.

    This visualisation shows us the source of how 11 MB of symbols are sneaking into the segments-paxgroups binary.

    Visualisation

    Every message format for any service that reads from, or writes to, streams at Grab is included in every service binary! Not cloud native!

How did this happen?

The short answer is that Golang doesn’t import only the symbols that it requires, but rather all the symbols defined within an imported directory and transitive symbols as well. So, when we think we’re importing just one directory, if our code structure doesn’t follow principles of encapsulation or isolation, we end up importing 11 MB of symbols that we don’t need! In our case, this occurred because a generic Message interface was included in the same directory with all the auto-generated code you see in the pretty picture above.

The Streams team did an awesome job of restructuring the code, which when built again, led to this outcome:

$$ ll | grep paxgroups
-rwxr-xr-x   1 stan.halka  staff   110M Aug 21 14:53 v2.0.709356.segments-paxgroups-macos-master-go1.12
-rwxr-xr-x   1 stan.halka  staff   103M Aug 25 16:34 v2.0.709356.segments-paxgroups-macos-master-go1.13
-rwxr-xr-x   1 stan.halka  staff        80M Aug 21 14:53 v2.0.709356.segments-paxgroups-macos-tinkered-go1.12
-rwxr-xr-x   1 stan.halka  staff        78M Aug 25 16:34 v2.0.709356.segments-paxgroups-macos-tinkered-go1.13

Not a bad reduction in service binary size!

Lessons learnt

The go-binsize-viz utility offers a treemap representation for imported symbols, and is very useful in determining what symbols are contributing to the overall size.

Code architecture matters: Keep binaries as small as possible!

To reduce your binary size, follow these best practices:

  • Structure your code so that the interfaces and common classes/utilities are imported from different locations than auto-generated classes.
  • Avoid huge, flat directory structures.
  • If it’s a platform offering and has too many interwoven dependencies, try to decouple the actual platform offering from the company specific instantiations. This fosters creating isolated, minimalistic code.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

New global ID format coming to GraphQL

Post Syndicated from Wissam Abirached original https://github.blog/2021-02-10-new-global-id-format-coming-to-graphql/

The GitHub GraphQL API has been publicly available for over 4 years now. Its usage has grown immensely over time, and we’ve learned a lot from running one of the largest public GraphQL APIs in the world. Today, we are introducing a new format for object identifiers and a roll out plan that brings it to our GraphQL API this year.

What is driving the change?

As GitHub grows and reaches new scale milestones, we came to the realization that the current format of Global IDs in our GraphQL API will not support our projected growth over the coming years. The new format gives us more flexibility and scalability in handling your requests even faster.

What exactly is changing?

We are changing the Global ID format in our GraphQL API. As a result, all object identifiers in GraphQL will change and some identifiers will become longer than they are now. Since you can get an object’s Global ID via the REST API, these changes will also affect an object’s node_id returned via the REST API. Object identifiers will continue to be opaque strings and should not be decoded.

How will this be rolled out?

We understand that our APIs are a critical part of your engineering workflows, and our goal is to minimize the impact as much as possible. In order to give you time to migrate your implementations, caches, and data records to the new Global IDs, we will go through a gradual rollout and thorough deprecation process that includes three phases.

  1. Introduce new format: This phase will introduce the new Global IDs into the wild on a type by type basis, for newly created objects only. Existing objects will continue to have the same ID. We will start by migrating the least requested object types, working our way towards the most popular types. Note that the new Global IDs may be longer and, in case you were storing the ID, you should ensure you can store the longer IDs. During this phase, as long as you can handle the potentially longer IDs, no action is required by you. The expected duration of this phase is 3 months.
  2. Migrate: In this phase you should look to update your caches and data records. We will introduce migration tools allowing you to toggle between the two formats making it easy for you to update your caches to the new IDs. The migration tools will be detailed in a separate blog post, closer to launch date. You will be able to use the old or new IDs to refer to an object throughout this phase. The expected duration of this phase is 3 months.
  3. Deprecate: In this phase, all REST API requests and GraphQL queries will return the new IDs. Requests made with the old IDs will continue to work, but the response will only include the new ID as well as a deprecation warning. The expected duration of this phase is 3 months.

Once the three migration phases are complete, we will sunset the old IDs. All requests made using the old IDs will result in an error. Overall, the whole process should take 9 months, with the goal of giving you plenty of time to adjust and migrate to the new format.

Tell us what you think

If you have any concerns about the rollout of this change impacting your app, please contact us and include information such as your app name so that we can better assist you.

Customer Support workforce routing

Post Syndicated from Grab Tech original https://engineering.grab.com/customer-support-workforce-routing

Introduction

With Grab’s wide range of services, we get large volumes of queries a day. Our Customer Support teams address concerns and issues from safety issues to general FAQs. The teams delight our customers through quick resolutions, resulting from world-class support framework and an efficient workforce routing system.

Our routing workforce system ensures that available resources are efficiently assigned to a request based on the right skillset and deciding factors such as department, country, request priority. Scalability to work across support channels (e.g. voice, chat, or digital) is also another factor considered for routing a request to a particular support specialist.

Sample flow of how it works today
Sample flow of how it works today

Having an efficient workforce routing system ensures that requests are directed to relevant support specialists who are most suited to handle a certain type of issue, resulting in quicker resolution, happier and satisfied customers, and reduced cost spent on support.

We initially implemented a third-party solution, however there were a few limitations, such prioritisation, that motivated us to build our very own routing solution that provides better routing configuration controls and cost reduction from licensing costs.

This article describes how we built our in-house workforce routing system at Grab and focuses on Livechat, one of the domains of customer support.

Problem

Let’s run through the issues with our previous routing solution in the next sections.

Priority management

The third-party solution didn’t allow us to prioritise a group of requests over others. This was particularly important for handling safety issues that were not impacted due to other low-priority requests like enquiries. So our goal for the in-house solution was to ensure that we were able to configure the priority of the request queues.

Bespoke product customisation

With the third-party solution being a generic service provider, customisations often required long lead times as not all product requests from Grab were well received by the mass market. Building this in-house meant Grab had full controls over the design and configuration over routing. Here are a few sample use cases that were addressed by customisation:

  • Bulk configuration changes – Previously, it was challenging to assign the same configuration to multiple agents. So, we introduced another layer of grouping for agents that share the same configuration. For example, which queues the agents receive chats from and what the proficiency and max concurrency should be.
  • Resource Constraints – To avoid overwhelming resources with unlimited chats and maintaining reasonable wait times for our customers, we introduced a dynamic queue limit on the number of chat requests enqueued. This limit was based on factors like the number of incoming chats and the agent performance over the last hour.
  • Remote Work Challenges – With the pandemic situation and more of our agents working remotely, network issues were common. So we released an enhancement on the routing system to reroute chats handled by unavailable agents (due to disconnection for an extended period) to another available agent.The seamless experience helped increase customer satisfaction.

Reporting and analytics

Similar to previous point, having a solution addressing generic use cases doesn’t allow you to add customisations at will over monitoring. With the custom implementation, we were able to add more granular metrics which are very useful to assess the agent productivity and performance and helps in planning the resources ahead of time, and this is why reporting and analytics were so valuable for workforce planning. Few of the customisations added additionally were

  • Agent Time Utilisation – While basic agent tracking was available in the out-of-the-box solution, it limited users to three states (online, away, and invisible). With the custom routing solution, we were able to create customised statuses to reflect the time the agent spent in a particular status due to  chat connection issues and failures and reflect this on dashboards for immediate attention.
  • Chat Transfers – The number of chat transfers could only be tabulated manually. We then automated this process with a custom implementation.

Solution

Now that we’ve covered the issues we’re solving, let’s cover the solutions.

Prioritising high-priority requests

During routing, the constraint is on the number of resources available. The incoming requests cannot simply be assigned to the first available agent. The issue with this approach is that we would eventually run out of agents to serve the high-priority requests.

One of the ways to prevent this is to have a separate group of agents to solely handle high-priority requests. This does not solve issues as the high-priority requests and low-priority requests share the same queue and are de-queued in a First-In, First-out (FIFO) order. As a result, the low-priority requests are directly processed instead of waiting for the queue to fill up before processing high-priority requests. Because of this queuing issue, prioritisation of requests is critical.

The need to prioritise

High-priority requests, such as safety issues, must not be in the queue for a long duration and should be handled as fast as possible even when the system is filled with low-priority requests.

There are two different kinds of queues, one to handle requests at priority level and other to handle individual issues, which are the business queues on which the queue limit constraints apply.

To illustrate further, here are two different scenarios of enqueuing/de-queuing:

Different issues with different priorities

In this scenario, the priority is set to dequeue safety issues, which are in the high-priority queue, before picking up the enquiry issues from the low-priority queue.

Different issues with different priorities
Different issues with different priorities

Identical issues with different priorities

In this scenario where identical issues have different priorities, the reallocated enquiry issue in the high-priority queue is dequeued first before picking up a low-priority enquiry issue.  Reallocations happen when a chat is transferred to another agent or when it was not accepted by the allocated agent. When reallocated, it goes back to the queue with a higher priority.

Identical issues with different priorities
Identical issues with different priorities

Approach

To implement different levels of priorities, we decided to use separate queues for each of the priorities and denoted the request queues by groups, which could logically exist in any of the priority queues.

For de-queueing, time slices of varied lengths were assigned to each of the queues to make sure the de-queueing worker spends more time on a higher priority queue.

The architecture uses multiple de-queueing workers running in parallel, with each worker looping over the queues and waiting for a message in a queue for a certain amount of time, and then allocating it to an agent.

for i := startIndex; i < len(consumer.priorityQueue); i++ {
 queue := consumer.priorityQueue\[i\]
 duration := queue.config.ProcessingDurationInMilliseconds
 for now := time.Now(); time.Since(now) < time.Duration(duration)\*time.Millisecond; {
   consumer.processMessage(queue.client, queue.config)
   // cool down
   time.Sleep(time.Millisecond \* 100)
 }
}

The above code snippet iterates over individual priority queues and waits for a message for a certain duration, it then processes the message upon receipt. There is also a cooldown period of 100ms before it moves on to receive a message from a different priority queue.

The caveat with the above approach is that the worker may end up spending more time than expected when it receives a message at the end of the waiting duration. We addressed this by having multiple workers running concurrently.

Request starvation

Now when priority queues are used, there is a possibility that some of the low-priority requests remain unprocessed for long periods of time. To ensure that this doesn’t happen, the workers are forced to run out of sync by tweaking the order in which priority queues are processed, such that when worker1 is processing a high-priority queue request, worker2 is waiting for a request in the medium-priority queue instead of the high-priority queue.

Customising to our needs

We wanted to make sure that agents with the adequate skills are assigned to the right queues to handle the requests. On top of that, we wanted to ensure that there is a limit on the number of requests that a queue can accept at a time, guaranteeing that the system isn’t flushed with too many requests, which can lead to longer waiting times for request allocation.

Approach

The queues are configured with a dynamic queue limit, which is the upper limit on the number of requests that a queue can accept. Additionally attributes such as country, department, and skills are defined on the queue.

The dynamic queue limit takes account of the utilisation factor of the queue and the available agents at the given time, which ensures an appropriate waiting time at the queue level.

A simple approach to assign which queues the agents can receive the requests from is to directly assign the queues to the agents. But this leads to another problem to solve, which is to control the number of concurrent chats an agent can handle and define how proficient an agent is at solving a request. Keeping this in mind, it made sense to have another grouping layer between the queue and agent assignment and to define attributes, such as concurrency, to make sure these groups can be reused.

There are three entities in agent assignment:

  • Queue
  • Agent Group
  • Agent

When the request is de-queued, the agent list mapped to the queue is found and then some additional business rules (e.g. checking for proficiency) are applied to calculate the eligibility score of each mapped agent to decide which agent is the best suited to cater to the request.

The factors impacting the eligibility score are proficiency, whether the agent is online/offline, the current concurrency, max concurrency and the last allocation time.

Ensuring the concurrency is not breached

To make sure that the agent doesn’t receive more chats than their defined concurrency, a locking mechanism is used at per agent level. During agent allocation, the worker acquires a lock on the agent record with an expiry, preventing other workers from allocating a chat to this agent. Only once the allocation process is complete (either failed or successful), the concurrency is updated and the lock is released, allowing other workers to assign more chats to the agent depending on the bandwidth.

A similar approach was used to ensure that the queue limit doesn’t exceed the desired limit.

Reallocation and transfers

Having the routing configuration set up, the reallocation of agents is done using the same steps for agent allocation.

To transfer a  chat to another queue, the request goes back to the queue with a higher priority so that the request is assigned faster.

Unaccepted chats

If the agent fails to accept the request in a given period of time, then the request is put back into the queue, but this time with a higher priority. This is the reason why there’s a corresponding re-allocation queue with a higher priority than the normal queue to make sure that those unaccepted requests don’t have to wait in the queue again.

Informing the frontend about allocation

When an allocation of an agent happens, the routing system needs to inform the frontend by sending messages over websocket to the frontend. This is done with our super reliable messaging system called Hermes, which operates at scale in supporting 12k concurrent connections and establishes real time communication between agents and customers.

Finding the online agents

The routing system should only send the allocation message to the frontend when the agent is online and accepting requests. Frontend uses the same websocket connection used to receive the allocation message to inform the routing system about the availability of agents. This means that if for some reason, the websocket connection is broken due to internet connection issues, the agent would stop receiving any new chat requests.

Enriched reporting and analytics

The routing system is able to push monitoring metrics, such as number of online agents, number of chat requests assigned to the agent, and so on. Because of fine grained control that comes with building this system in-house, it gives us the ability to push more custom metrics.

There are two levels of monitoring offered by this system, real time monitoring and non-real time monitoring, which could be used for analytics for calculating things like the productivity of the agent and the time they spent on each chat.

We achieved the discussed solutions with the help of StatsD for real time monitoring and for analytical purposes, we sent the data used for Tableau visualisations and reporting to Presto tables.

Given that the bottleneck for this system is the number of resources (i.e. number of agents), the real time monitoring helps identify which configuration needs to be adjusted when there is a spike in the number of requests. Moreover, the analytical persistent data allows us the ability to predict the traffic and plan the workforce management such that they are efficiently handling the requests.

Scalability

Letting the system behave appropriately when rolled out to multiple regions is a very critical piece that needed to be taken into account. To ensure that there were enough workers to handle the requests, horizontal scaling of instances can be done if the CPU utilisation increases.

Now to understand the system limitations and behaviour before releasing to multiple regions, we ran load tests with 10x more traffic than expected. This gave us the understanding on what monitors and alerts we should add to make sure the system is able to function efficiently and reduce our recovery time if something goes wrong.

Next steps

The few enhancements lined up after building this routing solution are to focus on reducing customer’s waiting time and to reduce the time spent by the agents on unresponsive customers, who have waited too long in the queue. Aside from chats, we would like to employ this solution into handling digital issues (social media and emails) and voice requests (call).


Special thanks to Andrea Carlevato and Karen Kue for making sure that the blogpost is interesting and represents the problem we solved accurately.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GitHub reduces Marketplace transaction fees, revamps Technology Partner Program

Post Syndicated from Ryan J. Salva original https://github.blog/2021-02-04-github-reduces-marketplace-transaction-fees-revamps-technology-partner-program/

At GitHub, our community is at the heart of everything we do. We want to make it easier to build the things you love, with the tools you prefer to use—which is why we’re committed to maintaining an open platform for developers. Launched in 2017 and now home to the world’s largest DevOps ecosystem, GitHub Marketplace is the single destination for developers to find, sell, and share tools and solutions that help simplify and improve the process of building software.

Whether buying or selling, our goal is to provide the best Marketplace experience for developers as possible. Today, we’re announcing some changes worth celebrating 🎉; changes to increase your revenue, simplify the application verification process, and make it easier for everyone to build with GitHub.

Supporting our Marketplace partners

In the spirit of helping developers both thrive and profit, we’re increasing developer’s take-home pay for apps sold in the marketplace from 75 to 95%. GitHub will only keep a 5% transaction fee. This change puts more revenue in the pockets of the developers, who are doing the work building tools that support the GitHub community.

Learn more

Simplifying app verification process on the Marketplace

We know our partners are excited to get on Marketplace, and we’ve made changes to make this as easy as possible. Previously, a deep review of app security and functionality was required before an app could be added to Marketplace. Moving forward, we’ll verify your organization’s identity and common-sense security precautions by:

  1. Validating your domain with a simple DNS TXT record
  2. Validating the email address on record
  3. Requiring two-factor authentication for your GitHub organization

You can track your app submission’s progress from your organization’s profile settings to fix issues faster. Now developers can get their solutions added to the Marketplace faster and the community can moderate app quality.

Screenshot of app publisher verification process in Marketplace

Soon, we’ll move all “verified apps” to the validated publisher model, updating the “green verified checkmarkverified” badge to indicate publishers, and not apps are scrutinized. Learn more

GitHub Technology Partner Program updates

We’ve also made some updates to our Technology Partner Program. If you’re interested in the GitHub Marketplace but unsure how to build integrations to the GitHub platform, co-market with us, or learn about partner events and opportunities, you can get started with our technology partner program for help. You can also check out the partner-centric resources section or reach out to us at [email protected].

Screenshot of Technology Partner Program Resource page

You’re now one step away from the technical and go-to-market resources you need to integrate with GitHub and help improve the lives of all software developers. Looking forward to seeing you on the Marketplace.

Happy coding. 👾

Deployment reliability at GitHub

Post Syndicated from Raffaele Di Fazio original https://github.blog/2021-02-03-deployment-reliability-at-github/

Welcome to another deep dive of the Building GitHub blog series, providing a look at how teams across the GitHub engineering organization identify and address opportunities to improve our internal development tooling and infrastructure.

In a previous post of this series, we described how we improved the deployment experience for github.com. When we describe deployments at GitHub, the deployment experience is an important part of what it takes to ship applications to production, especially at GitHub’s scale, but there is more to it: the actual deployment mechanics need to be fast and reliable.

Deploying GitHub

GitHub is deployed to two types of “targets”: multiple Kubernetes clusters and directly to bare metal hosts. Those two targets have different needs and characteristics, such as different number of replicas, different runtimes, etc.

The deployment process of GitHub is designed to be an invisible event for users—we deploy GitHub tens of times a day (yes, even on a Friday) without impact on our users.

When implementing deployments for a monolithic application, we must keep into account the impact that the deployment process has on the internal users of the tool as well. Hundreds of GitHub engineers work at the same time on new features and bug fixes on the same codebase and it’s critical that they can reliably deploy to production. If deployments take too long or if they are prone to fail (even if there is no impact on users), it will mean that developers at GitHub will spend more time getting those features out to users.

For these reasons, we asked ourselves the following questions:

  • How long does it take to get code successfully running in production?
  • How often do we roll back changes?
  • How often do deployments require any kind of manual intervention?

One thing was sure from the beginning, we needed data to answer those questions.

Measure all the things

We instrumented our tooling to send metrics on several important key aspects, including, but not limited to:

  • Duration of CI builds.
  • Duration of individual steps of the deployment pipeline.
  • Total duration of a deployment pipeline.
  • Final state of a deployment pipeline.
  • Number of deployments that are rolled back.
  • Occurrences of deployment retries in one of the steps of the pipeline.

As well as more general metrics related to the overall delivery:

  • How many pull requests we deploy/merge every week.
  • How long it takes to get a pull request from “ready to be deployed” to “merged”.

We used these metrics to implement several improvements to our deployment tooling: we generally made our deployments more reliable by analyzing those metrics, but we also introduced changes that allow us to tolerate some classes of intermittent deployment failures, introducing automatic retries in case of problems.

Additionally, instrumenting our deployment tooling allowed us to identify problems sooner when they happen so that we can react in a timely fashion.

Better visibility in the deployment process 

As we mentioned, GitHub is a monolithic rails app that is deployed to Kubernetes and bare metal servers, with the customer facing part of GitHub being 100% deployed to Kubernetes. When we deploy a new version of GitHub we need to start hundreds of pods in multiple Kubernetes clusters. 

A few months ago, our deploy tooling did not print much information on what was going on behind the scenes with our Kubernetes deployments. This meant that whenever a deployment failed, for example due to an issue that we didn’t previously detect in stages before canary, we would have to dig into what happened by directly asking Kubernetes.

At GitHub, we don’t require engineers deploying to understand the internals of Kubernetes. We abstract Kubernetes in a way that is easier to deal with and we have tooling in place to debug GitHub without directly accessing specific Kubernetes clusters.

We analyzed internal support requests to our infrastructure teams and found the possibility to reduce toil by making it easier to figure out what went wrong when deploying to Kubernetes.

For those reasons, we introduced changes to our tooling to provide better information on a deployment while it is being rolled out and proactively providing specific lower level information in case of failures, which includes a view of the Kubernetes events without the need to directly access Kubernetes itself.

This change allowed us to have better detailed information on the progress of a deployment and to increase the visibility on errors in case of failures, which reduces the time to detect a problem and ultimately reduces the overall time needed to deploy to production.

An SLO based approach to deployment reliability

When deployments fail, there is no impact for GitHub customers: deploys are automatically halted before there can be customer-facing issues and to do so we heavily rely on Kubernetes, for example by using readiness probes.

However, deploying GitHub tens of times a day means that the longer a deployment takes, the less things we can ship!

To make sure that we can successfully keep shipping new features to our customers, we defined a few service level objectives (SLOs) to keep track of how fast and reliable deploying GitHub is.

SLOs are usually defined for things like the success rate or latency of a web application, but they can be used for pretty much anything else. In our case, we started using SLOs to set reliability objectives and keep track of how much time it takes to deploy PRs to production which allows us to understand when we need to shift our focus from new features to improvements to the overall shipping flow of GitHub.

At GitHub we have a dedicated team that is responsible for the continuous deployment of applications, which means that we develop tools and best practices but ultimately also help Hubbers ship their applications. These SLOs are now an integral part of the team dynamics and influence the priorities of the team, to ensure that we can keep shipping hundreds of pull requests every week.

Conclusion

In this post we discussed how we make sure that we keep Hubbers shipping new features and improvements over time. Since we started taking a look at the problem, we significantly improved our deployment process, but more importantly we introduced SLOs that can guide investments to further improve our tools and processes so that GitHub users can keep getting fresh new features all year round.

GitHub Availability Report: January 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-02-02-github-availability-report-january-2021/

Introduction

In January, we experienced one incident resulting in significant impact and degraded state of availability for the GitHub Actions service.

January 28 04:21 UTC (lasting 3 hours 53 minutes)

Our service monitors detected abnormal levels of errors affecting the Actions service. This incident resulted in the failure or delay of some queued jobs for a period of time. Jobs that were queued during the incident were run successfully after the issue was resolved.

We identified the issue as caused by an infrastructure error in our SQL database layer. The database failure impacted one of the core microservices that facilitates authentication and communication between the Actions microservices, which affected queued jobs across the service. In normal circumstances, automated processes would detect that the database was unhealthy and failover with minimal or no customer impact. In this case, the failure pattern was not recognized by the automated processes, and telemetry did not show issues with the database, resulting in a longer time to determine the root cause and complete mitigation efforts.

To help avoid this class of failure in the future, we are updating the automation processes in our SQL database layer to improve error detection and failovers. Furthermore, we are continuing to invest in localizing failures to minimize the scope of impact resulting from infrastructure errors.

In summary

We’ll continue to keep you updated on the progress we’re making on ensuring reliability of our services. To learn more about how teams across GitHub identify and address opportunities to improve our engineering systems, check out the GitHub Engineering blog.

Making GitHub’s new homepage fast and performant

Post Syndicated from Tobias Ahlin original https://github.blog/2021-01-29-making-githubs-new-homepage-fast-and-performant/

This post is the third installment of our five-part series on building GitHub’s new homepage:

  1. How our globe is built
  2. How we collect and use the data behind the globe
  3. How we made the page fast and performant
  4. How we illustrate at GitHub
  5. How we designed the homepage and wrote the narrative

Creating a page full of product shots, animations, and videos that still loads fast and performs well can be tricky. Throughout the process of building GitHub’s new homepage, we’ve used the Core Web Vitals as one of our North Stars and measuring sticks. There are many different ways of optimizing for these metrics, and we’ve already written about how we optimized our WebGL globe. We’re going to take a deep-dive here into two of the strategies that produced the overall biggest performance impact for us: crafting high performance animations and serving the perfect image.

High performance animation and interactivity

As you scroll down the GitHub homepage, we animate in certain elements to bring your attention to them:

Traditionally, a typical way of building this relied on listening to the scroll event, calculating the visibility of all elements that you’re tracking, and triggering animations depending on the elements’ position in the viewport:

// Old-school scroll event listening (avoid)
window.addEventListener('scroll', () => checkForVisibility)
window.addEventListener('resize', () => checkForVisibility)

function checkForVisibility() {
  animatedElements.map(element => {
    const distTop = element.getBoundingClientRect().top
    const distBottom = element.getBoundingClientRect().bottom
    const distPercentTop = Math.round((distTop / window.innerHeight) * 100)
    const distPercentBottom = Math.round((distBottom / window.innerHeight) * 100)
    // Based on this position, animate element accordingly
  }
}

There’s at least one big problem with an approach like this: calls to getBoundingClientRect() will trigger reflows, and utilizing this technique might quickly create a performance bottleneck.

Luckily, IntersectionObservers are supported in all modern browsers, and they can be set up to notify you of an element’s position in the viewport, without ever listening to scroll events, or without calling getBoundingClientRect.  An IntersectionObserver can be set up in just a few lines of code to track if an element is shown in the viewport, and trigger animations depending on its state, using each entry’s isIntersecting method:

// Create an intersection observer with default options, that 
// triggers a class on/off depending on an element’s visibility 
// in the viewport
const animationObserver = new IntersectionObserver((entries, observer) => {
  for (const entry of entries) {
    entry.target.classList.toggle('build-in-animate', entry.isIntersecting)
  }
});

// Use that IntersectionObserver to observe the visibility
// of some elements
for (const element of querySelectorAll('.js-build-in')) {
  animationObserver.observe(element);
}

Avoiding animation pollution

As we moved over to IntersectionObservers for our animations, we also went through all of our animations and doubled down on one of the core tenets of optimizing animations: only animate the transform and opacity properties, since these properties are easier for browsers to animate (generally computationally less expensive). We thought we did a fairly good job of following this principle already, but we discovered that in some circumstances we did not, because unexpected properties were bleeding into our transitions and polluting them as elements changed state.

One might think a reasonable implementation of the “only animate transform and opacity” principle might be to define a transition in CSS like so:

// Don’t do this
.animated {
  opacity: 0;
  transform: translateY(10px);
  transition: * 0.6s ease;
}

.animated:hover {
  opacity: 1;
  transform: translateY(0);
}

In other words, we’re only explicitly changing opacity and transform, but we’re defining the transition to animate all changed properties. These transitions can lead to poor performance since other property changes can pollute the transition (you may have a global style that changes the text color on hover, for example), which can cause unnecessary style and layout calculations. To avoid this kind of animation pollution, we moved to always explicitly defining only opacity and transform as animatable:

// Be explicit about what can animate (and not)
.animated {
  opacity: 0;
  transform: translateY(10px);
  transition: opacity 0.6s ease, transform 0.6s ease;
}

.animated:hover {
  opacity: 0;
  transform: translateY(0);
}

As we rebuilt all of our animations to be triggered through IntersectionObservers and to explicitly specify only opacity and transform as animatable, we saw a drastic decrease in CPU usage and style recalculations, helping to improve our Cumulative Layout Shift score:

Lazy-loading videos with IntersectionObservers

If you’re powering any animations through video elements, you likely want to do two things: only play the video while it’s visible in the viewport, and lazy-load the video when it’s needed. Sadly, the lazy load attribute doesn’t work on videos, but if we use IntersectionObservers to play videos as they appear in the viewport, we can get both of these features in one go:

<!-- HTML: A video that plays inline, muted, w/o autoplay & preload -->
<video loop muted playsinline preload="none" class="js-viewport-aware-video" poster="video-first-frame.jpg">
  <source type="video/mp4" src="video.h264.mp4">
</video>

// JS: Play videos while they are visible in the viewport
const videoObserver = new IntersectionObserver((entries, observer) => {
  for (const entry of entries) entry.isIntersecting ? video.play() : video.pause();
});

for (const element of querySelectorAll('.js-viewport-aware-video')) {
  videoObserver.observe(element);
}

Together with setting preload to none, this simple observer setup saves us several megabytes on each page load.

Serving the perfect image

We visit web pages with a myriad of different devices, screens and browsers, and something simple as displaying an image is becoming increasingly complex if you want to cover all bases. Our particular illustration style also happens to fall between all of the classic JPG, PNG or SVG formats. Take this illustration, for example, that we use to transition from the main narrative to the footer:

To render this illustration, we would ideally need the transparency from PNGs but combine it with the compression from JPGs, as saving an illustration like this as a PNG would weigh in at several megabytes. Luckily, WebP is, as of iOS 14 and macOS Big Sur, supported in Safari on both desktops and phones, which brings browser support up to a solid +90%. WebP does in fact give us the best of both worlds: we can create compressed, lossy images with transparency. What about support for older browsers? Even a new Mac running the latest version of Safari on macOS Catalina can’t render WebP images, so we have to do something.

This challenge eventually led us to develop a somewhat obscure solution: two JPGs embedded in an SVG (one for the image data and one for the mask), embedded as base64 data—essentially creating a transparent JPG with one single HTTP request. Take a look at this image. Download it, open it up, and inspect it. Yes, it’s a JPG with transparency, encoded in base64, wrapped in an SVG.

Part of the SVG specification is the mask element. With it, you can mask out parts of an SVG. If we embed an SVG in a document, we can use the mask element in tandem with the image element to render an image with transparency:

<svg viewBox="0 0 300 300">
  <defs>
    <mask id="mask">
      <image width="300" height="300" href="mask.jpg"></image>
    </mask>
  </defs>
  <image mask="url(#mask)" width="300" height="300" href="image.jpg"></image>
</svg>

This is great, but it won’t work as a fallback for WebP. Since the paths for these images are dynamic (see href in the example above), the SVG needs to be embedded inside the document. If we instead save this SVG in a file and set it as the src of a regular img, the images won’t be loaded, and we’ll see nothing.

We can work around this limitation by embedding the image data inside the SVG as base64. There are services online where you can convert an image to base64, but if you’re on a Mac, base64 is available by default in your Terminal, and you can use it like so:

base64 -i <in-file> -o <outfile>

Where the in-file is your image of choice, the outfile is a text file where you’ll save the base64 data. With this technique, we can embed the images inside of the SVG and use the SVG as a src on a regular image.

These are the two images that we’re using to construct the footer illustration—one for the image data and one for the mask (black is completely transparent and white is fully opaque):

We convert the mask and the image to base64 using the Terminal command and then paste the data into the SVG:

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2900 1494">
  <defs>
    <mask id="mask">
      <image width="300" height="300" href="data:image/png;base64,/* your image in base64 */”></image>
    </mask>
  </defs>
  <image mask="url(#mask)" width="300" height="300" href="data:image/jpeg;base64,/* your image in base64 */”></image>
</svg>

You can save that SVG and use it like any regular image. We can then safely use WebP with lazy loading and a solid fallback that works in all browsers:

<picture>
  <source srcset="compressed-transparent-image.webp" type="image/webp">
  <img src="compressed-transparent-image.svg" loading="lazy">
</picture>

This somewhat obscure SVG hack saves us hundreds of kilobytes on each page load, and it enables us to utilize the latest technologies for the browsers and operating systems that support them.

Towards a faster web

We’re working throughout the company to create a faster and more reliable GitHub, and these are some of the techniques that we’re utilizing. We still have a long way to go, and if you’d like to be part of that journey, check out our careers page.

 

Serving driver-partners data at scale using mirror cache

Post Syndicated from Grab Tech original https://engineering.grab.com/mirror-cache-blog

Since the early beginnings, driver-partners have been the centerpiece of the wide-range of services or features provided by the Grab platform. Over time, many backend microservices were developed to support our driver-partners such as earnings, ratings, insurance, etc. All of these different microservices require certain information, such as name, phone number, email, active car types, and so on, to curate the services provided to the driver-partners.

We built the Drivers Data service to provide drivers-partners data to other microservices. The service attracts a high QPS and handles 10K requests during peak hours. Over the years, we have tried different strategies to serve driver-partners data in a resilient and cost-effective manner, while accounting for low response time. In this blog post, we talk about mirror cache, an in-memory local caching solution built to serve driver-partners data efficiently.

What we started with

Figure 1. Drivers Data service architecture
Figure 1. Drivers Data service architecture

Our Drivers Data service previously used MySQL DB as persistent storage and two caching layers – standalone local cache (RAM of the EC2 instances) as primary cache and Redis as secondary for eventually consistent reads. With this setup, the cache hit ratio was very low.

Figure 2. Request flow chart
Figure 2. Request flow chart

We opted for a cache aside strategy. So when a client request comes, the Drivers Data service responds in the following manner:

  • If data is present in the in-memory cache (local cache), then the service directly sends back the response.
  • If data is not present in the in-memory cache and found in Redis, then the service sends back the response and updates the local cache asynchronously with data from Redis.
  • If data is not present either in the in-memory cache or Redis, then the service responds back with the data fetched from the MySQL DB and updates both Redis and local cache asynchronously.
Figure 3. Percentage of response from different sources
Figure 3. Percentage of response from different sources

The measurement of the response source revealed that during peak hours ~25% of the requests were being served via standalone local cache, ~20% by MySQL DB, and ~55% via Redis.

The low cache hit rate is caused by the driver-partners data loading patterns: low frequency per driver over time but the high frequency in a short amount of time. When a driver-partner is a candidate for a job or is involved in an ongoing job, different services make multiple requests to the Drivers Data service to fetch that specific driver-partner information. The frequency of calls for a specific driver-partner reduces if he/she is not involved in the job allocation process or is not doing any job at the moment.

While low frequency per driver over time impacts the Redis cache hit rate, high frequency in short amounts of time mostly contributes to in-memory cache hit rate. In our investigations, we found that local caches of different nodes in the Drivers Data service cluster were making redundant calls to Redis and DB for fetching the same data that are already present in a node local cache.

Making in-memory cache available on every instance while the data is in active use, we could greatly increase the in-memory cache hit rate, and that’s what we did.

Mirror cache design goals

We set the following design goals:

  • Support a local least recently used (LRU) cache use-case.
  • Support active cache invalidation.
  • Support best effort replication between local cache instances (EC2 instances). If any instance successfully fetches the latest data from the database, then it should try to replicate or mirror this latest data across all the other nodes in the cluster. If replication fails and the item is expired or not found, then the nodes should fetch it from the database.
  • Support async data replication across nodes to ensure updates for the same key happens only with more recent data. For any older updates, the current data in the cache is ignored. The ordering of cache updates is not guaranteed due to the async replication.
  • Ability to handle auto-scaling.

The building blocks

Figure 4. Mirror cache
Figure 4. Mirror cache

The mirror cache library runs alongside the Drivers Data service inside each of the EC2 instances of the cluster. The two main components are in-memory cache and replicator.

In-memory cache

The in-memory cache is used to store multiple key/value pairs in RAM. There is a TTL associated with each key/value pair. We wanted to use a cache that can provide high hit ratio, memory bound, high throughput, and concurrency. After evaluating several options, we went with dgraph’s open-source concurrent caching library Ristretto as our in-memory local cache. We were particularly impressed by its use of the TinyLFU admission policy to ensure a high hit ratio.

Replicator

The replicator is responsible for mirroring/replicating each key/value entry among all the live instances of the Drivers Data service. The replicator has three main components: Membership Store, Notifier, and gRPC Server.

Membership Store

The Membership Store registers callbacks with our service discovery service to notify mirror cache in case any nodes are added or removed from the Drivers Data service cluster.

It maintains two maps – nodes in the same AZ (AWS availability zone) as itself (the current node of the Drivers Data service in which mirror cache is running) and the nodes in the other AZs.

Notifier

Each service (Drivers Data) node runs a single instance of mirror cache. So effectively, each node has one notifier.

  • Combine several (key/value) pairs updates to form a batch.
  • Propagate the batch updates among all the nodes in the same AZ as itself.
  • Send the batch updates to exactly one notifier (node) in different AZs who, in turn, are responsible for updating all the nodes in their own AZs with the latest batch of data. This communication technique helps to reduce cross AZ data transfer overheads.

In the case of auto-scaling, there is a warm-up period during which the notifier doesn’t notify the other nodes in the cluster. This is done to minimize duplicate data propagation. The warm-up period is configurable.

gRPC Server

An exclusive gRPC server runs for mirror cache. The different nodes of the Drivers Data service use this server to receive new cache updates from the other nodes in the cluster.

Here’s the structure of each cache update entity:

message Entity {
    string key = 1; // Key for cache entry.
    bytes value = 2; // Value associated with the key.
    Metadata metadata = 3; // Metadata related to the entity.
    replicationType replicate = 4; // Further actions to be undertaken by the mirror cache after updating its own in-memory cache.
    int64 TTL = 5; // TTL associated with the data.
    bool  delete = 6; // If delete is set as true, then mirror cache needs to delete the key from it's local cache.
}

enum replicationType {
    Nothing = 0; // Stop propagation of the request.
    SameRZ = 1; // Notify the nodes in the same Region and AZ.
}

message Metadata {
    int64 updatedAt = 1; // Same as updatedAt time of DB.
}

The server first checks if the local cache should update this new value or not. It tries to fetch the existing value for the key. If the value is not found, then the new key/value pair is added. If there is an existing value, then it compares the updatedAt time to ensure that stale data is not updated in the cache.

If the replicationType is Nothing, then the mirror cache stops further replication. In case the replicationType is SameRZ then the mirror cache tries to propagate this cache update among all the nodes in the same AZ as itself.

Run at scale

Figure 5. Drivers Data Service new architecture
Figure 5. Drivers Data Service new architecture

The behavior of the service hasn’t changed and the requests are being served in the same manner as before. The only difference here is the replacement of the standalone local cache in each of the nodes with mirror cache. It is the responsibility of mirror cache to replicate any cache updates to the other nodes in the cluster.

After mirror cache was fully rolled out to production, we rechecked our metrics related to the response source and saw a huge improvement. The graph showed that during peak hours ~75% of the response was from in-memory local cache. About 15% of the response was served by MySQL DB and a further 10% via Redis.

The local cache hit ratio was at 0.75, a jump of 0.5 from before and there was a 5% drop in the number of DB calls too.

Figure 6. New percentage of response from different sources
Figure 6. New percentage of response from different sources

Limitations and future improvements

Mirror cache is eventually consistent, so it is not a good choice for systems that need strong consistency.

Mirror cache stores all the data in volatile memory (RAM) and they are wiped out during deployments, resulting in a temporary load increase to Redis and DB.

Also, many new driver-partners are added everyday to the Grab system, and we might need to increase the cache size to maintain a high hit ratio. To address these issues we plan to use SSD in the future to store a part of the data and use RAM only to store hot data.

Conclusion

Mirror cache really helped us scale the Drivers Data service better and serve driver-partners data to the different microservices at low latencies. It also helped us achieve our original goal of an increase in the local cache hit ratio.

We also extended mirror cache in some other services and found similar promising results.


A huge shout out to Haoqiang Zhang and Roman Atachiants for their inputs into the final design. Special thanks to the Driver Backend team at Grab for their contribution.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Improving how we deploy GitHub

Post Syndicated from Julian Nadeau original https://github.blog/2021-01-25-improving-how-we-deploy-github/

Over the last year GitHub has doubled the number of developers contributing to the main GitHub.com application. While this seems like a solely positive thing on the surface, the 2x increase in folks contributing to the core software exposed some problems in terms of tooling. Tooling that worked for us a year ago no longer functioned in the same capacity. While GitHub itself has been a fantastic vehicle to drive change for GitHub, the deployment tooling and coordination has not enjoyed the same levels of success. One of those areas was our deployments.

GitHub.com is deployed primarily through chatops using a branch deploy model (we deploy branches before merging into the main branch). This meant that developers can add changes to a queue, check the status of the queue, and organize groups of pull requests to be deployed and merged. This all functioned using chatops in Slack in a room called #dotcom-ops. While this is a very simple system it started to cause some confusion while monitoring a deploy as a single chat room managed the queue, the deploy, and many other tasks for many people at the same time. All of a sudden this channel that was once a central part of a crucial information system was overwhelmed by the amount of information being pushed through it. Developers could no longer track their change through the system which resulted in a reduced capacity for that developer and an increased risk profile for GitHub.

This is just one step of about a dozen spread across hundreds of messages, it is hard to keep track and validate the state of a deploy.

The Build

At the beginning of this summer, we set out to be able to completely revamp the way we monitored deploy changes for GitHub.com. With the problem in mind –namely the multi-step, information-dense deploy process via chatops –we set out with a few main goals:

Simplify the Deploy for Developers

One of the main issues that we experienced with the previous system was that deploys were tracked across a number of different messages within the Slack channel. This made it very difficult to piece together the different messages that made-up a single deploy. Sometimes there could be as many as a few hundred messages in between subsequent messages from the deploy system.

Second Canary Stage

The second main issue was that we had a canary stage but the stage would only deploy to up to 2% of GitHub.com traffic. This meant that there were a whole slew of problems that would never get caught in the canary stage before we rolled out to 100% of production, and would have to instead start an incident and roll back. Availability and uptime is of the utmost concern to GitHub, so this risk became crucial to fix and address as GitHub continued to grow. With this in mind, we set out to introduce a second Canary stage at a higher percent so that we could catch more issues in earlier stages which would reduce the impact of future incidents.

At the end of this project, we were able to have two canary stages. The first is at 2% and aims to capture the majority of the issues. This low percentage keeps the risk profile at a tolerable level as such a small amount of traffic would actually be impacted by an issue. A second canary stage was introduced at 20% and allows us to direct to a much larger amount of traffic while still in canary stages. This has a higher risk profile, but is mitigated by the initial 2% canary stage, and allows us to transition with less risk to the 100% production stage.

Automate

The last issue is that developers had to sit with the deploy and help poke it along at every step of the way by running independent chatops commands to queue up, deploy canary, and deploy production – making judgement calls every step of the way. This meant that developers were fumbling with chatops commands multiple times in every deployment, and often getting them wrong.

We asked ourselves “what if there was one command and everything else just happened?”. So we did that, we automated the entire process.

The solution

We already had internal deploy software that was capable of tracking a single part of a deploy. This system had a proven track record and solid foundations, and so we aimed to add the ability to automate the entire deploy sequence from a single chatops command, rather than running multiple commands per deploy, and link these records together within the existing deployment infrastructure. Moreover, this new solution would provide an easy to use interface for a quick overview of any given deployment rather than piecing together multiple chatops commands.

We came up with two basic concepts:

  1. A deploy is made up of many stages (canary, production, etc.)
  2. There are gates between different stages that perform some sort of check to validate we can progress to the next stage

These concepts allowed us to model the entire system in a state-machine-like fashion:

This diagram shows an automatic progression between a 2% canary, 20% canary, production deploy, and a ready to merge stage – separated by automated 5-minute timers. Finally, a pointer was automated to progress across the data model after the timer gates completed and stages were deployed.

What resulted was a state machine backed deploy system with a first party UI component. This was combined with the traditional chatops with which our developers were already familiar. Below, you can see an overview of deploys which have recently been deployed to GitHub.com. Rather than tracking down various messages in a noisy Slack channel, you can go to a consolidated UI. You can see the state machine progression mentioned above in the overview on the right.

Drilling down into a specific deployment, you can see everything that has happened during a specific deployment. You can see that each stage of this deployment has an automatic 5-minute timer between them, and the ability to pause the deployment to give a developer more time to test. We also made sure that, in the case something is wrong, we have a quick way to rollback or revert the changes with the dropdown in the top right corner.

Finally, the entire system could be monitored and started from Slack – just like before. We found that this is how developers typically want to start their deploys, followed by monitoring in the UI component:

The results

These changes have revolutionized the way we deploy GitHub.com. Where confusion and frustration had once set in, we now have joy and content at the use of an automated system. We have received overwhelmingly positive feedback internally, and even attracted some attention from our very own Actions team. Our learnings, in this case, helped to inform and influence the decisions in the recent GitHub Actions CD offering announced at GitHub Universe. Our focus on our own developers means that we can apply our learnings to continue creating the best possible system for the 56M+ developers around the world.

Our work this past summer could not have been possible without the dedication and work from many people and teams. We’d like a special shoutout to the GitHub internal deploy team: their existing system, advice, and ongoing help was crucial in making sure our deploys were successful.

Part of the Building GitHub blog series.

The best of Changelog • 2020 Edition

Post Syndicated from Michelle Mannering original https://github.blog/2021-01-21-changelog-2020-edition/

If you haven’t seen it, the GitHub Changelog helps you keep up-to-date with all the latest features and updates to GitHub. We shipped a tonne of changes last year, and it’s impossible to blog about every feature. In fact, we merged over 90,000 pull requests into the GitHub codebase in the past 12 months!

Here’s a quick recap of the top changes made to GitHub in 2020. We hope these changes are helping you build cooler things better and faster. Let us know what your favourite feature of the past year has been.

GitHub wherever you are

While we haven’t exactly been travelling a lot recently, one of the things we love is the flexibility to work wherever we want, however we want. Whether you want to work on your couch, in the terminal, or check your notifications on the go, we’ve shipped some updates for you.

GitHub CLI

Do you like to work in the command line? In September, we brought GitHub to your terminal. Having GitHub available in the command line reduces the need to switch between applications or various windows and helps simplify a bunch of automation scenarios.

The GitHub CLI allows you to run your entire GitHub workflow directly from the terminal. You can clone a repo, create, view and review PRs, open issues, assign tasks, and so much more. The CLI is available on Windows, iOS, and Linux. Best of all, the GitHub CLI is open source. Download the CLI today, check out the repo, and view the Docs for a full list of the CLI commands.

GitHub for Mobile

It doesn’t stop there. Now you can also have GitHub in your pocket with GitHub for Mobile!

This new native app makes it easy to create, view, and comment on issues, check your notifications, merge a pull request, explore, organise your tasks, and more. One of the most used features of GitHub for Mobile is push notification support. Mobile alerts means you’ll never miss a mention or review again and can help keep your team unblocked.

GitHub for Mobile is available on iOS and Android. Download it today if you’re not already carrying the world’s development platform in your pocket.

Oh and did you know, GitHub for Mobile isn’t just in English? It’s also available in Brazilian Portuguese, Japanese, Simplified Chinese, and Spanish.

 

GitHub Enterprise Server

With the release of GitHub Enterprise Server 2.21 in 2020, there was a host of amazing new features. There are new features for PRs, a new notification experience, and changes to issues. These are all designed to make it easier to connect, communicate, and collaborate within your organisation.

And now we’ve made Enterprise Server even better with GitHub Enterprise Server 3.0 RC. That means GitHub Actions, Packages, Code Scanning, Mobile Support, and Secret Scanning are now available in your Enterprise Server. This is the biggest release we’ve done of GitHub Enterprise Server in years, and you can install it now with full support.

Working better with automation

GitHub Actions was launched at the end of 2019 and is already the most popular CI/CD service on GitHub. Our team has continued adding features and improving ways for you to automate common tasks in your repository. GitHub Actions is so much more than simply CI/CD. Our community has really stepped up to help you automate all the things with over 6,500 open source Actions available in the GitHub Marketplace.

Some of the enhancements to GitHub Actions in 2020 include:

Workflow visualisation

We made it easy for you to see what’s happening with your Actions automation. With Workflow visualisation, you can now see a visual graph of your workflow.

This workflow visualisation allows you to easily view and understand your workflows no matter how complex they are. You can also track the progress of your workflow in real time and easily monitor what’s happening so you can access deployment targets.

On top of workflow visualisation, you can also create workflow templates. This makes it easier to promote best practices and consistency across your organisation. It also cuts down time when using the same or similar workflows. You can even define rules for these templates that work across your repo.

Self-hosted runners

Right at the end of 2019, we announced GitHub Actions supports self-hosted runner groups. It offered developers maximum flexibility and control over their workflows. Last year, we made updates to self-hosted runners, making self-hosted runners shareable across some or all of your GitHub organisations.

In addition, you can separate your runners into groups, and add custom labels to the runners in your groups. Read more about these Enterprise self-hosted runners and groups over on our GitHub Docs.

Environments & Environment Secrets

Last year we added environment protection rules and environment secrets across our CD capabilities in GitHub Actions. This new update ensures there is separation between the concerns of deployment and concerns surrounding development to meet compliance and security requirements.

Manual Approvals

With Environments, we also added the ability to pause a job that’s trying to deploy to the protected environment and request manual approval before that job continues. This unleashes a whole new raft of continuous deployment workflows, and we are very excited to see how you make use of these new features.

Other Actions Changes

Yes there’s all the big updates, and we’re committed to making small improvements too. Alongside other changes, we now have better support for whatever default branch name you choose. We updated all our starter workflows to use a new $default-branch macro.

We also added the ability to re-run all jobs after a successful run, as well as change the retention days for artifacts and logs. Speaking of logs, we updated how the logs are displayed. They are now much easier to read, have better searching, auto-scrolling, clickable URLs, support for more colours, and full screen mode. You can now disable or delete workflow runs in the Actions tab as well as manually trigger Actions runs with the workflow_dispatch trigger.

While having access to all 6,500+ actions in the marketplace helps integrate with different tools, some enterprises want to limit which actions you can invoke to a limited trusted sub-set. You can now fine-tune access to your external actions by limiting control to GitHub-verified authors, and even limit access to specific Actions.

There were so many amazing changes and updates to GitHub Actions that we couldn’t possibly include them all here. Check out the Changelog for all our GitHub Actions updates.

Working better with Security

Keeping your code safe and secure is one of the most important things for us at GitHub. That’s why we made a number of improvements to GitHub Advanced Security for 2020.

You can read all about these improvements in the special Security Highlights from 2020. There are new features such as code scanning, secret scanning, Dependabot updates, Dependency review, and NPM advisory information.

If you missed the talk at GitHub Universe on the state of security in the software industry, don’t forget to check it out. Justin Hutchings, the Staff Product Manager for Security, walks through the latest trends in security and all things DevSecOps. It’s definitely worth carving out some time over the weekend to watch this:

Working better with your communities

GitHub is about building code together. That’s why we’re always making improvements to the way you work with your team and your community.

Issues improvements

Issues are important for keeping track of your project, so we have been busy making issues work better and faster on GitHub.

You can now also link issues and PRs via the sidebar, and issues now have list autocompletion. When you’re looking for an issue to reference, you can use multiple words to search for that issue inline.

Sometimes when creating an issue, you might like to add a GIF or short video to demo a bug or new feature. Now you can do it natively by adding an *.mp4 or *.mov into your issue.

GitHub Discussions

Issues are a great place to talk about feature updates and bug fixes, but what about when you want to have an open- ended conversation or have your community help answering common questions?

GitHub Discussions is a place for you and your community to come together and collaborate, chat, or discuss something in a separate space, away from your issues. Discussions allows you to have threaded conversations. You can even convert Issues to Discussions, mark questions as answered, categorise your topics, and pin your Discussions. These features help you provide a welcoming space to new people as well as quick access to the most common discussion points.

If you are an admin or maintainer of a public repo you can enable Discussions via repo settings today. Check out our Docs for more info.

Speaking of Docs, did you know we recently published all our documentation as an open source project? Check it out and get involved today.

GitHub Sponsors

We launched GitHub Sponsors in 2019, and people have been loving this program. It’s a great way to contribute to open source projects. In 2020, we made GitHub Sponsors available in even more countries. Last year, GitHub Sponsors became available in Mexico, Czech Republic, Malta, and Cyprus.

We also added some other fancy features to GitHub Sponsors. This includes the ability to export a list of your sponsors. You can also set up webhooks for events in your sponsored account and easily keep track of everything that’s happening via your activity feed.

At GitHub Universe, we also announced Sponsors for Companies. This means organisations can now invest in open source projects via their billing arrangement with GitHub. Now is a great time to consider supporting your company’s most critical open source dependencies.

Working better with code

We’re always finding ways to help developers. As Nat said in his GitHub Universe keynote, the thing we care about the most is helping developers build amazing things. That’s why we’re always trying to make it quicker and easier to collaborate on code.

Convert pull requests to drafts

Draft pull requests are a great way to let your team know you are working on a feature. It helps start the conversation about how it should be built without worrying about someone thinking it’s ready to merge into main. We recently made it easy to convert an existing PR into a draft anytime.

Multi-line code suggestions

Not only can you do multi-line comments, you can now suggest a specific change to multiple lines of code when you’re reviewing a pull request. Simply click and drag and then edit text within the suggestion block.

Default branch naming

Alongside the entire Git community, we’ve been trying to make it easier for teams wanting to use more inclusive naming for their default branch. This also gives teams much more flexibility around branch naming. We’ve added first-tier support for renaming branches in the GitHub UI.

This helps take care of retargeting pull requests and updating branch protection rules. Furthermore, it provides instructions to people who have forked or cloned your repo to make it easier for them to update to your new branch names.

Re-directing to the new default branch

We provided re-directs so links to deleted branch names now point to the new default branch. In addition, we updated GitHub Pages to allow it to publish from any branch. We also added a preference so you can set the default branch name for your organization. If you need to stay with ‘master’ for compatibility with your existing tooling and automation, or if you prefer to use a different default branch, such as ‘development,’ you can now set this in a single place.

For new organizations to GitHub, we also updated the default to ‘main’ to reflect the new consensus among the Git community. Existing repos are also not affected by any of these changes. Hopefully we’ve helped make it easier for the people who do want to move away from the old ‘master’ terminology in Git.

Design updates for repos and GitHub UI

In mid 2020, we launched a fresh new look to the GitHub UI. The way repos are shown on the homepage and the overall look and feel of GitHub is super sleek. There’s responsive layout, improved UX in the mobile web experience, and more. We also made lots of small improvements. For example, the way your commits are shown in the pull request timeline has changed. PRs in the past were ordered by author date. Now they’ll show up according to their chronological order in the head branch.

If you’ve been following a lot of our socials, you’ll know we’ve also got a brand new look and feel to GitHub.com. Check out these changes, and we hope it gives you fresh vibes for the future.

Go to the Dark Side

Speaking of fresh vibes, you’ve asked for it, and now it’s here! No longer will you be blinded by the light. Now you can go to the dark side with dark mode for the web.

Changelog 2020

These are just some of the highlights for 2020. We’re all looking forward to bringing you more great updates in 2021.

Keep an eye on the Changelog to stay informed and ensure you don’t miss out on any cool updates. You can also follow our changes with @GHChangelog on Twitter and see what’s coming soon by checking out the GitHub Roadmap. Tweet us your favourite changes for 2020, and tell us what you’re most excited to see in 2021.

Trident – Real-time event processing at scale

Post Syndicated from Grab Tech original https://engineering.grab.com/trident-real-time-event-processing-at-scale

Ever wondered what goes behind the scenes when you receive advisory messages on a confirmed booking? Or perhaps how you are awarded with rewards or points after completing a GrabPay payment transaction? At Grab, thousands of such campaigns targeting millions of users are operated daily by a backbone service called Trident. In this post, we share how Trident supports Grab’s daily business, the engineering challenges behind it, and how we solved them.

60-minute GrabMart delivery guarantee campaign operated via Trident
60-minute GrabMart delivery guarantee campaign operated via Trident

What is Trident?

Trident is essentially Grab’s in-house real-time if this, then that (IFTTT) engine, which automates various types of business workflows. The nature of these workflows could either be to create awareness or to incentivize users to use other Grab services.

If you are an active Grab user, you might have noticed new rewards or messages that appear in your Grab account. Most likely, these originate from a Trident campaign. Here are a few examples of types of campaigns that Trident could support:

  • After a user makes a GrabExpress booking, Trident sends the user a message that says something like “Try out GrabMart too”.
  • After a user makes multiple ride bookings in a week, Trident sends the user a food reward as a GrabFood incentive.
  • After a user is dropped off at his office in the morning, Trident awards the user a ride reward to use on the way back home on the same evening.
  • If  a GrabMart order delivery takes over an hour of waiting time, Trident awards the user a free-delivery reward as compensation.
  • If the driver cancels the booking, then Trident awards points to the user as a compensation.
  • With the current COVID pandemic, when a user makes a ride booking, Trident sends a message to both the passenger and driver reminding about COVID protocols.

Trident processes events based on campaigns, which are basically a logic configuration on what event should trigger what actions under what conditions. To illustrate this better, let’s take a sample campaign as shown in the image below. This mock campaign setup is taken from the Trident Internal Management portal.

Trident process flow
Trident process flow

This sample setup basically translates to: for each user, count his/her number of completed GrabMart orders. Once he/she reaches 2 orders, send him/her a message saying “Make one more order to earn a reward”. And if the user reaches 3 orders, award him/her the reward and send a congratulatory message. 😁

Other than the basic event, condition, and action, Trident also allows more fine-grained configurations such as supporting the overall budget of a campaign, adding limitations to avoid over awarding, experimenting A/B testing, delaying of actions, and so on.

An IFTTT engine is nothing new or fancy, but building a high-throughput real-time IFTTT system poses a challenge due to the scale that Grab operates at. We need to handle billions of events and run thousands of campaigns on an average day. The amount of actions triggered by Trident is also massive.

In the month of October 2020, more than 2,000 events were processed every single second during peak hours. Across the entire month, we awarded nearly half a billion rewards, and sent over 2.5 billion communications to our end-users.

Now that we covered the importance of Trident to the business, let’s drill down on how we designed the Trident system to handle events at a massive scale and overcame the performance hurdles with optimization.

Architecture design

We designed the Trident architecture with the following goals in mind:

  • Independence: It must run independently of other services, and must not bring performance impacts to other services.
  • Robustness: All events must be processed exactly once (i.e. no event missed, no event gets double processed).
  • Scalability: It must be able to scale up processing power when the event volume surges and withstand when popular campaigns run.

The following diagram depicts how the overall system architecture looks like.

Trident architecture
Trident architecture

Trident consumes events from multiple Kafka streams published by various backend services across Grab (e.g. GrabFood orders, Transport rides, GrabPay payment processing, GrabAds events). Given the nature of Kafka streams, Trident is completely decoupled from all other upstream services.

Each processed event is given a unique event key and stored in Redis for 24 hours. For any event that triggers an action, its key is persisted in MySQL as well. Before storing records in both Redis and MySQL, we make sure any duplicate event is filtered out. Together with the at-least-once delivery guaranteed by Kafka, we achieve exactly-once event processing.

Scalability is a key challenge for Trident. To achieve high performance under massive event volume, we needed to scale on both the server level and data store level. The following mind map shows an outline of our strategies.

Outline of Trident’s scale strategy
Outline of Trident’s scale strategy

Scale servers

Our source of events are Kafka streams. There are mostly two factors that could affect the load on our system:

  1. Number of events produced in the streams (more rides, food orders, etc. results in more events for us to process).
  2. Number of campaigns running.
  3. Nature of campaigns running. The campaigns that trigger actions for more users cause higher load on our system.

There are naturally two types of approaches to scale up server capacity:

  • Distribute workload among server instances.
  • Reduce load (i.e. reduce the amount of work required to process each event).

Distribute load

Distributing workload seems trivial with the load balancing and auto-horizontal scaling based on CPU usage that cloud providers offer. However, an additional server sits idle until it can consume from a Kafka partition.

Each Kafka partition can only be consumed by one consumer within the same consumer group (our auto-scaling server group in this case). Therefore, any scaling in or out requires matching the Kafka partition configuration with the server auto-scaling configuration.

Here’s an example of a bad case of load distribution:

Kafka partitions config mismatches server auto-scaling config
Kafka partitions config mismatches server auto-scaling config

And here’s an example of a good load distribution where the configurations for the Kafka partitions and the server auto-scaling match:

Kafka partitions config matches server auto-scaling config
Kafka partitions config matches server auto-scaling config

Within each server instance, we also tried to increase processing throughput while keeping the resource utilization rate in check. Each Kafka partition consumer has multiple goroutines processing events, and the number of active goroutines is dynamically adjusted according to the event volume from the partition and time of the day (peak/off-peak).

Reduce load

You may ask how we reduced the amount of processing work for each event. First, we needed to see where we spent most of the processing time. After performing some profiling, we identified that the rule evaluation logic was the major time consumer.

What is rule evaluation?

Recall that Trident needs to operate thousands of campaigns daily. Each campaign has a set of rules defined. When Trident receives an event, it needs to check through the rules for all the campaigns to see whether there is any match. This checking process is called rule evaluation.

More specifically, a rule consists of one or more conditions combined by AND/OR Boolean operators. A condition consists of an operator with a left-hand side (LHS) and a right-hand side (RHS). The left-hand side is the name of a variable, and the right-hand side a value. A sample rule in JSON:

Country is Singapore and taxi type is either JustGrab or GrabCar.
  {
    "operator": "and",
    "conditions": [
    {
      "operator": "eq",
      "lhs": "var.country",
      "rhs": "sg"
      },
      {
        "operator": "or",
        "conditions": [
        {
          "operator": "eq",
          "lhs": "var.taxi",
          "rhs": <taxi-type-id-for-justgrab>
          },
          {
            "operator": "eq",
            "lhs": "var.taxi",
            "rhs": <taxi-type-id-for-grabcard>
          }
        ]
      }
    ]
  }

When evaluating the rule, our system loads the values of the LHS variable, evaluates against the RHS value, and returns as result (true/false) whether the rule evaluation passed or not.

To reduce the resources spent on rule evaluation, there are two types of strategies:

  • Avoid unnecessary rule evaluation
  • Evaluate “cheap” rules first

We implemented these two strategies with event prefiltering and weighted rule evaluation.

Event prefiltering

Just like the DB index helps speed up data look-up, having a pre-built map also helped us narrow down the range of campaigns to evaluate. We loaded active campaigns from the DB every few minutes and organized them into an in-memory hash map, with event type as key, and list of corresponding campaigns as the value. The reason we picked event type as the key is that it is very fast to determine (most of the time just a type assertion), and it can distribute events in a reasonably even way.

When processing events, we just looked up the map, and only ran rule evaluation on the campaigns in the matching hash bucket. This saved us at least 90% of the processing time.

Event prefiltering
Event prefiltering
Weighted rule evaluation

Evaluating different rules comes with different costs. This is because different variables (i.e. LHS) in the rule can have different sources of values:

  1. The value is already available in memory (already consumed from the event stream).
  2. The value is the result of a database query.
  3. The value is the result of a call to an external service.

These three sources are ranked by cost:

In-memory < database < external service

We aimed to maximally avoid evaluating expensive rules (i.e. those that require calling external service, or querying a DB) while ensuring the correctness of evaluation results.

First optimization – Lazy loading

Lazy loading is a common performance optimization technique, which literally means “don’t do it until it’s necessary”.

Take the following rule as an example:

A & B

If we load the variable values for both A and B before passing to evaluation, then we are unnecessarily loading B if A is false. Since most of the time the rule evaluation fails early (for example, the transaction amount is less than the given minimum amount), there is no point in loading all the data beforehand. So we do lazy loading ie. load data only when evaluating that part of the rule.

Second optimization – Add weight

Let’s take the same example as above, but in a different order.

B & A
Source of data for A is memory and B is external service

Now even if we are doing lazy loading, in this case, we are loading the external data always even though it potentially may fail at the next condition whose data is in memory.

Since most of our campaigns are targeted, a popular condition is to check if a user is in a certain segment, which is usually the first condition that a campaign creator sets. This data resides in another service. So it becomes quite expensive to evaluate this condition first even though the next condition’s data can be already in memory (e.g. if the taxi type is JustGrab).

So, we did the next phase of optimization here, by sorting the conditions based on weight of the source of data (low weight if data is in memory, higher if it’s in our database and highest if it’s in an external system). If AND was the only logical operator we supported, then it would have been quite simple. But the presence of OR made it complex. We came up with an algorithm that sorts the evaluation based on weight keeping in mind the AND/OR. Here’s what the flowchart looks like:

Event flowchart
Event flowchart

An example:

Conditions: A & ( B | C ) & ( D | E )

Actual result: true & ( false | false ) & ( true | true ) --> false

Weight: B < D < E < C < A

Expected check order: B, D, C

Firstly, we start validating B which is false. Apparently, we cannot skip the sibling conditions here since B and C are connected by |. Next, we check D. D is true and its only sibling E is connected by | so we can mark E “skip”. Then, we check E but since E has been marked “skip”, we just skip it. Still, we cannot get the final result yet, so we need to continue validating C which is false. Now we know (B | C) is false so the whole condition is false too. We can stop now.

Sub-streams

After investigation, we learned that we consumed a particular stream that produced terabytes of data per hour. It caused our CPU usage to shoot up by 30%. We found out that we process only a handful of event types from that stream. So we introduced a sub-stream in between, which contains the event types we want to support. This stream is populated from the main stream by another server, thereby reducing the load on Trident.

Protect downstream

While we scaled up our servers wildly, we needed to keep in mind that there were many downstream services that received more traffic. For example, we call the GrabRewards service for awarding rewards or the LocaleService for checking the user’s locale. It is crucial for us to have control over our outbound traffic to avoid causing any stability issues in Grab.

Therefore, we implemented rate limiting. There is a total rate limit configured for calling each downstream service, and the limit varies in different time ranges (e.g. tighter limit for calling critical service during peak hour).

Scale data store

We have two types of storage in Trident: cache storage (Redis) and persistent storage (MySQL and others).

Scaling cache storage is straightforward, since Redis Cluster already offers everything we need:

  • High performance: Known to be fast and efficient.
  • Scaling capability: New shards can be added at any time to spread out the load.
  • Fault tolerance: Data replication makes sure that data does not get lost when any single Redis instance fails, and auto election mechanism makes sure the cluster can always auto restore itself in case of any single instance failure.

All we needed to make sure is that our cache keys can be hashed evenly into different shards.

As for scaling persistent data storage, we tackled it in two ways just like we did for servers:

  • Distribute load
  • Reduce load (both overall and per query)

Distribute load

There are two levels of load distribution for persistent storage: infra level and DB level. On the infra level, we split data with different access patterns into different types of storage. Then on the DB level, we further distributed read/write load onto different DB instances.

Infra level

Just like any typical online service, Trident has two types of data in terms of access pattern:

  • Online data: Frequent access. Requires quick access. Medium size.
  • Offline data: Infrequent access. Tolerates slow access. Large size.

For online data, we need to use a high-performance database, while for offline data, we can  just use cheap storage. The following table shows Trident’s online/offline data and the corresponding storage.

Trident’s online/offline data and storage
Trident’s online/offline data and storage

Writing of offline data is done asynchronously to minimize performance impact as shown below.

Online/offline data split
Online/offline data split

For retrieving data for the users, we have high timeout for such APIs.

DB level

We further distributed load on the MySQL DB level, mainly by introducing replicas, and redirecting all read queries that can tolerate slightly outdated data to the replicas. This relieved more than 30% of the load from the master instance.

Going forward, we plan to segregate the single MySQL database into multiple databases, based on table usage, to further distribute load if necessary.

Reduce load

To reduce the load on the DB, we reduced the overall number of queries and removed unnecessary queries. We also optimized the schema and query, so that query completes faster.

Query reduction

We needed to track usage of a campaign. The tracking is just incrementing the value against a unique key in the MySQL database. For a popular campaign, it’s possible that multiple increment (a write query) queries are made to the database for the same key. If this happens, it can cause an IOPS burst. So we came up with the following algorithm to reduce the number of queries.

  • Have a fixed number of threads per instance that can make such a query to the DB.
  • The increment queries are queued into above threads.
  • If a thread is idle (not busy in querying the database) then proceed to write to the database then itself.
  • If the thread is busy, then increment in memory.
  • When the thread becomes free, increment by the above sum in the database.

To prevent accidental over awarding of benefits (rewards, points, etc), we require campaign creators to set the limits. However, there are some campaigns that don’t need a limit, so the campaign creators just specify a large number. Such popular campaigns can cause very high QPS to our database. We had a brilliant trick to address this issue- we just don’t track if the number is high. Do you think people really want to limit usage when they set the per user limit to 100,000? 😉

Query optimization

One of our requirements was to track the usage of a campaign – overall as well as per user (and more like daily overall, daily per user, etc). We used the following query for this purpose:

INSERT INTO … ON DUPLICATE KEY UPDATE value = value + inc

The table had a unique key index (combining multiple columns) along with a usual auto-increment integer primary key. We encountered performance issues arising from MySQL gap locks when high write QPS hit this table (i.e. when popular campaigns ran). After testing out a few approaches, we ended up making the following changes to solve the problem:

  1. Removed the auto-increment integer primary key.
  2. Converted the secondary unique key to the primary key.

Conclusion

Trident is Grab’s in-house real-time IFTTT engine, which processes events and operates business mechanisms on a massive scale. In this article, we discussed the strategies we implemented to achieve large-scale high-performance event processing. The overall ideas of distributing and reducing load may be straightforward, but there were lots of thoughts and learnings shared in detail. If you have any comments or questions about Trident, feel free to leave a comment below.

All the examples of campaigns given in the article are for demonstration purpose only, they are not real live campaigns.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GitHub Availability Report: December 2020

Post Syndicated from Keith Ballinger original https://github.blog/2021-01-06-github-availability-report-december-2020/

Introduction

In December, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will provide a summary and follow-up details on how we addressed an incident mentioned in November’s report.

Follow-up to November 27 16:04 UTC (lasting one hour and one minute)

Upon further investigation around one of the incidents mentioned in November’s Availability Report, we discovered an edge case that triggered a large number of GitHub App token requests. This caused abnormal levels of replication lag within one of our MySQL clusters, specifically affecting the GitHub Actions service. This particular scenario resulted in amplified queries and increased the database lag, which impacted the database nodes that process GitHub App token requests.

When a GitHub Action is invoked, the Action is passed a GitHub App token to perform tasks on GitHub. In this case, the database lag resulted in the failure of some of those token requests because the database replicas did not have up to date information.

To help avoid this class of failure, we are updating the queries to prevent large quantities of token requests from overloading the database servers in the future.

In summary

Whether we’re introducing a system to manage flaky tests or improving our CI workflow, we’ve continued to invest in our engineering systems and overall reliability. To learn more about what we’re working on, visit GitHub’s engineering blog.

Building On-Call Culture at GitHub

Post Syndicated from Mary Moore-Simmons original https://github.blog/2021-01-06-building-on-call-culture-at-github/

As GitHub grows in size and our product offerings grow in number and complexity, we need to constantly evolve our on-call strategy so we can continue to be the trusted home for all developers. Expanding upon our Building GitHub blog series, this post gives you a window into one of the major steps along our continuous journey for operational excellence at GitHub.

Monolithic On-Call

Most of the GitHub products you interact with are in a large Ruby on Rails monolith. Monolithic codebases are common for many high-growth startups, and it’s a difficult situation to detangle yourself from. One of the pain points we had was problems with the on-call system for our monolith.

The biggest pain points with our monolithic on-call structure were:

  • The main GitHub monolith spans a huge number of products and features. Most engineers (understandably) weren’t familiar enough with enough of the codebase to feel confident in their ability to respond to incidents when on-call. Often, the engineer who got paged would escalate to another team, which made them feel more like a switchboard operator than an engineer.
  • The on-call rotation was large and engineers were on-call for 24 hours at a time. As a result, engineers were only on-call ~4 times per year for one day, and most never gained the context they needed to feel confident while on-call.
  • Since the monitoring and alerting for this on-call rotation was spread across most engineering teams at GitHub and people only had to experience on-call for 24 hours at a time, the monitoring and documentation was not well maintained. As a result, we had noisy alerts and poor runbooks.
  • Because most engineers didn’t feel confident with the monolithic on-call shift, the same 5-10 people who knew the platform best were involved for every production incident, which caused an imbalance in on-call responsibilities.

New On-Call Culture

To address these pain points, we made large changes to our on-call structure. We split up the monolith on-call rotation so every engineering team was on-call for the code they maintain. We had a number of logistical difficulties to overcome, but most of the hurdles we faced were cultural and educational.

Logistical Hurdles

File Ownership

The GitHub monolith contains over 16,000 files, and file ownership was often unclear or undocumented. To solve this, we rolled out a new system that associates files to services, and then assigns services to teams, which made it much easier to change ownership as teams changed. Here’s a snippet of the file to service mappings code we use in the monolith:

## Apps
### API
app/api/github_apps.rb     :apps
app/api/grants.rb          :apps
### Components             :apps
app/component/apps*        :apps
test/app/component/apps*   :apps

## Authzd
**/*authzd*                :authzd
app/models/permissions.rb  :authzd

You’ll see that each file in the monolith maps to a service, such as “apps” or “authzd”. We then have another file in the monolith that lists every service and maps them to teams. Here’s a snippet of that code for the apps service:

apps:
  name: GitHub Apps
  description: Allows users to build 'GitHub Apps' on top of GitHub.
  team: ecosystem-apps

This information is automatically pulled into a service we developed in-house, called the Service Catalog, which then displays the information to GitHub employees. You can search for services in the Service Catalog, which allows GitHub engineers, support, and product to track down which engineering team owns which services. Here’s an example of what it looks like for the apps service:

We asked all teams to move to this new service ownership model, implemented a linter that doesn’t allow files in the monolith to be updated or added unless the ownership information was completed, and worked with engineering and leadership to assign ownership for major services that were still unowned.

Monitoring and Alerting

Monitoring and alerting was set up for the monolith as a whole, so we asked teams to create monitoring specific to their areas of responsibility. Once this was mostly completed, a group of more senior engineers researched all remaining monolith-wide alerts and then split up alerts, assigned alerts out to teams, and decommissioned alerts that were no longer necessary.

Large Scope of Teams Involved

There were over 50 engineering teams that needed to make changes to their processes in order to complete this effort. To coordinate this, we opened a GitHub issue for every team with clear checklists for the work needed. From there we did regular check-ins with teams and provided help and information so we didn’t leave anyone behind.

Cultural and Educational Hurdles

  • Adjusting with the pandemic: This was a huge change for our engineers, and seven months into the project a global pandemic hit. This caused significant compounding anxiety that negatively impacted peoples’ ability to think critically.  As a result, the project required a more high-touch, empathy-first approach than was originally expected.
  • Training touchpoints: Many of our engineers had never been on-call before and didn’t have experience with operational best practices. We designed and delivered three rounds of training, set up office hours with on-call experts, created significant tooling and documentation for engineering teams, and opened Slack channels where people could ask questions and get help.
  • Instilling work-life balance while on-call: Several engineers were anxious about the impact on-call would have on their lives. Once you have been on-call for months or years, going to the grocery store or going for a bike ride while on-call is a normal activity that you can plan for. However, for those who haven’t been on-call before, it can be daunting to design personal strategies that allow you to respond to a page within minutes and try to do everyday tasks like shower or go grocery shopping. We worked with teams to understand their anxieties, document tips and tricks from people who had experience being on-call, and work with people 1:1 who had further concerns. We also reinforced that team members are there to support each other by taking on-call for a couple hours if someone wants to go for a run or handle childcare, by being a back-up if someone misses a page, or designing a follow-the-sun rotation if their team is in different time zones. This is yet another situation where GitHub’s global remote team is a major asset – we can lean on our team members in other parts of the world to take on-call when we’re not working. Many people will not become comfortable with how to balance on-call with having a life until they spend months or even years being on-call regularly, so some of this comfort is a matter of time and not training.
  • Cultivating a blameless culture: Many engineers were anxious about performing well when they are on-call. They had concerns that they would miss a page or make a mistake and let down their team. We worked with the organization to reinforce the message that mistakes are okay and outages happen no matter how well you perform your on-call duties, but we still have a lot of work to do in this area. We are undergoing a long-term effort to continue fostering a blameless and supportive on-call culture at GitHub, which includes nurturing safe spaces to learn about on-call and celebrating people publicly who bravely worked on something they weren’t familiar with while on-call. It will be a long journey to create a sense of safety about making mistakes on the engineering teams at GitHub, but we can improve over time.
  • Meeting the criticality needs for each team: Different GitHub products are at different levels of criticality, which means some engineering teams have to respond within 5 minutes if they are paged, and some don’t have to respond until the next business day if their service breaks. Some engineers are concerned that this causes an unfair imbalance between engineering teams. However, different engineers have different interests. Some would prefer to work on more business-critical, technically complex systems with stricter uptime requirements, while others value work/life balance more than the technical complexity needed for products with strict operational requirements. Over time, engineers will select teams that have a level of operational rigor they identify most with, and this will correct itself.
  • Dedicating time to focus on long-term success: As we rolled out the change, several teams expressed concerns that they are not able to spend enough time making their on-call experience better. We provided clear documentation reinforcing that the person on-call should be focused on improving the on-call experience when they are not responding to pages. This includes updating runbooks, tuning noisy alerts, scripting/automating on-call tasks to eliminate or streamline on-call tasks for sleep-deprived engineers, and fixing underlying technical debt that makes the on-call experience worse. We communicated that teams might spend ~20% of their time on technical debt and ~20% of their time on improving the on-call experience if needed for the stability of the product, customer experience, and engineering experience. These guidelines require long-term focus from leadership. We need to continuously spread the message from Senior VPs all the way to line managers that a sustainable engineering on-call experience is critical to the success of GitHub.

Continuing the Journey

GitHub’s incident resolution time improved after the bulk of this initiative was completed, but our journey is never done. Organizations need to constantly improve their operational best practices or they’ll fall behind.

There are several long-term cultural changes discussed above that we must continue to promote for years at GitHub. We are conducting a retrospective in January to learn how we can improve rolling out large changes like these in the future, and how we can continue to improve the on-call experience for our engineers and GitHub’s stability for our customers. In addition, we are sending out regular surveys to engineers about their on-call experience and we continue to monitor GitHub’s uptime. We will continue to meet with engineering teams to discuss their on-call pain points and how to improve so that we can encourage a growth mindset in our engineering teams and help each other learn operational best practices. Everyone at GitHub is in this journey together, and we need to support each other in our drive for excellence so we can continue to be the trusted home for all developers.

Git clone: a data-driven study on cloning behaviors

Post Syndicated from Solmaz Abbaspoursani original https://github.blog/2020-12-22-git-clone-a-data-driven-study-on-cloning-behaviors/

@derrickstolee recently discussed several different git clone options, but how do those options actually affect your Git performance? Which option is fastest for your client experience? Which option is fastest for your build machines? How can these options impact server performance? If you are a GitHub Enterprise Server administrator it’s important that you understand how the server responds to these options under the load of multiple simultaneous requests.

Here at GitHub, we use a data-driven approach to answer these questions. We ran an experiment to compare these different clone options and measured the client and server behavior. It is not enough to just compare git clone times, because that is only the start of your interaction with a Git repository. In particular, we wanted to determine how these clone options change the behavior of future Git operations such as git fetch.

In this experiment, we aimed to answer the below questions:

  1. How fast are the various git clone commands?
  2. Once we have cloned a repository, what kind of impact do future git fetch commands have on the server and client?
  3. What impact do full, shallow and partial clones have on a Git server? This is mostly important for our GitHub Enterprise Server Admins.
  4. Will the repository shape and size make any difference in the overall performance?

It is worth special emphasis that these results come from simulations that we performed in our controlled environments and do not simulate complex workflows that might be used by many Git users. Depending on your workflows and repository characteristics these results may change. Perhaps this experiment provides a framework that you could follow to measure how your workflows are affected by these options. If you would like help analyzing your worksflows, feel free to engage with GitHub’s Professional Services team.

For a summary of our findings, feel free to jump to our conclusions and recommendations.

Experiment design

To maximize the repeatability of our experiment, we use open source repositories for our sample data. This way, you can compare your repository shape to the tested repositories to see which is most applicable to your scenario.

We chose to use the jquery/jqueryapple/swift and torvalds/linux repositories. These three repositories vary in size and number of commits, blobs, and trees.

These repositories were mirrored to a GitHub Enterprise Server running version 2.22 on a 8-core cloud machine. We use an internal load testing tool based on Gatling to generate git requests against the test instance. We ran each test with a specific number of users across 5 different load generators for 30 minutes. All of our load generators use git version 2.28.0 which by default is using protocol version 1. We would like to make a note that protocol version 2 only improves ref advertisement and therefore we don’t expect it to make a difference in our tests.

Once a test is complete, we use a combination of Gatling results, ghe-governor and server health metrics to analyze the test.

Test repository characteristics

The git-sizer tool measures the size of Git repositories along many dimensions. In particular, we care about the total size on disk along with the count of each object type. The table below contains this information for our three test repositories.

Repo blobs trees commits Size (MB)
jquery/jquery 14,779 22,579 7,873 40
apple/swift 390,116 649,322 132,316 750
torvalds/linux 2,174,798 4,619,864 968,500 4,000

The jquery/jquery repository is a fairly small repository with only 40MB of disk space. apple/swift is a medium-sized repository with around 130 thousand commits and 750MB on disk. The torvalds/linux repository is typically the gold standard for Git performance tests on open source repositories. It uses 4 gigabytes of disk space and has close to a million commits.

Test scenarios

We care about the following clone options:

  1. Full clones.
  2. Shallow clones (--depth=1).
  3. Treeless clones (--filter=tree:0).
  4. Blobless clones (--filter=blob:none).

In addition to these options at clone time, we can also choose to fetch in a shallow way using --depth=1. Since treeless and blobless clones have their own way to reduce the objects downloaded during git fetch, we only test shallow fetches on full and shallow clones.

We organized our test scenarios into the following ten categories, labeled T1 through T10. T1 to T4, simulate four different git clone types. T5 to T10 simulate various git fetch operations into these cloned repositories.

Test# git command Description
T1 git clone Full clone
T2 git clone --depth 1 Shallow clone
T3 git clone --filter=tree:0 Treeless partial clone
T4 git clone --filter=blob:none Blobless partial clone
T5 T1 + git fetch Full fetch in a fully cloned repository
T6 T1 + git fetch --depth 1 Shallow fetch in a fully cloned repository
T7 T2 + git fetch Full fetch in a shallow cloned repository
T8 T2 + git fetch --depth 1 Shallow fetch in a shallow cloned repository
T9 T3 + git fetch Full fetch in a treeless partially cloned repository
T10 T4 + git fetch Full fetch in a blobless partially cloned repository

In partial clones, the new blobs at the new ref tip are not downloaded until we navigate to that position and populate our working directory with those blob contents. To be a fair comparison with the full and shallow clone cases, we also have our simulation run git reset --hard origin/$branch in all T5 to T10 tests. In T5 to T8 this extra step will not have a huge impact, but in T9 and T10 it will ensure the blob downloads are included in the cost.

In all the scenarios above, a single user was also set to repeatedly change 3 random files in the repository and push them to the same branch that the other users were cloning and fetching. This simulates repository growth so the git fetch commands actually have new data to download.

Test Results

Let’s dig into the numbers to see what our experiment says.

git clone performance

The full numbers are provided in the tables below.

Unsurprisingly, shallow clone is the fastest clone for the client, followed by a treeless then blobless partial clones, and finally full clones. This performance is directly proportional to the amount of data required to satisfy the clone request. Recall that full clones need all reachable objects, blobless clones need all reachable commits and trees, treeless clones need all reachable commits. A shallow clone is the only clone type that does not grow at all along with the history of your repository.

The performance impact of these clone types grows in proportion to the repository size, especially the number of commits. For example, a shallow clone of torvalds/linux is four times faster than a full clone, while a treeless clone is only twice as fast and a blobless clone is only 1.5 times as fast. It is worth noting that the development culture of the Linux project promotes very small blobs that compress extremely well. We expect that the performance difference to be greater for most other projects with a higher blob count or size.

As for server performance, we see that the Git CPU time per clone is higher for the blobless partial clone (T4). Looking a bit closer with ghe-governor, we observe that the higher Git CPU is mainly due to the higher amount of pack-objects operations in the partial clone scenarios (T3 and T4). In the torvalds/linux repository, the Git CPU time spent on pack-objects is four times more in a treeless partial clone (T3) compared to a full clone (T1). In contrast, in the smaller jquery/jquery repository, a full clone consumes more CPU per clone compared to the partial clones (T3 and T4). Shallow clone of all the three different repositories, consumes the lowest amount of total and Git CPU per clone.

If the full clone is sending more data in a full clone, then why is it spending more CPU on a partial clone? When Git sends all reachable objects to the client, it mostly transfers the data it has on disk without decompressing or converting the data. However, partial and shallow clones need to extract that subset of data and repackage it to send to the client. We are investigating ways to reduce this CPU cost in partial clones.

The real fun starts after cloning the repository and users start developing and pushing code back up to the server. In the next section we analyze scenarios T5 through T10, which focus on the git fetch and git reset --hard origin/$branch commands.

jquery/jquery clone performance

Test# description clone avgRT (milliseconds) git CPU spent per clone
T1 full clone 2,000ms (Slowest) 450ms (Highest)
T2 shallow clone 300ms (6x faster than T1) 15ms (30x less than T1)
T3 treeless partial clone 900ms (2.5x faster than T1) 270ms (1.7x less than T1)
T4 blobless partial clone 900ms (2.5x faster than T1) 300ms (1.5x less than T1)

apple/swift clone performance

Test# description clone avgRT (seconds) git CPU spent per clone
T1 full clone 50s (Slowest) 8s (2x less than T4)
T2 shallow clone 8s (6x faster than T1) 3s (6x less than T4)
T3 treeless partial clone 16s (3x faster than T1) 13s (Similar to T4)
T4 blobless partial clone 22s (2x faster than T1) 15s (Highest)

torvalds/linux clone performance

Test# description clone avgRT (minutes) git CPU spent per clone
T1 full clone 5m (Slowest) 60s (2x less than T4)
T2 shallow clone 1.2m (4x faster than T1) 40s (3.5x less than T4)
T3 treeless partial clone 2.4m (2x faster than T1) 120s (Similar to T4)
T4 blobless partial clone 3m (1.5x faster than T1) 130s (Highest)

git fetch performance

The full fetch performance numbers are provided in the tables below, but let’s first summarize our findings.

The biggest finding is that shallow fetches are the worst possible options, in particular from full clones. The technical reason is that the existence of a “shallow boundary” disables an important performance optimization on the server. This causes the server to walk commits and trees to find what’s reachable from the client’s perspective. This is more expensive in the full clone case because there are more commits and trees on the client that the server is checking to not duplicate. Also, as more shallow commits are accumulated, the client needs to send more data to the server to describe those shallow boundaries.

The two partial clone options have drastically different behavior in the git fetch and git reset --hard origin/$branch sequence. Fetching from blobless partial clones increases the reset command by a small, measurable way, but not enough to make a huge difference from the user perspective. In contrast, fetching from a treeless partial clone causes significant more time because the server needs to send all trees and blobs reachable from a commit’s root tree in order to satisfy the git reset --hard origin/$branch command.

Due to these extra costs as a repository grows, we strongly recommend against shallow fetches and fetching from treeless partial clones. The only recommended scenario for a treeless partial clone is for quickly cloning on a build machine that needs access to the commit history, but will delete the repository at the end of the build.

Blobless partial clones do increase the Git CPU costs on the server somewhat, but the network data transfer is much less than a full clone or a full fetch from a shallow clone. The extra CPU cost is likely to become less important if your repository has larger blobs than our test repositories. In addition, you have access to the full commit history, which might be valuable to real users interacting with these repositories.

It is also worth noting that we noticed a surprising result during our testing. During T9 and T10 tests for the Linux repository, our load generators encountered memory issues as it seems that these scenarios with the heavy load that we were running, triggered more auto Garbage Collections (GC). GC in the Linux repository is expensive and involves a full repack of all Git data. Since we were testing on a Linux client, the GC processes were launched in the background to avoid blocking our foreground commands. However, as we kept fetching we ended up with several concurrent background processes; this is not a realistic scenario but a factor of our synthetic load testing. We ran git config gc.auto false to prevent this from affecting our test results.

It is worth noting that blobless partial clones might trigger automatic garbage collection more often than a full clone. This is a natural byproduct of splitting the data into a larger number of small requests. We have work in progress to make Git’s repository maintenance be more flexible, especially for large repositories where a full repack is too time-consuming. Look forward to more updates about that feature here on the GitHub blog.

jquery/jquery fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 200ms 5ms 4ms (Lowest)
T6 shallow fetch in a fully cloned repository 300ms 5ms 18ms (4x more than to T5)
T7 full fetch in a shallow cloned repository 200ms 5ms 4ms (Similar to T5)
T8 shallow fetch in a shallow cloned repository 250ms 5ms 14ms (3x more than to T5)
T9 full fetch in a treeless partially cloned repository 200ms 85ms 9ms (2x more than to T5)
T10 full fetch in a blobless partially cloned repository 200ms 40ms 6ms (1.5x more than to T5)

apple/swift fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 350ms 80ms 20ms (Lowest)
T6 shallow fetch in a fully cloned repository 1,500ms 80ms 300ms (13x more than to T5)
T7 full fetch in a shallow cloned repository 300ms 80ms 20ms (Similar to T5)
T8 shallow fetch in a shallow cloned repository 350ms 80ms 45ms (2x more than to T5)
T9 full fetch in a treeless partially cloned repository 300ms 300ms 70ms (3x more than to T5)
T10 full fetch in a blobless partially cloned repository 300ms 150ms 35ms (1.5x more than to T5)

torvalds/linux fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 350ms 250ms 40ms (Lowest)
T6 shallow fetch in a fully cloned repository 6,000ms 250ms 1,000ms (25x more than T5)
T7 full fetch in a shallow cloned repository 300ms 250ms 50ms (1.3x more than T5)
T8 shallow fetch in a shallow cloned repository 400ms 250ms 80ms (2x more than to T5)
T9 full fetch in a treeless partially cloned repository 350ms 1250ms 400ms (10x more than to T5)
T10 full fetch in a blobless partially cloned repository 300ms 500ms 140ms (3.5x more than to T5)

What does this mean for you?

Our experiment demonstrated some performance changes between these different clone and fetch options. Your mileage may vary! Our experimental load was synthetic, and your repository shape can differ greatly from these repositories.

Here are some common themes we identified that could help you choose the right scenario for your own usage:

If you are a developer focused on a single repository, the best approach is to do a full clone and then always perform a full fetch into that clone. You might deviate to a blobless partial clone if that repository is very large due to many large blobs, as that clone will help you get started more quickly. The trade-off is that some commands such as git checkout or git blame will require downloading new blob data when necessary.

In general, calculating a shallow fetch is computationally more expensive compared to a full fetch. Always use a full fetch instead of a shallow fetch both in fully and shallow cloned repositories.

In workflows such as CI builds when there is a need to do a single clone and delete the repository immediately, shallow clones are a good option. Shallow clones are the fastest way to get a copy of the working directory at the tip commit. If you need the commit history for your build, then a treeless partial clone might work better for you than a full clone. Bear in mind that in larger repositories such as the torvalds/linux repository, it will save time on the client but it’s a bit heavier on your git server when compared to a full clone.

Blobless partial clones are particularly effective if you are using Git’s sparse-checkout feature to reduce the size of your working directory. The combination greatly reduces the number of blobs you need to do your work. Sparse-checkout does not reduce the required data transfer for shallow clones.

Notably, we did not test repositories that are significantly larger than the torvalds/linux repository. Such repositories are not really available in the open, but are becoming increasingly common for private repositories. If you feel your repository is not represented by our test repositories, then we recommend trying to replicate our experiments yourself.

As can be observed, our test process is not simulating a real life situation where users have different workflows and work on different branches. Also the set of Git commands that have been analyzed in this study is a small set and is not a representative of a user’s daily Git usage. We are continuing to study these options to get a holistic view of how they change the user experience.

Acknowledgements

A special thanks to Derrick Stolee and our Professional Services team for their efforts and sponsorship of this study!