Tag Archives: Engineering

Atom editor is now faster

Post Syndicated from Rafael Oleza original https://github.blog/2019-06-12-atom-editor-is-now-faster/

The Atom Team improved a few of Atom’s most common features—they’re now dramatically faster and ready to help you be even more productive.

Fuzzy Finder

The fuzzy finder is one of the most popular features in Atom, and it’s used by nearly every user to quickly open files by name. We wanted to make it even better by speeding up the process. Atom version 1.38 introduces an experimental fast mode that brings drastic speed improvements to any project. With this mode, indexing a medium or large project is roughly six times faster than when using the standard mode.

Medium to large projects are crawled six times quicker using fast mode.

This change also made the process to display filtered results in Atom about 11 times faster. You’ll notice the difference in speed as soon as you start searching for a query.

Medium to large projects are filtered 11 times quicker using fast mode.

What’s next

Our goal is to make the new experimental mode the default option on Atom version 1.39 when it’s released next month. Until then, switch to this mode by clicking the try experimental fast mode option in the fuzzy finder.

The next Atom version 1.39 will also introduce performance improvements to the find and replace package. We’re adding a new mode to make searching for files between 10 and 20 times faster than the current mode.

Thank you

We’d like to thank the open source community since it has allowed us to build and improve Atom over the years. More specifically, we’ve made these performance improvements by building on the shoulders of two big giants:

Try out Atom.io

The post Atom editor is now faster appeared first on The GitHub Blog.

Direct instruction marking in Ruby 2.6

Post Syndicated from Aaron Patterson original https://github.blog/2019-06-04-direct-instruction-marking-in-ruby-2-6/

We recently upgraded GitHub to use the latest version of Ruby 2.6. Ruby 2.6 contains an optimization for reducing memory usage. We’ve found it to reduce the “post-boot live heap” by about 3 percent. The “post-boot live heap” are the objects still referenced and not garbage collected after booting our Rails application, but before accepting any requests.

Ruby’s virtual machine

MRI (Matz’s Ruby Implementation) uses a stack-based virtual machine.  Each instruction manipulates a stack, and that stack can be thought of as a sort of a “scratch space”. For example, the program 3 + 5 could be represented with the instruction sequences:

push 3
push 5
add

As the virtual machine executes, each instruction manipulates the stack:

Ruby VM Stack Manipulation

Abstract Syntax Trees in Ruby

Before the virtual machine has something to execute, the code must go through a few different processing phases. The code is tokenized, parsed, turned in to an AST (or Abstract Syntax Tree), and finally the AST is converted to byte code. The byte code is what the virtual machine will eventually execute.

Let’s look at an example Ruby program:

"hello" + "world"

This code is turned into an AST, which is a tree data structure. Each node in the tree is represented internally by an object called a T_NODE object. The tree for this code will look like this:

Abstract Synatax Tree of hello + world

Some of the T_NODE objects reference literals. In this case some T_NODE objects reference the string literals “hello” and “world”. Those string literals are allocated via Ruby’s Garbage Collector. They are just like any other string in Ruby, except that they happen to be allocated as the code is being parsed.

Translating the AST to instructions

Instructions and their operands are represented as integers and can only be represented as integers. Instruction sequences for a program are just a list of integers, and it’s up to the virtual machine to interpret the meaning of those integers.

The compilation process produces the list of integers that the virtual machine will interpret. To compile a program, we simply walk the tree translating nodes and their operands to integers.  The program "hello" + "world" will result in three instructions: two push operations and one add operation. The push instructions have "hello" and "world" as operands.

There are a fixed number of instructions, and we can represent each instruction with an integer. In this case, let’s use the number 7 for push and the number 9 for add. But how can we convert the strings “hello” and “world” to integers?

Half translation of AST to instructions

In order to represent these strings in the instruction sequences, the compiler will add the address of the Ruby object that represents each string. The virtual machine knows that the operand to the push instruction is actually the address of a Ruby object, and will act appropriately. This means that the final instructions will look like this:

Full translation of AST to instructions

After the AST is fully processed, it is thrown away and only the instruction sequences remain:

Instructions after AST is gone

 

Literal liveness in Ruby 2.5

Instruction Sequences are Ruby objects and are managed via Ruby’s garbage collector. As mentioned earlier, string literals are also Ruby objects and are managed by the garbage collector. If the string literals are not marked, they could be collected, and the instruction sequences would point to an invalid address.

To prevent these literal objects from being collected, Ruby 2.5 would maintain a “mark array”. The mark array is simply a Ruby array that contains references to all literals referenced for that set of instruction sequences:

InstructionSequences with Mark Array

Both the mark array and the instructions contain references to the string literals found in the code. But the instruction sequences depend on the mark array to keep the literals from being collected.

Literal liveness in Ruby 2.6+

Ruby 2.6 introduced a patch that eliminates this mark array. When instruction sequences are marked, rather than marking an array, it disassembles the instructions and marks instruction operands that were allocated via the garbage collector. This disassembly process means that the mark array can be completely eliminated:

Instruction Sequences without Mark Array

We found that this reduced the number of live objects in our heap by three percent after the application starts.

Performance

Of course, disassembling instructions is more expensive than iterating an array. We found that only 30 percent of instruction sequences actually contain references to objects that need marking. In order to prevent needless disassembly, instructions that contain objects allocated from Ruby’s garbage collector are flagged at compile time, and only those instruction sequences are disassembled during mark time. On top of this, instruction sequence objects typically become “old”. This means that thanks to Ruby’s generational garbage collector, they are examined very infrequently. As a result, we observed memory reduction with zero cost to throughput.

Have a good day!

The post Direct instruction marking in Ruby 2.6 appeared first on The GitHub Blog.

React Native in GrabPay

Post Syndicated from Grab Tech original https://engineering.grab.com/react-native-in-grabpay

Overview

It wasn’t too long ago that Grab formed a new team, GrabPay, to improve the cashless experience in Southeast Asia and to venture into the promising mobile payments arena. To support the work, Grab also decided to open a new R&D center in Bangalore.

It was an exciting journey for the team from the very beginning, as it gave us the opportunity to experiment with new cutting edge technologies. Our first release was the GrabPay Merchant App, the first all React Native Grab App. Its success gave us the confidence to use React Native to optimize the Grab PAX app.

React Native is an open source mobile application framework. It lets developers use React (a JavaScript library for building user interfaces) with native platform capabilities. Its two big advantages are:

  • We could make cross-platform mobile apps and components completely in JavaScript.
  • Its hot reloading feature significantly reduced development time.

This post describes our work on developing React Native components for Grab apps, the challenges faced during implementation, what we learned from other internal React Native projects, and our future roadmap.

Before embarking on our work with React Native, these were the goals we set out. We wanted to:

  • Have a reusable code between Android and iOS as well as across various Grab apps (Driver app, Merchant app, etc.).
  • Have a single codebase to minimize the effort needed to modify and maintain our code long term.
  • Match the performance and standards of existing Grab apps.
  • Use as few Engineering resources as possible.

Challenges

Many Grab teams located across Southeast Asia and in the United States support the App platform. It was hard to convince all of them to add React Native as a project dependency and write new feature code with React Native. In particular, having React Native dependency significantly increases a project’s binary’s size,

But the initial cost was worth it. We now have only a few modules, all written in React Native:

  • Express
  • Transaction History
  • Postpaid Billpay

As there is only one codebase instead of two, the modules take half the maintenance resources. Debugging is faster with React Native’s hot reloading. And it’s much easier and faster to implement one of our modules in another app, such as DAX.

Another challenge was creating a universally acceptable format for a bridging library to communicate between existing code and React Native modules. We had to define fixed guidelines to create new bridges and define communication protocols between React Native modules and existing code.

Invoking a module written in React Native from a Native Module (written in a standard computer language such as Swift or Kotlin) should follow certain guidelines. Once all Grab’s tech families reached consensus on solutions to these problems, we started making our bridges and doing the groundwork to use React Native.

Foundation

On the native side, we used the Grablet architecture to add our React Native modules. Grablet gave us a wonderful opportunity to scale our Grab platform so it could be used by any tech family to plug and play their module. And the module could be in any of  Native, React Native, Flutter, or Web.

We also created a framework encapsulating all the project’s React Native Binaries. This simplified the React Native Upgrade process. Dependencies for the framework are react, react-native, and react-native-event-bridge.

We had some internal proof of concept projects for determining React Native’s performance on different devices, as discussed here. Many teams helped us make an extensive set of JS bridges for React Native in Android and iOS. Oleksandr Prokofiev wrote this bridge creation example:

publicfinalclassDeviceKitModule: NSObject, RCTBridgeModule {
 privateletdeviceKit: DeviceKitService

 publicinit(deviceKit: DeviceKitService) {
   self.deviceKit = deviceKit
   super.init()
 }
 publicstaticfuncmoduleName() -> String {
   return"DeviceKitModule"
 }
 publicfuncmethodsToExport() -> [RCTBridgeMethod] {
   let methods: [RCTBridgeMethod?] = [
     buildGetDeviceID()
     ]
   return methods.compactMap { $0 }
 }

 privatefuncbuildGetDeviceID() -> BridgeMethodWrapper? {
   returnBridgeMethodWrapper("getDeviceID", { [weakself] (_: [Any], _, resolve) in
     letvalue = self?.deviceKit.getDeviceID()
     resolve(value)
   })
 }
}

GrabPay Components and React Native

The GrabPay Merchant App gave us a good foundation for React Native in terms of

  • Component libraries
  • Networking layer and api middleware
  • Real world data for internal assessment of performance and stability

We used this knowledge to build theTransaction History and GrabPay Digital Marketplace components inside the Grab Pax App with React Native.

Component Library

We selected particularly useful components from the Merchant App codebase such as GPText, GPTextInput, GPErrorView, and GPActivityIndicator. We expanded that selection to a common (internal) component library of approximately 20 stateless and stateful components.

API Calls

We used to make api calls using axios (now deprecated). We now make calls from the Native side using bridges that return a promise and make api calls using an existing framework. This helped us remove the dependency for getting an access token from  Native-Android or Native-iOS to make the calls. Also it helped us optimize the api requests, as suggested by Parashuram from Facebook’s React Native team.

Locale

We use React Localize Redux for all our translations and moment for our date time conversion as per the device’s current Locale. We currently support translation in five languages; English, Chinese Simplified, Bahasa Indonesia, Malay, and Vietnamese. This Swift code shows how we get the device’s current Locale from the native-react Native Bridge.

public func methodsToExport() -> [RCTBridgeMethod] {
   let methods: [RCTBridgeMethod?] =  [
     BridgeMethodWrapper("getLocaleIdentifier", { (_, _, resolver) in
     letlocaleIdentifier = self.locale.getLocaleIdentifier()
     resolver(localeIdentifier)
   })]
   return methods.compactMap { $0 }
 }

Redux

Redux is an extremely lightweight predictable state container that behaves consistently in every environment. We use Redux with React Native to manage its state.

For in-app navigation we use react-navigation. It is very flexible in adapting to both the Android and iOS navigation and gesture recognition styles.

End Product

After setting up our foundation bridges and porting skeleton boilerplate code from the GrabPay Merchant App, we wrote two payments modules using GrabPay Digital Marketplace (also known as BillPay), React Native, and Transaction History.

Grab app - Selecting a company

The ios Version is on the left and the Android version is on the right. Not only do their UIs look identical, but also their code is identical. A single codebase lets us debug faster, deliver quicker, and maintain smaller (codebase; apologies but it was necessary for the joke).

Grab app - Company selected

We launched BillPay first in Indonesia, then in Vietnam and Malaysia. So far, it’s been a very stable product with little to no downtime.

Transaction History started in Singapore and is now rolling out in other countries.

Flow For BillPay

BillPay Flow

The above shows BillPay’s flow.

  1. We start with the first screen, called Biller List. It shows all the postpaid billers available for the current region. For now, we show Billers based on which country the user is in. The user selects a biller.
  2. We then asks for your customerID (or prefills that value if you have paid your bill before). The amount is either fetched from the backend or filled in by the user, depending on the region and biller type.
  3. Next, the user confirms all the entered details before they pay the dues.
  4. Finally, the user sees their bill payment receipt. It comes directly from the biller, and so it’s a valid proof of payment.

Our React Native version has kept the same experience as our Native developed App and help users pay their bills seamlessly and hassle free.

Future

We are moving code to Typescript to reduce compile-time bugs and clean up our code. In addition to reducing native dependencies, we will refactor modules as needed. We will also have 100% unit test code coverage. But most importantly, we plan to open source our component library as soon as we feel it is stable.

Preventing Pipeline Calls from Crashing Redis Clusters

Post Syndicated from Grab Tech original https://engineering.grab.com/preventing-pipeline-calls-from-crashing-redis-clusters

Introduction

On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement. During this period, roughly only 1 in 21 calls to Apollo, a primary transport booking service, succeeded. This brought Grab rides down significantly for the one minute it took the Redis Cluster to self-recover. This behavior was totally unexpected and completely breached our intention of having multiple replicas.

This blog post describes Grab’s outage post-mortem findings.

Understanding the infrastructure

With Grab’s continuous growth, our services must handle large amounts of data traffic involving high processing power for reading and writing operations. To address this significant growth, reduce handler latency, and improve overall performance, many of our services use Redis – a common in-memory data structure storage – as a cache, database, or message broker. Furthermore, we use a Redis Cluster, a distributed implementation of Redis, for shorter latency and higher availability.

Apollo is our driver-side state machine. It is on almost all requests’ critical path and is a primary component for booking transport and providing great service for customer bookings. It stores individual driver availability in an AWS ElastiCache Redis Cluster, letting our booking service efficiently assign jobs to drivers. It’s critical to keep Apollo running and available 24/7.

Apollo's infrastructure

Because of Apollo’s significance, its Redis Cluster has 3 shards each with 2 slaves. It hashes all keys and, according to the hash value, divides them into three partitions. Each partition has two replications to increase reliability.

We use the Go-Redis client, a popular Redis library, to direct all written queries to the master nodes (which then write to their slaves) to ensure consistency with the database.

Master and slave nodes in the Redis Cluster

For reading related queries, engineers usually turn on the ReadOnly flag and turn off the RouteByLatency flag. These effectively turn on ReadOnlyFromSlaves in the Grab gredis3 library, so the client directs all reading queries to the slave nodes instead of the master nodes. This load distribution frees up master node CPU usage.

Client reading and writing from/to the Redis Cluster

When designing a system, we consider potential hardware outages and network issues. We also think of ways to ensure our Redis Cluster is highly efficient and available; setting the above-mentioned flags help us achieve these goals.

Ideally, this Redis Cluster configuration would not cause issues even if a master or slave node breaks. Apollo should still function smoothly. So, why did that February Apollo outage happen? Why did a single down slave node cause a 95+% call failure rate to the Redis Cluster during the dim-out time?

Let’s start by discussing how to construct a local Redis Cluster step by step, then try and replicate the outage. We’ll look at the reasons behind the outage and provide suggestions on how to use a Redis Cluster client in Go.

How to set up a local Redis Cluster

1. Download and install Redis from here.

2. Set up configuration files for each node. For example, in Apollo, we have 9 nodes, so we need to create 9 files like this with different port numbers(x).

// file_name: node_x.conf (do not include this line in file)

port 600x

cluster-enabled yes

cluster-config-file cluster-node-x.conf

cluster-node-timeout 5000

appendonly yes

appendfilename node-x.aof

dbfilename dump-x.rdb

3. Initiate each node in an individual terminal tab with:

$PATH/redis-4.0.9/src/redis-server node_1.conf

4. Use this Ruby script to create a Redis Cluster. (Each master has two slaves.)

$PATH/redis-4.0.9/src/redis-trib.rb create --replicas 2127.0.0.1:6001..... 127.0.0.1:6009

>>> Performing Cluster Check (using node 127.0.0.1:6001)

M: 7b4a5d9a421d45714e533618e4a2b3becc5f8913 127.0.0.1:6001

   slots:0-5460 (5461 slots) master

   2 additional replica(s)

S: 07272db642467a07d515367c677e3e3428b7b998 127.0.0.1:6007

   slots: (0 slots) slave

   replicates 05363c0ad70a2993db893434b9f61983a6fc0bf8

S: 65a9b839cd18dcae9b5c4f310b05af7627f2185b 127.0.0.1:6004

   slots: (0 slots) slave

   replicates 7b4a5d9a421d45714e533618e4a2b3becc5f8913

M: 05363c0ad70a2993db893434b9f61983a6fc0bf8 127.0.0.1:6003

   slots:10923-16383 (5461 slots) master

   2 additional replica(s)

S: a78586a7343be88393fe40498609734b787d3b01 127.0.0.1:6006

   slots: (0 slots) slave

   replicates 72306f44d3ffa773810c810cfdd53c856cfda893

S: e94c150d910997e90ea6f1100034af7e8b3e0cdf 127.0.0.1:6005

   slots: (0 slots) slave

   replicates 05363c0ad70a2993db893434b9f61983a6fc0bf8

M: 72306f44d3ffa773810c810cfdd53c856cfda893 127.0.0.1:6002

   slots:5461-10922 (5462 slots) master

   2 additional replica(s)

S: ac6ffbf25f48b1726fe8d5c4ac7597d07987bcd7 127.0.0.1:6009

   slots: (0 slots) slave

   replicates 7b4a5d9a421d45714e533618e4a2b3becc5f8913

S: bc56b2960018032d0707307725766ec81e7d43d9 127.0.0.1:6008

   slots: (0 slots) slave

   replicates 72306f44d3ffa773810c810cfdd53c856cfda893

[OK] All nodes agree about slots configuration.

5. Finally, we try to send queries to our Redis Cluster, e.g.

$PATH/redis-4.0.9/src/redis-cli -c -p 6001 hset driverID 100 state available updated_at 11111

What happens when nodes become unreachable?

Redis Cluster Server

As long as the majority of a Redis Cluster’s masters and at least one slave node for each unreachable master are reachable, the cluster is accessible. It can survive even if a few nodes fail.

Let’s say we have N masters, each with K slaves, and random T nodes become unreachable. This algorithm calculates the Redis Cluster failure rate percentage:

if T <= K:
        availability = 100%
else:
        availability = 100% - (1/(N*K - T))

If you successfully built your own Redis Cluster locally, try to kill any node with a simple command-c. The Redis Cluster broadcasts to all nodes that the killed node is now unreachable, so other nodes no longer direct traffic to that port.

If you bring this node back up, all nodes know it’s reachable again. If you kill a master node, the Redis Cluster promotes a slave node to a temp master for writing queries.

$PATH/redis-4.0.9/src/redis-server node_x.conf

With this information, we can’t answer the big question of why a single slave node failure caused an over 95% failure rate in the Apollo outage. Per the above theory, the Redis Cluster should still be 100% available. So, the Redis Cluster server could properly handle an outage, and we concluded it wasn’t the failure rate’s cause. So we looked at the client side and Apollo’s queries.

Golang Redis Cluster Client & Apollo Queries

Apollo’s client side is based on the Go-Redis Library.

During the Apollo outage, we found some code returned many errors during certain pipeline GET calls. When Apollo tried to send a pipeline of HMGET calls to its Redis Cluster, the pipeline returned errors.

First, we looked at the pipeline implementation code in the Go-Redis library. In the function defaultProcessPipeline, the code assigns each command to a Redis node in this line err:=c.mapCmdsByNode(cmds, cmdsMap).

func (c *ClusterClient) mapCmdsByNode(cmds []Cmder, cmdsMap *cmdsMap) error {
state, err := c.state.Get()
        if err != nil {
                setCmdsErr(cmds, err)
                returnerr
        }

        cmdsAreReadOnly := c.cmdsAreReadOnly(cmds)
        for_, cmd := range cmds {
                var node *clusterNode
                var err error
                if cmdsAreReadOnly {
                        _, node, err = c.cmdSlotAndNode(cmd)
                } else {
                        slot := c.cmdSlot(cmd)
                        node, err = state.slotMasterNode(slot)
                }
                if err != nil {
                        returnerr
                }
                cmdsMap.mu.Lock()
                cmdsMap.m[node] = append(cmdsMap.m[node], cmd)
                cmdsMap.mu.Unlock()
        }
        return nil
}

Next, since the readOnly flag is on, we look at the cmdSlotAndNode function. As mentioned earlier, you can get better performance by setting readOnlyFromSlaves to true, which sets RouteByLatency to false. By doing this, RouteByLatency will not take priority and the master does not receive the read commands.

func (c *ClusterClient) cmdSlotAndNode(cmd Cmder) (int, *clusterNode, error) {
        state, err := c.state.Get()
        if err != nil {
                return 0, nil, err
        }

        cmdInfo := c.cmdInfo(cmd.Name())
        slot := cmdSlot(cmd, cmdFirstKeyPos(cmd, cmdInfo))

        if c.opt.ReadOnly && cmdInfo != nil && cmdInfo.ReadOnly {
                if c.opt.RouteByLatency {
                        node, err:= state.slotClosestNode(slot)
                        return slot, node, err
                }

                if c.opt.RouteRandomly {
                        node:= state.slotRandomNode(slot)
                        return slot, node, nil
                }

                node, err:= state.slotSlaveNode(slot)
                return slot, node, err
        }

        node, err:= state.slotMasterNode(slot)
        return slot, node, err
}

Now, let’s try and better understand the outage.

  1. When a slave becomes unreachable, all commands assigned to that slave node fail.
  2. We found in Grab’s Redis library code that a single error in all cmds could cause the entire pipeline to fail.
  3. In addition, engineers return a failure in their code if err != nil. This explains the high failure rate during the outage.
func (w *goRedisWrapperImpl) getResultFromCommands(cmds []goredis.Cmder) ([]gredisapi.ReplyPair, error) {
        results := make([]gredisapi.ReplyPair, len(cmds))
        var err error
        for idx, cmd := range cmds {
                results[idx].Value, results[idx].Err = cmd.(*goredis.Cmd).Result()
                if results[idx].Err == goredis.Nil {
                        results[idx].Err = nil
                        continue
                }
                if err == nil && results[idx].Err != nil {
                        err = results[idx].Err
                }
        }

        return results, err
}

Our next question was, “Why did it take almost one minute for Apollo to recover?”.  The Redis Cluster broadcasts instantly to its other nodes when one node is unreachable. So we looked at how the client assigns jobs.

When the Redis Cluster client loads the node states, it only refreshes the state once a minute. So there’s a maximum one minute delay of state changes between the client and server. Within that minute, the Redis client kept sending queries to that unreachable slave node.

func (c *clusterStateHolder) Get() (*clusterState, error) {
        v := c.state.Load()
        if v != nil {
                state := v.(*clusterState)
                if time.Since(state.createdAt) > time.Minute {
                        c.LazyReload()
                }
                return state, nil
        }
        return c.Reload()
}

What happened to the write queries? Did we lose new data during that one min gap? That’s a very good question! The answer is no since all write queries only went to the master nodes and the Redis Cluster client with a watcher for the master nodes. So, whenever any master node becomes unreachable, the client is not oblivious to the change in state and is well aware of the current state. See the Watcher code.

How to use Go Redis safely?

Redis Cluster Client

One way to avoid a potential outage like our Apollo outage is to create another Redis Cluster client for pipelining only and with a true RouteByLatency value. The Redis Cluster determines the latency according to ping calls to its server.

In this case, all pipelining queries would read through the master nodesif the latency is less than 1ms (code), and as long as the majority side of partitions are alive, the client will get the expected results. More load would go to master with this setting, so be careful about CPU usage in the master nodes when you make the change.

Pipeline Usage

In some cases, the master nodes might not handle so much traffic. Another way to mitigate the impact of an outage is to check for  errors on individual queries when errors happen in a pipeline call.

In Grab’s Redis Cluster library, the function Pipeline(PipelineReadOnly) returns a response with an error for individual reply.

func (c *clientImpl) Pipeline(ctx context.Context, argsList [][]interface{}) ([]gredisapi.ReplyPair, error) {
        defer c.stats.Duration(statsPkgName, metricElapsed, time.Now(), c.getTags(tagFunctionPipeline)...)
        pipe := c.wrappedClient.Pipeline()
        cmds := make([]goredis.Cmder, len(argsList))
        for i, args := range argsList {
                cmd := goredis.NewCmd(args...)
                cmds[i] = cmd
                _ = pipe.Process(cmd)
        }
        _, _ = pipe.Exec()
        return c.wrappedClient.getResultFromCommands(cmds)
}

func (w *goRedisWrapperImpl) getResultFromCommands(cmds []goredis.Cmder) ([]gredisapi.ReplyPair, error) {
        results := make([]gredisapi.ReplyPair, len(cmds))
        var err error
        for idx, cmd := range cmds {
                results[idx].Value, results[idx].Err = cmd.(*goredis.Cmd).Result()
                if results[idx].Err == goredis.Nil {
                        results[idx].Err = nil
                        continue
                }
                if err == nil && results[idx].Err != nil {
                        err = results[idx].Err
                }
        }

        return results, err
}

type ReplyPair struct {
        Value interface{}
        Err   error
}

Instead of returning nil or an error message when err != nil, we could check for errors for each result so successful queries are not affected. This might have minimized the outage’s business impact.

Go Redis Cluster Library

One way to fix the Redis Cluster library is to reload nodes’ status when an error happens.In the go-redis library, defaultProcessor has this logic, which can be applied to defaultProcessPipeline.

In Conclusion

We’ve shown how to build a local Redis Cluster server, explained how Redis Clusters work, and identified its potential risks and solutions. Redis Cluster is a great tool to optimize service performance, but there are potential risks when using it. Please carefully consider our points about how to best use it. If you have any questions, please ask them in the comments section.

Loki, a dynamic mock server for HTTP/TCP testing

Post Syndicated from Grab Tech original https://engineering.grab.com/loki-dynamic-mock-server-http-tcp-testing

Background

In a previous article we introduced Mockers – an innovative tool for local box testing at Grab. Mockers used a Shift Left testing strategy, making testing more effective and cheaper for development teams. Mockers’ popularity and success motivated us to create Loki – a one-stop dynamic mock server for local box testing of mobile apps.

There are some unique challenges in mobile apps testing at Grab. End-to-end testing of an app is difficult due to high dependency on backend services and other apps. Staging environment, which hosts a plethora of backend services, is tough to manage and maintain. Issues such as staging downtime, configuration mismatches, and data corruption can affect staging adding to the testing woes. Moreover, our apps are fairly complex, utilizing multiple transport protocols such as HTTP, HTTPS, TCP for various business flows.

The business flows are also complex, requiring exhaustive set up such as credit card payments set up, location spoofing, etc resulting in high maintenance costs for automated testing. Loki simulates these flows and developers can easily test use cases that take longer to set up in a real backend staging.

Loki is our attempt to address challenges in mobile app testing by turning every developer local box into a full fledged pseudo backend environment where all mobile workflows can be tested without any external dependencies. It mocks backend services on developer local boxes, decoupling the mobile apps from real backend services, which provides several advantages such as:

No need to deploy frequently to staging

Testing is blocked if the app receives a bad response from staging. In these cases, code changes have to be deployed on staging to fix issues before resuming tests. In contrast, using Loki lets developers continue testing without any immediate need to deploy code changes to staging.

Allows parallel frontend and backend development

Loki acts as a mock backend service when the real backend is still evolving. It lets the frontend development run in parallel with backend development.

Overcome time limitations

In a one week regression-and-release scenario, testing time is limited. However, the application UI rendering and functionality still needs reasonable testing. Loki lets developers concentrate on testing in the available time instead of fixing dependencies on backend services.

Loki – Grab’s solution to simplify mobile apps testing

At Grab, we have multiple mobile apps that are dependent on each other. For example, our Passenger and Driver apps are two sides of a coin; the driver gets a job card only when a passenger requests a booking. These apps are developed by different teams, each with its own release cycle. This can make it tricky to confidently and repeatedly test the whole business flow across apps. Apps also depend on multiple backend services to execute a booking or food order and communicate over different protocols.

Here’s a look at how our mobile apps interact with backend services over different protocols:

Mobile app interaction with backend services

Loki is a dynamic mock server, written in Golang, running in a Docker container on the local box or in CI. It is easy to set up and run through standard Docker commands. In the context of mobile app testing, it plays the role of backend services, so you no longer need to set up an extensive staging environment.

The Loki architecture looks like this:

Loki architecture

The technical challenges we had to overcome

We wanted a comprehensive mocking solution so that teams don’t need to integrate multiple tools to achieve independent testing. It turned out that mocking TCP was most challenging because:

  • It is a long running client-server connection, and it doesn’t follow an HTTP-like request/response pattern.
  • Messages can be sent to the app without an incoming request as well, hence we had to expose a way via Loki to set a mock expectation which can send messages to the app without any request triggering it.
  • As TCP is a long running connection, we needed a way to delimit incoming requests so we know when we can truncate and deserialize the incoming request into JSON.

We engineered the Loki backend to support both HTTP and TCP protocols on different ports. Yet, the mock expectations are set up using RESTful APIs over HTTP for both protocols. A single point of entry for setting expectations made it more intuitive for our developers.

An in-memory cron implementation pushes scheduled messages to the app over a TCP connection. This enabled testing of complex use cases such as drivers getting new job cards, driver and passenger chat workflows, etc. The delimiter for TCP protocol is configurable at start up, so each team can decide when to truncate the request.

To enable Loki on our CI, we had to reduce its memory footprint. Hence, we built Loki with pluggable storages. MySQL is used when running on local and on CI we switch seamlessly to in-memory cache or Redis.

For testing apps locally, developers must validate complex use cases such as:

  • Payment related flows, which require the response to include the same payment ID as sent in the request. This is a case of simple mapping of request fields in the response JSON.

  • Flows requiring runtime logic execution. For example, a job card sent to a driver must have a valid timestamp, requiring runtime computation on Loki.

To support these cases and many more, we added JavaScript injection capability to Loki. So, when we set an expectation for an HTTP request/response pair or for TCP events, we can specify JavaScript for computing the dynamic response. This is executed in a sandbox by an in-house JS execution library.

Grab follows a transactional workflow for bookings. Over the life of a ride, bookings go through different statuses. So, Loki had to address multiple HTTP requests to the same endpoint returning different responses. This feature is required for successfully mocking a whole ride end-to-end.

Loki uses  an HTTP API “httpTimesAndOrder” for this feature. For example, using “httpTimesAndOrder”, you can configure the same status endpoint (/ride/status) to return different ride statuses such as “PICKING” for the first five requests, “IN_RIDE” for the next three requests, and so on.

Now, let’s look at how to use Loki to mock HTTP requests and TCP events.

Mocking HTTP requests

To mock HTTP requests, developers first point their app to send requests to the Loki mock server. Then, they set up expectations for all requests sent to the Loki mock server.

Loki mock server

For example, the Passenger app calls an HTTP dependency GET /closeby/drivers/ to get nearby drivers. To mock it with Loki, you set an expected response on the Loki mock server. When the GET /closeby/drivers/ request is actually made from the Passenger app, Loki returns the set response.

This snippet shows how to set an expected response for the GET /closeby/drivers/request:

Loki API: POST `/api/v1/expectations`

Request Body :

{
  "uriToMock": "/closeby/drivers",
  "method": "GET",
  "response": {
    "drivers": [
      1001,
      1002,
      1010
    ]
  }
}

Workflow for setting expectations and receiving responses

Workflow for setting expectations and receiving responses

Mocking TCP events

Developers point their app to Loki over a TCP connection and set up the TCP expectations. Loki then generates scheduled events such as sending push messages (job cards, notifications, etc) to the apps pointing at Loki.

For example, if the Driver app, after it starts, wants to get a job card, you can set an expectation in Loki to push a job card over the TCP connection to the Driver app after a scheduled time interval.

This snippet shows how to set the TCP expectation and schedule a push message:

Loki API: POST `/api/v1/tcp/expectations/pushmessage`

Request Body :

{
  "name": "samplePushMsg",
  "msgSequence": [
    {
      "messages": {
        "body": {
          "jobCardID": 1001
        }
      }
    },
    {
      "messages": {
        "body": {
          "jobCardID": 1002
        }
      }
    }
  ],
  "schedule": "@every 1m"
}

Workflow for scheduling a push message over TCP

Workflow for scheduling a push message over TCP

Some example use cases

Now that you know about Loki, let’s look at some example use cases.

Generating a custom response at runtime

Our first example is customizing a runtime response for both HTTP and TCP requests. This is helpful when developers need dynamic responses to requests. For example, you can add parameters from the request URL or request body to the runtime response.

It’s simple to implement this with a JavaScript function. Assume you want to embed a message parameter in the request URL to the response. To do this, you first use a POST method to set up the expectation (in JSON format) for the request on Loki:

Loki API: POST `/api/v1/feature/expectations`

Request Body :

{
  "expectations": [{
    "name": "Sample call",
    "desc": "v1/test/{name}",
    "tags": "v1/test/{name}",
    "resource": "/v1/test?name=user1",
    "verb": "POST",
    "response": {
      "body": "{ \"msg\": \"Hi \"}",
      "status": 200
    },
    "clientOptions": {
"javascript": "function main(req, resp) { var url = req.RequestURI; var captured = /name=([^&]+)/.exec(url)[1]; resp.msg =  captured ? resp.msg + captured : resp.msg + 'myDefaultValue'; return resp }"
    },
    "isActive": 1
  }]
}

When Loki receives the request, the JavaScript function used in the clientOptionskey, adds name to the response at runtime. For example, this is the request’s fixed response:

{
    "msg": "Hi "
}

But, after using the JavaScript function to add the URL parameter, the dynamic response is:

{
    "msg": "Hi user1"
}

Similarly, you can use JavaScript to add other dynamic responses such as modifying the response’s JSON array, adding parameters to push messages, etc.

Defining a response sequence for mocked API endpoints

Here’s another interesting example – defining the response sequence for API endpoints.

A response sequence is useful when you need different responses from the same API endpoint. For example, a status endpoint should return different ride statuses such as ‘allocating’, ‘allocated’, ‘picking’, etc. depending on the stage of a ride.

To do this, developers set up their HTTP expectations on Loki. Then, they easily define the response sequence for an API endpoint using a Loki POST method.

In this example:

  • times – specifies the number of times the same response is returned.
  • after – specifies one or more expectations that must match before a specified expectation is matched.

Here, the expectations are matched in this sequence when a request is made to an endpoint – Allocating > Allocated > Pickuser > Completed. Further, Completed is set to two times, so Loki returns this response two times.

Loki API: POST `/api/v1/feature/sequence`

Request Body :
  "httpTimesAndOrder": [
      {
          "name": "Allocating",
          "times": 1
      },
      {
          "name": "Allocated",
          "times": 1,
          "after": ["Allocating"]
      },
      {
          "name": "Pickuser",
          "times": 1,
          "after": ["Allocated"]
      },
      {
          "name": "Completed",
          "times": 2,
          "after": ["Pickuser"]
      }
  ]
}

In conclusion

Since Loki’s inception, we have set up a full range CI with proper end-to-end app UI tests and, to a great extent, decoupled our app releases from the staging backend. This improved delivery cycles, and we did faster bug catching and more exhaustive testing. Moreover, both developers and QAs can easily play with apps to perform exploratory testing as well as manual functional validations. Teams are also using Loki to run automated scripts (Espresso and XCUItests) for validating the mobile app pages.

Loki’s adoption is growing steadily at Grab. With our frequent release of new mobile app features, Loki helps teams meet our high quality bar and achieve huge productivity gains.

If you have any feedback or questions on Loki, please leave a comment.

Designing resilient systems beyond retries (Part 3): Architecture Patterns and Chaos Engineering

Post Syndicated from Grab Tech original https://engineering.grab.com/beyond-retries-part-3

This post is the third of a three-part series on going beyond retries and circuit breakers to improve system resiliency. This whole series covers techniques and architectures that can be used as part of a strategy to improve resiliency. In this article, we will focus on architecture patterns and chaos engineering to reduce, prevent, and test resiliency.

Reducing failure through architecture patterns

Resiliency is all about preparing for and handling failure. So the most effective way to improve resiliency is undoubtedly to reduce the possible ways in which failure can occur, and several architectural patterns have emerged with this aim in mind. Unfortunately these are easier to apply when designing new systems and less relevant to existing ones, but if resiliency is still an issue and no other techniques are helping, then refactoring the system is a good approach to consider.

Idempotency

One popular pattern for improving resiliency is the concept of idempotency. Strictly speaking, an idempotent endpoint is one which always returns the same result given the same parameters, no matter how many times it is called. However, the definition is usually extended to mean it returns the results and has no side-effects, or any side-effects are only executed once. The main benefit of making endpoints idempotent is that they are always safe to retry, so it complements the retry technique to make it more effective. It also means there is less chance of the system getting into an inconsistent or worse state after experiencing failure.

If an operation has side-effects but cannot distinguish unique calls with its current parameters, it can be made to be idempotent by adding an idempotency key parameter. The classic example is money: a ‘transfer money to X’ operation may legitimately occur multiple times with the same parameters, but making the same call twice would be a mistake, so it is not idempotent. A client would not be able to retry a call that timed out, because it does not know whether or not the server processed the request. However, if the client generates and sends a unique ID as an idempotency key parameter, then it can safely retry. The server can then use this information to determine whether to process the request (if it sees the request for the first time) or return the result of the previous operation.

Using idempotency keys can guarantee idempotency for endpoints with side-effects
Using idempotency keys can guarantee idempotency for endpoints with side-effects

 

Asynchronous responses

A second pattern is making use of asynchronous responses. Rather than relying on a successful call to a dependency which may fail, a service may complete its own work and return a successful or partial response to the client. The client would then have to receive the response in an alternate way, either by polling (‘pull’) until the result is ready or the response being ‘pushed’ from the server when it completes.

From a resiliency perspective, this guarantees that the downstream errors do not affect the endpoint. Furthermore, the risk of the dependency causing latency or consuming resources goes away, and it can be retried in the background until it succeeds. The disadvantage is that this works against the ‘fail fast’ principle, since the call might be retried indefinitely without ever failing. It might not be clear to the client what to do in this case.

Not all endpoints have to be made asynchronous, and the decision to be synchronous or not could be made by the endpoint dynamically, depending on the service health. Work that can be made asynchronous is known as deferrable work, and utilizing this information can save resources and allow the more critical endpoints to complete. For example, a fraud system may decide whether or not a newly registered user should be allowed to use the application, but such decisions are often complex and costly. Rather than slow down the registration process for every user and create a poor first impression, the decision can be made asynchronously. When the fraud-decision system is available, it picks up the task and processes it. If the user is then found to be fraudulent, their account can be deactivated at that point.

Preventing disaster through chaos engineering

It is famously understood that disaster recovery is worthless unless it’s tested regularly. There are dozens of stories of employees diligently performing backups every day only to find that when they actually needed to restore from it, the backups were empty. The same thing applies to resiliency, albeit with less spectacular consequences.

The emerging best practice for testing resiliency is chaos engineering. This practice, made famous by Netflix’s Chaos Monkey, is the idea of deliberately causing parts of a system to fail in order to test (and subsequently improve) its resiliency. There are many different kinds of chaos engineering that vary in scope, from simulating an outage in an entire AWS region to injecting latency into a single endpoint. A chaos engineering strategy may include multiple types of failure, to build confidence in the ability of various parts of the system to withstand failure.

Chaos engineering has evolved since its inception, ironically becoming less ‘chaotic’, despite the name. Shutting off parts of a system without a clear plan is unlikely to provide much value, but is practically guaranteed to frustrate your customers – and upper management! Since it is recommended to experiment on production, minimizing the blast radius of chaos experiments, at least at the beginning, is crucial to avoid unnecessary impact to the system.

Chaos experiment process

The basic process for conducting a chaos experiment is as follows:

  1. Define how to measure a ‘steady state’, in order to confirm that the system is currently working as expected.
  2. Decide on a ‘control group’ (which does not change) and an ‘experiment group’ from the pool of backend servers.
  3. Hypothesize that the steady state will not change during the experiment.
  4. Introduce a failure in one component or aspect of the system in the control group, such as the network connection to the database.
  5. Attempt to disprove the hypothesis by analyzing the difference in metrics between the control and experiment groups.

If the hypothesis is disproved, then the parts of the system which failed are candidates for improvement. After making changes, the experiments are run again, and gradually confidence in the system should improve.

Chaos experiments should ideally mimic real-world scenarios that could actually happen, such as a server shutting down or a network connection being disconnected. These events do not necessarily have to be directly related to failure – ordinary events such as auto-scaling or a change in server hardware or VM type can be experimented with, as they could still potentially affect the steady state.

Finally, it is important to automate as much of the chaos experiment process as possible. From setting up the control group to starting the experiment and measuring the results, to automatically disabling the experiment if the impact to production has exceeded the blast radius, the investment in automating them will save valuable engineering time and allow for experiments to eventually be run continuously.

Conclusion

Retries are a useful and important part of building resilient software systems. However, they only solve one part of the resiliency problem, namely recovery. Recovery via retries is only possible under certain conditions and could potentially exacerbate a system failure if other safeguards aren’t also in place. Some of these safeguards and other resiliency patterns have been discussed in this article.

The excellent Hystrix library combines multiple resiliency techniques, such as circuit-breaking, timeouts and bulkheading, in a single place. But even Hystrix cannot claim to solve all resiliency issues, and it would not be wise to rely on a single library completely. However, just as it can’t be recommended to only use Hystrix, suddenly introducing all of the above patterns isn’t advisable either. There is a point of diminishing returns with adding more; more techniques means more complexity, and more possible things that could go wrong.

Rather than implement all of the resiliency patterns described above, it is recommended to selectively apply patterns that complement each other and cover existing gaps that have previously been identified. For example, an existing retry strategy can be enhanced by gradually switching to idempotent endpoints, improving the coverage of API calls that can be retried.

A microservice architecture is a good foundation for building a resilient system, but it requires careful planning and implementation to achieve. By identifying the possible ways in which a system can fail, then evaluating and applying the tried-and-tested patterns to withstand them, a reliable system can become one that is truly resilient.

I hope you found this series useful. Comments are always welcome.

Designing resilient systems beyond retries (Part 2): Bulkheading, Load Balancing, and Fallbacks

Post Syndicated from Grab Tech original https://engineering.grab.com/beyond-retries-part-2

This post is the second of a three-part series on going beyond retries to improve system resiliency. We’ve previously discussed about rate-limiting as a strategy to improve resiliency. In this article, we will cover these techniques: bulkheading, load balancing, and fallbacks.

Introducing Bulkheading (Isolation)

Bulkheading is a fundamental pattern which underpins many other resiliency techniques, especially where microservices are concerned, so it’s worth introducing first. The term actually comes from an ancient technique in ship building, where a ship’s hull would be partitioned into several watertight compartments. If one of the compartments has a leak, then the water fills just that compartment and is contained, rather than flooding the entire ship. We can apply this principle to software applications and microservices: by isolating failures to individual components, we can prevent a single failure from cascading and bringing down the entire system.

Bulkheads also help to prevent single points of failure, by reducing the impact of any failures so services can maintain some level of service.

Level of bulkheads

It is important to note that bulkheads can be applied at multiple levels in software architecture. The two highest levels of bulkheads are at the infrastructure level, and the first is hardware isolation. In a cloud environment, this usually means isolating regions or availability zones. The second is isolating the operating system, which has become a widespread technique with the popularity of virtual machines and now containerization. Previously, it was common for multiple applications to run on a single (very powerful) dedicated server. Unfortunately, this meant that a rogue application could wreak havoc on the entire system in a number of ways, from filling the disk with logs to consuming memory or other resources.

Isolation can be achieved by applying bulkheading at multiple levels
Isolation can be achieved by applying bulkheading at multiple levels

 

This article focuses on resiliency from the application perspective, so below the system level is process-level isolation. In practical terms, this isolation prevents an application crash from affecting multiple system components. By moving those components into separate processes (or microservices), certain classes of application-level failures are prevented from causing cascading failure.

At the lowest level, and perhaps the most common form of bulkheading to software engineers, are the concepts of connection pooling and thread pools. While these techniques are commonly employed for performance reasons (reusing resources is cheaper than acquiring new ones), they also help to put a finite limit on the number of connections or concurrent threads that an operation is allowed to consume. This ensures that if the load of a particular operation suddenly increases unexpectedly (such as due to external load or downstream latency), the impact is contained to only a partial failure.

Bulkheading support in the Hystrix library

The Hystrix library for Go supports a form of bulkheading through its MaxConcurrentRequests parameter. This is conveniently tied to the circuit name, meaning that different levels of isolation can be achieved by choosing an appropriate circuit name. A good rule of thumb is to use a different circuit name for each operation or API call. This ensures that if just one particular endpoint of a remote service is failing, the other circuits are still free to be used for the remaining healthy endpoints, achieving failure isolation.

Load balancing

Global rate-limiting with a central server
Global rate-limiting with a central server

 

Load balancing is where network traffic from a client may be served by one of many backend servers. You can think of load balancers as traffic cops who distribute traffic on the road to prevent congestion and overload. Assuming the traffic is distributed evenly on the network, this effectively increases the computing power of the backend. Adding capacity like this is a common way to handle an increase in load from the clients, such as when a website becomes more popular.

Almost always, load balancers provide high availability for the application. When there is just a single backend server, this server is a ‘single point of failure’, because if it is ever unavailable, there are no servers remaining to serve the clients. However, if there is a pool of backend servers behind a load balancer, the impact is reduced. If there are 4 backend servers and only 1 is unavailable, evenly distributed requests would only fail 25% of the time instead of 100%. This is already an improvement, but modern load balancers are more sophisticated.

Usually, load balancers will include some form of a health check. This is a mechanism that monitors whether servers in the pool are ‘healthy’, ie. able to serve requests. The implementations for the health check vary, but this can be an active check such as sending ‘pings’, or passive monitoring of responses and removing the failing backend server instances.

As with rate-limiting, there are many strategies for load balancing to consider.

There are four main types of load balancer to choose from, each with their own pros and cons:

  • Proxy. This is perhaps the most well-known form of load-balancer, and is the method used by Amazon’s Elastic Load Balancer. The proxy sits on the boundary between the backend servers and the public clients, and therefore also doubles as a security layer: the clients do not know about or have direct access to the backend servers. The proxy will handle all the logic for load balancing and health checking. It is a very convenient and popular approach because it requires no special integration with the client or server code. They also typically perform ‘SSL termination’, decrypting incoming HTTPS traffic and using HTTP to communicate with the backend servers.
  • Client-side. This is where the client performs all of the load-balancing itself, often using a dedicated library built for the purpose. Compared with the proxy, it is more performant because it avoids an extra network ‘hop.’ However, there is a significant cost in developing and maintaining the code, which is necessarily complex and any bugs have serious consequences.
  • Lookaside. This is a hybrid approach where the majority of the load-balancing logic is handled by a dedicated service, but it does not proxy; the client still makes direct connections to the backend. This reduces the burden of the client-side library but maintains high performance, however the load-balancing service becomes another potential point of failure.
  • Service mesh with sidecar. A service mesh is an all-in-one solution for service communication, with many popular open-source products available. They usually include a sidecar, which is a proxy that sits on the same server as the application to route network traffic. Like the traditional proxy load balancer, this handles many concerns of load-balancing for free. However, there is still an extra network hop, and there can be a significant development cost to integrate with existing systems for logging, reporting and so on, so this must be weighed against building a client-side solution in-house.
Comparison of load-balancer architectures
Comparison of load-balancer architectures

 

Grab’s load-balancing implementation

At Grab, we have built our own internal client-side solution called CSDP, which uses the distributed key-value store etcd as its backend store.

Fallbacks

There are scenarios when simply retrying a failed API call doesn’t work. If the remote server is completely down or only returning errors, no amount of retries are going to help; the failure is unrecoverable. When recovery isn’t an option, mitigation is an alternative. This is related to the concept of graceful degradation: sometimes it is preferable to return a less optimal response than fail completely, especially for user-facing applications where user experience is important.

One such mitigation strategy is fallbacks. This is a broad topic with many different sub-strategies, but here are a few of the most common:

Fail silently

Starting with the easiest to implement, one basic fallback strategy is fail silently. This means returning an empty or null response when an error is encountered, as if the call had succeeded. If the data being requested is not critical functionality then this can be considered: missing part of a UI is less noticeable than an error page! For example, UI bubbles showing unread notifications are a common feature. But if the service providing the notifications is failing and the bubble shows 0 instead of N notifications, the user’s experience is unlikely to be significantly affected.

Local computation

A second fallback strategy when a downstream dependency is failing could be to compute the value locally instead. This could mean either returning a default (static) value, or using a simple formula to compute the response. For example, a marketplace application might have a service to calculate shipping costs. If it is unavailable, then using a default price might be acceptable. Or even $0 – users are unlikely to complain about errors that benefit them, and it’s better than losing business!

Cached values

Similarly, cached values are often used as fallbacks. If the service isn’t available to calculate the most up to date value, returning a stale response might be better than returning nothing. If an application is already caching the value with a short expiration to optimize performance, it can be reused as a fallback cache by setting two expiration times: one for normal circumstances, and another when the service providing the response has failed.

Backup service

Finally, if the response is too complex to compute locally or if major functionality of the application is required to have a fallback, then an entirely new service can act as a fallback; a backup service. Such a service is a big investment, so to make it worthwhile some trade-offs must be accepted. The backup service should be considerably simpler than the service it is intended to replace; if it is too complex then it will require constant testing and maintenance, not to mention documentation and training to make sure it is well understood within the engineering team. Also, a complex system is more likely to fail when activated. Usually such systems will have very few or no dependencies, and certainly should not depend on any parts of the original system, since they could have failed, rendering the backup system useless.

Grab’s fallback implementation

At Grab, we make use of various fallback strategies in our services. For example, our microservice framework Grab-Kit has built-in support for returning cached values when a downstream service is unresponsive. We’ve even built a backup service to replicate our core functionality, so we can continue to serve customers despite severe technical difficulties!

Up next, Architecture Patterns and Chaos Engineering…

We’ve covered various techniques in designing reliable and resilient systems in the previous articles. I hope you found them useful. Comments are always welcome.

In our next post, we will look at ways to prevent and reduce failures through architecture patterns and testing.

Please stay tuned!

Designing resilient systems beyond retries (Part 1): Rate-Limiting

Post Syndicated from Grab Tech original https://engineering.grab.com/beyond-retries-part-1

This post is the first of a three-part series on going beyond retries to improve system resiliency. In this series, we will discuss other techniques and architectures that can be used as part of a strategy to improve resiliency. To start off the series, we will cover rate-limiting.

Software engineers aim for reliability. Systems that have predictable and consistent behaviour in terms of performance and availability. In the electricity industry, reliability may equate to being able to keep the lights on. But just because a system has remained reliable up until a certain point, does not mean that it will continue to be. This is where resiliency comes in: the ability to withstand or recover from problematic conditions or failure. Going back to our electricity analogy – resiliency is the ability to turn the lights back on quickly when say, a natural disaster hits the power grid.

Why we value resiliency

Being resilient to many different failures is the best way to ensure a system is reliable and – more importantly – stays that way. At Grab, our architecture features hundreds of microservices, which is constantly stressed in an increasing number of different ways at higher and higher volumes. Failures that would be rare or unusual become more likely as our scale increases. For that reason, we proactively focus on – and require our services to think about – resiliency, even if they have historically been very reliable.

As software systems evolve and become more complex, the number of potential failure modes that software engineers have to account for grows. Fortunately, so too have the techniques for dealing with them. The circuit-breaker pattern and retries are two such techniques commonly employed to improve resiliency specifically in the context of distributed systems. In pursuit of reliability, this is a fine start, but it would be wrong to assume that this will keep the service reliable forever. This article will discuss how you can use rate-limiting as part of a strategy to improve resilience, beyond retries.

Challenges with retries and circuit breakers

A common risk when introducing retries in a resiliency strategy is ‘retry storms’. Retries by definition increase the number of requests from the client, especially when the system is experiencing some kind of failure. If the server is not prepared to handle this increase in traffic, and is possibly already struggling to handle the load, it can quickly become overwhelmed. This is counter-productive to introducing retries in the first place!

When using a circuit-breaker in combination with retries, the application has some form of safety net: too many failures and the circuit will open, preventing the retry storms. However, this can be dangerous to rely on. For one thing, it assumes that all clients have the correct circuit-breaker configurations. Knowing how to configure the circuit-breaker correctly is difficult because it requires knowledge of the downstream service’s configurations too.

Introducing rate-limiting

In a large organization such as Grab with hundreds of microservices, it becomes increasingly difficult to coordinate and maintain the correct circuit-breaker configurations as the number of services increases.

Secondly, it is never a good idea for the server to depend on its clients for resiliency. The circuit-breaker could fail or simply be bypassed, and the server would have to deal with all requests the client makes.

It is therefore desirable to have some form of rate-limiting/throttling as another line of defense. There are many strategies for rate-limiting to consider.

Types of thresholds for rate-limiting

The traditional approach to rate-limiting is to implement a server-side check which monitors the rate of incoming requests and if it exceeds a certain threshold, an error will be returned instead of processing the request. There are many algorithms such as ‘leaky bucket’, fixed/sliding window and so on. A key decision is where to set the thresholds: usually by client, endpoint, or a combination of both.

Rate-limiting by client or user account is the approach taken by many public APIs: Each client is allowed to make a certain number of requests over a period, say 1000 requests per hour, and once that number is exceeded then their requests will be rejected until the time window resets. In this approach, the server must ensure that it has enough capacity (or can scale adequately) to handle the maximum allowed number of requests for each client. If new clients are added frequently, the overhead of maintaining and adjusting the limits may be significant. However, it can be a good way to guarantee a service-level agreement (SLA) with your clients.

An alternative to per-client thresholds is to use per-endpoint thresholds. This limit is applied across all clients and can be set according to the server’s true capacity using benchmarks. Compared with per-client limits this is easier to configure and more reliable in preventing the server from becoming overloaded. However, one misbehaving client may be able to consume the entire quota, blocking other clients of the service.

A rate-limiting strategy may use different levels of thresholds, and this is the best approach to get the benefits of both per-client and per-endpoint thresholds. For example, the following rules might be applied (in order):

  • Per-client, per-endpoint: For example, client A accessing the sendEmail endpoint. It is not necessary to configure thresholds at this granularity, but may be useful for critical endpoints.
  • Per-client: In addition to any per-client per-endpoint settings, client A could have a global threshold of 1000 requests/hour to any API.
  • Per-endpoint: This is the server’s catch-all guard to guarantee that none of its endpoints become overloaded. If client limits are properly configured, this limit should never be reached.
  • Server-wide: Finally, a limit on the number of requests a server can handle in total. This is important because even if endpoints can meet their limits individually, they are never completely isolated: the server will have some overhead and limited resources for processing any kind of request, opening and closing network connections etc.

Local vs global rate-limiting

Another consideration is local vs global rate-limiting. As we saw in the previous section, backend servers are usually pooled together for resiliency. A naive rate-limiting solution might be implemented at the individual server instance level. This sounds intuitive because the thresholds can be calculated exactly according to the instance’s computing power, and it scales automatically as the number of instances increases. However, in a microservice architecture, this is rarely correct as the bottlenecks are unlikely to be so closely tied to individual instance hardware.

More often, the capacity is reached when a downstream resource is exhausted, such as a database, a third-party service or another microservice. If the rate-limiting is only enforced at the instance level, when the service scales, the pressure on these resources will increase and quickly overload them. Local rate-limiting’s effectiveness is limited.

Global rate-limiting on the other hand monitors thresholds and enforces limits across the entire backend server pool. This is usually achieved through the use of a centralized rate-limiting service to make the decisions about whether or not requests should be allowed to go through. While this is much more desirable, implementing such a service is not without challenges.

Considerations when implementing rate-limiting

Care must be taken to ensure the rate-limiting service does not become a single point of failure. The system should still function when the rate-limiter itself is experiencing problems (perhaps by falling back to a local limiter). Since the rate-limiter must be in the request path, it should not add significant latency because any latency would be multiplied across every endpoint being monitored. Grab’s own Quotas service is an example of a global rate-limiter which addresses these concerns.

Global rate-limiting with a central server
Global rate-limiting with a central server. The servers send information about the request volumes, and the rate-limiting service responds with the rate-limiting decisions. This is done asynchronously to avoid introducing a point of failure.

 

Generally, it is more important to implement rate-limiting at the server side. This is because, once again, assuming that clients have correct implementation and configurations is risky. However, there is a case to be made for rate-limiting on the client as well, especially if the clients can be trusted or share a common SDK.

With server-side limiting, the server still has to accept the initial connection, process the rate-limiting logic and return an appropriate error response. With sufficient load, this overhead can be enough to render the system unresponsive; an unintentional denial-of-service (DoS) effect.

Client-side limiting can be implemented by using a central service as described above or, more commonly, utilizing response headers from the server. In this approach, the server response may include information about the client’s remaining quota and/or a timestamp at which the quota is reset. If the client implements logic for these headers, it can avoid sending requests at all if it knows they will be rate-limited. The disadvantage of this is that the client-side logic becomes more complex and another possible source of bugs, so this cost has to be considered against the simpler server-only method.

Up next, Bulkheading, Load Balancing, and Fallbacks…

So we’ve taken a look at rate-limiting as a strategy for having resilient systems. I hope you found this article useful. Comments are always welcome.

In our next post, we will look at the other resiliency techniques such as bulkheading (isolation), load balancing, and fallbacks.

Please stay tuned!

Context Deadlines and How to Set Them

Post Syndicated from Grab Tech original https://engineering.grab.com/context-deadlines-and-how-to-set-them

At Grab, our microservice architecture involves a huge amount of network traffic and inevitably, network issues will sometimes occur, causing API calls to fail or take longer than expected. We strive to make such incidents a non-event, by designing with the expectation of such incidents in mind. With the aid of Go’s context package, we have improved upon basic timeouts by passing timeout information along the request path. However, this introduces extra complexity, and care must be taken to ensure timeouts are configured in a way that is efficient and does not worsen problems. This article explains from the ground up a strategy for configuring timeouts and using context deadlines correctly, drawing from our experience developing microservices in a large scale and often turbulent network environment.

Timeouts

Timeouts are a fundamental concept in computer networking. Almost every kind of network communication will have some kind of timeout associated with it, often configurable with a parameter. The idea is to place a time limit on some event happening, often a network response; after the limit has passed, the operation is aborted rather than waiting indefinitely. Examples of useful places to put timeouts include connecting to a database, making a HTTP request or on idle connections in a pool.

Figure 1.1: How timeouts prevent long API calls
Figure 1.1: How timeouts prevent long API calls

 

Timeouts allow a program to continue where it otherwise might hang, providing a better experience to the end user. Often the default way for programs to handle timeouts is to return an error, but this doesn’t have to be the case: there are several better alternatives for handling timeouts which we’ll cover later.

While they may sound like a panacea, timeouts must be configured carefully to be effective: too short a timeout will result in increased errors from a resource which could still be working normally, and too long a timeout will risk consuming excess resources and a poor user experience. Furthermore, timeouts have evolved over time with new concepts such as Go’s context package, and the trend towards distributed systems has raised the stakes: timeouts are more important, and can cause more damage if misused!

Why timeouts are useful

In the context of microservices, timeouts are useful as a defensive measure against misbehaving or faulty dependencies. It is a guarantee that no matter how badly the dependency is failing, your call will never take longer than the timeout setting (for example 1 second). With so many other things to worry about, that’s a really nice thing to have! So there’s an instant benefit to your service’s resiliency, even if you do nothing more than set the timeout.

However, a service can choose what to do when it encounters a timeout, which can make them even more useful. Generally there are three options:

  1. Return an error. This is the simplest, but unless you know there is error handling upstream, this can actually deliver the worst user experience.
  2. Return a fallback value. We can return a default value, a cached value, or fall back to a simpler computed value. Depending on the circumstances, this can offer a better user experience.
  3. Retry. In the best case, a retry will succeed and deliver the intended response to the caller, albeit with the added timeout delay. However, there are other complexities to consider for retries to be effective. For a full discussion on this topic, see Circuit Breaker vs Retries Part 1and Circuit Breaker vs Retries Part 2.

At Grab, our services tend towards using retries wherever possible, to make minor errors as transparent as possible.

The main advantage of timeouts is that they give your service time to do something else, and this should be kept in mind when considering a good timeout value: not only do you want to allow the remote call time to complete (or not), but you need to allow enough time to handle the potential timeout as well.

Different types of timeouts

Not all timeouts are the same. There are different types of timeouts with crucial differences in semantics, and you should check the behaviour of the timeout settings in the library or resource you’re using before configuring them for production use.

In Go, there are three common classes of timeouts:

  • Network timeouts: These come from the net package and apply to the underlying network connection. These are the best to use when available, because you can be sure that the network call has been cancelled when the call returns to your function.
  • Context timeouts: Context is discussed later in this article, but for now just note that these timeouts are propagated to the server. Since the server is aware of the timeout, it can avoid wasted effort by abandoning computation after the timeout is reached.
  • Asynchronous timeouts: These occur when a goroutine is executed and abandoned after some time. This does not automatically cancel the goroutine (you can’t really cancel goroutines without extra handling), so it risks leaking the goroutine and other resources. This approach should be avoided in production unless combined with some other measures to provide cancellation or avoid leaking resources.

Dangers of poor timeout configuration for microservice calls

The benefits of using timeouts are enticing, but there’s no free lunch: relying on timeouts too heavily can lead to disastrous cascading failure scenarios. Worse, the effects of a poor timeout configuration often don’t become evident until it’s too late: it’s peak hour, traffic just reached an all-time high and… all your services froze up at the same time. Not good.

To demonstrate this effect, imagine a simple 3-service architecture where each service naively uses a default timeout of 1 second:

Figure 1.2: Example of how incorrect timeout configuration causes cascading failure
Figure 1.2: Example of how incorrect timeout configuration causes cascading failure

 

Service A’s timeout does not account for the fact that Service B calls C. If B itself is experiencing problems and takes 800ms to complete its work, then C effectively only has 200ms to complete before service A gives up. But since B’s timeout to C is also 1s, that means that C could be wasting up to 800ms of computational effort that ‘leaks’ – it has no chance of being used. Both B and C are blissfully unaware at first that anything is wrong – they happily return successful responses that A never receives!

This resource leak can soon be catastrophic, though: since the calls from B to A are timing out, A (or A’s clients) are likely to retry, causing the load on B to increase. This in turn causes the load on C to increase, and eventually all services will stop responding.

The same thing happens if B is healthy but C is experiencing problems: B’s calls to C will build up and cause B to become overloaded and fail too. This is a common cause of cascading failure.

How to set a good timeout

Given the importance of correctly configuring timeout values, the question remains as to how to decide upon a ‘correct’ timeout value. If the timeout is for an API call to another service, a good place to start would be that service’s service-level agreements (SLAs). Often SLAs are based on latency percentiles, which is a value below which a given percentage of latencies fall. For example, a system might have a 99th percentile (also known as P99) latency of 300ms; this would mean that 99% of latencies are below 300ms. A high-order percentile such as P99 or even P99.9 can be used as a ballpark worst-case value.

Let’s say a service (B)’s endpoint has a 99th percentile latency of 600ms. Setting the timeout for this call at 600ms would guarantee that no calls take longer than 600ms, while returning errors for the rest and accepting an error rate of at most 1% (assuming the service is keeping to their SLA). This is an example of how the timeout can be combined with information about latencies to give predictable behaviour.

This idea can be taken further by considering retries too. If the median latency for this service is 50ms, then you could introduce a retry of 50ms for an overall timeout of 50ms + 600ms = 650ms:

Service B

Service B P99 latency SLA = 600ms

Service B median latency = 50ms

Service A

Request timeout = 600ms

Number of retries = 1

Retry request timeout = 50ms

Overall timeout = 50ms+600ms = 650ms

Chance of timeout after retry = 1% * 50% = 0.5%

Figure 1.3: Example timeout configuration settings based on latency data

 

This would still cut off the top 1% of latencies, while optimistically making another attempt for the median latency. This way, even for the 1% of calls that encounter a timeout, our service would still expect to return a successful response within 650ms more than half the time, for an overall success rate of 99.5%.

Context propagation

Go officially introduced the concept of context in Go 1.7, as a way of passing request-scoped information across server boundaries. This includes deadlines, cancellation signals and arbitrary values. Let’s ignore the last part for now and focus on deadlines and cancellations. Often, when setting a regular timeout on a remote call, the server side is unaware of the timeout. Even if the server is notified indirectly when the client closes the connection, it’s still not necessarily clear whether the client timed out or encountered another issue. This can lead to wasted resources, because without knowing the client timed out, the server often carries on regardless. Context aims to solve this problem by propagating the timeout and context information across API boundaries.

Figure 1.4: Context propagation cancels work on B and C
Figure 1.4: Context propagation cancels work on B and C

 

Server A sets a context timeout of 1 second. Since this information spans the entire request and gets propagated to C, C is always aware of the remaining time it has to do useful work – work that won’t get discarded. The remaining time can be defined as (1 – b), where b is the amount of time that server B spent processing before calling C. When the deadline is exceeded, the context is immediately cancelled, along with any child contexts that were created from the parent.

The context timeout can be a relative time (eg. 3 seconds from now) or an absolute time (eg. 7pm). In practice they are equivalent, and the absolute deadline can be queried from a timeout created with a relative time and vice-versa.

Another useful feature of contexts is cancellation. The client has the ability to cancel the request for any reason, which will immediately signal the server to stop working. When a context is cancelled manually, this is very similar to a context being cancelled when it exceeds the deadline. The main difference is the error message will be ‘context cancelled’ instead of ‘context deadline exceeded’. This is a common cause of confusion, but context cancelled is always caused by an upstream client, while deadline exceeded could be a deadline set upstream or locally.

The server must still listen for the ‘context done’ signal and implement cancellation logic, but at least it has the option of doing so, unlike with ordinary timeouts. The most common reason for cancelling a request is because the client encountered an error and no longer needs the response that the server is processing. However, this technique can also be used in request hedging, where concurrent duplicate requests are sent to the server to decrease the impact of an individual call experiencing latency. When the first response returns, the other requests are cancelled because they are no longer needed.

Context can be seen as ‘distributed timeouts’ – an improvement to the concept of timeouts by propagating them. But while they achieve the same goal, they introduce other issues that must be considered.

Context propagation and timeout configuration

When propagating timeout information via context, there is no longer a static ‘timeout’ setting per call. This can complicate debugging: even if the client has correctly configured their own timeout as above, a context timeout could mean that either the remote downstream server is slow, or that an upstream client was slow and there was insufficient time remaining in the propagated context!

Let’s revisit the scenario from earlier, and assume that service A has set a context timeout of 1 second. If B is still taking 800ms, then the call to C will time out after 200ms. This changes things completely: although there is no longer the resource leak (because both B and C will terminate the call once the context timeout is exceeded), B will have an increase in errors whereas previously it would not (at least until it became overloaded). This may be worse than completing the request after A has given up, depending on the circumstances. There is also a dangerous interaction with circuit breakers which we will discuss in the next section.

If allowing the request to complete is preferable than cancelling it even in the event of a client timeout, the request should be made with a new context decoupled from the parent (ie. context.Background()). This will ensure that the timeout is not propagated to the remote service. When doing this, it is still a good idea to set a timeout, to avoid waiting indefinitely for it to complete.

Context and circuit-breakers

A circuit-breaker is a software library or function which monitors calls to external resources with the aim of preventing calls which are likely to fail, ‘short-circuiting’ them (hence the name). It is a good practice to use a circuit-breaker for all outgoing calls to dependencies, especially potentially unreliable ones. But when combined with context propagation, that raises an important question: should context timeouts or cancellation cause the circuit to open?

Let’s consider the options. If ‘yes’, this means the client will avoid wasting calls to the server if it’s repeatedly hitting the context timeout. This might seem desirable at first, but there are drawbacks too.

Pros:

  • Consistent behaviour with other server errors
  • Avoids making calls that are unlikely to succeed
  • It is obvious when things are going wrong
  • Client has more time to fall back to other behaviour
  • More lenient on misconfigured timeouts because circuit-breaking ensures that subsequent calls will fail fast, thus avoiding cascading failure

Cons:

  • Unpredictable
  • A misconfigured upstream client can cause the circuit to open for all other clients
  • Can be misinterpreted as a server error

It is generally better not to open the circuit when the context deadline set upstream is exceeded. The only timeout allowed to trigger the circuit-breaker should be the request timeout of the specific call for that circuit.

Pros:

  • More predictable
  • Circuit depends mostly on server health, not client
  • Clients are isolated

Cons:

  • May be confusing for clients who expect the circuit to open
  • Misconfigured timeouts are more likely to waste resources

Note that the above only applies to propagated contexts. If the context only spans a single individual call, then it is equivalent to a static request timeout, and such errors should cause circuits to open.

How to set context deadlines

Let’s recap some of the concepts covered in this article so far:

  • Timeouts are a time limit on an event taking place, such as a microservice completing an API call to another service.
  • Request timeouts refer to the timeout of a single individual request. When accounting for retries, an API call may include several request timeouts before completing successfully.
  • Context timeouts are introduced in Go to propagate timeouts across API boundaries.
  • A context deadline is an absolute timestamp at which the context is considered to be ‘done’, and work covered by this context should be cancelled when the deadline is exceeded.

Fortunately, there is a simple rule for correctly configuring context timeouts:

The upstream timeout must always be longer than the total downstream timeouts including retries.

The upstream timeout should be set at the ‘edge’ server and cascade throughout.

In our scenario, A is the edge server. Let’s say that B’s timeout to C is 1s, and it may retry at most once, after a delay of 500ms. The appropriate context timeout (CT) set from A can be calculated as follows:

CT(A) = (timeout to C * number of attempts) + (retry delay * number of retries)

CT(A) = (1s * 2) + (500ms * 1) = 2,500ms

Figure 1.5: Formula for calculating context timeouts
Figure 1.5: Formula for calculating context timeouts

 

Extra time can be allocated for B’s processing time and to allow B to return a fallback response if appropriate.

Note that if A configures its timeout according to this rule, then many of the above issues disappear. There are no wasted resources, because B and C are given the maximum time to complete their requests successfully. There is no chance for B’s circuit-breaker to open unexpectedly, and cascading failure is mostly avoided: a failure in C will be handled and be returned by B, instead of A timing out as well.

A possible alternative would be to rely on context cancellation: allow A to set a shorter timeout, which cancels B and C if the timeout is exceeded. This is an acceptable approach to avoiding cascading failure (and cancellation should be implemented in any case), but it is less optimal than configuring timeouts according to the above formula. One reason is that there is no guarantee of the downstream services handling the timeout gracefully; as mentioned previously, the service must explicitly check for ctx.Done() and this is rarely followed in practice. It is also impractical to place checks at every point in the code, so there could be a considerable delay between the client cancellation and the server abandoning the processing.

A second reason not to set shorter timeouts is that it could lead to unexpected errors on the downstream services. Even if B and C are healthy, a shorter context timeout could lead to errors if A has timed out. Besides the problem of having to handle the cancelled requests, the errors could create noise in the logs, and more importantly could have been avoided. If the downstream services are healthy and responding within their SLA, there is no point in timing out earlier. An exception might be for the edge server (A) to allow for only 1 attempt or fewer retries than the downstream service actually performs. But this is tricky to configure and weakens the resiliency. If it is desirable to shorten the timeouts to decrease latency, it is better to start adjusting the timeouts of the downstream resources first, starting from the innermost service outwards.

A model implementation for using context timeouts in calls between microservices

We’ve touched on several useful concepts for improving resiliency in distributed systems: timeouts, context, circuit-breakers and retries. It is desirable to use all of them together in a good resiliency strategy. However, the actual implementation is far from trivial; finding the right order and configuration to use them effectively can seem like searching for the holy grail, and many teams go through a long process of trial and error, continuously improving their implementation. Let’s try to formally put together an ideal implementation, step by step.

Note that the code below is not a final or production-ready implementation. At Grab we have developed independent circuit-breaker and retry libraries, with many settings that can be configured for fine-tuning. However, it should serve as a guide for writing resilient client libraries.

Step 1: Context propagation

Context propagation code

The skeleton function signature includes a context object as the first parameter, which is the best practice intended by Google. We check whether the context is already done before proceeding, in which case we ‘fail fast’ without wasting any further effort.

Step 2: Create child context with request timeout

Child context with request timeout code

Our service has no control over the parent context. Indeed, it could have no deadline at all! Therefore it’s important to create a new context and timeout for our own outgoing request as well, using WithTimeout. It is mandatory to call the returned cancel function to ensure the context is properly cancelled and avoid a goroutine leak.

Step 3: Introduce circuit-breaker logic

Introduce circuit-breaker logic code

Next, we wrap our call to the external service in a circuit-breaker. The actual circuit-breaker implementation has been omitted for brevity, but there are two important points to consider:

  • It should only consider opening the circuit-breaker when requestTimeout is reached, not on ctx.Done().
  • The circuit name should ideally be unique for this specific endpoint
Introduce circuit-breaker logic code - 2

Step 4: Introduce retries

The last step is to add retries to our request in the case of error. This can be implemented as a simple for loop, but there are some key things to include in a complete retry implementation:

  • ctx.Done() should be checked after each retry attempt to avoid wasting a call if the client has given up.
  • The request context should be cancelled before the next retry to avoid duplicate concurrent calls and goroutine leaks.
  • Not all kinds of requests should be retried.
  • A delay should be added before the next retry, using exponential backoff.
  • See Circuit Breaker vs Retries Part 2 for a thorough guide to implementing retries.

Step 5: The complete implementation

Complete implementation

And here we have arrived at our ‘ideal’ implementation of an external call including context handling and propagation, two levels of timeout (parent and request), circuit-breaking and retries. This should be sufficient for a good level of resiliency, avoiding wasted effort on both the client and server.

As a future enhancement, we could consider introducing a ‘minimum time per request’, which the retry loop should use to check for remaining time as well as ctx.Done() (but not instead – we need to account for client cancellation too). Of course metrics, logging and error handling should also be added as necessary.

Important Takeaways

To summarise, here are a few of the best practices for working with context timeouts:

Use SLAs and latency data to set effective timeouts

Having a default timeout value for everything doesn’t scale well. Use available information on SLAs and historic latency to set timeouts that give predictable results.

Understand the common error messages

The context canceled (context.Canceled) error occurs when the context is manually cancelled. This automatically cancels any child contexts attached to the parent. It is rare for this error to surface on the same service that triggered the cancellation; if cancel is called, it is usually because another error has been detected (such as a timeout) which would be returned instead. Therefore, context canceled is usually caused by an upstream error: either the client timed out and cancelled the request, or cancelled the request because it was no longer needed, or closed the connection (this typically results in a cancelled context from Go libraries).

The context deadline exceeded error occurs only when the time limit was reached. This could have been set locally (by the server processing the request) or by an upstream client. Unfortunately, it’s often difficult to distinguish between them, although they should generally be handled in the same way. If a more granular error is required, it is recommended to use child contexts and explicitly check them for ctx.Done(), as shown in our model implementation.

Check for ctx.Done() before starting any significant work

Don’t enter an expensive block of code without checking the context; if the client has already given up, the work will be wasted.

Don’t open circuits for context errors

This leads to unpredictable behaviour, because there could be a number of reasons why the context might have been cancelled. Only context errors due to request timeouts originating from the local service should lead to circuit-breaker errors.

Set context timeouts at the edge service, using a cascading timeout budget

The upstream timeout must always be longer than the total downstream timeouts. Following this formula will help to avoid wasted effort and cascading failure.

In Conclusion

Go’s context package provides two extremely valuable tools that complement timeouts: deadline propagation and cancellation. This article has shown the benefits of using context timeouts and how to correctly configure them in a multi-server request path. Finally, we have discussed the relationship between context timeouts and circuit-breakers, proposing a model implementation for integrating them together in a common library.

If you have a Go server, chances are it’s already making heavy use of context. If you’re new to Go or had been confused by how context works, hopefully this article has helped to clarify misunderstandings. Otherwise, perhaps some of the topics covered will be useful in reviewing and improving your current context handling or circuit-breaker implementation.

Recipe for building a widget: How we helped to “peak-shift” demand by helping passengers understand travel trends

Post Syndicated from Grab Tech original https://engineering.grab.com/peak-shift-demand-travel-trends

Credits: Photo by rawpixel on Unsplash

 

Stuck in traffic in a Grab ride? Pass the time by opening your Grab app and checking out the Feed – just scroll down! You’ll find widgets for games, polls, videos, news and even food recommendations!

Beyond serving your everyday needs, we want to provide our users with information that is interesting, useful and relevant. That’s why we’re always coming up with new widgets.

Building each widget takes close collaboration across multiple different teams – from Product Management to Design, Engineering, Behavioral Science, and Data Science and Analytics. Sounds like a lot of people, doesn’t it? But you’ll be surprised to hear that this behind-the-scenes collaboration works rapidly, usually in the span of one month! Which means we’re often moving from ideation phase to product release in just a few weeks.

Travel Trends Widget

This fast-and-furious process is anchored on one word – “customer-centric”. And that’s how it all began with our  “Travel Trends Widget” – a widget that provides passengers with an overview of historical supply and demand trends for their current location and nearby time periods.

Because we had so much fun developing this widget, we wanted to write a blog post to share with you what we did and how we did it!

Inspiration: Where it all started

Transport demand can be rather lumpy. Owing to organic patterns (e.g. office hours), a lot of passengers tend to request for cars around the same time. In periods like this, the increase in demand could outpace the arrival of driver supply, increasing the waiting time for passengers.

Our goal at Grab is to make sure people get a ride when they want it and at the price they want, so we got to thinking about how we can ease this friction by leveraging our treasure trove – Big Data! – to help our passengers better plan their trips.

As we were looking at the data, we noticed that there is a seasonality to demand and supply: at certain times and days, imbalances appear, peak and disappear, and the process repeats itself. Studies say that humans in general, unless shown a compelling reason or benefit for change, are habitual beings subject to inertia. So we set out to achieve exactly that: To create a widget to surface information to our passengers that may help them alter their decisions on when they choose to book a ride, thereby redistributing some of the present peak demands to periods just before and after peak – also known as “peak shifting the demand”!

While this widget is the first-of-its-kind in the ride-hailing industry, “peak-shifting” was actually coined and introduced long ago!

London Transport Museum Trends

As you can see from this post from the London Transport Museum (Source: Transport for London), London tube tried peak-shifting long before anyone else: Original Ad from 1928 displayed on the left, and Ad from 2015 displayed on the right, comparing the trends to 1928.

Trends from a hotel in Beijing

You may also have seen something similar at the last hotel you stayed at. Notice here a poster in an elevator at a Beijing hotel, announcing the best times to eat breakfast in comfort and avoid the crowd. (Photo credits to Prashant, our Product Manager, who saw this on holiday.)

To apply “peak-shifting” and help our users better plan their trips, we decided to dig in and leverage our data. It was way more complex than we had initially thought, as market conditions could be different on different days. This meant that  generic statements like “5PM-8PM are peak hours and prices will be hight” would not hold true. Contrary to general perception, we observed that even during peak hours, there are buckets of time when there is no surge or low surge.

For instance, plot 1 and plot 2 below shows how a typical Monday and Tuesday surge looks like in a given month respectively. One of the key insights is that the surge trends during peak hour is different on Monday from Tuesday. It reinforces our initial hypothesis that every day is unique.

So we used machine learning techniques to build a forecasting widget which can help our users and give them the power to plan their trips beforehand. This widget is able to provide the pricing trends for the next 2 hours. So with a bit of flexibility, riders can ride the tide!

Grab trends

So how exactly does this widget work?!

Historical trends for Monday

It pulls together historically observed imbalances between supply and demand, for the consumer’s current location and nearby time periods. Aggregated data is displayed to consumers in easily interpreted visualisations, so that they can plan to leave at times when there are more supply, and with potentially more savings for fares.

How did we build the widget? Loop, agile working process, POC & workstream

Widget-building is an agile, collaborative, and simultaneous process. First, we started the process with analysis from Product Analytics team, pulling out data on traffic trends, surge patterns, and behavioral insights of both passengers and drivers in Singapore.

When we noticed the existence of seasonality for each day of the week, we came up with more precise analytical and business questions to dig deeper into the data. Upon verification of hypotheses, we decided that we will build a widget.

Then joined the Behavioural Science, UX (User Experience) Design and the Product Management teams, who started giving shape to the problem we are solving. Our Behavioural Scientists shared their expertise on how information, suggestions and choices should be presented to enable easy assimilation and beneficial action. Daily whiteboarding breakouts, endless back-and forth conversations, and a healthy amount of challenge-and-accept culture ensured that we distilled the idea down to its core. We then presented the relevant information with just the right level of detail, and with the right amount of messaging, to allow users to take the intended action i.e. shift his/her demand outside of peak periods if possible.

Our amazing regional Copywriting team then swung in to put our intent into words in 7 different languages for our users across South-East Asia. Simultaneously, our UX designers and Full-stack Engineers started exploring the best visual components to communicate data on time trends to users. More on this later, but suffice to say that plenty of ideas were explored and discarded in a collaborative process, which aimed to create something that’s intuitive and engaging while being robust and scalable to work across all types of devices.

While these designs made their way up to engineering, the Data Science team worked on finding the most rigorous method to deduce the historical trend of surge across all our cities and areas, and time periods within them. There were discussions on how to best store and update this data reliably so that the widget itself can access it with great performance.

Soon after, we went into the development process, and voila! We had the first iteration of the widget ready on our staging (internal testing) servers in just 2 weeks! This prototype was opened up to the core team for influx of feedback.

And just two weeks later, the widget made its way to our Singapore and Jakarta Feeds, accessible to the world at large! Feedback from our users started pouring in almost immediately (thanks to the rich feedback functionality that comes with each widget), ranging from great to sometimes not-so-great, and we listened to all of it with a keen ear! And thus began a new cycle of iterations and continuous improvement, more of which we will share in a subsequent post.

In the trenches with the creators: How multiple teams got together to make this come true

Various disciplines within our cross functional team came together to whip out this widget by quipping their expertise to the end product.

Using Behavioural Science to simplify choices and design good outcomes

Behavioural Science helped to explore many facets of consumer behaviour in order to plan and design the widget: understanding how consumers think and conceptualizing a widget that can be easily understood and used by the consumers.

While fares are governed entirely by market conditions, it’s important for us to explain the economics to customers. As a customer-centric company, we aim to make the consumers feel like they own their decisions, which they can take based on full information. And this is the role of Behavioral Scientists at Grab!

In guiding the customers through the information, Behavioural Science team had the following three objectives in mind while building this Travel Trends widget:

  1. Offer transparency on the fares: By exposing our historic surge levels for a 4 hour period, we wanted to ensure that the passenger is aware of the surge levels and does not treat the fare as a nasty shock.
  2. Give information that helps them plan: By showing them surge levels for the future 2 hours, we wanted to help customers who have the flexibility, plan for a better time, hence, giving them the power to decide based on transparent information.
  3. Provide helpful tips: Every bar gives users tips on the conditions at that time and the immediate future. For instance, a low surge bar, followed by a high surge bar gives the tip “Psst… Leave now, It might get busy later!”, helping people understand the graph better and nudging them to take an action. If you are interested in saving fares, may we suggest tapping around all the bars to reveal the secret pro-tips?

Designing interfaces that lead to consumer success by abstracting complexity

Design team is the one behind the colors and shapes that make up the widget that you see and interact with! The team took inspiration from Google’s Popular Times.

Source/Credits: Google Live Popular Times
Source/Credits: Google Live Popular Times

 

Right from the offset, our content and product teams were keen to surface additional information and actions with each bar to keep the widget interactive and useful. One of the early challenges was to arrive at the right gesture that invites the user to interact and intuitively navigate the bars on the widget but also does not conflict with other gestures (eg scrolling and scrubbing) that the user was pre-trained to perform on the feed. We found out that tapping was simultaneously an unused and yet intuitive gesturethat we could use for interaction with the bars.

We then went into rounds of iteration on the visual design of the widget. In this process, multiple stakeholders were involved ranging from Product to Content to Engineering. We had to overcome a number of constraints i.e. the limited canvas of a widget and the context of a user when she is exploring the feed. By re-using existing libraries and components, we managed to keep the development light and ship something fast.

GrabCar trends near you

Dozens of revisions and four iterations later, we landed with a design that we felt equipped the feature for its user-facing goal, and did so in a manner which was aesthetically appealing!

And finally we managed to deliver on the feature’s goal, by surfacing just the right detail of information in a manner that is intuitive yet effective to peak-shift demand.  

Bringing all of this to fruition through high performance engineering

Our Development Engineering team was in charge of developing the widget and making it available to our users in just a few weeks’ time – materialising the work of the other teams.

One of their challenges was to find the best way to process the vast amount of data (millions of database entries) so it can be visualized simply as bar charts. Grab’s engineers had to achieve this while making sure performance is as resilient as possible.

There were two options in doing this:

a) Fetch the data directly from the DB for each API call; or

b) Store the data in an in-memory data structure on a timely basis, so when a user calls the API will no longer have to hit the DB.

After considering that this feature will likely expect a lot of traffic thus high QPS, we decided that the former option would be too costly. Ultimately, we chose the latter option since it is more performant and more scalable.

At the frontend, the challenge was to cater to the intricate request from our designers. We use chart libraries to increase our development speed, and not all of the requirements were readily supported by these libraries.

For instance, let’s say this library makes visualising charts easy, but not so much for customising them. If designers wanted to have an average line in a dotted form, the library did not support this so easily. Also, the moving arrow pointers as you move between bar chart, changing colors of the bars changes when clicked – all required countless CSS tweaks.

CSS tweak on trends widget
CSS tweak on trends widget

Closing the product loop with user feedback and data driven insights

One of the most crucial parts of launching any product is to ensure that customers are engaging with the widget and finding it useful.

To understand what customers think about the widget, whether they find it useful and whether it is helping them to plan better,  we delved into the huge mine of clickstream data.

User feedback on the trends widget

We found that 1 in 3 users who make a booking everyday interact with the widget. And of these people, more than 70% users have given positive rating for the widget. This validates our initial hypothesis that if given an option, our customers will love the freedom to plan their trips and inculcate more transparent ecosystem.

These users also indicate the things they like most about the widget. 61% of users gave positive rating for usefulness, 20% were impressed by the design (Kudos to our fantastic designer Ajmal!!) and 13% for usability.

Tweet about the widget

Beyond internal data, our widget made some rounds on social media channels. For Example, here is screenshot of what our users have to say on Twitter.

We closely track these metrics on user engagement and feedback to ensure that we keep improving and coming up with new iterations which helps us to serve our customers in a better way.

Conclusion

We hope you enjoyed reading about how we went from ideation, through iterations to a finished widget in the hands of the user, all in 1 month! Many hands helped along the way. If you are interested in joining this hyper-proactive problem-solving team, please check out Grab’s career site!

And if you have feedback for us, we are here to listen! While we cannot be happier to see some positive reaction from the public, we are also thrilled to hear your suggestions and advice. Please leave us a memo using the Widget’s comment function!

Epilogue

We just released an upgrade to this widget which allows users to set reminders and be notified about availability of good fares in a time period of their choosing. We will keep a watch and come knocking! Go ahead, find the widget on your Grab feed, set a reminder and save on fares on your next ride!

Vulcanizer: a library for operating Elasticsearch

Post Syndicated from GitHub Engineering original https://github.blog/2019-03-05-vulcanizer-a-library-for-operating-elasticsearch/

At GitHub, we use Elasticsearch as the main technology backing our search services. In order to administer our clusters, we use ChatOps via Hubot. As of 2017, those commands were a collection of Bash and Ruby-based scripts.

Although this served our needs for a time, it was becoming increasingly apparent that these scripts lacked composability and reusability. It was also difficult to contribute back to the community by open sourcing any of these scripts due to the fact they are specific to bespoke GitHub infrastructure.

Why build something new?

There are plenty of excellent Elasticsearch libraries, both official and community driven. For Ruby, GitHub has already released the Elastomer library and for Go we make use of the Elastic library by user olivere. However, these libraries focus primarily on indexing and querying data. This is exactly what an application needs to use Elasticsearch, but it’s not the same set of tools that operators of an Elasticsearch cluster need. We wanted a high-level API that corresponded to the common operations we took on a cluster, such as disabling allocation or draining the shards from a node. Our goal was a library that focused on these administrative operations and that our existing tooling could easily use.

Full speed ahead with Go…

We started looking into Go and were inspired by GitHub’s success with freno and orchestrator.

Go’s structure encourages the construction of composable (self-contained, stateless, components that can be selected and assembled) software, and we saw it as a good fit for this application.

… Into a wall

We initially scoped the project out to be a packaged chat app and planned to open source only what we were using internally. During implementation, however, we ran into a few problems:

  • GitHub uses a simple protocol based on JSON-RPC over HTTPS called ChatOps RPC. However, ChatOps RPC is not widely adopted outside of GitHub. This would make integration of our application into ChatOps infrastructure difficult for most parties.
  • The internal REST library our ChatOps commands relied on was not open sourced. Some of the dependencies of this REST library would also need to be open sourced. We’ve started the process of open sourcing this library and its dependencies, but it will take some time.
  • We relied on Consul for service discovery, which not everyone uses.

Based on these factors we decided to break out the core of our library into a separate package that we could open source. This would decouple the package from our internal libraries, Consul, and ChatOps RPC.

The package would only have a few goals:

  • Access the REST endpoints on a single host.
  • Perform an action.
  • Provide results of the action.

This module could then be open sourced without being tied to our internal infrastructure, so that anyone could use it with the ChatOps infrastructure, service discovery, or tooling they choose.

To that end, we wrote vulcanizer.

Vulcanizer

Vulcanizer is a Go library for interacting with an Elasticsearch cluster. It is not meant to be a full-fledged Elasticsearch client. Its goal is to provide a high-level API to help with common tasks that are associated with operating an Elasticsearch cluster such as querying health status of the cluster, migrating data off of nodes, updating cluster settings, and more.

Examples of the Go API

Elasticsearch is great in that almost all things you’d want to accomplish can be done via its HTTP interface, but you don’t want to write JSON by hand, especially during an incident. Below are a few examples of how we use Vulcanizer for common tasks and the equivalent curl commands. The Go examples are simplified and don’t show error handling.

Getting nodes of a cluster

You’ll often want to list the nodes in your cluster to pick out a specific node or to see how many nodes of each type you have in the cluster.

$ curl localhost:9200/_cat/nodes?h=master,role,name,ip,id,jdk
- mdi vulcanizer-node-123 172.0.0.1 xGIs 1.8.0_191
* mdi vulcanizer-node-456 172.0.0.2 RCVG 1.8.0_191

Vulcanizer exposes typed structs for these types of objects.

v := vulcanizer.NewClient("localhost", 9200)

nodes, err := v.GetNodes()

fmt.Printf("Node information: %#v\n", nodes[0])
// Node information: vulcanizer.Node{Name:"vulcanizer-node-123", Ip:"172.0.0.1", Id:"xGIs", Role:"mdi", Master:"-", Jdk:"1.8.0_191"}

Update the max recovery cluster setting

The index recovery speed is a common setting to update when you want balance time to recovery and I/O pressure across your cluster. The curl version has a lot of JSON to write.

$ curl -XPUT localhost:9200/_cluster/settings -d '{ "transient": { "indices.recovery.max_bytes_per_sec": "1000mb" } }'
{
"acknowledged": true,
"persistent": {},
"transient": {
"indices": {
"recovery": {
"max_bytes_per_sec": "1000mb"
}
}
}
}

The Vulcanizer API is fairly simple and will also retrieve and return any existing setting for that key so that you can record the previous value.

v := vulcanizer.NewClient("localhost", 9200)
oldSetting, newSetting, err := v.SetSetting("indices.recovery.max_bytes_per_sec", "1000mb")
// "50mb", "1000mb", nil

Move shards on to and off of a node

To safely update a node, you can set allocation rules so that data is migrated off a specific node. In the Elasticsearch settings, this is a comma-separated list of node names, so you’ll need to be careful not to overwrite an existing value when updating it.

$ curl -XPUT localhost:9200/_cluster/settings -d '
{
"transient" : {
"cluster.routing.allocation.exclude._name" : "vulcanizer-node-123,vulcanizer-node-456"
}
}'

The Vulcanizer API will safely add or remove nodes from the exclude settings so that shards won’t allocate on to a node unexpectedly.

v := vulcanizer.NewClient("localhost", 9200)

// Existing exclusion settings:
// vulcanizer-node-123,vulcanizer-node-456

exclusionSetttings1, err := v.DrainServer("vulcanizer-node-789")
// vulcanizer-node-123,vulcanizer-node-456,vulcanizer-node-789

exclusionSetttings2, err := v.FillOneServer("vulcanizer-node-456")
// vulcanizer-node-123,vulcanizer-node-789

Command-line application

Included is a small CLI application that leverages the library:

$ vulcanizer -h
Usage:
  vulcanizer [command]

Available Commands:
  allocation  Set shard allocation on the cluster.
  drain       Drain a server or see what servers are draining.
  fill        Fill servers with data, removing shard allocation exclusion rules.
  health      Display the health of the cluster.
  help        Help about any command
  indices     Display the indices of the cluster.
  nodes       Display the nodes of the cluster.
  setting     Interact with cluster settings.
  settings    Display all the settings of the cluster.
  snapshot    Interact with a specific snapshot.
  snapshots   Display the snapshots of the cluster.

Flags:
  -c, --cluster string      Cluster to connect to defined in config file
  -f, --configFile string   Configuration file to read in (default to "~/.vulcanizer.yaml")
  -h, --help                help for vulcanizer
      --host string         Host to connect to (default "localhost")
  -p, --port int            Port to connect to (default 9200)

Use "vulcanizer [command] --help" for more information about a command.

ChatOps

ChatOps is important for GitHub and our geographically distributed workforce. Vulcanizer enables us to build ChatOps tooling around Elasticsearch quickly and easily for common tasks:

jessbreckenridge
jessbreckenridge

.vulcan-go health testbed
hubot
Hubot

The cluster is green: All primary and replica shards are allocated. This does NOT mean the cluster is otherwise healthy.
+--------+------------+------+------------+----------+
| STATUS | RELOCATING | INIT | UNASSIGNED | ACTIVE % |
+--------+------------+------+------------+----------+
| green  | 0          | 0    | 0          | 100      |
+--------+------------+------+------------+----------+
jessbreckenridge
jessbreckenridge

.vulcan-go nodes testbed
hubot
Hubot

testbed nodes
+--------+------+------------------------+-----------+-----------+
| MASTER | ROLE |          NAME          |    JDK    |      IP   |
+--------+------+------------------------+-----------+-----------+
| *      | mi   | search-arbiter-aaaaaaa | 1.8.0_111 | 172.0.0.1 |
| -      | mi   | search-arbiter-bbbbbbb | 1.8.0_111 | 172.0.0.2 |
| -      | mi   | search-arbiter-ccccccc | 1.8.0_111 | 172.0.0.3 |
| -      | di   | search-storage-ddddddd | 1.8.0_111 | 172.0.0.4 |
| -      | di   | search-storage-eeeeeee | 1.8.0_111 | 172.0.0.5 |
| -      | di   | search-storage-eeeeeee | 1.8.0_111 | 172.0.0.6 |
+--------+------+------------------------+-----------+-----------+
nickcanz
nickcanz

.vulcan-go drain status testbed
hubot
Hubot

IP's filtered: None
Name Excluded: None
Host Excluded: None
nickcanz
nickcanz

.vulcan-go drain server search-storage-zzzzzzz testbed
hubot
Hubot

Server "search-storage-zzzzzzz" added to list of servers to drain.

Complete list servers draining:
    Ips: None
    Names: search-storage-zzzzzzz
    Hosts: None
nickcanz
nickcanz

.vulcan-go settings testbed
hubot
Hubot

testbed settings
No persistent settings are set.
+------------------------------------------+------------------------+
|            TRANSIENT SETTINGS            |         VALUE          |
+------------------------------------------+------------------------+
| cluster.routing.allocation.exclude._name | search-storage-zzzzzzz |
+------------------------------------------+------------------------+

Closing

We stumbled a bit when we first started down this path, but the end result is best for everyone:

  • Since we had to regroup about what exact functionality we wanted to open source, we made sure we were providing value to ourselves and the community instead of just shipping something.
  • Internal tooling doesn’t always follow engineering best practices like proper release management, so developing Vulcanizer in the open provides an external pressure to make sure we follow all of the best practices.
  • Having all of the Elasticsearch functionality in its own library allows our internal applications to be very slim and isolated. Our different internal applications have a clear dependency on Vulcanizer instead of having different internal applications depend on each other or worse, trying to get ChatOps to talk to other ChatOps.

Visit the Vulcanizer repository to clone or contribute to the project. We have ideas for future development in the Vulcanizer roadmap.

Authors


The post Vulcanizer: a library for operating Elasticsearch appeared first on The GitHub Blog.

Structured Logging: The Best Friend You’ll Want When Things Go Wrong

Post Syndicated from Grab Tech original https://engineering.grab.com/structured-logging

Introduction

Everyday millions of people around Southeast Asia count on Grab to get themselves or what they need from point A to B in a safe, comfortable and reliable manner. In fact, just very recently we crossed our 3 billion transport rides milestone, gaining the last billion in just a mere 6 months!

We take this responsibility very seriously, and as we continue to grow and expand, it’s important for us to maintain a sophisticated backend system that is capable of sustaining the kind of scale needed to support all our customers in Southeast Asia. This backend system is comprised of multiple services that interact with each other in many different ways. As Grab evolves, maintaining them becomes a significantly larger and harder task as developers continuously develop new features.

To maintain these systems well, it’s important to have better observability; data that helps us better understand what is happening in the system by having good monitoring (metrics), event logs, and tracing for request scope data. Out of these, logs provide the most complete picture of what happened within the system – and is typically the first and most engaged point of contact. With good logs, the backend becomes much easier to understand, maintain, and debug. Without logs or with bad logs – we have a recipe for disaster; making it nearly impossible to understand what’s happening.

In this article, we focus on a form of logging called structured logging. We discuss what it is, why is it better, and how we built a framework that integrates well with our current Elastic stack-based logging backend, allowing us to do logging better and more efficiently.

Structured Logging is a part of a larger endeavour which will enable us to reduce the Mean Time To Resolve (MTTR), helping developers to mitigate issues faster when outages happen.

What are Logs?

Logs are lines of texts containing some information about some event that occurred in our system, and they serve a crucial function of helping us understand what’s happening in the backend. Logs are usually placed at points in the code where a significant event has happened (for example, some database operation succeeded or a passenger got assigned to a driver) or at any other place in the code that we are interested in observing.

The first thing that a developer would normally do when an error is reported is check the logs – sort of like walking through the history of the system and finding out what happened. Therefore, logs can be a developer’s best friend in times of service outages, errors, and failed builds.

Logs in today’s world have varying formats and features.

  • Log Format: These range from simple key-value based (like syslog) to quite structured and detailed (like JSON). Since logs are mostly meant for developer eyes, how detailed or structured a log is dictates how fast the developer can query the logs, as well as read them. The more structured the data is – the larger the size is per log line, although it’s more queryable and contains richer information.
  • Levelled Logging (or Log Levels): Logs with different severities can be logged at different levels. The visibility can be limited to a single level, limiting all logs only with a certain severity or above (for example, only logs WARN and above). Usually log levels are static in production environments, and finding DEBUG logs usually requires redeploying.
  • Log Aggregation Backend: Logs can have different log aggregation backends, which means different backends (i.e. Splunk, Kibana, etc.) decide what your logs might look like or what you might be able to do with them. Some might cost a lot more than others.
  • Causal Ordering: Logs might or might not preserve the exact time in which they are written. This is important, as how exact the time is dictates how accurately we can predict the sequence of events via logs.
  • Log Correlation: We serve countless requests from our backend services. Being able to see all the logs relevant to a particular request or a particular event helps us drill down to relevant  information for a specific request (e.g. for a specific passenger trying to book a ride).

Combine this with the plethora of logging libraries available and you easily have a developer who is holding his head in confusion, unable to decide what to use. Also, each library has their own set of advantages and disadvantages, so the discussion might quickly become subjective and polarized – therefore it is crucial that you choose the appropriate library and backend pair for your applications.

We at Grab use different types of logging libraries. However, as requirements changed  – we also found ourselves re-evaluating our logging strategy.

The State of Logging at Grab

The number of Golang services at Grab has continuously grown. Most services used syslog-style key-value format logs, recognized as the most common format of logs for server-side applications due to its simplicity and ease for reading and writing. All these logs were made possible by a handful of common libraries, which were directly imported and used by different services.

We used a cloud-based SaaS vendor as a frontend for these logs, where application-emitted logs were routed to files and sent to our logging vendor, making it possible to view and query them in real time. Things were pretty great and frictionless for a long time.

However, as time went by, our logging bills started mounting to unprecedented levels and we found ourselves revisiting and re-evaluating how we did logging. A few issues surfaced:

  • Logging volume reduction efforts were successful to some extent – but were arduous and painful. Part of the reason was that almost all the logs were at a single log level – INFO.
Figure 1: Log Level Usage
Figure 1: Log Level Usage

 

This issue was not limited to a single service, but pervasive across services. For mitigation, some services added sampling to logs, some removed logs altogether. The latter is only a recipe for disaster, so it was known that we had to improve levelled logging.

  • The vendor was expensive for us at the time and also had a few concerns – primarily with limitations around DSL (query language). There were many good open source alternatives available – Elastic stack to name one. Our engineers felt confident that we could probably manage our logging infrastructure and manage the costs better – which led to the proposal and building of Elastic stack logging cluster. Elasticsearch is vastly more powerful and rich than our vendor at the time and our current libraries weren’t enough to fully leverage its capabilities, so we needed a library which can leverage structure in logs better and easily integrate with Elastic stack.
  • There were some minor issues in our logging libraries namely:
    • Singleton initialisation pattern that made unit-testing harder
    • Single logger interface that reduced the possibility of extending the core logging functionality as almost all the services imported the logger interface directly
    • No out-of-the-box support for multiple writers
  • If we were to write a library, we had to fix these issues – and also encourage usage of best practices.

  • Grab’s critical path (number of services traversed by a single booking flow request) has grown in size. On average, a single booking request touches multiple microservices – each of which does something different. At the large scale at which we operate, it’s necessary therefore to easily view logs from all the services for a single request – however this was not something which was done automatically by the library. Hence, we also wanted to make log correlation easier and better.
  • Logs are events which happened at some point of time. The order in which these events occurred gives us a complete history of what happened in the system. However, the core logging library which formed the base of the logging across our Golang services didn’t preserve the log generation time (it instead used write time). This led to jumbling of logs which are generated in a span of a few microseconds – which not only makes the lives of our developers harder, but makes it near impossible to get an exact history of the system. This is why we wanted to also improve and enable causal ordering of logs – one of the key steps in understanding what’s happening in the system.

Why Change?

As mentioned, we knew there were issues with how we were logging. To best approach the problem and be able to solve it as much as possible without affecting existing infrastructure and services, it was decided to bootstrap a new library from the ground up. This library would solve known issues, as well as contain features which would not have been possible by modifying existing libraries. For a recap, here’s what we wanted to solve:

  • Improve levelled logging
  • Leverate structure in logs better
  • Easily integrate with Elastic stack
  • Encourage usage of best practices
  • Make log correlation easier and better
  • Improve and enable causal ordering of logs for a better understanding of service distribution

Enter Structured Logging. Structured Logging has been quite popular around the world, finding widespread adoption. It was easily integrable with our Elastic stack backend and would also solve most of our pain points.

Structured Logging

Keeping our previous problems and requirements in mind, we bootstrapped a library in Golang, which has the following features:

Dynamic Log Levels

This allows us to change our initialized log levels at runtime from a configuration management system – something which was not possible and encouraged before.

This makes the log levels actually more meaningful now –  developers can now deploy with the usual WARN or INFO log levels, and when things go wrong, just with a configuration change they can update the log level to DEBUG and make their services output more logs when debugging. This also helps us keep our logging costs in check. We made support for integrating this with our configuration management system easy and straightforward.

Consistent Structure in Logs

Logs are inherently unstructured unlike database schema, which is rigid, or a freeform text, which has no structure. Our Elastic stack backend is primarily based on indices (sort of like tables) with mapping (sort of like a loose schema). For this, we needed to output logs in JSON with a consistent structure (for example, we cannot output integer and string under the same JSON field because that will cause an indexing failure in Elasticsearch). Also, we were aware that one of our primary goals was keeping our logging costs in check, and since it didn’t make sense to structure and index almost every field – adding only the structure which is useful to us made sense.

For addressing this, we built a utility that allows us to add structure to our logs deterministically. This is built on top of a schema in which we can add key-value pairs with a specific key name and type, generate code based on that – and use the generated code to make sure that things are consistently formatted and don’t break. We called this schema (a collection of key name and type pairs) the Common Grab Log Schema (CGLS). We only add structure to CGLS which is important – everything included in CGLS gets formatted in the different field and everything else gets formatted in a single field in the generated JSON. This helps keeps our structure consistent and easily usable with Elastic stack.

Figure 2: Overview of Common Grab Log Schema for Golang backend services
Figure 2: Overview of Common Grab Log Schema for Golang backend services

Plug and Play support with Grab-Kit

We made the initialization and use easy and out-of-the-box with our in-house support for Grab-Kit, so developers can just use it without making any drastic changes. Also, as part of this integration, we added automatic log correlation based on request IDs present in traces, which ensured that all the logs generated for a particular request already have that trace ID.

Configurable Log Format

Our primary requirement was building a logger expressive and consistent enough to integrate with the Elastic stack backend well – without going through fancy log parsing in the downstream. Therefore, the library is expressive and configurable enough to allow any log format (we can write different log formats for different future use cases. For example, readable format in development settings and JSON output in production settings), with a default option of JSON output. This ensures that we can produce log output which is compatible with Elastic stack, but still be configurable enough for different use cases.

Support for Multiple Writes with Different Formats

As part of extending the library’s functionality, we needed enough configurability to be able to send different logs to different places at different settings. For example, sending FATAL logs to Slack asynchronously in some readable format, while sending all the usual logs to our Elastic stack backend. This library includes support for chaining such “cores” to any arbitrary degree possible – making sure that this logger can be used in such highly specialized cases as well.

Production-like Logging Environment in Development

Developers have been seeing console logs since the dawn of time, however having structured JSON logs which are only meant for production logs and are more searchable provides more power. To leverage this power in development better and allow developers to directly see their logs in Kibana, we provide a dockerized version of Kibana which can be spun up locally to accept structured logs. This allows developers to directly use the structured logs and see their logs in Kibana – just like production!

Having this library enabled us to do logging in a much better way. The most noticeable impact was that our simple access logs can now be queried better – with more filters and conditions.

Figure 3: Production-like Logging Environment in Development
Figure 3: Production-like Logging Environment in Development

Causal Ordering

Having an exact history of events makes debugging issues in production systems easier – as one can just look at the history and quickly hypothesize what’s wrong and fix it. To this end, the structured logging library adds the exact write timestamp in nanoseconds in the logger. This combined with the structured JSON-like format makes it possible to sort all the logs by this field – so we can see logs in the exact order as they happened – achieving causal ordering in logs. This is an underplayed but highly powerful feature that makes debugging easier.

Figure 4: Causal ordering of logs with Y'ALL
Figure 4: Causal ordering of logs with Y’ALL

But Why Structured Logging?

Now that you know about the history and the reasons behind our logging strategy, let’s discuss the benefits that you reap from it.

On the outset, having logs well-defined and structured (like JSON) has multiple benefits, including but not limited to:

  • Better root cause analysis: With structured logs, we can ingest and perform more powerful queries which won’t be possible with simple unstructured logs. Developers can do more informative queries on finding the logs which are relevant to the situation. Not only this, log correlation and causal ordering make it possible to gain a better understanding of the distributed logs. Unlike unstructured data, where we are only limited to full-text or a handful of log types, structured logs take the possibility to a whole new level.
  • More transparency or better observability: With structured logs, you increase the visibility of what is happening with your system – since now you can log information in a better, more expressive way. This enables you to have a more transparent view of what is happening in the system and makes your systems easier to maintain and debug over longer periods of time.
  • Better consistency: With structured logs, you increase the structure present in your logs – and in turn, make your logs more consistent as the systems evolve. This allows us to index our logs in a system like Elastic stack more easily as we can be sure that we are sticking to some structure. Also with the adoption of a common schema, we can be rest assured that we are all using the same structure.
  • Better standardization: Having a single, well-defined, structured way to do logging allows us to standardize logging – which reduces cognitive overhead of figuring out what happened in systems via logs and allows easier adoption. Instead of going through 100 different types of logs, you instead would only have a single format. This is also one of the goals of the library – standardizing the usage of the library across Golang backend services.

We get some additional benefits as well:

  • Dynamic Log Levels: This allows us to have meaningful log levels in our code – where we can deploy with baseline warning settings and switch to lower levels (debug logs) only when we need them. This helps keep our logging costs low, as well as reduces the noise that developers usually need to go through when debugging.
  • Future-proof Consistency in Logs: With the adoption of a common schema, we make sure that we stick with the same structure, even if say tomorrow our logging infrastructure changes – making us future-ready. Instead of manually specifying what to log, we can simply expose a function in our loggers.
  • Production-Like Logging Environment in Development: The dockerized Kibana allows developers to enjoy the same benefits as the production Kibana. This also encourages developers to use Elastic stack more and explore its features such as building dashboards based on the log data, having better watchers, and so on.

I hope you have enjoyed this article and found it useful. Comments and corrections are always welcome.

Happy Logging!

How we simplified our Data Ingestion & Transformation Process

Post Syndicated from Grab Tech original https://engineering.grab.com/data-ingestion-transformation-product-insights

Introduction

As Grab grew from a small startup to an organisation serving millions of customers and driver partners, making day-to-day data-driven decisions became paramount. We needed a system to efficiently ingest data from mobile apps and backend systems and then make it available for analytics and engineering teams.

Thanks to modern data processing frameworks, ingesting data isn’t a big issue. However, at Grab scale it is a non-trivial task. We had to prepare for two key scenarios:

  • Business growth, including organic growth over time and expected seasonality effects.
  • Any unexpected peaks due to unforeseen circumstances. Our systems have to be horizontally scalable.

We could ingest data in batches, in real time, or a combination of the two. When you ingest data in batches, you can import it at regularly scheduled intervals or when it reaches a certain size. This is very useful when processes run on a schedule, such as reports that run daily at a specific time. Typically, batched data is useful for offline analytics and data science.

On the other hand, real-time ingestion has significant business value, such as with reactive systems. For example, when a customer provides feedback for a Grab superapp widget, we re-rank widgets based on that customer’s likes or dislikes. Note when information is very time-sensitive, you must continuously monitor its data.

This blog post describes how Grab built a scalable data ingestion system and how we went from prototyping with Spark Streaming to running a production-grade data processing cluster written in Golang.

Building the system without reinventing the wheel

The data ingestion system:

  1. Collects raw data as app events.
  2. Transforms the data into a structured format.
  3. Stores the data for analysis and monitoring.

In a previous blog post, we discussed dealing with batched data ETL with Spark. This post focuses on real-time ingestion.

We separated the data ingestion system into 3 layers: collection, transformation, and storage. This table and diagram highlights the tools used in each layer in our system’s first design.

LayerTools
CollectionGateway, Kafka
TransformationGo processing service, Spark Streaming
StorageTalariaDB

Our first design might seem complex, but we used battle-tested and common tools such as Apache Kafka and Spark Streaming. This let us get an end-to-end solution up and running quickly.

Collection layer

Our collection layer had two sub-layers:

  1. Our custom built API Gateway received HTTP requests from the mobile app. It simply decoded and authenticated HTTP requests, streaming the data to the Kafka queue.
  2. The Kafka queue decoupled the transformation layer (shown in the above figure as the processing service and Spark streaming) from the collection layer (shown above as the Gateway service). We needed to retain raw data in the Kafka queue for fault tolerance of the entire system. Imagine an error where a data pipeline pollutes the data with flawed transformation code or just simply crashes. The Kafka queue saves us from data loss by data backfilling.

Since it’s robust and battle-tested, we chose Kafka as our queueing solution. It perfectly met our requirements, such as high throughput and low latency. Although Kafka takes some operational effort such as self-hosting and monitoring, Grab has a proficient and dedicated team managing our Kafka cluster.

Transformation layer

There are many options for real-time data processing, including Spark Streaming, Flink, and Storm. Since we use Spark for all our batch processing, we decided to use Spark Streaming.

We deployed a Golang processing service between Kafka and Spark Streaming. This service converts the data from Protobuf to Avro. Instead of pointing Spark Streaming directly to Kafka, we used this processing service as an intermediary. This was because our Spark Streaming job was written in Python and Spark doesn’t natively support protobuf decoding.  We used Avro format, since Grab historically used it for archiving streaming data. Each raw event was enriched and batched together with other events. Batches were then uploaded to S3.

Storage layer

TalariaDB is a Grab-built time-series database. It ingests events as columnar ORC files, indexing them by event name and time. We use the same ORC format files for batch processing. TalariaDB also implements the Presto Thrift connector interface, so our users could query certain event types by time range. They did this by connecting a Presto to a TalariaDB hosting distributed cluster.

Problems

Building and deploying our data pipeline’s MVP provided great value to our data analysts, engineers, and QA team. For example, our mobile app team could monitor any abnormal change in the real-time metrics, such as the screen load time for the latest released app version. The QA team could perform app side actions (book a ride, make payment, etc.) and check which events were triggered and received by the backend. The latency between the ingestion and the serving layer was only 4 minutes instead of the batch processing system’s 60 minutes. The streaming processing’s data showed good business value.

This prompted us to develop more features on top of our platform-collected real-time data. Very soon our QA engineers and the product analytics team used more and more of the real-time data processing system. They started instrumenting various mobile applications so more data started flowing in. However, as our ingested data increased, so did our problems. These were mostly related to operational complexity and the increased latency.

Operational complexity

Only a few team members could operate Spark Streaming and EMR. With more data and variable rates, our streaming jobs had scaling issues and failed occasionally. This was due to checkpoint issues when the cluster was under heavy load. Increasing the cluster size helped, but adding more nodes also increased the likelihood of losing more cluster nodes. When we lost nodes,our latency went up and added more work for our already busy on-call engineers.

Supporting native Protobuf

To simplify the architecture, we initially planned to bypass our Golang-written processing service for the real-time data pipeline. Our plan was to let Spark directly talk to the Kafka queue and send the output to S3. This required packaging the decoders for our protobuf messages for Python Spark jobs, which was cumbersome. We thought about rewriting our job in Scala, but we didn’t have enough experience with it.

Also, we’d soon hit some streaming limits from S3. Our Spark streaming job was consuming objects from S3, but the process was not continuous due to S3’s  eventual consistency. To avoid long pagination queries in the S3 API, we had to prefix the data with the hour in which it was ingested. This resulted in some data loss after processing by the Spark streaming. The loss happened because the new data would appear in S3 while Spark Streaming had already moved on to the next hour. We tried various tweaks, but it was just a bad design. As our data grew to over one terabyte per hour, our data loss grew with it.

Processing lag

On average, the time from our system ingesting an event to when it was available on the Presto was 4 to 6 minutes. We call that processing lag, as it happened due to our data processing. It was substantially worse under heavy loads, increasing to 8 to 13 minutes. While that wasn’t bad at this scale (a few TBs of data), it made some use cases impossible, such as monitoring. We needed to do better.

Simplifying the architecture and rewriting in Golang

After completing the MVP phase development, we noticed the Spark Streaming functionality we actually used was relatively trivial. In the Spark Streaming job, we only:

  • Partitioned the batch of events by event name.
  • Encoded the data in ORC format.
  • And uploaded to an S3 bucket.

To mitigate the problems mentioned above, we tried re-implementing the features in our existing Golang processing service. Besides consuming the data and publishing to an S3 bucket, the transformation service also needed to deal with event partitioning and ORC encoding.

One key problem we addressed was implementing a robust event partitioner with a large write throughput and low read latency. Fortunately, Golang has a nice concurrent map package. To further reduce the lock contention, we added sharding.

We made the changes, deployed the service to production,and discovered our service was now memory-bound as we buffered data for 1 minute. We did thorough benchmarking and profiling on heap allocation to improve memory utilization. By iteratively reducing inefficiencies and contributing to a lower CPU consumption, we made our data transformation more efficient.

Performance

After revamping the system, the elapsed time for a single event to travel from the gateway to our dashboard is about 1 minute. We also fixed the data loss issue. Finally, we significantly reduced our on-call workload by removing Spark Streaming.

Validation

At this point, we had both our old and new pipelines running in parallel. After drastically improving our performance, we needed to confirm we still got the same end results. This was done by running a query against each of the pipelines and comparing the results. Both systems were registered to the same Presto cluster.

We ran two SQL “excerpts” between the two pipelines in different order. Both queries returned the same events, validating our new pipeline’s correctness.

select count(1) from ((
 select uuid, time from grab_x.realtime_new
 where event = 'app.metric1' and time between 1541734140 and 1541734200
) except (
 select uuid, time from grab_x.realtime_old
 where event = 'app.metric1' and time between 1541734140 and 1541734200
))

/* output: 0 */

Conclusions

Scaling a data ingestion system to handle hundreds of thousands of events per second was a non-trivial task. However, by iterating and constantly simplifying our overall architecture, we were able to efficiently ingest the data and drive down its lag to around one minute.

Spark Streaming was a great tool and gave us time to understand the problem. But, understanding what we actually needed to build and iteratively optimise the entire data pipeline led us to:

  • Replacing Spark Streaming with our new Golang-implemented pipeline.
  • Removing Avro encoding.
  • Removing an intermediary S3 step.

Differences between the old and new pipelines are:

Old PipelineNew Pipeline
LanguagesPython, GoGo
Stages4 services3 services
ConversionsProtobuf → Avro → ORCProtobuf → ORC
Lag4-13 min1 min

Systems usually become more and more complex over time, leading to tech debt and decreased performance. In our case, starting with more steps in the data pipeline was actually the simple solution, since we could re-use existing tools. But as we reduced processing stages, we’ve also seen fewer failures. By simplifying the problem, we improved performance and decreased operational complexity. At the end of the day, our data pipeline solves exactly our problem and does nothing else, keeping things fast.

Highlights from Git 2.21

Post Syndicated from Taylor Blau original https://github.blog/2019-02-24-highlights-from-git-2-21/

The open source Git project just released Git 2.21 with features and bug fixes from over 60 contributors. We last caught up with you on the latest Git releases when 2.19 was released. Here’s a look at some of the most interesting features and changes introduced since then.

Human-readable dates with --date=human

As part of its output, git log displays the date each commit was authored. Without making an alternate selection, timestamps will display in Git’s “default” format (for example, “Tue Feb 12 09:00:33 2019 -0800”).

That’s very precise, but a lot of those details are things that you might already know, or don’t care about. For example, if the commit happened today, you already know what year it is. Likewise, if a commit happened seven years ago, you don’t care which second it was authored. So what do you do? You could use --date=relative, which would give you output like 6 days ago, but often, you want to relate “six days ago” to a specific event, like “my meeting last Wednesday.” So, was six days ago Tuesday, or was it Wednesday that you were interested in?

Git 2.21 introduces a new date format that tells you exactly when something occurred with just the right amount of detail: --date=human. Here’s how git log looks with the new format:

git log --date=human example

That’s both more accurate than --date=relative and easier to consume than the full weight of --date=default.

But what about when you’re scripting? Here, you might want to frequently switch between the human and machine-readable formats while putting together a pipeline. Git 2.21 has an option suitable for this setting, too: --date=auto:human. When printing output to a pager, if Git is given this option it will act as if it had been passed --date=human. When otherwise printing output to a non-pager, Git will act as if no format had been given at all. If human isn’t quite your speed, you can combine auto with any other format of your choosing, like --date=auto:relative.

git log --date=auto:human example[source]

Detecting case-insensitive path collisions

One commonly asked Git question is, “After cloning a repository, why does git status report some of the files as modified?” Quite often, the answer is that the repository contains a tree which cannot be represented on your file system. For instance, if it contains both file as well as FILE and your file system is case-insensitive, Git can only checkout one of those files. Worse, Git doesn’t actually detect this case during the clone; it simply writes out each path, unaware that the file system considers them to be the same file. You only find out something has gone wrong when you see a mystery modification.

The exact rules for when this occurs will vary from system to system. In addition to “folding” what we normally consider upper and lowercase characters in English, you may also see this from language-specific conversions, non-printing characters, or Unicode normalization.

In Git 2.20, git clone now detects and reports colliding groups during the initial checkout, which should remove some of the confusion. Unfortunately, Git can’t actually fix the problem for you. What the original committer put in the repository can’t be checked out as-is on your file system. So if you’re thinking about putting files into a multi-platform project that differ only in case, the best advice is still: don’t.

git
[source]

Performance improvements and other bits

Behind the scenes, a lot has changed over the last couple of Git releases, too. We’re dedicating this section to overview a few of these changes. Not all of them will impact your Git usage day-to-day, but some will, and all of the changes are especially important for server administrators.

Multi-pack indexes

Git stores objects (e.g., representations of the files, directories, and more that make up your Git repository) in both the “loose” and “packed” formats. A “loose” object is a compressed encoding of an object, stored in a file. A “packed” object is stored in a packfile, which is a collection of objects, written in terms of deltas of one another.

Because it can be costly to rewrite these packs every time a new object is added to the repository, repositories tend to accumulate many loose objects or individual packs over time. Eventually, these are reconciled during a “repack” operation. However, this reconciliation is not possible for larger repositories, like the Windows repository.

Instead of repacking, Git can now create a multi-pack index file, which is a listing of objects residing in multiple packs, removing the need to perform expensive repacks (in many cases).

[source]

Delta islands

An important optimization for Git servers is that the format for transmitted objects is the same as the heavily-compressed on-disk packfiles. That means that in many cases, Git can serve repositories to clients by simply copying bytes off disk without having to inflate individual objects.

But sometimes this assumption breaks down. Objects on disk may be stored as “deltas” against one another. When two versions of a file have similar content, we might store the full contents of one version (the “base”), but only the differences against the base for the other version. This creates a complication when serving a fetch. If object A is stored as a delta against object B, we can only send the client our on-disk version of A if we are also sending them B (or if we know they already have B). Otherwise, we have to reconstruct the full contents of A and re-compress it.

This happens rarely in many repositories where clients clone all of the objects stored by the server. But it can be quite common when multiple distinct but overlapping sets of objects are stored in the same packfile (for example, due to repository forks or unmerged pull requests). Git may store a delta between objects found only in two different forks. When someone clones one of the forks, they want only one of the objects, and we have to discard the delta.

Git 2.20 solves this by introducing the concept of “delta islands. Repository administrators can partition the ref namespace into distinct “islands”, and Git will avoid making deltas between islands. The end result is a repository which is slightly larger on disk but is still able to serve client fetches much more cheaply.

[source 1, source 2]

Delta reuse with bitmaps

We already discussed the importance of reusing on-disk deltas when serving fetches, but how do we know when the other side has the base object they’ need to use the delta we send them? If we’re sending them the base, too, then the answer is easy. But if we’re not, how do we know if they have it?

That answer is deceptively simple: the client will have already told us which commits it has (so that we don’t bother sending them again). If they claim to have a commit which contains the base object, then we can re-use the delta. But there’s one hitch: we not only need to know about the commit they mentioned, but also the entire object graph. The base may have been part of a commit hundreds or thousands of commits deep in the history of the project.

Git doesn’t traverse the entire object graph to check for possible bases because it’s too expensive to do so. For instance, walking the entire graph of a Linux kernel takes roughly 30 seconds.

Fortunately, there’s already a solution within Git: reachability bitmaps. Git has an optional on-disk data structure to record the sets of objects “reachable” from each commit. When this data is available, we can query it to quickly determine whether the client has a base object. This results in the server generating smaller packs that are produced more quickly for an overall faster fetch experience.

[source]

Custom alternates reference advertisement

Repository alternates are a tool that server administrators have at their disposal to reduce redundant information. When two repositories are known to share objects (like a fork and its parent), the fork can list the parent as an “alternate”, and any objects the fork doesn’t have itself, it can look for in its parent. This is helpful since we can avoid storing twice the vast number of objects shared between the fork and parent.

Likewise, a repository with alternates advertises “tips” it has when receiving a push. In other words, before writing from your computer to a remote, that remote will tell you what the tips of its branches are, so you can determine information that is already known by the remote, and therefore use less bandwidth. When a repository has alternates, the tips advertisement is the union of all local and alternate branch tips.

But what happens when computing the tips of an alternate is more expensive than a client sending redundant data? It makes the push so slow that we have disabled this feature for years at GitHub. In Git 2.20, repositories can hook into the way that they enumerate alternate tips, and make the corresponding transaction much faster.

[source]

Tidbits

Now that we’ve highlighted a handful of the changes in the past two releases, we want to share a summary of a few other interesting changes. As always, you can learn more by clicking the “source” link, or reading the documentation or release notes.

  • Have you ever tried to run git cherry-pick on a merge commit only to have it fail? You might have found that the fix involves passing -m1 and moved on. In fact, -m1 says to select the first parent as the mainline, and it replays the relevant commits. Prior to Git 2.21, passing this option on a non-merge commit caused an error, but now it transparently does what you meant. [source]

  • Veteran Git users from our last post might recall that git branch -l establishes a reflog for a newly created branch, instead of listing all branches. Now, instead of doing something you almost certainly didn’t mean, git branch -l will list all of your repository’s branches, keeping in line with other commands that accept -l. [source]

  • If you’ve ever been stuck or forgotten what a certain command or flag does, you might have run git --help (or git -h) to learn more. In Git 2.21, this invocation now follows aliases, and shows the aliased command’s helptext. [source]

  • In repositories with large on-disk checkouts, git status can take a long time to complete. In order to indicate that it’s making progress, the status command now displays a progress bar. [source]

  • Many parts of Git have historically been implemented as shell scripts, calling into tools written in C to do the heavy lifting. While this allowed rapid prototyping, the resulting tools could often be slow due to the overhead of running many separate programs. There are continuing efforts to move these scripts into C, affecting git submodule, git bisect, and git rebase. You may notice rebase in particular being much faster, due to the hard work of the Summer of Code students, Pratik Karki and Alban Gruin.

  • The -G option tells git log to only show commits whose diffs match a particular pattern. But until Git 2.21, it was searching binary files as if they were text, which made things slower and often produced confusing results. [source]

That’s all for now

We went through a few of the changes that have happened over the last couple of versions, but there’s a lot more to discover. Read the release notes for 2.21, or review the release notes for previous versions in the Git repository.

The post Highlights from Git 2.21 appeared first on The GitHub Blog.

Five years of the GitHub Bug Bounty program

Post Syndicated from philipturnbull original https://github.blog/2019-02-19-five-years-of-the-github-bug-bounty-program/

GitHub launched our Security Bug Bounty program in 2014, allowing us to reward independent security researchers for their help in keeping GitHub users secure. Over the past five years, we have been continuously impressed by the hard work and ingenuity of our researchers. Last year was no different and we were glad to pay out $165,000 to researchers from our public bug bounty program in 2018.

We’ve previously talked about our other initiatives to engage with researchers. In 2018, our researcher grants, private bug bounty programs, and a live-hacking event allowed us to reach even more independent security talent. These different ways of working with the community helped GitHub reach a huge milestone in 2018: $250,000 paid out to researchers in a single year.

We’re happy to share some of our highlights from the past year and introduce some big changes for the coming year: full legal protection for researchers, more GitHub properties eligible for rewards, and increased reward amounts.

2018 Highlights

GraphQL and API authorization researcher grant

Since the launch of our researcher grants program in 2017 we’ve been on the lookout for bug bounty researchers who show a specialty in particular features of our products. In mid-2018 @kamilhism submitted a series of vulnerabilities to the public bounty program showing his expertise in the authorization logic of our REST and GraphQL APIs. To support their future research, we provided Kamil with a fixed grant payment to perform a systematic audit of our API authorization logic. Kamil’s audit was excellent, uncovering and allowing us to fix an additional seven authorization flaws in our API.

H1-702

In August, GitHub took part in HackerOne’s H1-702 live-hacking event in Las Vegas. This brought together over 75 of the top researchers from HackerOne to focus on GitHub’s products for one evening of live-hacking. The event didn’t disappoint—GitHub’s security improved and nearly $75,000 was paid out for 43 vulnerabilities. This included one critical-severity vulnerability in GitHub Enterprise Server. We also met with our researchers in-person and received great feedback on how we could improve our bug bounty program.

GitHub Actions private bug bounty

In October, GitHub launched a limited public beta of GitHub Actions. As part of the limited beta, we also ran a private bug bounty program to complement our extensive internal security assessments. We sent out over 150 invitations to researchers from last year’s private program, all H1-702 participants, and invited a number of the best researchers that have worked with our public program. The private bounty program allowed us to uncover a number of vulnerabilities in GitHub Actions.

We also held an office-hours event so that the GitHub security team and researchers could meet. We took the opportunity to meet face-to-face with other researchers because it’s a great way to build a community and learn from each other. Two of our researchers, @not-an-aardvark and @ngaloggc, gave an overview of their submissions and shared details of how they approached the target with everyone.

Workflow improvements

We’ve been making refinements to our internal bug bounty workflow since we last announced it back in 2017.  Our ChatOps-based tools have continued to evolve over the past year as we find more ways to streamline the process. These aren’t just technical changes—each day we’ve had individual on-call first responders who were responsible for handling incoming bounty submissions. We’ve also added a weekly status meeting to review current submissions with all members of the Application Security team. These meetings allow the team to ensure that submissions are not stalled, work is correctly prioritized by engineering teams based on severity, and researchers are getting timely updates on their submissions.

A key success metric for our program is how much time it takes to validate a submission and triage that information to the relevant engineering team so remediation work can begin. Our workflow improvements have paid off and we’ve significantly reduced the average time to triage from four days in 2017 down to 19 hours. Likewise, we’ve reduced our average time to resolution from 16 days to six days. Keep in mind: for us to consider a submission as resolved, the issue has to either be fixed or properly prioritized and tracked, by the responsible engineering team.

We’ve continued to reach our target of replying to researchers in less than 24 hours on average. Most importantly for our researchers, we’ve also dropped our average time for rewarding a submission from 17 days in 2017 down to 11 days. We’re grateful for the effort that researchers invest in our program and we aim to reduce these times further over the next year.

2019 initiatives

Although our program has been running successfully for the past five years, we know that we can always improve. We’ve taken feedback from our researchers and are happy to announce three major changes to our program for 2019:

Keeping bounty program participants safe from the legal risks of security research is a high priority for GitHub. To make sure researchers are as safe as possible, we’ve added a robust set of Legal Safe Harbor terms to our site policy. Our new policies are based on CC0-licensed templates by GitHub’s Associate Corporate Counsel, @F-Jennings. These templates are a fork of EdOverflow’s Legal Bug Bounty repo, with extensive modifications based on broad discussions with security researchers and Amit Elazari’s general research in this field. The templates are also inspired by other best-practice safe harbor examples including Bugcrowd’s disclose.io project and Dropbox’s updated vulnerability disclosure policy.

Our new Legal Safe Harbor terms cover three main sources of legal risk:

  • Your research activity remains protected and authorized even if you accidentally overstep our bounty program’s scope. Our safe harbor now includes a firm commitment not to pursue civil or criminal legal action, or support any prosecution or civil action by others, for participants’ bounty program research activities. You remain protected even for good faith violations of the bounty policy.
  • We will do our best to protect you against legal risk from third parties who won’t commit to the same level of safe harbor protections. Our safe harbor terms now limit report-sharing with third parties in two ways. We will share only non-identifying information with third parties, and only after notifying you and getting that third party’s written commitment not to pursue legal action against you. Unless we get your written permission, we will not share identifying information with a third party.
  • You won’t be violating our site terms if it’s specifically for bounty research. For example, if your in-scope research includes reverse engineering, you can safely disregard the GitHub Enterprise Agreement’s restrictions on reverse engineering. Our safe harbor now provides a limited waiver for relevant parts of our site terms and policies. This protects against legal risk from DMCA anti-circumvention rules or similar contract terms that could otherwise prohibit necessary research tasks like reverse engineering or deobfuscating code.

Other organizations can look to these terms as an industry standard for safe harbor best practices—and we encourage others to freely adopt, use, and modify them to fit their own bounty programs. In creating these terms, we aim to go beyond the current standards for safe harbor programs and provide researchers with the best protection from criminal, civil, and third-party legal risks. The terms have been reviewed by expert security researchers, and are the product of many months of legal research and review of other legal safe harbor programs. Special thanks to MG, Mugwumpjones, and several other researchers for providing input on early drafts of @F-Jennings’ templates.

Expanded scope

Over the past five years, we’ve been steadily expanding the list of GitHub products and services that are eligible for reward. We’re excited to share that we are now increasing our bounty scope to reward vulnerabilities in all first party services hosted under our github.com domain. This includes GitHub Education, GitHub Learning Lab, GitHub Jobs, and our GitHub Desktop application. While GitHub Enterprise Server has been in scope since 2016, to further increase the security of our enterprise customers we are now expanding the scope to include Enterprise Cloud.

It’s not just about our user-facing systems. The security of our users’ data also depends on the security of our employees and our internal systems. That’s why we’re also including all first-party services under our employee-facing githubapp.com and github.net domains.

Increased rewards

We regularly assess our reward amounts against our industry peers. We also recognize that finding higher-severity vulnerabilities in GitHub’s products is becoming increasingly difficult for researchers and they should be rewarded for their efforts. That’s why we’ve increased our reward amounts at all levels:

  • Critical: $20,000–$30,000+
  • High: $10,000–$20,000
  • Medium: $4,000–$10,000
  • Low: $617–$2,000

Our broad ranges have served us well, but we’ve been consistently impressed by the ingenuity of researchers. To recognize that, we no longer have a maximum reward amount for critical vulnerabilities. Although we’ve listed $30,000 as a guideline amount for critical vulnerabilities, we’re reserving the right to reward significantly more for truly cutting-edge research.

Get involved

The bounty program remains a core part of GitHub’s security process and we’re learning a lot from our researchers. With our new initiatives, now is the perfect time to get involved. Details about our safe harbor, expanded scope, and increased awards are available on the GitHub Bug Bounty site.

Working with the community has been a great experience—we’re looking forward to triaging your submissions in the future!

The post Five years of the GitHub Bug Bounty program appeared first on The GitHub Blog.

An open source parser for GitHub Actions

Post Syndicated from Patrick Reynolds original https://github.blog/2019-02-07-an-open-source-parser-for-github-actions/

Since the beta release of GitHub Actions last October, thousands of users have added workflow files to their repositories. But until now, those files only work with the tools GitHub provided: the Actions editor, the Actions execution platform, and the syntax highlighting built into pull requests. To expand that universe, we need to release the parser and the specification for the Actions workflow language as open source. Today, we’re doing that.

We believe that tools beyond GitHub should be able to run workflows. We believe there should be programs to check, format, compose, and visualize workflow files. We believe that text editors can provide syntax highlighting and autocompletion for Actions workflows. And we believe all that can only happen if the Actions community is empowered to build these tools along with us. That can happen better and faster if there is a single language specification and a free parser implementation.

The first project to use the open source parser will be act, which is @nektos‘s tool for running Actions workflows in a local development environment.

The parser and language specification are both in actions/workflow-parser, which we’re sharing under an MIT license. As of today, there is a Go implementation, which is the same code that powers both the Actions UI and the Actions execution platform. The repository also contains a Javascript parser in development, along with syntax-highlighting configurations for Atom and Vim.

GitHub Actions

GitHub Actions is platform for automating software development workflows, from idea to production. Developers add a simple text file to their repository, .github/main.workflow, to describe automation. The workflow file describes how events like pushing code or opening and closing issues map to automation actions, implemented in any Docker container. Those automation actions have whatever powers you grant them: pushing commits to the repository, cutting a new release, building it through continuous integration, deploying it to staging in the cloud, testing the deployment, flipping it to production, and announcing it to the world — and any others you can build.

Every workflow begins with an event and runs through a set of actions to reach some target or goal. Those events and actions are described in a main.workflow file, which you can create and edit with the visual editor or any text editor you like. Here is a simple example:

workflow "when I push" {
  on = "push"
  resolves = "ci"
}

action "ci" {
  uses = "docker://golang:latest"
  runs = "./script/cibuild"
}

Whenever I push to a branch that contains that file, the when I push workflow executes. It resolves the target action ci, which runs ./script/cibuild in a golang:latest Docker container. I can add more workflow blocks to harness more events, and I can add more action blocks to run after the ci action or in parallel with it.

The Actions workflow language

All main.workflow files are written in the Actions workflow language, which is a subset of Hashicorp’s HCL. In fact, our parser builds on top of the open source hashicorp/hcl parser.

All Actions workflow files are valid HCL, but not all HCL files are valid workflows. The Actions workflow parser is stricter, allowing only a specific set of keywords and prohibiting nested objects, among other restrictions. The reason for that is a long-standing goal of Actions: making the Actions editor and the text representation of workflows equivalent and interchangeable. Any file you write in the graphical editor can be expressed in a main.workflow file, of course, but also: any main.workflow can be fully displayed and edited in the graphical editor. There is one exception to this: the graphical editor does not display comments. But it preserves them: changes you make in the graphical editor do not disturb comments you have added to your main.workflow file.

Contributing

We have two reasons for releasing the parser. First, we want to encourage the Actions community to build tools that generate and manipulate workflow files. The language specification should help developers understand the language, and the parser should save developers the trouble of writing their own.

Second, we welcome your contributions. If you find bugs, please open an issue or send a pull request. If you want to add a syntax highlighter for a new editor or implement the parser in another language, we welcome that. The only real limitation is on features that go beyond the parser to the rest of Actions. For broader questions and suggestions about the rest of Actions, reach out through support; we’re listening.

The post An open source parser for GitHub Actions appeared first on The GitHub Blog.

Atom understands your code better than ever before

Post Syndicated from Max Brunsfeld original https://github.blog/2018-10-31-atoms-new-parsing-system/

Text editors like Atom have many features that make code easier to read and write—syntax highlighting and code folding are two of the most important examples. For decades, all major text editors have implemented these kinds of features based on a very crude understanding of code, obtained by searching for simple, regular expression patterns. This approach has severely limited how helpful text editors can be.

At GitHub, we want to explore new ways of making programming intuitive and delightful, so we’ve developed a parsing system called Tree-sitter that will serve as a new foundation for code analysis in Atom. Tree-sitter makes it possible for Atom to parse your code while you type—maintaining a syntax tree at all times that precisely describes the structure of your code. We’ve enabled the new system by default in Atom, bringing a number of improvements.

Crisp syntax highlighting

Atom’s syntax highlighting is now based on the syntax trees provided by Tree-sitter. This lets us use color to outline your code’s structure more clearly than before. Notice the consistency with which fields, functions, keywords, types, and variables are highlighted across a variety of languages:

This animated GIF shows a snippet of C code rendered in Atom. It then switches to show an equivalent snippet of code written in several different languages: C++, Go, Rust, and TypeScript.

Reliable code folding

In most text editors, code folding is based on indentation: lines with greater indentation are considered to be nested more deeply than lines with less indentation. But this doesn’t always match the structure of our code and can make code folding useless in some files. With Tree-sitter, Atom folds code based on its syntax, which allows folding to work as expected, even for code like this:

This animated GIF shows a snippet of Ruby code, containing a heredoc with unindented text. Code folding is used to first collapse the block containing the heredoc, then collapse the method containing that block, and then collapse the class containing that method.

Syntax-aware selection

Atom also uses syntax trees as the basis for two new editing commands: Select Larger Syntax Node and Select Smaller Syntax Node, bound to Alt+Up and Alt+Down. These commands can make many editing tasks more efficient and fun, especially when used in combination with multiple cursors.

This animated GIF shows a snippet of JavaScript code in Atom. First, nothing is selected. Then, larger and larger constructs in the code are selected.

Speed

Parsing an entire source file can be time-consuming. This is why most IDEs wait to parse your code until you stop typing for a moment, and there is often a delay before syntax highlighting updates. We want to avoid these delays, so we designed Tree-sitter to parse your code incrementally: it keeps the syntax tree up to date as you edit your code without ever having to re-parse the entire file from scratch.

tree-editing

Language support

Currently, we use Tree-sitter to parse 11 languages: Bash, C, C++, ERB, EJS, Go, HTML, JavaScript, Python, Ruby, and TypeScript. And we’ve added Rust support on our Beta channel. If you’d like to help us bring the power of Tree-sitter to more languages, check out the Tree-sitter documentation and the grammar page of the Atom Flight Manual.

Feedback

Want to know more about Tree-sitter? Check out this talk from this year’s StrangeLoop conference.

If you write code and you’re interested in trying our new system, give the new version of Atom a try. We’d love to hear your feedback—tweet us at @AtomEditor or if you’ve run into a bug, open an issue.

The post Atom understands your code better than ever before appeared first on The GitHub Blog.

October 21 post-incident analysis

Post Syndicated from Jason Warner original https://github.blog/2018-10-30-oct21-post-incident-analysis/

Last week, GitHub experienced an incident that resulted in degraded service for 24 hours and 11 minutes. While portions of our platform were not affected by this incident, multiple internal systems were affected which resulted in our displaying of information that was out of date and inconsistent. Ultimately, no user data was lost; however manual reconciliation for a few seconds of database writes is still in progress. For the majority of the incident, GitHub was also unable to serve webhook events or build and publish GitHub Pages sites.

All of us at GitHub would like to sincerely apologize for the impact this caused to each and every one of you. We’re aware of the trust you place in GitHub and take pride in building resilient systems that enable our platform to remain highly available. With this incident, we failed you, and we are deeply sorry. While we cannot undo the problems that were created by GitHub’s platform being unusable for an extended period of time, we can explain the events that led to this incident, the lessons we’ve learned, and the steps we’re taking as a company to better ensure this doesn’t happen again.

Background

The majority of user-facing GitHub services are run within our own data center facilities. The data center topology is designed to provide a robust and expandable edge network that operates in front of several regional data centers that power our compute and storage workloads. Despite the layers of redundancy built into the physical and logical components in this design, it is still possible that sites will be unable to communicate with each other for some amount of time.

At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.

A high-level depiction of GitHub's network architecture, including two physical datacenters, 3 POPS, and cloud capacity in multiple regions connected via peering.

In the past, we’ve discussed how we use MySQL to store GitHub metadata as well as our approach to MySQL High Availability. GitHub operates multiple MySQL clusters varying in size from hundreds of gigabytes to nearly five terabytes, each with up to dozens of read replicas per cluster to store non-Git metadata, so our applications can provide pull requests and issues, manage authentication, coordinate background processing, and serve additional functionality beyond raw Git object storage. Different data across different parts of the application is stored on various clusters through functional sharding.

To improve performance at scale, our applications will direct writes to the relevant primary for each cluster, but delegate read requests to a subset of replica servers in the vast majority of cases. We use Orchestrator to manage our MySQL cluster topologies and handle automated failover. Orchestrator considers a number of variables during this process and is built on top of Raft for consensus. It’s possible for Orchestrator to implement topologies that applications are unable to support, therefore care must be taken to align Orchestrator’s configuration with application-level expectations.

In the normal topology, all apps perform reads locally with low latency.

Incident timeline

2018 October 21 22:52 UTC

During the network partition described above, Orchestrator, which had been active in our primary data center, began a process of leadership deselection, according to Raft consensus. The US West Coast data center and US East Coast public cloud Orchestrator nodes were able to establish a quorum and start failing over clusters to direct writes to the US West Coast data center. Orchestrator proceeded to organize the US West Coast database cluster topologies. When connectivity was restored, our application tier immediately began directing write traffic to the new primaries in the West Coast site.

The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.

2018 October 21 22:54 UTC

Our internal monitoring systems began generating alerts indicating that our systems were experiencing numerous faults. At this time there were several engineers responding and working to triage the incoming notifications. By 23:02 UTC, engineers in our first responder team had determined that topologies for numerous database clusters were in an unexpected state. Querying the Orchestrator API displayed a database replication topology that only included servers from our US West Coast data center.

2018 October 21 23:07 UTC

By this point the responding team decided to manually lock our internal deployment tooling to prevent any additional changes from being introduced. At 23:09 UTC, the responding team placed the site into yellow status. This action automatically escalated the situation into an active incident and sent an alert to the incident coordinator. At 23:11 UTC the incident coordinator joined and two minutes later made the decision change to status red.

2018 October 21 23:13 UTC

It was understood at this time that the problem affected multiple database clusters. Additional engineers from GitHub’s database engineering team were paged. They began investigating the current state in order to determine what actions needed to be taken to manually configure a US East Coast database as the primary for each cluster and rebuild the replication topology. This effort was challenging because by this point the West Coast database cluster had ingested writes from our application tier for nearly 40 minutes. Additionally, there were the several seconds of writes that existed in the East Coast cluster that had not been replicated to the West Coast and prevented replication of new writes back to the East Coast.

Guarding the confidentiality and integrity of user data is GitHub’s highest priority. In an effort to preserve this data, we decided that the 30+ minutes of data written to the US West Coast data center prevented us from considering options other than failing-forward in order to keep user data safe. However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls. This decision would result in our service being unusable for many users. We believe that the extended degradation of service was worth ensuring the consistency of our users’ data.

In the invalid topology, replication from US West to US East is broken and apps are unable to read from current replicas as they depend on low latency to maintain transaction performance.

2018 October 21 23:19 UTC

It was clear through querying the state of the database clusters that we needed to stop running jobs that write metadata about things like pushes. We made an explicit choice to partially degrade site usability by pausing webhook delivery and GitHub Pages builds instead of jeopardizing data we had already received from users. In other words, our strategy was to prioritize data integrity over site usability and time to recovery.

2018 October 22 00:05 UTC

Engineers involved in the incident response team began developing a plan to resolve data inconsistencies and implement our failover procedures for MySQL. Our plan was to restore from backups, synchronize the replicas in both sites, fall back to a stable serving topology, and then resume processing queued jobs. We updated our status to inform users that we were going to be executing a controlled failover of an internal data storage system.

Overview of recovery plan was to fail forward, synchronize, fall back, then churn through backlogs before returning to green.

While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. The process to decompress, checksum, prepare, and load large backup files onto newly provisioned MySQL servers took the majority of time. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.

2018 October 22 00:41 UTC

A backup process for all affected MySQL clusters had been initiated by this time and engineers were monitoring progress. Concurrently, multiple teams of engineers were investigating ways to speed up the transfer and recovery time without further degrading site usability or risking data corruption.

2018 October 22 06:51 UTC

Several clusters had completed restoration from backups in our US East Coast data center and begun replicating new data from the West Coast. This resulted in slow site load times for pages that had to execute a write operation over a cross-country link, but pages reading from those database clusters would return up-to-date results if the read request landed on the newly restored replica. Other larger database clusters were still restoring.

Our teams had identified ways to restore directly from the West Coast to overcome throughput restrictions caused by downloading from off-site storage and were increasingly confident that restoration was imminent, and the time left to establishing a healthy replication topology was dependent on how long it would take replication to catch up. This estimate was linearly interpolated from the replication telemetry we had available and the status page was updated to set an expectation of two hours as our estimated time of recovery.

2018 October 22 07:46 UTC

GitHub published a blog post to provide more context. We use GitHub Pages internally and all builds had been paused several hours earlier, so publishing this took additional effort. We apologize for the delay. We intended to send this communication out much sooner and will be ensuring we can publish updates in the future under these constraints.

2018 October 22 11:12 UTC

All database primaries established in US East Coast again. This resulted in the site becoming far more responsive as writes were now directed to a database server that was co-located in the same physical data center as our application tier. While this improved performance substantially, there were still dozens of database read replicas that were multiple hours delayed behind the primary. These delayed replicas resulted in users seeing inconsistent data as they interacted with our services. We spread the read load across a large pool of read replicas and each request to our services had a good chance of hitting a read replica that was multiple hours delayed.

In reality, the time required for replication to catch up had adhered to a power decay function instead of a linear trajectory. Due to increased write load on our database clusters as users woke up and began their workday in Europe and the US, the recovery process took longer than originally estimated.

2018 October 22 13:15 UTC

By now, we were approaching peak traffic load on GitHub.com. A discussion was had by the incident response team on how to proceed. It was clear that replication delays were increasing instead of decreasing towards a consistent state. We’d begun provisioning additional MySQL read replicas in the US East Coast public cloud earlier in the incident. Once these became available it became easier to spread read request volume across more servers. Reducing the utilization in aggregate across the read replicas allowed replication to catch up.

2018 October 22 16:24 UTC

Once the replicas were in sync, we conducted a failover to the original topology, addressing the immediate latency/availability concerns. As part of a conscious decision to prioritize data integrity over a shorter incident window, we kept the service status red while we began processing the backlog of data we had accumulated.

2018 October 22 16:45 UTC

During this phase of the recovery, we had to balance the increased load represented by the backlog, potentially overloading our ecosystem partners with notifications, and getting our services back to 100% as quickly as possible. There were over five million hook events and 80 thousand Pages builds queued.

As we re-enabled processing of this data, we processed ~200,000 webhook payloads that had outlived an internal TTL and were dropped. Upon discovering this, we paused that processing and pushed a change to increase that TTL for the time being.

To avoid further eroding the reliability of our status updates, we remained in degraded status until we had completed processing the entire backlog of data and ensured that our services had clearly settled back into normal performance levels.

2018 October 22 23:03 UTC

All pending webhooks and Pages builds had been processed and the integrity and proper operation of all systems had been confirmed. The site status was updated to green.

Next steps

Resolving data inconsistencies

During our recovery, we captured the MySQL binary logs containing the writes we took in our primary site that were not replicated to our West Coast site from each affected cluster. The total number of writes that were not replicated to the West Coast was relatively small. For example, one of our busiest clusters had 954 writes in the affected window. We are currently performing an analysis on these logs and determining which writes can be automatically reconciled and which will require outreach to users. We have multiple teams engaged in this effort, and our analysis has already determined a category of writes that have since been repeated by the user and successfully persisted. As stated in this analysis, our primary goal is preserving the integrity and accuracy of the data you store on GitHub.

Communication

In our desire to communicate meaningful information to you during the incident, we made several public estimates on time to repair based on the rate of processing of the backlog of data. In retrospect, our estimates did not factor in all variables. We are sorry for the confusion this caused and will strive to provide more accurate information in the future.

Technical initiatives

There are a number of technical initiatives that have been identified during this analysis. As we continue to work through an extensive post-incident analysis process internally, we expect to identify even more work that needs to happen.

  1. Adjust the configuration of Orchestrator to prevent the promotion of database primaries across regional boundaries. Orchestrator’s actions behaved as configured, despite our application tier being unable to support this topology change. Leader-election within a region is generally safe, but the sudden introduction of cross-country latency was a major contributing factor during this incident. This was emergent behavior of the system given that we hadn’t previously seen an internal network partition of this magnitude.

  2. We have accelerated our migration to a new status reporting mechanism that will provide a richer forum for us to talk about active incidents in crisper and clearer language. While many portions of GitHub were available throughout the incident, we were only able to set our status to green, yellow, and red. We recognize that this doesn’t give you an accurate picture of what is working and what is not, and in the future will be displaying the different components of the platform so you know the status of each service.

  3. In the weeks prior to this incident, we had started a company-wide engineering initiative to support serving GitHub traffic from multiple data centers in an active/active/active design. This project has the goal of supporting N+1 redundancy at the facility level. The goal of that work is to tolerate the full failure of a single data center failure without user impact. This is a major effort and will take some time, but we believe that multiple well-connected sites in a geography provides a good set of trade-offs. This incident has added urgency to the initiative.

  4. We will take a more proactive stance in testing our assumptions. GitHub is a fast growing company and has built up its fair share of complexity over the last decade. As we continue to grow, it becomes increasingly difficult to capture and transfer the historical context of trade-offs and decisions made to newer generations of Hubbers.

Organizational initiatives

This incident has led to a shift in our mindset around site reliability. We have learned that tighter operational controls or improved response times are insufficient safeguards for site reliability within a system of services as complicated as ours. To bolster those efforts, we will also begin a systemic practice of validating failure scenarios before they have a chance to affect you. This work will involve future investment in fault injection and chaos engineering tooling at GitHub.

Conclusion

We know how much you rely on GitHub for your projects and businesses to succeed. No one is more passionate about the availability of our services and the correctness of your data. We will continue to analyze this event for opportunities to serve you better and earn the trust you place in us.

The post October 21 post-incident analysis appeared first on The GitHub Blog.

Behind the scenes of GitHub Token Scanning

Post Syndicated from Patrick Toomey original https://github.blog/2018-10-17-behind-the-scenes-of-github-token-scanning/

Several years ago we started scanning all pushes to public repositories for GitHub OAuth tokens and personal access tokens. Now we’re extending this capability to include tokens from cloud service providers and additional credentials, such as unencrypted SSH private keys associated with a user’s GitHub account.

We live in amazing times for software development. Capabilities that were once only available to large technology companies are now accessible to the smallest of startups. Developers can leverage cloud services to quickly perform continuous integration testing, deploy their code to fully scalable infrastructure, accept credit card payments from customers, and nearly anything else you can imagine.

Composing cloud services like this is the norm going forward, but it comes with inherent security complexities. Each cloud service a developer typically uses requires one or more credentials, often in the form of API tokens. In the wrong hands, they can be used to access sensitive customer data—or vast computing resources for mining cryptocurrency, presenting significant risks to both users and cloud service providers.

diff --git a/config.yml b/config.yml
index 1110370..da7b5de 100644
--- a/config.yml
+++ b/config.yml
@@ -1,2 +1,2 @@
 production:
-  github_token: 4e862a8ec12ef0ad7c1337e8b16d98f4d764b8f6
+  github_token: <%= ENV["GITHUB_TOKEN"] %>

Fig 1: Your master branch might look innocent, but exposed credentials in your project’s history could prove costly.

Token Scanning 1.0

GitHub, and our users, aren’t immune to this problem. That is why we started scanning for GitHub OAuth tokens in public repositories several years ago. But, our users are not just GitHub customers. They are also customers of cloud infrastructure providers, cloud payment processing providers, and other cloud service providers that have become commonplace in modern development.

Our existing solution focused on GitHub OAuth tokens exclusively and was never designed for extensibility. The existing code leveraged hand-tuned assembly that was extremely fast at finding 40-hex character strings (the format of GitHub OAuth tokens). This bit of code was patched into Git and run inline whenever code was pushed to GitHub. It was an amazing piece of work, but could not support multiple credential formats. Our vision was to support all of the popular cloud service providers.

Token Scanning 2.0

So began our journey into the next generation of GitHub Token Scanning. The obvious path to a more extensible scanner is some form of regular expression support. However, scanning for credentials using typical regular expression libraries doesn’t scale from a performance perspective, as they optimize for a slightly different problem than the one we have.

The vast majority of regular expression libraries are designed to return the first match in a set of patterns. Given we have at least one pattern for each cloud service provider, we require all matches are returned and not just the first. The only way to ensure this with traditional libraries is to scan a given input once for each pattern. However, this increases the scan time dramatically for large repositories or large sets of patterns. Fortunately, scanning Git data for credentials is just a specific case of a general problem. For example, high-performance application-level firewalls similarly need to scan high-volume network traffic for sets of patterns to identify known viruses or malware. If you squint, scanning high-volume Git push data for credentials is a very similar problem.

Our research eventually lead us to a GitHub repository hosting the amazing Hyperscan library by Intel. This library is incredibly performant and provides exactly what we need. We will explore the technical details in more depth in a follow-up engineering post. But, in short, Hyperscan let us replace all of the assembly code patches to Git with a new standalone scanner, written in Go, that has scaled nicely.

In parallel with working on the implementation, we reached out to several cloud service providers we thought would be interested in testing out Token Scanning in a private beta. They were all enthusiastic to participate, as many of them had contacted us in the past looking for a solution to this widespread problem.

Since April, we’ve worked with cloud service providers in private beta to scan all changes to public repositories and public Gists for credentials (GitHub doesn’t scan private code). Each candidate credential is sent to the provider, including some basic metadata such as the repository name and the commit that introduced the credential. The provider can then validate the credential and decide if the credential should be revoked depending on the associated risks to the user or provider. Either way, the provider typically contacts the owner of the credential, letting them know what occurred and what action was taken.

Where we go from here

We have received amazing feedback from both providers and users during the private beta. Cloud service providers have told us that GitHub Token Scanning has been tremendously effective in helping them identify credentials before malicious users. And, while GitHub users were not aware that Token Scanning was in beta, we did take notice of a tweet from a GitHub user that included a number of enthusiastic exclamation marks and 🎉 emojis. This user was extremely grateful for having received a notification from a participating cloud service provider less than a minute after they had accidentally pushed a highly sensitive credential to a public repository.

During the beta we have scanned millions of public repository changes and identified millions of candidate credentials. As announced yesterday at GitHub Universe, Token Scanning is now in public beta, and supports an increasing number of cloud providers. We’re excited by the impact of Token Scanning today and have lots of ideas about how to make it even more powerful in the future. Dealing with credentials is an unavoidable part of modern development. With GitHub by your side, we hope to minimize the security impact of such accidents.

The post Behind the scenes of GitHub Token Scanning appeared first on The GitHub Blog.

Applying machine intelligence to GitHub security alerts

Post Syndicated from Ben Thompson original https://github.blog/2018-10-09-applying-machine-intelligence-to-security-alerts/

Last year, we released security alerts that track security vulnerabilities in Ruby and JavaScript packages. Since then, we’ve identified more than four million of these vulnerabilities and added support for Python. In our launch post, we mentioned that all vulnerabilities with CVE IDs are included in security alerts, but sometimes there are vulnerabilities that are not disclosed in the National Vulnerability Database. Fortunately, our collection of security alerts can be supplemented with vulnerabilities detected from activity within our developer community.

Leveraging the community

There are many places a project can publicize security fixes within a new version: the CVE feed, various mailing lists, and open source groups, or even within its release notes or changelog. Regardless of how projects share this information, some developers within the GitHub community will see the advisory and immediately bump their required versions of the dependency to a known safe version. If detected, we can use the information in these commits to generate security alerts for vulnerabilities which may not have been published in the CVE feed.

On an average day, the dependency graph can track around 10,000 commits to dependency files for any of our supported languages. We can’t manually process this many commits. Instead, we depend on machine intelligence to sift through them and extract those that might be related to a security release.

For this purpose, we created a machine learning model that scans text associated with public commits (the commit message and linked issues or pull requests) to filter out those related to possible security upgrades. With this smaller batch of commits, the model uses the diff to understand how required version ranges have changed. Then it aggregates across a specific timeframe to get a holistic view of all dependencies that a security release might affect. Finally, the model outputs a list of packages and version ranges it thinks require an alert and currently aren’t covered by any known CVE in our system.

Always quality focused

No machine learning model is perfect. While machine intelligence can sift through thousands of commits in an instant, this anomaly-detection algorithm will still generate false positives for packages where no security patch was released. Security alert quality is a focus for us, so we review all model output before the community receives an alert.

Learn more

Interested in learning more? Join us at GitHub Universe next week to explore the connections that push technology forward and keep projects secure through talks, trainings, and workshops. Tune in to the blog October 16-17 for more updates and announcements.

The post Applying machine intelligence to GitHub security alerts appeared first on The GitHub Blog.