Tag Archives: Engineering

Designing resilient systems beyond retries (Part 1): Rate-Limiting

Post Syndicated from Grab Tech original https://engineering.grab.com/beyond-retries-part-1

This post is the first of a three-part series on going beyond retries to improve system resiliency. In this series, we will discuss other techniques and architectures that can be used as part of a strategy to improve resiliency. To start off the series, we will cover rate-limiting.

Software engineers aim for reliability. Systems that have predictable and consistent behaviour in terms of performance and availability. In the electricity industry, reliability may equate to being able to keep the lights on. But just because a system has remained reliable up until a certain point, does not mean that it will continue to be. This is where resiliency comes in: the ability to withstand or recover from problematic conditions or failure. Going back to our electricity analogy – resiliency is the ability to turn the lights back on quickly when say, a natural disaster hits the power grid.

Why we value resiliency

Being resilient to many different failures is the best way to ensure a system is reliable and – more importantly – stays that way. At Grab, our architecture features hundreds of microservices, which is constantly stressed in an increasing number of different ways at higher and higher volumes. Failures that would be rare or unusual become more likely as our scale increases. For that reason, we proactively focus on – and require our services to think about – resiliency, even if they have historically been very reliable.

As software systems evolve and become more complex, the number of potential failure modes that software engineers have to account for grows. Fortunately, so too have the techniques for dealing with them. The circuit-breaker pattern and retries are two such techniques commonly employed to improve resiliency specifically in the context of distributed systems. In pursuit of reliability, this is a fine start, but it would be wrong to assume that this will keep the service reliable forever. This article will discuss how you can use rate-limiting as part of a strategy to improve resilience, beyond retries.

Challenges with retries and circuit breakers

A common risk when introducing retries in a resiliency strategy is ‘retry storms’. Retries by definition increase the number of requests from the client, especially when the system is experiencing some kind of failure. If the server is not prepared to handle this increase in traffic, and is possibly already struggling to handle the load, it can quickly become overwhelmed. This is counter-productive to introducing retries in the first place!

When using a circuit-breaker in combination with retries, the application has some form of safety net: too many failures and the circuit will open, preventing the retry storms. However, this can be dangerous to rely on. For one thing, it assumes that all clients have the correct circuit-breaker configurations. Knowing how to configure the circuit-breaker correctly is difficult because it requires knowledge of the downstream service’s configurations too.

Introducing rate-limiting

In a large organization such as Grab with hundreds of microservices, it becomes increasingly difficult to coordinate and maintain the correct circuit-breaker configurations as the number of services increases.

Secondly, it is never a good idea for the server to depend on its clients for resiliency. The circuit-breaker could fail or simply be bypassed, and the server would have to deal with all requests the client makes.

It is therefore desirable to have some form of rate-limiting/throttling as another line of defense. There are many strategies for rate-limiting to consider.

Types of thresholds for rate-limiting

The traditional approach to rate-limiting is to implement a server-side check which monitors the rate of incoming requests and if it exceeds a certain threshold, an error will be returned instead of processing the request. There are many algorithms such as ‘leaky bucket’, fixed/sliding window and so on. A key decision is where to set the thresholds: usually by client, endpoint, or a combination of both.

Rate-limiting by client or user account is the approach taken by many public APIs: Each client is allowed to make a certain number of requests over a period, say 1000 requests per hour, and once that number is exceeded then their requests will be rejected until the time window resets. In this approach, the server must ensure that it has enough capacity (or can scale adequately) to handle the maximum allowed number of requests for each client. If new clients are added frequently, the overhead of maintaining and adjusting the limits may be significant. However, it can be a good way to guarantee a service-level agreement (SLA) with your clients.

An alternative to per-client thresholds is to use per-endpoint thresholds. This limit is applied across all clients and can be set according to the server’s true capacity using benchmarks. Compared with per-client limits this is easier to configure and more reliable in preventing the server from becoming overloaded. However, one misbehaving client may be able to consume the entire quota, blocking other clients of the service.

A rate-limiting strategy may use different levels of thresholds, and this is the best approach to get the benefits of both per-client and per-endpoint thresholds. For example, the following rules might be applied (in order):

  • Per-client, per-endpoint: For example, client A accessing the sendEmail endpoint. It is not necessary to configure thresholds at this granularity, but may be useful for critical endpoints.
  • Per-client: In addition to any per-client per-endpoint settings, client A could have a global threshold of 1000 requests/hour to any API.
  • Per-endpoint: This is the server’s catch-all guard to guarantee that none of its endpoints become overloaded. If client limits are properly configured, this limit should never be reached.
  • Server-wide: Finally, a limit on the number of requests a server can handle in total. This is important because even if endpoints can meet their limits individually, they are never completely isolated: the server will have some overhead and limited resources for processing any kind of request, opening and closing network connections etc.

Local vs global rate-limiting

Another consideration is local vs global rate-limiting. As we saw in the previous section, backend servers are usually pooled together for resiliency. A naive rate-limiting solution might be implemented at the individual server instance level. This sounds intuitive because the thresholds can be calculated exactly according to the instance’s computing power, and it scales automatically as the number of instances increases. However, in a microservice architecture, this is rarely correct as the bottlenecks are unlikely to be so closely tied to individual instance hardware.

More often, the capacity is reached when a downstream resource is exhausted, such as a database, a third-party service or another microservice. If the rate-limiting is only enforced at the instance level, when the service scales, the pressure on these resources will increase and quickly overload them. Local rate-limiting’s effectiveness is limited.

Global rate-limiting on the other hand monitors thresholds and enforces limits across the entire backend server pool. This is usually achieved through the use of a centralized rate-limiting service to make the decisions about whether or not requests should be allowed to go through. While this is much more desirable, implementing such a service is not without challenges.

Considerations when implementing rate-limiting

Care must be taken to ensure the rate-limiting service does not become a single point of failure. The system should still function when the rate-limiter itself is experiencing problems (perhaps by falling back to a local limiter). Since the rate-limiter must be in the request path, it should not add significant latency because any latency would be multiplied across every endpoint being monitored. Grab’s own Quotas service is an example of a global rate-limiter which addresses these concerns.

Global rate-limiting with a central server
Global rate-limiting with a central server. The servers send information about the request volumes, and the rate-limiting service responds with the rate-limiting decisions. This is done asynchronously to avoid introducing a point of failure.

 

Generally, it is more important to implement rate-limiting at the server side. This is because, once again, assuming that clients have correct implementation and configurations is risky. However, there is a case to be made for rate-limiting on the client as well, especially if the clients can be trusted or share a common SDK.

With server-side limiting, the server still has to accept the initial connection, process the rate-limiting logic and return an appropriate error response. With sufficient load, this overhead can be enough to render the system unresponsive; an unintentional denial-of-service (DoS) effect.

Client-side limiting can be implemented by using a central service as described above or, more commonly, utilizing response headers from the server. In this approach, the server response may include information about the client’s remaining quota and/or a timestamp at which the quota is reset. If the client implements logic for these headers, it can avoid sending requests at all if it knows they will be rate-limited. The disadvantage of this is that the client-side logic becomes more complex and another possible source of bugs, so this cost has to be considered against the simpler server-only method.

Up next, Bulkheading, Load Balancing, and Fallbacks…

So we’ve taken a look at rate-limiting as a strategy for having resilient systems. I hope you found this article useful. Comments are always welcome.

In our next post, we will look at the other resiliency techniques such as bulkheading (isolation), load balancing, and fallbacks.

Please stay tuned!

Context Deadlines and How to Set Them

Post Syndicated from Grab Tech original https://engineering.grab.com/context-deadlines-and-how-to-set-them

At Grab, our microservice architecture involves a huge amount of network traffic and inevitably, network issues will sometimes occur, causing API calls to fail or take longer than expected. We strive to make such incidents a non-event, by designing with the expectation of such incidents in mind. With the aid of Go’s context package, we have improved upon basic timeouts by passing timeout information along the request path. However, this introduces extra complexity, and care must be taken to ensure timeouts are configured in a way that is efficient and does not worsen problems. This article explains from the ground up a strategy for configuring timeouts and using context deadlines correctly, drawing from our experience developing microservices in a large scale and often turbulent network environment.

Timeouts

Timeouts are a fundamental concept in computer networking. Almost every kind of network communication will have some kind of timeout associated with it, often configurable with a parameter. The idea is to place a time limit on some event happening, often a network response; after the limit has passed, the operation is aborted rather than waiting indefinitely. Examples of useful places to put timeouts include connecting to a database, making a HTTP request or on idle connections in a pool.

Figure 1.1: How timeouts prevent long API calls
Figure 1.1: How timeouts prevent long API calls

 

Timeouts allow a program to continue where it otherwise might hang, providing a better experience to the end user. Often the default way for programs to handle timeouts is to return an error, but this doesn’t have to be the case: there are several better alternatives for handling timeouts which we’ll cover later.

While they may sound like a panacea, timeouts must be configured carefully to be effective: too short a timeout will result in increased errors from a resource which could still be working normally, and too long a timeout will risk consuming excess resources and a poor user experience. Furthermore, timeouts have evolved over time with new concepts such as Go’s context package, and the trend towards distributed systems has raised the stakes: timeouts are more important, and can cause more damage if misused!

Why timeouts are useful

In the context of microservices, timeouts are useful as a defensive measure against misbehaving or faulty dependencies. It is a guarantee that no matter how badly the dependency is failing, your call will never take longer than the timeout setting (for example 1 second). With so many other things to worry about, that’s a really nice thing to have! So there’s an instant benefit to your service’s resiliency, even if you do nothing more than set the timeout.

However, a service can choose what to do when it encounters a timeout, which can make them even more useful. Generally there are three options:

  1. Return an error. This is the simplest, but unless you know there is error handling upstream, this can actually deliver the worst user experience.
  2. Return a fallback value. We can return a default value, a cached value, or fall back to a simpler computed value. Depending on the circumstances, this can offer a better user experience.
  3. Retry. In the best case, a retry will succeed and deliver the intended response to the caller, albeit with the added timeout delay. However, there are other complexities to consider for retries to be effective. For a full discussion on this topic, see Circuit Breaker vs Retries Part 1and Circuit Breaker vs Retries Part 2.

At Grab, our services tend towards using retries wherever possible, to make minor errors as transparent as possible.

The main advantage of timeouts is that they give your service time to do something else, and this should be kept in mind when considering a good timeout value: not only do you want to allow the remote call time to complete (or not), but you need to allow enough time to handle the potential timeout as well.

Different types of timeouts

Not all timeouts are the same. There are different types of timeouts with crucial differences in semantics, and you should check the behaviour of the timeout settings in the library or resource you’re using before configuring them for production use.

In Go, there are three common classes of timeouts:

  • Network timeouts: These come from the net package and apply to the underlying network connection. These are the best to use when available, because you can be sure that the network call has been cancelled when the call returns to your function.
  • Context timeouts: Context is discussed later in this article, but for now just note that these timeouts are propagated to the server. Since the server is aware of the timeout, it can avoid wasted effort by abandoning computation after the timeout is reached.
  • Asynchronous timeouts: These occur when a goroutine is executed and abandoned after some time. This does not automatically cancel the goroutine (you can’t really cancel goroutines without extra handling), so it risks leaking the goroutine and other resources. This approach should be avoided in production unless combined with some other measures to provide cancellation or avoid leaking resources.

Dangers of poor timeout configuration for microservice calls

The benefits of using timeouts are enticing, but there’s no free lunch: relying on timeouts too heavily can lead to disastrous cascading failure scenarios. Worse, the effects of a poor timeout configuration often don’t become evident until it’s too late: it’s peak hour, traffic just reached an all-time high and… all your services froze up at the same time. Not good.

To demonstrate this effect, imagine a simple 3-service architecture where each service naively uses a default timeout of 1 second:

Figure 1.2: Example of how incorrect timeout configuration causes cascading failure
Figure 1.2: Example of how incorrect timeout configuration causes cascading failure

 

Service A’s timeout does not account for the fact that Service B calls C. If B itself is experiencing problems and takes 800ms to complete its work, then C effectively only has 200ms to complete before service A gives up. But since B’s timeout to C is also 1s, that means that C could be wasting up to 800ms of computational effort that ‘leaks’ – it has no chance of being used. Both B and C are blissfully unaware at first that anything is wrong – they happily return successful responses that A never receives!

This resource leak can soon be catastrophic, though: since the calls from B to A are timing out, A (or A’s clients) are likely to retry, causing the load on B to increase. This in turn causes the load on C to increase, and eventually all services will stop responding.

The same thing happens if B is healthy but C is experiencing problems: B’s calls to C will build up and cause B to become overloaded and fail too. This is a common cause of cascading failure.

How to set a good timeout

Given the importance of correctly configuring timeout values, the question remains as to how to decide upon a ‘correct’ timeout value. If the timeout is for an API call to another service, a good place to start would be that service’s service-level agreements (SLAs). Often SLAs are based on latency percentiles, which is a value below which a given percentage of latencies fall. For example, a system might have a 99th percentile (also known as P99) latency of 300ms; this would mean that 99% of latencies are below 300ms. A high-order percentile such as P99 or even P99.9 can be used as a ballpark worst-case value.

Let’s say a service (B)’s endpoint has a 99th percentile latency of 600ms. Setting the timeout for this call at 600ms would guarantee that no calls take longer than 600ms, while returning errors for the rest and accepting an error rate of at most 1% (assuming the service is keeping to their SLA). This is an example of how the timeout can be combined with information about latencies to give predictable behaviour.

This idea can be taken further by considering retries too. If the median latency for this service is 50ms, then you could introduce a retry of 50ms for an overall timeout of 50ms + 600ms = 650ms:

Service B

Service B P99 latency SLA = 600ms

Service B median latency = 50ms

Service A

Request timeout = 600ms

Number of retries = 1

Retry request timeout = 50ms

Overall timeout = 50ms+600ms = 650ms

Chance of timeout after retry = 1% * 50% = 0.5%

Figure 1.3: Example timeout configuration settings based on latency data

 

This would still cut off the top 1% of latencies, while optimistically making another attempt for the median latency. This way, even for the 1% of calls that encounter a timeout, our service would still expect to return a successful response within 650ms more than half the time, for an overall success rate of 99.5%.

Context propagation

Go officially introduced the concept of context in Go 1.7, as a way of passing request-scoped information across server boundaries. This includes deadlines, cancellation signals and arbitrary values. Let’s ignore the last part for now and focus on deadlines and cancellations. Often, when setting a regular timeout on a remote call, the server side is unaware of the timeout. Even if the server is notified indirectly when the client closes the connection, it’s still not necessarily clear whether the client timed out or encountered another issue. This can lead to wasted resources, because without knowing the client timed out, the server often carries on regardless. Context aims to solve this problem by propagating the timeout and context information across API boundaries.

Figure 1.4: Context propagation cancels work on B and C
Figure 1.4: Context propagation cancels work on B and C

 

Server A sets a context timeout of 1 second. Since this information spans the entire request and gets propagated to C, C is always aware of the remaining time it has to do useful work – work that won’t get discarded. The remaining time can be defined as (1 – b), where b is the amount of time that server B spent processing before calling C. When the deadline is exceeded, the context is immediately cancelled, along with any child contexts that were created from the parent.

The context timeout can be a relative time (eg. 3 seconds from now) or an absolute time (eg. 7pm). In practice they are equivalent, and the absolute deadline can be queried from a timeout created with a relative time and vice-versa.

Another useful feature of contexts is cancellation. The client has the ability to cancel the request for any reason, which will immediately signal the server to stop working. When a context is cancelled manually, this is very similar to a context being cancelled when it exceeds the deadline. The main difference is the error message will be ‘context cancelled’ instead of ‘context deadline exceeded’. This is a common cause of confusion, but context cancelled is always caused by an upstream client, while deadline exceeded could be a deadline set upstream or locally.

The server must still listen for the ‘context done’ signal and implement cancellation logic, but at least it has the option of doing so, unlike with ordinary timeouts. The most common reason for cancelling a request is because the client encountered an error and no longer needs the response that the server is processing. However, this technique can also be used in request hedging, where concurrent duplicate requests are sent to the server to decrease the impact of an individual call experiencing latency. When the first response returns, the other requests are cancelled because they are no longer needed.

Context can be seen as ‘distributed timeouts’ – an improvement to the concept of timeouts by propagating them. But while they achieve the same goal, they introduce other issues that must be considered.

Context propagation and timeout configuration

When propagating timeout information via context, there is no longer a static ‘timeout’ setting per call. This can complicate debugging: even if the client has correctly configured their own timeout as above, a context timeout could mean that either the remote downstream server is slow, or that an upstream client was slow and there was insufficient time remaining in the propagated context!

Let’s revisit the scenario from earlier, and assume that service A has set a context timeout of 1 second. If B is still taking 800ms, then the call to C will time out after 200ms. This changes things completely: although there is no longer the resource leak (because both B and C will terminate the call once the context timeout is exceeded), B will have an increase in errors whereas previously it would not (at least until it became overloaded). This may be worse than completing the request after A has given up, depending on the circumstances. There is also a dangerous interaction with circuit breakers which we will discuss in the next section.

If allowing the request to complete is preferable than cancelling it even in the event of a client timeout, the request should be made with a new context decoupled from the parent (ie. context.Background()). This will ensure that the timeout is not propagated to the remote service. When doing this, it is still a good idea to set a timeout, to avoid waiting indefinitely for it to complete.

Context and circuit-breakers

A circuit-breaker is a software library or function which monitors calls to external resources with the aim of preventing calls which are likely to fail, ‘short-circuiting’ them (hence the name). It is a good practice to use a circuit-breaker for all outgoing calls to dependencies, especially potentially unreliable ones. But when combined with context propagation, that raises an important question: should context timeouts or cancellation cause the circuit to open?

Let’s consider the options. If ‘yes’, this means the client will avoid wasting calls to the server if it’s repeatedly hitting the context timeout. This might seem desirable at first, but there are drawbacks too.

Pros:

  • Consistent behaviour with other server errors
  • Avoids making calls that are unlikely to succeed
  • It is obvious when things are going wrong
  • Client has more time to fall back to other behaviour
  • More lenient on misconfigured timeouts because circuit-breaking ensures that subsequent calls will fail fast, thus avoiding cascading failure

Cons:

  • Unpredictable
  • A misconfigured upstream client can cause the circuit to open for all other clients
  • Can be misinterpreted as a server error

It is generally better not to open the circuit when the context deadline set upstream is exceeded. The only timeout allowed to trigger the circuit-breaker should be the request timeout of the specific call for that circuit.

Pros:

  • More predictable
  • Circuit depends mostly on server health, not client
  • Clients are isolated

Cons:

  • May be confusing for clients who expect the circuit to open
  • Misconfigured timeouts are more likely to waste resources

Note that the above only applies to propagated contexts. If the context only spans a single individual call, then it is equivalent to a static request timeout, and such errors should cause circuits to open.

How to set context deadlines

Let’s recap some of the concepts covered in this article so far:

  • Timeouts are a time limit on an event taking place, such as a microservice completing an API call to another service.
  • Request timeouts refer to the timeout of a single individual request. When accounting for retries, an API call may include several request timeouts before completing successfully.
  • Context timeouts are introduced in Go to propagate timeouts across API boundaries.
  • A context deadline is an absolute timestamp at which the context is considered to be ‘done’, and work covered by this context should be cancelled when the deadline is exceeded.

Fortunately, there is a simple rule for correctly configuring context timeouts:

The upstream timeout must always be longer than the total downstream timeouts including retries.

The upstream timeout should be set at the ‘edge’ server and cascade throughout.

In our scenario, A is the edge server. Let’s say that B’s timeout to C is 1s, and it may retry at most once, after a delay of 500ms. The appropriate context timeout (CT) set from A can be calculated as follows:

CT(A) = (timeout to C * number of attempts) + (retry delay * number of retries)

CT(A) = (1s * 2) + (500ms * 1) = 2,500ms

Figure 1.5: Formula for calculating context timeouts
Figure 1.5: Formula for calculating context timeouts

 

Extra time can be allocated for B’s processing time and to allow B to return a fallback response if appropriate.

Note that if A configures its timeout according to this rule, then many of the above issues disappear. There are no wasted resources, because B and C are given the maximum time to complete their requests successfully. There is no chance for B’s circuit-breaker to open unexpectedly, and cascading failure is mostly avoided: a failure in C will be handled and be returned by B, instead of A timing out as well.

A possible alternative would be to rely on context cancellation: allow A to set a shorter timeout, which cancels B and C if the timeout is exceeded. This is an acceptable approach to avoiding cascading failure (and cancellation should be implemented in any case), but it is less optimal than configuring timeouts according to the above formula. One reason is that there is no guarantee of the downstream services handling the timeout gracefully; as mentioned previously, the service must explicitly check for ctx.Done() and this is rarely followed in practice. It is also impractical to place checks at every point in the code, so there could be a considerable delay between the client cancellation and the server abandoning the processing.

A second reason not to set shorter timeouts is that it could lead to unexpected errors on the downstream services. Even if B and C are healthy, a shorter context timeout could lead to errors if A has timed out. Besides the problem of having to handle the cancelled requests, the errors could create noise in the logs, and more importantly could have been avoided. If the downstream services are healthy and responding within their SLA, there is no point in timing out earlier. An exception might be for the edge server (A) to allow for only 1 attempt or fewer retries than the downstream service actually performs. But this is tricky to configure and weakens the resiliency. If it is desirable to shorten the timeouts to decrease latency, it is better to start adjusting the timeouts of the downstream resources first, starting from the innermost service outwards.

A model implementation for using context timeouts in calls between microservices

We’ve touched on several useful concepts for improving resiliency in distributed systems: timeouts, context, circuit-breakers and retries. It is desirable to use all of them together in a good resiliency strategy. However, the actual implementation is far from trivial; finding the right order and configuration to use them effectively can seem like searching for the holy grail, and many teams go through a long process of trial and error, continuously improving their implementation. Let’s try to formally put together an ideal implementation, step by step.

Note that the code below is not a final or production-ready implementation. At Grab we have developed independent circuit-breaker and retry libraries, with many settings that can be configured for fine-tuning. However, it should serve as a guide for writing resilient client libraries.

Step 1: Context propagation

Context propagation code

The skeleton function signature includes a context object as the first parameter, which is the best practice intended by Google. We check whether the context is already done before proceeding, in which case we ‘fail fast’ without wasting any further effort.

Step 2: Create child context with request timeout

Child context with request timeout code

Our service has no control over the parent context. Indeed, it could have no deadline at all! Therefore it’s important to create a new context and timeout for our own outgoing request as well, using WithTimeout. It is mandatory to call the returned cancel function to ensure the context is properly cancelled and avoid a goroutine leak.

Step 3: Introduce circuit-breaker logic

Introduce circuit-breaker logic code

Next, we wrap our call to the external service in a circuit-breaker. The actual circuit-breaker implementation has been omitted for brevity, but there are two important points to consider:

  • It should only consider opening the circuit-breaker when requestTimeout is reached, not on ctx.Done().
  • The circuit name should ideally be unique for this specific endpoint
Introduce circuit-breaker logic code - 2

Step 4: Introduce retries

The last step is to add retries to our request in the case of error. This can be implemented as a simple for loop, but there are some key things to include in a complete retry implementation:

  • ctx.Done() should be checked after each retry attempt to avoid wasting a call if the client has given up.
  • The request context should be cancelled before the next retry to avoid duplicate concurrent calls and goroutine leaks.
  • Not all kinds of requests should be retried.
  • A delay should be added before the next retry, using exponential backoff.
  • See Circuit Breaker vs Retries Part 2 for a thorough guide to implementing retries.

Step 5: The complete implementation

Complete implementation

And here we have arrived at our ‘ideal’ implementation of an external call including context handling and propagation, two levels of timeout (parent and request), circuit-breaking and retries. This should be sufficient for a good level of resiliency, avoiding wasted effort on both the client and server.

As a future enhancement, we could consider introducing a ‘minimum time per request’, which the retry loop should use to check for remaining time as well as ctx.Done() (but not instead – we need to account for client cancellation too). Of course metrics, logging and error handling should also be added as necessary.

Important Takeaways

To summarise, here are a few of the best practices for working with context timeouts:

Use SLAs and latency data to set effective timeouts

Having a default timeout value for everything doesn’t scale well. Use available information on SLAs and historic latency to set timeouts that give predictable results.

Understand the common error messages

The context canceled (context.Canceled) error occurs when the context is manually cancelled. This automatically cancels any child contexts attached to the parent. It is rare for this error to surface on the same service that triggered the cancellation; if cancel is called, it is usually because another error has been detected (such as a timeout) which would be returned instead. Therefore, context canceled is usually caused by an upstream error: either the client timed out and cancelled the request, or cancelled the request because it was no longer needed, or closed the connection (this typically results in a cancelled context from Go libraries).

The context deadline exceeded error occurs only when the time limit was reached. This could have been set locally (by the server processing the request) or by an upstream client. Unfortunately, it’s often difficult to distinguish between them, although they should generally be handled in the same way. If a more granular error is required, it is recommended to use child contexts and explicitly check them for ctx.Done(), as shown in our model implementation.

Check for ctx.Done() before starting any significant work

Don’t enter an expensive block of code without checking the context; if the client has already given up, the work will be wasted.

Don’t open circuits for context errors

This leads to unpredictable behaviour, because there could be a number of reasons why the context might have been cancelled. Only context errors due to request timeouts originating from the local service should lead to circuit-breaker errors.

Set context timeouts at the edge service, using a cascading timeout budget

The upstream timeout must always be longer than the total downstream timeouts. Following this formula will help to avoid wasted effort and cascading failure.

In Conclusion

Go’s context package provides two extremely valuable tools that complement timeouts: deadline propagation and cancellation. This article has shown the benefits of using context timeouts and how to correctly configure them in a multi-server request path. Finally, we have discussed the relationship between context timeouts and circuit-breakers, proposing a model implementation for integrating them together in a common library.

If you have a Go server, chances are it’s already making heavy use of context. If you’re new to Go or had been confused by how context works, hopefully this article has helped to clarify misunderstandings. Otherwise, perhaps some of the topics covered will be useful in reviewing and improving your current context handling or circuit-breaker implementation.

Recipe for building a widget: How we helped to “peak-shift” demand by helping passengers understand travel trends

Post Syndicated from Grab Tech original https://engineering.grab.com/peak-shift-demand-travel-trends

Credits: Photo by rawpixel on Unsplash

 

Stuck in traffic in a Grab ride? Pass the time by opening your Grab app and checking out the Feed – just scroll down! You’ll find widgets for games, polls, videos, news and even food recommendations!

Beyond serving your everyday needs, we want to provide our users with information that is interesting, useful and relevant. That’s why we’re always coming up with new widgets.

Building each widget takes close collaboration across multiple different teams – from Product Management to Design, Engineering, Behavioral Science, and Data Science and Analytics. Sounds like a lot of people, doesn’t it? But you’ll be surprised to hear that this behind-the-scenes collaboration works rapidly, usually in the span of one month! Which means we’re often moving from ideation phase to product release in just a few weeks.

Travel Trends Widget

This fast-and-furious process is anchored on one word – “customer-centric”. And that’s how it all began with our  “Travel Trends Widget” – a widget that provides passengers with an overview of historical supply and demand trends for their current location and nearby time periods.

Because we had so much fun developing this widget, we wanted to write a blog post to share with you what we did and how we did it!

Inspiration: Where it all started

Transport demand can be rather lumpy. Owing to organic patterns (e.g. office hours), a lot of passengers tend to request for cars around the same time. In periods like this, the increase in demand could outpace the arrival of driver supply, increasing the waiting time for passengers.

Our goal at Grab is to make sure people get a ride when they want it and at the price they want, so we got to thinking about how we can ease this friction by leveraging our treasure trove – Big Data! – to help our passengers better plan their trips.

As we were looking at the data, we noticed that there is a seasonality to demand and supply: at certain times and days, imbalances appear, peak and disappear, and the process repeats itself. Studies say that humans in general, unless shown a compelling reason or benefit for change, are habitual beings subject to inertia. So we set out to achieve exactly that: To create a widget to surface information to our passengers that may help them alter their decisions on when they choose to book a ride, thereby redistributing some of the present peak demands to periods just before and after peak – also known as “peak shifting the demand”!

While this widget is the first-of-its-kind in the ride-hailing industry, “peak-shifting” was actually coined and introduced long ago!

London Transport Museum Trends

As you can see from this post from the London Transport Museum (Source: Transport for London), London tube tried peak-shifting long before anyone else: Original Ad from 1928 displayed on the left, and Ad from 2015 displayed on the right, comparing the trends to 1928.

Trends from a hotel in Beijing

You may also have seen something similar at the last hotel you stayed at. Notice here a poster in an elevator at a Beijing hotel, announcing the best times to eat breakfast in comfort and avoid the crowd. (Photo credits to Prashant, our Product Manager, who saw this on holiday.)

To apply “peak-shifting” and help our users better plan their trips, we decided to dig in and leverage our data. It was way more complex than we had initially thought, as market conditions could be different on different days. This meant that  generic statements like “5PM-8PM are peak hours and prices will be hight” would not hold true. Contrary to general perception, we observed that even during peak hours, there are buckets of time when there is no surge or low surge.

For instance, plot 1 and plot 2 below shows how a typical Monday and Tuesday surge looks like in a given month respectively. One of the key insights is that the surge trends during peak hour is different on Monday from Tuesday. It reinforces our initial hypothesis that every day is unique.

So we used machine learning techniques to build a forecasting widget which can help our users and give them the power to plan their trips beforehand. This widget is able to provide the pricing trends for the next 2 hours. So with a bit of flexibility, riders can ride the tide!

Grab trends

So how exactly does this widget work?!

Historical trends for Monday

It pulls together historically observed imbalances between supply and demand, for the consumer’s current location and nearby time periods. Aggregated data is displayed to consumers in easily interpreted visualisations, so that they can plan to leave at times when there are more supply, and with potentially more savings for fares.

How did we build the widget? Loop, agile working process, POC & workstream

Widget-building is an agile, collaborative, and simultaneous process. First, we started the process with analysis from Product Analytics team, pulling out data on traffic trends, surge patterns, and behavioral insights of both passengers and drivers in Singapore.

When we noticed the existence of seasonality for each day of the week, we came up with more precise analytical and business questions to dig deeper into the data. Upon verification of hypotheses, we decided that we will build a widget.

Then joined the Behavioural Science, UX (User Experience) Design and the Product Management teams, who started giving shape to the problem we are solving. Our Behavioural Scientists shared their expertise on how information, suggestions and choices should be presented to enable easy assimilation and beneficial action. Daily whiteboarding breakouts, endless back-and forth conversations, and a healthy amount of challenge-and-accept culture ensured that we distilled the idea down to its core. We then presented the relevant information with just the right level of detail, and with the right amount of messaging, to allow users to take the intended action i.e. shift his/her demand outside of peak periods if possible.

Our amazing regional Copywriting team then swung in to put our intent into words in 7 different languages for our users across South-East Asia. Simultaneously, our UX designers and Full-stack Engineers started exploring the best visual components to communicate data on time trends to users. More on this later, but suffice to say that plenty of ideas were explored and discarded in a collaborative process, which aimed to create something that’s intuitive and engaging while being robust and scalable to work across all types of devices.

While these designs made their way up to engineering, the Data Science team worked on finding the most rigorous method to deduce the historical trend of surge across all our cities and areas, and time periods within them. There were discussions on how to best store and update this data reliably so that the widget itself can access it with great performance.

Soon after, we went into the development process, and voila! We had the first iteration of the widget ready on our staging (internal testing) servers in just 2 weeks! This prototype was opened up to the core team for influx of feedback.

And just two weeks later, the widget made its way to our Singapore and Jakarta Feeds, accessible to the world at large! Feedback from our users started pouring in almost immediately (thanks to the rich feedback functionality that comes with each widget), ranging from great to sometimes not-so-great, and we listened to all of it with a keen ear! And thus began a new cycle of iterations and continuous improvement, more of which we will share in a subsequent post.

In the trenches with the creators: How multiple teams got together to make this come true

Various disciplines within our cross functional team came together to whip out this widget by quipping their expertise to the end product.

Using Behavioural Science to simplify choices and design good outcomes

Behavioural Science helped to explore many facets of consumer behaviour in order to plan and design the widget: understanding how consumers think and conceptualizing a widget that can be easily understood and used by the consumers.

While fares are governed entirely by market conditions, it’s important for us to explain the economics to customers. As a customer-centric company, we aim to make the consumers feel like they own their decisions, which they can take based on full information. And this is the role of Behavioral Scientists at Grab!

In guiding the customers through the information, Behavioural Science team had the following three objectives in mind while building this Travel Trends widget:

  1. Offer transparency on the fares: By exposing our historic surge levels for a 4 hour period, we wanted to ensure that the passenger is aware of the surge levels and does not treat the fare as a nasty shock.
  2. Give information that helps them plan: By showing them surge levels for the future 2 hours, we wanted to help customers who have the flexibility, plan for a better time, hence, giving them the power to decide based on transparent information.
  3. Provide helpful tips: Every bar gives users tips on the conditions at that time and the immediate future. For instance, a low surge bar, followed by a high surge bar gives the tip “Psst… Leave now, It might get busy later!”, helping people understand the graph better and nudging them to take an action. If you are interested in saving fares, may we suggest tapping around all the bars to reveal the secret pro-tips?

Designing interfaces that lead to consumer success by abstracting complexity

Design team is the one behind the colors and shapes that make up the widget that you see and interact with! The team took inspiration from Google’s Popular Times.

Source/Credits: Google Live Popular Times
Source/Credits: Google Live Popular Times

 

Right from the offset, our content and product teams were keen to surface additional information and actions with each bar to keep the widget interactive and useful. One of the early challenges was to arrive at the right gesture that invites the user to interact and intuitively navigate the bars on the widget but also does not conflict with other gestures (eg scrolling and scrubbing) that the user was pre-trained to perform on the feed. We found out that tapping was simultaneously an unused and yet intuitive gesturethat we could use for interaction with the bars.

We then went into rounds of iteration on the visual design of the widget. In this process, multiple stakeholders were involved ranging from Product to Content to Engineering. We had to overcome a number of constraints i.e. the limited canvas of a widget and the context of a user when she is exploring the feed. By re-using existing libraries and components, we managed to keep the development light and ship something fast.

GrabCar trends near you

Dozens of revisions and four iterations later, we landed with a design that we felt equipped the feature for its user-facing goal, and did so in a manner which was aesthetically appealing!

And finally we managed to deliver on the feature’s goal, by surfacing just the right detail of information in a manner that is intuitive yet effective to peak-shift demand.  

Bringing all of this to fruition through high performance engineering

Our Development Engineering team was in charge of developing the widget and making it available to our users in just a few weeks’ time – materialising the work of the other teams.

One of their challenges was to find the best way to process the vast amount of data (millions of database entries) so it can be visualized simply as bar charts. Grab’s engineers had to achieve this while making sure performance is as resilient as possible.

There were two options in doing this:

a) Fetch the data directly from the DB for each API call; or

b) Store the data in an in-memory data structure on a timely basis, so when a user calls the API will no longer have to hit the DB.

After considering that this feature will likely expect a lot of traffic thus high QPS, we decided that the former option would be too costly. Ultimately, we chose the latter option since it is more performant and more scalable.

At the frontend, the challenge was to cater to the intricate request from our designers. We use chart libraries to increase our development speed, and not all of the requirements were readily supported by these libraries.

For instance, let’s say this library makes visualising charts easy, but not so much for customising them. If designers wanted to have an average line in a dotted form, the library did not support this so easily. Also, the moving arrow pointers as you move between bar chart, changing colors of the bars changes when clicked – all required countless CSS tweaks.

CSS tweak on trends widget
CSS tweak on trends widget

Closing the product loop with user feedback and data driven insights

One of the most crucial parts of launching any product is to ensure that customers are engaging with the widget and finding it useful.

To understand what customers think about the widget, whether they find it useful and whether it is helping them to plan better,  we delved into the huge mine of clickstream data.

User feedback on the trends widget

We found that 1 in 3 users who make a booking everyday interact with the widget. And of these people, more than 70% users have given positive rating for the widget. This validates our initial hypothesis that if given an option, our customers will love the freedom to plan their trips and inculcate more transparent ecosystem.

These users also indicate the things they like most about the widget. 61% of users gave positive rating for usefulness, 20% were impressed by the design (Kudos to our fantastic designer Ajmal!!) and 13% for usability.

Tweet about the widget

Beyond internal data, our widget made some rounds on social media channels. For Example, here is screenshot of what our users have to say on Twitter.

We closely track these metrics on user engagement and feedback to ensure that we keep improving and coming up with new iterations which helps us to serve our customers in a better way.

Conclusion

We hope you enjoyed reading about how we went from ideation, through iterations to a finished widget in the hands of the user, all in 1 month! Many hands helped along the way. If you are interested in joining this hyper-proactive problem-solving team, please check out Grab’s career site!

And if you have feedback for us, we are here to listen! While we cannot be happier to see some positive reaction from the public, we are also thrilled to hear your suggestions and advice. Please leave us a memo using the Widget’s comment function!

Epilogue

We just released an upgrade to this widget which allows users to set reminders and be notified about availability of good fares in a time period of their choosing. We will keep a watch and come knocking! Go ahead, find the widget on your Grab feed, set a reminder and save on fares on your next ride!

Vulcanizer: a library for operating Elasticsearch

Post Syndicated from GitHub Engineering original https://github.blog/2019-03-05-vulcanizer-a-library-for-operating-elasticsearch/

At GitHub, we use Elasticsearch as the main technology backing our search services. In order to administer our clusters, we use ChatOps via Hubot. As of 2017, those commands were a collection of Bash and Ruby-based scripts.

Although this served our needs for a time, it was becoming increasingly apparent that these scripts lacked composability and reusability. It was also difficult to contribute back to the community by open sourcing any of these scripts due to the fact they are specific to bespoke GitHub infrastructure.

Why build something new?

There are plenty of excellent Elasticsearch libraries, both official and community driven. For Ruby, GitHub has already released the Elastomer library and for Go we make use of the Elastic library by user olivere. However, these libraries focus primarily on indexing and querying data. This is exactly what an application needs to use Elasticsearch, but it’s not the same set of tools that operators of an Elasticsearch cluster need. We wanted a high-level API that corresponded to the common operations we took on a cluster, such as disabling allocation or draining the shards from a node. Our goal was a library that focused on these administrative operations and that our existing tooling could easily use.

Full speed ahead with Go…

We started looking into Go and were inspired by GitHub’s success with freno and orchestrator.

Go’s structure encourages the construction of composable (self-contained, stateless, components that can be selected and assembled) software, and we saw it as a good fit for this application.

… Into a wall

We initially scoped the project out to be a packaged chat app and planned to open source only what we were using internally. During implementation, however, we ran into a few problems:

  • GitHub uses a simple protocol based on JSON-RPC over HTTPS called ChatOps RPC. However, ChatOps RPC is not widely adopted outside of GitHub. This would make integration of our application into ChatOps infrastructure difficult for most parties.
  • The internal REST library our ChatOps commands relied on was not open sourced. Some of the dependencies of this REST library would also need to be open sourced. We’ve started the process of open sourcing this library and its dependencies, but it will take some time.
  • We relied on Consul for service discovery, which not everyone uses.

Based on these factors we decided to break out the core of our library into a separate package that we could open source. This would decouple the package from our internal libraries, Consul, and ChatOps RPC.

The package would only have a few goals:

  • Access the REST endpoints on a single host.
  • Perform an action.
  • Provide results of the action.

This module could then be open sourced without being tied to our internal infrastructure, so that anyone could use it with the ChatOps infrastructure, service discovery, or tooling they choose.

To that end, we wrote vulcanizer.

Vulcanizer

Vulcanizer is a Go library for interacting with an Elasticsearch cluster. It is not meant to be a full-fledged Elasticsearch client. Its goal is to provide a high-level API to help with common tasks that are associated with operating an Elasticsearch cluster such as querying health status of the cluster, migrating data off of nodes, updating cluster settings, and more.

Examples of the Go API

Elasticsearch is great in that almost all things you’d want to accomplish can be done via its HTTP interface, but you don’t want to write JSON by hand, especially during an incident. Below are a few examples of how we use Vulcanizer for common tasks and the equivalent curl commands. The Go examples are simplified and don’t show error handling.

Getting nodes of a cluster

You’ll often want to list the nodes in your cluster to pick out a specific node or to see how many nodes of each type you have in the cluster.

$ curl localhost:9200/_cat/nodes?h=master,role,name,ip,id,jdk
- mdi vulcanizer-node-123 172.0.0.1 xGIs 1.8.0_191
* mdi vulcanizer-node-456 172.0.0.2 RCVG 1.8.0_191

Vulcanizer exposes typed structs for these types of objects.

v := vulcanizer.NewClient("localhost", 9200)

nodes, err := v.GetNodes()

fmt.Printf("Node information: %#v\n", nodes[0])
// Node information: vulcanizer.Node{Name:"vulcanizer-node-123", Ip:"172.0.0.1", Id:"xGIs", Role:"mdi", Master:"-", Jdk:"1.8.0_191"}

Update the max recovery cluster setting

The index recovery speed is a common setting to update when you want balance time to recovery and I/O pressure across your cluster. The curl version has a lot of JSON to write.

$ curl -XPUT localhost:9200/_cluster/settings -d '{ "transient": { "indices.recovery.max_bytes_per_sec": "1000mb" } }'
{
"acknowledged": true,
"persistent": {},
"transient": {
"indices": {
"recovery": {
"max_bytes_per_sec": "1000mb"
}
}
}
}

The Vulcanizer API is fairly simple and will also retrieve and return any existing setting for that key so that you can record the previous value.

v := vulcanizer.NewClient("localhost", 9200)
oldSetting, newSetting, err := v.SetSetting("indices.recovery.max_bytes_per_sec", "1000mb")
// "50mb", "1000mb", nil

Move shards on to and off of a node

To safely update a node, you can set allocation rules so that data is migrated off a specific node. In the Elasticsearch settings, this is a comma-separated list of node names, so you’ll need to be careful not to overwrite an existing value when updating it.

$ curl -XPUT localhost:9200/_cluster/settings -d '
{
"transient" : {
"cluster.routing.allocation.exclude._name" : "vulcanizer-node-123,vulcanizer-node-456"
}
}'

The Vulcanizer API will safely add or remove nodes from the exclude settings so that shards won’t allocate on to a node unexpectedly.

v := vulcanizer.NewClient("localhost", 9200)

// Existing exclusion settings:
// vulcanizer-node-123,vulcanizer-node-456

exclusionSetttings1, err := v.DrainServer("vulcanizer-node-789")
// vulcanizer-node-123,vulcanizer-node-456,vulcanizer-node-789

exclusionSetttings2, err := v.FillOneServer("vulcanizer-node-456")
// vulcanizer-node-123,vulcanizer-node-789

Command-line application

Included is a small CLI application that leverages the library:

$ vulcanizer -h
Usage:
  vulcanizer [command]

Available Commands:
  allocation  Set shard allocation on the cluster.
  drain       Drain a server or see what servers are draining.
  fill        Fill servers with data, removing shard allocation exclusion rules.
  health      Display the health of the cluster.
  help        Help about any command
  indices     Display the indices of the cluster.
  nodes       Display the nodes of the cluster.
  setting     Interact with cluster settings.
  settings    Display all the settings of the cluster.
  snapshot    Interact with a specific snapshot.
  snapshots   Display the snapshots of the cluster.

Flags:
  -c, --cluster string      Cluster to connect to defined in config file
  -f, --configFile string   Configuration file to read in (default to "~/.vulcanizer.yaml")
  -h, --help                help for vulcanizer
      --host string         Host to connect to (default "localhost")
  -p, --port int            Port to connect to (default 9200)

Use "vulcanizer [command] --help" for more information about a command.

ChatOps

ChatOps is important for GitHub and our geographically distributed workforce. Vulcanizer enables us to build ChatOps tooling around Elasticsearch quickly and easily for common tasks:

jessbreckenridge
jessbreckenridge

.vulcan-go health testbed
hubot
Hubot

The cluster is green: All primary and replica shards are allocated. This does NOT mean the cluster is otherwise healthy.
+--------+------------+------+------------+----------+
| STATUS | RELOCATING | INIT | UNASSIGNED | ACTIVE % |
+--------+------------+------+------------+----------+
| green  | 0          | 0    | 0          | 100      |
+--------+------------+------+------------+----------+
jessbreckenridge
jessbreckenridge

.vulcan-go nodes testbed
hubot
Hubot

testbed nodes
+--------+------+------------------------+-----------+-----------+
| MASTER | ROLE |          NAME          |    JDK    |      IP   |
+--------+------+------------------------+-----------+-----------+
| *      | mi   | search-arbiter-aaaaaaa | 1.8.0_111 | 172.0.0.1 |
| -      | mi   | search-arbiter-bbbbbbb | 1.8.0_111 | 172.0.0.2 |
| -      | mi   | search-arbiter-ccccccc | 1.8.0_111 | 172.0.0.3 |
| -      | di   | search-storage-ddddddd | 1.8.0_111 | 172.0.0.4 |
| -      | di   | search-storage-eeeeeee | 1.8.0_111 | 172.0.0.5 |
| -      | di   | search-storage-eeeeeee | 1.8.0_111 | 172.0.0.6 |
+--------+------+------------------------+-----------+-----------+
nickcanz
nickcanz

.vulcan-go drain status testbed
hubot
Hubot

IP's filtered: None
Name Excluded: None
Host Excluded: None
nickcanz
nickcanz

.vulcan-go drain server search-storage-zzzzzzz testbed
hubot
Hubot

Server "search-storage-zzzzzzz" added to list of servers to drain.

Complete list servers draining:
    Ips: None
    Names: search-storage-zzzzzzz
    Hosts: None
nickcanz
nickcanz

.vulcan-go settings testbed
hubot
Hubot

testbed settings
No persistent settings are set.
+------------------------------------------+------------------------+
|            TRANSIENT SETTINGS            |         VALUE          |
+------------------------------------------+------------------------+
| cluster.routing.allocation.exclude._name | search-storage-zzzzzzz |
+------------------------------------------+------------------------+

Closing

We stumbled a bit when we first started down this path, but the end result is best for everyone:

  • Since we had to regroup about what exact functionality we wanted to open source, we made sure we were providing value to ourselves and the community instead of just shipping something.
  • Internal tooling doesn’t always follow engineering best practices like proper release management, so developing Vulcanizer in the open provides an external pressure to make sure we follow all of the best practices.
  • Having all of the Elasticsearch functionality in its own library allows our internal applications to be very slim and isolated. Our different internal applications have a clear dependency on Vulcanizer instead of having different internal applications depend on each other or worse, trying to get ChatOps to talk to other ChatOps.

Visit the Vulcanizer repository to clone or contribute to the project. We have ideas for future development in the Vulcanizer roadmap.

Authors


The post Vulcanizer: a library for operating Elasticsearch appeared first on The GitHub Blog.

Structured Logging: The Best Friend You’ll Want When Things Go Wrong

Post Syndicated from Grab Tech original https://engineering.grab.com/structured-logging

Introduction

Everyday millions of people around Southeast Asia count on Grab to get themselves or what they need from point A to B in a safe, comfortable and reliable manner. In fact, just very recently we crossed our 3 billion transport rides milestone, gaining the last billion in just a mere 6 months!

We take this responsibility very seriously, and as we continue to grow and expand, it’s important for us to maintain a sophisticated backend system that is capable of sustaining the kind of scale needed to support all our customers in Southeast Asia. This backend system is comprised of multiple services that interact with each other in many different ways. As Grab evolves, maintaining them becomes a significantly larger and harder task as developers continuously develop new features.

To maintain these systems well, it’s important to have better observability; data that helps us better understand what is happening in the system by having good monitoring (metrics), event logs, and tracing for request scope data. Out of these, logs provide the most complete picture of what happened within the system – and is typically the first and most engaged point of contact. With good logs, the backend becomes much easier to understand, maintain, and debug. Without logs or with bad logs – we have a recipe for disaster; making it nearly impossible to understand what’s happening.

In this article, we focus on a form of logging called structured logging. We discuss what it is, why is it better, and how we built a framework that integrates well with our current Elastic stack-based logging backend, allowing us to do logging better and more efficiently.

Structured Logging is a part of a larger endeavour which will enable us to reduce the Mean Time To Resolve (MTTR), helping developers to mitigate issues faster when outages happen.

What are Logs?

Logs are lines of texts containing some information about some event that occurred in our system, and they serve a crucial function of helping us understand what’s happening in the backend. Logs are usually placed at points in the code where a significant event has happened (for example, some database operation succeeded or a passenger got assigned to a driver) or at any other place in the code that we are interested in observing.

The first thing that a developer would normally do when an error is reported is check the logs – sort of like walking through the history of the system and finding out what happened. Therefore, logs can be a developer’s best friend in times of service outages, errors, and failed builds.

Logs in today’s world have varying formats and features.

  • Log Format: These range from simple key-value based (like syslog) to quite structured and detailed (like JSON). Since logs are mostly meant for developer eyes, how detailed or structured a log is dictates how fast the developer can query the logs, as well as read them. The more structured the data is – the larger the size is per log line, although it’s more queryable and contains richer information.
  • Levelled Logging (or Log Levels): Logs with different severities can be logged at different levels. The visibility can be limited to a single level, limiting all logs only with a certain severity or above (for example, only logs WARN and above). Usually log levels are static in production environments, and finding DEBUG logs usually requires redeploying.
  • Log Aggregation Backend: Logs can have different log aggregation backends, which means different backends (i.e. Splunk, Kibana, etc.) decide what your logs might look like or what you might be able to do with them. Some might cost a lot more than others.
  • Causal Ordering: Logs might or might not preserve the exact time in which they are written. This is important, as how exact the time is dictates how accurately we can predict the sequence of events via logs.
  • Log Correlation: We serve countless requests from our backend services. Being able to see all the logs relevant to a particular request or a particular event helps us drill down to relevant  information for a specific request (e.g. for a specific passenger trying to book a ride).

Combine this with the plethora of logging libraries available and you easily have a developer who is holding his head in confusion, unable to decide what to use. Also, each library has their own set of advantages and disadvantages, so the discussion might quickly become subjective and polarized – therefore it is crucial that you choose the appropriate library and backend pair for your applications.

We at Grab use different types of logging libraries. However, as requirements changed  – we also found ourselves re-evaluating our logging strategy.

The State of Logging at Grab

The number of Golang services at Grab has continuously grown. Most services used syslog-style key-value format logs, recognized as the most common format of logs for server-side applications due to its simplicity and ease for reading and writing. All these logs were made possible by a handful of common libraries, which were directly imported and used by different services.

We used a cloud-based SaaS vendor as a frontend for these logs, where application-emitted logs were routed to files and sent to our logging vendor, making it possible to view and query them in real time. Things were pretty great and frictionless for a long time.

However, as time went by, our logging bills started mounting to unprecedented levels and we found ourselves revisiting and re-evaluating how we did logging. A few issues surfaced:

  • Logging volume reduction efforts were successful to some extent – but were arduous and painful. Part of the reason was that almost all the logs were at a single log level – INFO.
Figure 1: Log Level Usage
Figure 1: Log Level Usage

 

This issue was not limited to a single service, but pervasive across services. For mitigation, some services added sampling to logs, some removed logs altogether. The latter is only a recipe for disaster, so it was known that we had to improve levelled logging.

  • The vendor was expensive for us at the time and also had a few concerns – primarily with limitations around DSL (query language). There were many good open source alternatives available – Elastic stack to name one. Our engineers felt confident that we could probably manage our logging infrastructure and manage the costs better – which led to the proposal and building of Elastic stack logging cluster. Elasticsearch is vastly more powerful and rich than our vendor at the time and our current libraries weren’t enough to fully leverage its capabilities, so we needed a library which can leverage structure in logs better and easily integrate with Elastic stack.
  • There were some minor issues in our logging libraries namely:
    • Singleton initialisation pattern that made unit-testing harder
    • Single logger interface that reduced the possibility of extending the core logging functionality as almost all the services imported the logger interface directly
    • No out-of-the-box support for multiple writers
  • If we were to write a library, we had to fix these issues – and also encourage usage of best practices.

  • Grab’s critical path (number of services traversed by a single booking flow request) has grown in size. On average, a single booking request touches multiple microservices – each of which does something different. At the large scale at which we operate, it’s necessary therefore to easily view logs from all the services for a single request – however this was not something which was done automatically by the library. Hence, we also wanted to make log correlation easier and better.
  • Logs are events which happened at some point of time. The order in which these events occurred gives us a complete history of what happened in the system. However, the core logging library which formed the base of the logging across our Golang services didn’t preserve the log generation time (it instead used write time). This led to jumbling of logs which are generated in a span of a few microseconds – which not only makes the lives of our developers harder, but makes it near impossible to get an exact history of the system. This is why we wanted to also improve and enable causal ordering of logs – one of the key steps in understanding what’s happening in the system.

Why Change?

As mentioned, we knew there were issues with how we were logging. To best approach the problem and be able to solve it as much as possible without affecting existing infrastructure and services, it was decided to bootstrap a new library from the ground up. This library would solve known issues, as well as contain features which would not have been possible by modifying existing libraries. For a recap, here’s what we wanted to solve:

  • Improve levelled logging
  • Leverate structure in logs better
  • Easily integrate with Elastic stack
  • Encourage usage of best practices
  • Make log correlation easier and better
  • Improve and enable causal ordering of logs for a better understanding of service distribution

Enter Structured Logging. Structured Logging has been quite popular around the world, finding widespread adoption. It was easily integrable with our Elastic stack backend and would also solve most of our pain points.

Structured Logging

Keeping our previous problems and requirements in mind, we bootstrapped a library in Golang, which has the following features:

Dynamic Log Levels

This allows us to change our initialized log levels at runtime from a configuration management system – something which was not possible and encouraged before.

This makes the log levels actually more meaningful now –  developers can now deploy with the usual WARN or INFO log levels, and when things go wrong, just with a configuration change they can update the log level to DEBUG and make their services output more logs when debugging. This also helps us keep our logging costs in check. We made support for integrating this with our configuration management system easy and straightforward.

Consistent Structure in Logs

Logs are inherently unstructured unlike database schema, which is rigid, or a freeform text, which has no structure. Our Elastic stack backend is primarily based on indices (sort of like tables) with mapping (sort of like a loose schema). For this, we needed to output logs in JSON with a consistent structure (for example, we cannot output integer and string under the same JSON field because that will cause an indexing failure in Elasticsearch). Also, we were aware that one of our primary goals was keeping our logging costs in check, and since it didn’t make sense to structure and index almost every field – adding only the structure which is useful to us made sense.

For addressing this, we built a utility that allows us to add structure to our logs deterministically. This is built on top of a schema in which we can add key-value pairs with a specific key name and type, generate code based on that – and use the generated code to make sure that things are consistently formatted and don’t break. We called this schema (a collection of key name and type pairs) the Common Grab Log Schema (CGLS). We only add structure to CGLS which is important – everything included in CGLS gets formatted in the different field and everything else gets formatted in a single field in the generated JSON. This helps keeps our structure consistent and easily usable with Elastic stack.

Figure 2: Overview of Common Grab Log Schema for Golang backend services
Figure 2: Overview of Common Grab Log Schema for Golang backend services

Plug and Play support with Grab-Kit

We made the initialization and use easy and out-of-the-box with our in-house support for Grab-Kit, so developers can just use it without making any drastic changes. Also, as part of this integration, we added automatic log correlation based on request IDs present in traces, which ensured that all the logs generated for a particular request already have that trace ID.

Configurable Log Format

Our primary requirement was building a logger expressive and consistent enough to integrate with the Elastic stack backend well – without going through fancy log parsing in the downstream. Therefore, the library is expressive and configurable enough to allow any log format (we can write different log formats for different future use cases. For example, readable format in development settings and JSON output in production settings), with a default option of JSON output. This ensures that we can produce log output which is compatible with Elastic stack, but still be configurable enough for different use cases.

Support for Multiple Writes with Different Formats

As part of extending the library’s functionality, we needed enough configurability to be able to send different logs to different places at different settings. For example, sending FATAL logs to Slack asynchronously in some readable format, while sending all the usual logs to our Elastic stack backend. This library includes support for chaining such “cores” to any arbitrary degree possible – making sure that this logger can be used in such highly specialized cases as well.

Production-like Logging Environment in Development

Developers have been seeing console logs since the dawn of time, however having structured JSON logs which are only meant for production logs and are more searchable provides more power. To leverage this power in development better and allow developers to directly see their logs in Kibana, we provide a dockerized version of Kibana which can be spun up locally to accept structured logs. This allows developers to directly use the structured logs and see their logs in Kibana – just like production!

Having this library enabled us to do logging in a much better way. The most noticeable impact was that our simple access logs can now be queried better – with more filters and conditions.

Figure 3: Production-like Logging Environment in Development
Figure 3: Production-like Logging Environment in Development

Causal Ordering

Having an exact history of events makes debugging issues in production systems easier – as one can just look at the history and quickly hypothesize what’s wrong and fix it. To this end, the structured logging library adds the exact write timestamp in nanoseconds in the logger. This combined with the structured JSON-like format makes it possible to sort all the logs by this field – so we can see logs in the exact order as they happened – achieving causal ordering in logs. This is an underplayed but highly powerful feature that makes debugging easier.

Figure 4: Causal ordering of logs with Y'ALL
Figure 4: Causal ordering of logs with Y’ALL

But Why Structured Logging?

Now that you know about the history and the reasons behind our logging strategy, let’s discuss the benefits that you reap from it.

On the outset, having logs well-defined and structured (like JSON) has multiple benefits, including but not limited to:

  • Better root cause analysis: With structured logs, we can ingest and perform more powerful queries which won’t be possible with simple unstructured logs. Developers can do more informative queries on finding the logs which are relevant to the situation. Not only this, log correlation and causal ordering make it possible to gain a better understanding of the distributed logs. Unlike unstructured data, where we are only limited to full-text or a handful of log types, structured logs take the possibility to a whole new level.
  • More transparency or better observability: With structured logs, you increase the visibility of what is happening with your system – since now you can log information in a better, more expressive way. This enables you to have a more transparent view of what is happening in the system and makes your systems easier to maintain and debug over longer periods of time.
  • Better consistency: With structured logs, you increase the structure present in your logs – and in turn, make your logs more consistent as the systems evolve. This allows us to index our logs in a system like Elastic stack more easily as we can be sure that we are sticking to some structure. Also with the adoption of a common schema, we can be rest assured that we are all using the same structure.
  • Better standardization: Having a single, well-defined, structured way to do logging allows us to standardize logging – which reduces cognitive overhead of figuring out what happened in systems via logs and allows easier adoption. Instead of going through 100 different types of logs, you instead would only have a single format. This is also one of the goals of the library – standardizing the usage of the library across Golang backend services.

We get some additional benefits as well:

  • Dynamic Log Levels: This allows us to have meaningful log levels in our code – where we can deploy with baseline warning settings and switch to lower levels (debug logs) only when we need them. This helps keep our logging costs low, as well as reduces the noise that developers usually need to go through when debugging.
  • Future-proof Consistency in Logs: With the adoption of a common schema, we make sure that we stick with the same structure, even if say tomorrow our logging infrastructure changes – making us future-ready. Instead of manually specifying what to log, we can simply expose a function in our loggers.
  • Production-Like Logging Environment in Development: The dockerized Kibana allows developers to enjoy the same benefits as the production Kibana. This also encourages developers to use Elastic stack more and explore its features such as building dashboards based on the log data, having better watchers, and so on.

I hope you have enjoyed this article and found it useful. Comments and corrections are always welcome.

Happy Logging!

How we simplified our Data Ingestion & Transformation Process

Post Syndicated from Grab Tech original https://engineering.grab.com/data-ingestion-transformation-product-insights

Introduction

As Grab grew from a small startup to an organisation serving millions of customers and driver partners, making day-to-day data-driven decisions became paramount. We needed a system to efficiently ingest data from mobile apps and backend systems and then make it available for analytics and engineering teams.

Thanks to modern data processing frameworks, ingesting data isn’t a big issue. However, at Grab scale it is a non-trivial task. We had to prepare for two key scenarios:

  • Business growth, including organic growth over time and expected seasonality effects.
  • Any unexpected peaks due to unforeseen circumstances. Our systems have to be horizontally scalable.

We could ingest data in batches, in real time, or a combination of the two. When you ingest data in batches, you can import it at regularly scheduled intervals or when it reaches a certain size. This is very useful when processes run on a schedule, such as reports that run daily at a specific time. Typically, batched data is useful for offline analytics and data science.

On the other hand, real-time ingestion has significant business value, such as with reactive systems. For example, when a customer provides feedback for a Grab superapp widget, we re-rank widgets based on that customer’s likes or dislikes. Note when information is very time-sensitive, you must continuously monitor its data.

This blog post describes how Grab built a scalable data ingestion system and how we went from prototyping with Spark Streaming to running a production-grade data processing cluster written in Golang.

Building the system without reinventing the wheel

The data ingestion system:

  1. Collects raw data as app events.
  2. Transforms the data into a structured format.
  3. Stores the data for analysis and monitoring.

In a previous blog post, we discussed dealing with batched data ETL with Spark. This post focuses on real-time ingestion.

We separated the data ingestion system into 3 layers: collection, transformation, and storage. This table and diagram highlights the tools used in each layer in our system’s first design.

LayerTools
CollectionGateway, Kafka
TransformationGo processing service, Spark Streaming
StorageTalariaDB

Our first design might seem complex, but we used battle-tested and common tools such as Apache Kafka and Spark Streaming. This let us get an end-to-end solution up and running quickly.

Collection layer

Our collection layer had two sub-layers:

  1. Our custom built API Gateway received HTTP requests from the mobile app. It simply decoded and authenticated HTTP requests, streaming the data to the Kafka queue.
  2. The Kafka queue decoupled the transformation layer (shown in the above figure as the processing service and Spark streaming) from the collection layer (shown above as the Gateway service). We needed to retain raw data in the Kafka queue for fault tolerance of the entire system. Imagine an error where a data pipeline pollutes the data with flawed transformation code or just simply crashes. The Kafka queue saves us from data loss by data backfilling.

Since it’s robust and battle-tested, we chose Kafka as our queueing solution. It perfectly met our requirements, such as high throughput and low latency. Although Kafka takes some operational effort such as self-hosting and monitoring, Grab has a proficient and dedicated team managing our Kafka cluster.

Transformation layer

There are many options for real-time data processing, including Spark Streaming, Flink, and Storm. Since we use Spark for all our batch processing, we decided to use Spark Streaming.

We deployed a Golang processing service between Kafka and Spark Streaming. This service converts the data from Protobuf to Avro. Instead of pointing Spark Streaming directly to Kafka, we used this processing service as an intermediary. This was because our Spark Streaming job was written in Python and Spark doesn’t natively support protobuf decoding.  We used Avro format, since Grab historically used it for archiving streaming data. Each raw event was enriched and batched together with other events. Batches were then uploaded to S3.

Storage layer

TalariaDB is a Grab-built time-series database. It ingests events as columnar ORC files, indexing them by event name and time. We use the same ORC format files for batch processing. TalariaDB also implements the Presto Thrift connector interface, so our users could query certain event types by time range. They did this by connecting a Presto to a TalariaDB hosting distributed cluster.

Problems

Building and deploying our data pipeline’s MVP provided great value to our data analysts, engineers, and QA team. For example, our mobile app team could monitor any abnormal change in the real-time metrics, such as the screen load time for the latest released app version. The QA team could perform app side actions (book a ride, make payment, etc.) and check which events were triggered and received by the backend. The latency between the ingestion and the serving layer was only 4 minutes instead of the batch processing system’s 60 minutes. The streaming processing’s data showed good business value.

This prompted us to develop more features on top of our platform-collected real-time data. Very soon our QA engineers and the product analytics team used more and more of the real-time data processing system. They started instrumenting various mobile applications so more data started flowing in. However, as our ingested data increased, so did our problems. These were mostly related to operational complexity and the increased latency.

Operational complexity

Only a few team members could operate Spark Streaming and EMR. With more data and variable rates, our streaming jobs had scaling issues and failed occasionally. This was due to checkpoint issues when the cluster was under heavy load. Increasing the cluster size helped, but adding more nodes also increased the likelihood of losing more cluster nodes. When we lost nodes,our latency went up and added more work for our already busy on-call engineers.

Supporting native Protobuf

To simplify the architecture, we initially planned to bypass our Golang-written processing service for the real-time data pipeline. Our plan was to let Spark directly talk to the Kafka queue and send the output to S3. This required packaging the decoders for our protobuf messages for Python Spark jobs, which was cumbersome. We thought about rewriting our job in Scala, but we didn’t have enough experience with it.

Also, we’d soon hit some streaming limits from S3. Our Spark streaming job was consuming objects from S3, but the process was not continuous due to S3’s  eventual consistency. To avoid long pagination queries in the S3 API, we had to prefix the data with the hour in which it was ingested. This resulted in some data loss after processing by the Spark streaming. The loss happened because the new data would appear in S3 while Spark Streaming had already moved on to the next hour. We tried various tweaks, but it was just a bad design. As our data grew to over one terabyte per hour, our data loss grew with it.

Processing lag

On average, the time from our system ingesting an event to when it was available on the Presto was 4 to 6 minutes. We call that processing lag, as it happened due to our data processing. It was substantially worse under heavy loads, increasing to 8 to 13 minutes. While that wasn’t bad at this scale (a few TBs of data), it made some use cases impossible, such as monitoring. We needed to do better.

Simplifying the architecture and rewriting in Golang

After completing the MVP phase development, we noticed the Spark Streaming functionality we actually used was relatively trivial. In the Spark Streaming job, we only:

  • Partitioned the batch of events by event name.
  • Encoded the data in ORC format.
  • And uploaded to an S3 bucket.

To mitigate the problems mentioned above, we tried re-implementing the features in our existing Golang processing service. Besides consuming the data and publishing to an S3 bucket, the transformation service also needed to deal with event partitioning and ORC encoding.

One key problem we addressed was implementing a robust event partitioner with a large write throughput and low read latency. Fortunately, Golang has a nice concurrent map package. To further reduce the lock contention, we added sharding.

We made the changes, deployed the service to production,and discovered our service was now memory-bound as we buffered data for 1 minute. We did thorough benchmarking and profiling on heap allocation to improve memory utilization. By iteratively reducing inefficiencies and contributing to a lower CPU consumption, we made our data transformation more efficient.

Performance

After revamping the system, the elapsed time for a single event to travel from the gateway to our dashboard is about 1 minute. We also fixed the data loss issue. Finally, we significantly reduced our on-call workload by removing Spark Streaming.

Validation

At this point, we had both our old and new pipelines running in parallel. After drastically improving our performance, we needed to confirm we still got the same end results. This was done by running a query against each of the pipelines and comparing the results. Both systems were registered to the same Presto cluster.

We ran two SQL “excerpts” between the two pipelines in different order. Both queries returned the same events, validating our new pipeline’s correctness.

select count(1) from ((
 select uuid, time from grab_x.realtime_new
 where event = 'app.metric1' and time between 1541734140 and 1541734200
) except (
 select uuid, time from grab_x.realtime_old
 where event = 'app.metric1' and time between 1541734140 and 1541734200
))

/* output: 0 */

Conclusions

Scaling a data ingestion system to handle hundreds of thousands of events per second was a non-trivial task. However, by iterating and constantly simplifying our overall architecture, we were able to efficiently ingest the data and drive down its lag to around one minute.

Spark Streaming was a great tool and gave us time to understand the problem. But, understanding what we actually needed to build and iteratively optimise the entire data pipeline led us to:

  • Replacing Spark Streaming with our new Golang-implemented pipeline.
  • Removing Avro encoding.
  • Removing an intermediary S3 step.

Differences between the old and new pipelines are:

Old PipelineNew Pipeline
LanguagesPython, GoGo
Stages4 services3 services
ConversionsProtobuf → Avro → ORCProtobuf → ORC
Lag4-13 min1 min

Systems usually become more and more complex over time, leading to tech debt and decreased performance. In our case, starting with more steps in the data pipeline was actually the simple solution, since we could re-use existing tools. But as we reduced processing stages, we’ve also seen fewer failures. By simplifying the problem, we improved performance and decreased operational complexity. At the end of the day, our data pipeline solves exactly our problem and does nothing else, keeping things fast.

Highlights from Git 2.21

Post Syndicated from Taylor Blau original https://github.blog/2019-02-24-highlights-from-git-2-21/

The open source Git project just released Git 2.21 with features and bug fixes from over 60 contributors. We last caught up with you on the latest Git releases when 2.19 was released. Here’s a look at some of the most interesting features and changes introduced since then.

Human-readable dates with --date=human

As part of its output, git log displays the date each commit was authored. Without making an alternate selection, timestamps will display in Git’s “default” format (for example, “Tue Feb 12 09:00:33 2019 -0800”).

That’s very precise, but a lot of those details are things that you might already know, or don’t care about. For example, if the commit happened today, you already know what year it is. Likewise, if a commit happened seven years ago, you don’t care which second it was authored. So what do you do? You could use --date=relative, which would give you output like 6 days ago, but often, you want to relate “six days ago” to a specific event, like “my meeting last Wednesday.” So, was six days ago Tuesday, or was it Wednesday that you were interested in?

Git 2.21 introduces a new date format that tells you exactly when something occurred with just the right amount of detail: --date=human. Here’s how git log looks with the new format:

git log --date=human example

That’s both more accurate than --date=relative and easier to consume than the full weight of --date=default.

But what about when you’re scripting? Here, you might want to frequently switch between the human and machine-readable formats while putting together a pipeline. Git 2.21 has an option suitable for this setting, too: --date=auto:human. When printing output to a pager, if Git is given this option it will act as if it had been passed --date=human. When otherwise printing output to a non-pager, Git will act as if no format had been given at all. If human isn’t quite your speed, you can combine auto with any other format of your choosing, like --date=auto:relative.

git log --date=auto:human example[source]

Detecting case-insensitive path collisions

One commonly asked Git question is, “After cloning a repository, why does git status report some of the files as modified?” Quite often, the answer is that the repository contains a tree which cannot be represented on your file system. For instance, if it contains both file as well as FILE and your file system is case-insensitive, Git can only checkout one of those files. Worse, Git doesn’t actually detect this case during the clone; it simply writes out each path, unaware that the file system considers them to be the same file. You only find out something has gone wrong when you see a mystery modification.

The exact rules for when this occurs will vary from system to system. In addition to “folding” what we normally consider upper and lowercase characters in English, you may also see this from language-specific conversions, non-printing characters, or Unicode normalization.

In Git 2.20, git clone now detects and reports colliding groups during the initial checkout, which should remove some of the confusion. Unfortunately, Git can’t actually fix the problem for you. What the original committer put in the repository can’t be checked out as-is on your file system. So if you’re thinking about putting files into a multi-platform project that differ only in case, the best advice is still: don’t.

git
[source]

Performance improvements and other bits

Behind the scenes, a lot has changed over the last couple of Git releases, too. We’re dedicating this section to overview a few of these changes. Not all of them will impact your Git usage day-to-day, but some will, and all of the changes are especially important for server administrators.

Multi-pack indexes

Git stores objects (e.g., representations of the files, directories, and more that make up your Git repository) in both the “loose” and “packed” formats. A “loose” object is a compressed encoding of an object, stored in a file. A “packed” object is stored in a packfile, which is a collection of objects, written in terms of deltas of one another.

Because it can be costly to rewrite these packs every time a new object is added to the repository, repositories tend to accumulate many loose objects or individual packs over time. Eventually, these are reconciled during a “repack” operation. However, this reconciliation is not possible for larger repositories, like the Windows repository.

Instead of repacking, Git can now create a multi-pack index file, which is a listing of objects residing in multiple packs, removing the need to perform expensive repacks (in many cases).

[source]

Delta islands

An important optimization for Git servers is that the format for transmitted objects is the same as the heavily-compressed on-disk packfiles. That means that in many cases, Git can serve repositories to clients by simply copying bytes off disk without having to inflate individual objects.

But sometimes this assumption breaks down. Objects on disk may be stored as “deltas” against one another. When two versions of a file have similar content, we might store the full contents of one version (the “base”), but only the differences against the base for the other version. This creates a complication when serving a fetch. If object A is stored as a delta against object B, we can only send the client our on-disk version of A if we are also sending them B (or if we know they already have B). Otherwise, we have to reconstruct the full contents of A and re-compress it.

This happens rarely in many repositories where clients clone all of the objects stored by the server. But it can be quite common when multiple distinct but overlapping sets of objects are stored in the same packfile (for example, due to repository forks or unmerged pull requests). Git may store a delta between objects found only in two different forks. When someone clones one of the forks, they want only one of the objects, and we have to discard the delta.

Git 2.20 solves this by introducing the concept of “delta islands. Repository administrators can partition the ref namespace into distinct “islands”, and Git will avoid making deltas between islands. The end result is a repository which is slightly larger on disk but is still able to serve client fetches much more cheaply.

[source 1, source 2]

Delta reuse with bitmaps

We already discussed the importance of reusing on-disk deltas when serving fetches, but how do we know when the other side has the base object they’ need to use the delta we send them? If we’re sending them the base, too, then the answer is easy. But if we’re not, how do we know if they have it?

That answer is deceptively simple: the client will have already told us which commits it has (so that we don’t bother sending them again). If they claim to have a commit which contains the base object, then we can re-use the delta. But there’s one hitch: we not only need to know about the commit they mentioned, but also the entire object graph. The base may have been part of a commit hundreds or thousands of commits deep in the history of the project.

Git doesn’t traverse the entire object graph to check for possible bases because it’s too expensive to do so. For instance, walking the entire graph of a Linux kernel takes roughly 30 seconds.

Fortunately, there’s already a solution within Git: reachability bitmaps. Git has an optional on-disk data structure to record the sets of objects “reachable” from each commit. When this data is available, we can query it to quickly determine whether the client has a base object. This results in the server generating smaller packs that are produced more quickly for an overall faster fetch experience.

[source]

Custom alternates reference advertisement

Repository alternates are a tool that server administrators have at their disposal to reduce redundant information. When two repositories are known to share objects (like a fork and its parent), the fork can list the parent as an “alternate”, and any objects the fork doesn’t have itself, it can look for in its parent. This is helpful since we can avoid storing twice the vast number of objects shared between the fork and parent.

Likewise, a repository with alternates advertises “tips” it has when receiving a push. In other words, before writing from your computer to a remote, that remote will tell you what the tips of its branches are, so you can determine information that is already known by the remote, and therefore use less bandwidth. When a repository has alternates, the tips advertisement is the union of all local and alternate branch tips.

But what happens when computing the tips of an alternate is more expensive than a client sending redundant data? It makes the push so slow that we have disabled this feature for years at GitHub. In Git 2.20, repositories can hook into the way that they enumerate alternate tips, and make the corresponding transaction much faster.

[source]

Tidbits

Now that we’ve highlighted a handful of the changes in the past two releases, we want to share a summary of a few other interesting changes. As always, you can learn more by clicking the “source” link, or reading the documentation or release notes.

  • Have you ever tried to run git cherry-pick on a merge commit only to have it fail? You might have found that the fix involves passing -m1 and moved on. In fact, -m1 says to select the first parent as the mainline, and it replays the relevant commits. Prior to Git 2.21, passing this option on a non-merge commit caused an error, but now it transparently does what you meant. [source]

  • Veteran Git users from our last post might recall that git branch -l establishes a reflog for a newly created branch, instead of listing all branches. Now, instead of doing something you almost certainly didn’t mean, git branch -l will list all of your repository’s branches, keeping in line with other commands that accept -l. [source]

  • If you’ve ever been stuck or forgotten what a certain command or flag does, you might have run git --help (or git -h) to learn more. In Git 2.21, this invocation now follows aliases, and shows the aliased command’s helptext. [source]

  • In repositories with large on-disk checkouts, git status can take a long time to complete. In order to indicate that it’s making progress, the status command now displays a progress bar. [source]

  • Many parts of Git have historically been implemented as shell scripts, calling into tools written in C to do the heavy lifting. While this allowed rapid prototyping, the resulting tools could often be slow due to the overhead of running many separate programs. There are continuing efforts to move these scripts into C, affecting git submodule, git bisect, and git rebase. You may notice rebase in particular being much faster, due to the hard work of the Summer of Code students, Pratik Karki and Alban Gruin.

  • The -G option tells git log to only show commits whose diffs match a particular pattern. But until Git 2.21, it was searching binary files as if they were text, which made things slower and often produced confusing results. [source]

That’s all for now

We went through a few of the changes that have happened over the last couple of versions, but there’s a lot more to discover. Read the release notes for 2.21, or review the release notes for previous versions in the Git repository.

The post Highlights from Git 2.21 appeared first on The GitHub Blog.

Five years of the GitHub Bug Bounty program

Post Syndicated from philipturnbull original https://github.blog/2019-02-19-five-years-of-the-github-bug-bounty-program/

GitHub launched our Security Bug Bounty program in 2014, allowing us to reward independent security researchers for their help in keeping GitHub users secure. Over the past five years, we have been continuously impressed by the hard work and ingenuity of our researchers. Last year was no different and we were glad to pay out $165,000 to researchers from our public bug bounty program in 2018.

We’ve previously talked about our other initiatives to engage with researchers. In 2018, our researcher grants, private bug bounty programs, and a live-hacking event allowed us to reach even more independent security talent. These different ways of working with the community helped GitHub reach a huge milestone in 2018: $250,000 paid out to researchers in a single year.

We’re happy to share some of our highlights from the past year and introduce some big changes for the coming year: full legal protection for researchers, more GitHub properties eligible for rewards, and increased reward amounts.

2018 Highlights

GraphQL and API authorization researcher grant

Since the launch of our researcher grants program in 2017 we’ve been on the lookout for bug bounty researchers who show a specialty in particular features of our products. In mid-2018 @kamilhism submitted a series of vulnerabilities to the public bounty program showing his expertise in the authorization logic of our REST and GraphQL APIs. To support their future research, we provided Kamil with a fixed grant payment to perform a systematic audit of our API authorization logic. Kamil’s audit was excellent, uncovering and allowing us to fix an additional seven authorization flaws in our API.

H1-702

In August, GitHub took part in HackerOne’s H1-702 live-hacking event in Las Vegas. This brought together over 75 of the top researchers from HackerOne to focus on GitHub’s products for one evening of live-hacking. The event didn’t disappoint—GitHub’s security improved and nearly $75,000 was paid out for 43 vulnerabilities. This included one critical-severity vulnerability in GitHub Enterprise Server. We also met with our researchers in-person and received great feedback on how we could improve our bug bounty program.

GitHub Actions private bug bounty

In October, GitHub launched a limited public beta of GitHub Actions. As part of the limited beta, we also ran a private bug bounty program to complement our extensive internal security assessments. We sent out over 150 invitations to researchers from last year’s private program, all H1-702 participants, and invited a number of the best researchers that have worked with our public program. The private bounty program allowed us to uncover a number of vulnerabilities in GitHub Actions.

We also held an office-hours event so that the GitHub security team and researchers could meet. We took the opportunity to meet face-to-face with other researchers because it’s a great way to build a community and learn from each other. Two of our researchers, @not-an-aardvark and @ngaloggc, gave an overview of their submissions and shared details of how they approached the target with everyone.

Workflow improvements

We’ve been making refinements to our internal bug bounty workflow since we last announced it back in 2017.  Our ChatOps-based tools have continued to evolve over the past year as we find more ways to streamline the process. These aren’t just technical changes—each day we’ve had individual on-call first responders who were responsible for handling incoming bounty submissions. We’ve also added a weekly status meeting to review current submissions with all members of the Application Security team. These meetings allow the team to ensure that submissions are not stalled, work is correctly prioritized by engineering teams based on severity, and researchers are getting timely updates on their submissions.

A key success metric for our program is how much time it takes to validate a submission and triage that information to the relevant engineering team so remediation work can begin. Our workflow improvements have paid off and we’ve significantly reduced the average time to triage from four days in 2017 down to 19 hours. Likewise, we’ve reduced our average time to resolution from 16 days to six days. Keep in mind: for us to consider a submission as resolved, the issue has to either be fixed or properly prioritized and tracked, by the responsible engineering team.

We’ve continued to reach our target of replying to researchers in less than 24 hours on average. Most importantly for our researchers, we’ve also dropped our average time for rewarding a submission from 17 days in 2017 down to 11 days. We’re grateful for the effort that researchers invest in our program and we aim to reduce these times further over the next year.

2019 initiatives

Although our program has been running successfully for the past five years, we know that we can always improve. We’ve taken feedback from our researchers and are happy to announce three major changes to our program for 2019:

Keeping bounty program participants safe from the legal risks of security research is a high priority for GitHub. To make sure researchers are as safe as possible, we’ve added a robust set of Legal Safe Harbor terms to our site policy. Our new policies are based on CC0-licensed templates by GitHub’s Associate Corporate Counsel, @F-Jennings. These templates are a fork of EdOverflow’s Legal Bug Bounty repo, with extensive modifications based on broad discussions with security researchers and Amit Elazari’s general research in this field. The templates are also inspired by other best-practice safe harbor examples including Bugcrowd’s disclose.io project and Dropbox’s updated vulnerability disclosure policy.

Our new Legal Safe Harbor terms cover three main sources of legal risk:

  • Your research activity remains protected and authorized even if you accidentally overstep our bounty program’s scope. Our safe harbor now includes a firm commitment not to pursue civil or criminal legal action, or support any prosecution or civil action by others, for participants’ bounty program research activities. You remain protected even for good faith violations of the bounty policy.
  • We will do our best to protect you against legal risk from third parties who won’t commit to the same level of safe harbor protections. Our safe harbor terms now limit report-sharing with third parties in two ways. We will share only non-identifying information with third parties, and only after notifying you and getting that third party’s written commitment not to pursue legal action against you. Unless we get your written permission, we will not share identifying information with a third party.
  • You won’t be violating our site terms if it’s specifically for bounty research. For example, if your in-scope research includes reverse engineering, you can safely disregard the GitHub Enterprise Agreement’s restrictions on reverse engineering. Our safe harbor now provides a limited waiver for relevant parts of our site terms and policies. This protects against legal risk from DMCA anti-circumvention rules or similar contract terms that could otherwise prohibit necessary research tasks like reverse engineering or deobfuscating code.

Other organizations can look to these terms as an industry standard for safe harbor best practices—and we encourage others to freely adopt, use, and modify them to fit their own bounty programs. In creating these terms, we aim to go beyond the current standards for safe harbor programs and provide researchers with the best protection from criminal, civil, and third-party legal risks. The terms have been reviewed by expert security researchers, and are the product of many months of legal research and review of other legal safe harbor programs. Special thanks to MG, Mugwumpjones, and several other researchers for providing input on early drafts of @F-Jennings’ templates.

Expanded scope

Over the past five years, we’ve been steadily expanding the list of GitHub products and services that are eligible for reward. We’re excited to share that we are now increasing our bounty scope to reward vulnerabilities in all first party services hosted under our github.com domain. This includes GitHub Education, GitHub Learning Lab, GitHub Jobs, and our GitHub Desktop application. While GitHub Enterprise Server has been in scope since 2016, to further increase the security of our enterprise customers we are now expanding the scope to include Enterprise Cloud.

It’s not just about our user-facing systems. The security of our users’ data also depends on the security of our employees and our internal systems. That’s why we’re also including all first-party services under our employee-facing githubapp.com and github.net domains.

Increased rewards

We regularly assess our reward amounts against our industry peers. We also recognize that finding higher-severity vulnerabilities in GitHub’s products is becoming increasingly difficult for researchers and they should be rewarded for their efforts. That’s why we’ve increased our reward amounts at all levels:

  • Critical: $20,000–$30,000+
  • High: $10,000–$20,000
  • Medium: $4,000–$10,000
  • Low: $617–$2,000

Our broad ranges have served us well, but we’ve been consistently impressed by the ingenuity of researchers. To recognize that, we no longer have a maximum reward amount for critical vulnerabilities. Although we’ve listed $30,000 as a guideline amount for critical vulnerabilities, we’re reserving the right to reward significantly more for truly cutting-edge research.

Get involved

The bounty program remains a core part of GitHub’s security process and we’re learning a lot from our researchers. With our new initiatives, now is the perfect time to get involved. Details about our safe harbor, expanded scope, and increased awards are available on the GitHub Bug Bounty site.

Working with the community has been a great experience—we’re looking forward to triaging your submissions in the future!

The post Five years of the GitHub Bug Bounty program appeared first on The GitHub Blog.

An open source parser for GitHub Actions

Post Syndicated from Patrick Reynolds original https://github.blog/2019-02-07-an-open-source-parser-for-github-actions/

Update: This blog post is no longer relevant with the update to GitHub Actions in August 2019. See the GitHub Actions documentation for more information.


Since the beta release of GitHub Actions last October, thousands of users have added workflow files to their repositories. But until now, those files only work with the tools GitHub provided: the Actions editor, the Actions execution platform, and the syntax highlighting built into pull requests. To expand that universe, we need to release the parser and the specification for the Actions workflow language as open source. Today, we’re doing that.

We believe that tools beyond GitHub should be able to run workflows. We believe there should be programs to check, format, compose, and visualize workflow files. We believe that text editors can provide syntax highlighting and autocompletion for Actions workflows. And we believe all that can only happen if the Actions community is empowered to build these tools along with us. That can happen better and faster if there is a single language specification and a free parser implementation.

The first project to use the open source parser will be act, which is @nektos‘s tool for running Actions workflows in a local development environment.

The parser and language specification are both in actions/workflow-parser, which we’re sharing under an MIT license. As of today, there is a Go implementation, which is the same code that powers both the Actions UI and the Actions execution platform. The repository also contains a Javascript parser in development, along with syntax-highlighting configurations for Atom and Vim.

GitHub Actions

GitHub Actions is platform for automating software development workflows, from idea to production. Developers add a simple text file to their repository, .github/main.workflow, to describe automation. The workflow file describes how events like pushing code or opening and closing issues map to automation actions, implemented in any Docker container. Those automation actions have whatever powers you grant them: pushing commits to the repository, cutting a new release, building it through continuous integration, deploying it to staging in the cloud, testing the deployment, flipping it to production, and announcing it to the world — and any others you can build.

Every workflow begins with an event and runs through a set of actions to reach some target or goal. Those events and actions are described in a main.workflow file, which you can create and edit with the visual editor or any text editor you like. Here is a simple example:

workflow "when I push" {
  on = "push"
  resolves = "ci"
}

action "ci" {
  uses = "docker://golang:latest"
  runs = "./script/cibuild"
}

Whenever I push to a branch that contains that file, the when I push workflow executes. It resolves the target action ci, which runs ./script/cibuild in a golang:latest Docker container. I can add more workflow blocks to harness more events, and I can add more action blocks to run after the ci action or in parallel with it.

The Actions workflow language

All main.workflow files are written in the Actions workflow language, which is a subset of Hashicorp’s HCL. In fact, our parser builds on top of the open source hashicorp/hcl parser.

All Actions workflow files are valid HCL, but not all HCL files are valid workflows. The Actions workflow parser is stricter, allowing only a specific set of keywords and prohibiting nested objects, among other restrictions. The reason for that is a long-standing goal of Actions: making the Actions editor and the text representation of workflows equivalent and interchangeable. Any file you write in the graphical editor can be expressed in a main.workflow file, of course, but also: any main.workflow can be fully displayed and edited in the graphical editor. There is one exception to this: the graphical editor does not display comments. But it preserves them: changes you make in the graphical editor do not disturb comments you have added to your main.workflow file.

Contributing

We have two reasons for releasing the parser. First, we want to encourage the Actions community to build tools that generate and manipulate workflow files. The language specification should help developers understand the language, and the parser should save developers the trouble of writing their own.

Second, we welcome your contributions. If you find bugs, please open an issue or send a pull request. If you want to add a syntax highlighter for a new editor or implement the parser in another language, we welcome that. The only real limitation is on features that go beyond the parser to the rest of Actions. For broader questions and suggestions about the rest of Actions, reach out through support; we’re listening.

The post An open source parser for GitHub Actions appeared first on The GitHub Blog.

Atom understands your code better than ever before

Post Syndicated from Max Brunsfeld original https://github.blog/2018-10-31-atoms-new-parsing-system/

Text editors like Atom have many features that make code easier to read and write—syntax highlighting and code folding are two of the most important examples. For decades, all major text editors have implemented these kinds of features based on a very crude understanding of code, obtained by searching for simple, regular expression patterns. This approach has severely limited how helpful text editors can be.

At GitHub, we want to explore new ways of making programming intuitive and delightful, so we’ve developed a parsing system called Tree-sitter that will serve as a new foundation for code analysis in Atom. Tree-sitter makes it possible for Atom to parse your code while you type—maintaining a syntax tree at all times that precisely describes the structure of your code. We’ve enabled the new system by default in Atom, bringing a number of improvements.

Crisp syntax highlighting

Atom’s syntax highlighting is now based on the syntax trees provided by Tree-sitter. This lets us use color to outline your code’s structure more clearly than before. Notice the consistency with which fields, functions, keywords, types, and variables are highlighted across a variety of languages:

This animated GIF shows a snippet of C code rendered in Atom. It then switches to show an equivalent snippet of code written in several different languages: C++, Go, Rust, and TypeScript.

Reliable code folding

In most text editors, code folding is based on indentation: lines with greater indentation are considered to be nested more deeply than lines with less indentation. But this doesn’t always match the structure of our code and can make code folding useless in some files. With Tree-sitter, Atom folds code based on its syntax, which allows folding to work as expected, even for code like this:

This animated GIF shows a snippet of Ruby code, containing a heredoc with unindented text. Code folding is used to first collapse the block containing the heredoc, then collapse the method containing that block, and then collapse the class containing that method.

Syntax-aware selection

Atom also uses syntax trees as the basis for two new editing commands: Select Larger Syntax Node and Select Smaller Syntax Node, bound to Alt+Up and Alt+Down. These commands can make many editing tasks more efficient and fun, especially when used in combination with multiple cursors.

This animated GIF shows a snippet of JavaScript code in Atom. First, nothing is selected. Then, larger and larger constructs in the code are selected.

Speed

Parsing an entire source file can be time-consuming. This is why most IDEs wait to parse your code until you stop typing for a moment, and there is often a delay before syntax highlighting updates. We want to avoid these delays, so we designed Tree-sitter to parse your code incrementally: it keeps the syntax tree up to date as you edit your code without ever having to re-parse the entire file from scratch.

tree-editing

Language support

Currently, we use Tree-sitter to parse 11 languages: Bash, C, C++, ERB, EJS, Go, HTML, JavaScript, Python, Ruby, and TypeScript. And we’ve added Rust support on our Beta channel. If you’d like to help us bring the power of Tree-sitter to more languages, check out the Tree-sitter documentation and the grammar page of the Atom Flight Manual.

Feedback

Want to know more about Tree-sitter? Check out this talk from this year’s StrangeLoop conference.

If you write code and you’re interested in trying our new system, give the new version of Atom a try. We’d love to hear your feedback—tweet us at @AtomEditor or if you’ve run into a bug, open an issue.

The post Atom understands your code better than ever before appeared first on The GitHub Blog.

October 21 post-incident analysis

Post Syndicated from Jason Warner original https://github.blog/2018-10-30-oct21-post-incident-analysis/

Last week, GitHub experienced an incident that resulted in degraded service for 24 hours and 11 minutes. While portions of our platform were not affected by this incident, multiple internal systems were affected which resulted in our displaying of information that was out of date and inconsistent. Ultimately, no user data was lost; however manual reconciliation for a few seconds of database writes is still in progress. For the majority of the incident, GitHub was also unable to serve webhook events or build and publish GitHub Pages sites.

All of us at GitHub would like to sincerely apologize for the impact this caused to each and every one of you. We’re aware of the trust you place in GitHub and take pride in building resilient systems that enable our platform to remain highly available. With this incident, we failed you, and we are deeply sorry. While we cannot undo the problems that were created by GitHub’s platform being unusable for an extended period of time, we can explain the events that led to this incident, the lessons we’ve learned, and the steps we’re taking as a company to better ensure this doesn’t happen again.

Background

The majority of user-facing GitHub services are run within our own data center facilities. The data center topology is designed to provide a robust and expandable edge network that operates in front of several regional data centers that power our compute and storage workloads. Despite the layers of redundancy built into the physical and logical components in this design, it is still possible that sites will be unable to communicate with each other for some amount of time.

At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.

A high-level depiction of GitHub's network architecture, including two physical datacenters, 3 POPS, and cloud capacity in multiple regions connected via peering.

In the past, we’ve discussed how we use MySQL to store GitHub metadata as well as our approach to MySQL High Availability. GitHub operates multiple MySQL clusters varying in size from hundreds of gigabytes to nearly five terabytes, each with up to dozens of read replicas per cluster to store non-Git metadata, so our applications can provide pull requests and issues, manage authentication, coordinate background processing, and serve additional functionality beyond raw Git object storage. Different data across different parts of the application is stored on various clusters through functional sharding.

To improve performance at scale, our applications will direct writes to the relevant primary for each cluster, but delegate read requests to a subset of replica servers in the vast majority of cases. We use Orchestrator to manage our MySQL cluster topologies and handle automated failover. Orchestrator considers a number of variables during this process and is built on top of Raft for consensus. It’s possible for Orchestrator to implement topologies that applications are unable to support, therefore care must be taken to align Orchestrator’s configuration with application-level expectations.

In the normal topology, all apps perform reads locally with low latency.

Incident timeline

2018 October 21 22:52 UTC

During the network partition described above, Orchestrator, which had been active in our primary data center, began a process of leadership deselection, according to Raft consensus. The US West Coast data center and US East Coast public cloud Orchestrator nodes were able to establish a quorum and start failing over clusters to direct writes to the US West Coast data center. Orchestrator proceeded to organize the US West Coast database cluster topologies. When connectivity was restored, our application tier immediately began directing write traffic to the new primaries in the West Coast site.

The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.

2018 October 21 22:54 UTC

Our internal monitoring systems began generating alerts indicating that our systems were experiencing numerous faults. At this time there were several engineers responding and working to triage the incoming notifications. By 23:02 UTC, engineers in our first responder team had determined that topologies for numerous database clusters were in an unexpected state. Querying the Orchestrator API displayed a database replication topology that only included servers from our US West Coast data center.

2018 October 21 23:07 UTC

By this point the responding team decided to manually lock our internal deployment tooling to prevent any additional changes from being introduced. At 23:09 UTC, the responding team placed the site into yellow status. This action automatically escalated the situation into an active incident and sent an alert to the incident coordinator. At 23:11 UTC the incident coordinator joined and two minutes later made the decision change to status red.

2018 October 21 23:13 UTC

It was understood at this time that the problem affected multiple database clusters. Additional engineers from GitHub’s database engineering team were paged. They began investigating the current state in order to determine what actions needed to be taken to manually configure a US East Coast database as the primary for each cluster and rebuild the replication topology. This effort was challenging because by this point the West Coast database cluster had ingested writes from our application tier for nearly 40 minutes. Additionally, there were the several seconds of writes that existed in the East Coast cluster that had not been replicated to the West Coast and prevented replication of new writes back to the East Coast.

Guarding the confidentiality and integrity of user data is GitHub’s highest priority. In an effort to preserve this data, we decided that the 30+ minutes of data written to the US West Coast data center prevented us from considering options other than failing-forward in order to keep user data safe. However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls. This decision would result in our service being unusable for many users. We believe that the extended degradation of service was worth ensuring the consistency of our users’ data.

In the invalid topology, replication from US West to US East is broken and apps are unable to read from current replicas as they depend on low latency to maintain transaction performance.

2018 October 21 23:19 UTC

It was clear through querying the state of the database clusters that we needed to stop running jobs that write metadata about things like pushes. We made an explicit choice to partially degrade site usability by pausing webhook delivery and GitHub Pages builds instead of jeopardizing data we had already received from users. In other words, our strategy was to prioritize data integrity over site usability and time to recovery.

2018 October 22 00:05 UTC

Engineers involved in the incident response team began developing a plan to resolve data inconsistencies and implement our failover procedures for MySQL. Our plan was to restore from backups, synchronize the replicas in both sites, fall back to a stable serving topology, and then resume processing queued jobs. We updated our status to inform users that we were going to be executing a controlled failover of an internal data storage system.

Overview of recovery plan was to fail forward, synchronize, fall back, then churn through backlogs before returning to green.

While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. The process to decompress, checksum, prepare, and load large backup files onto newly provisioned MySQL servers took the majority of time. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.

2018 October 22 00:41 UTC

A backup process for all affected MySQL clusters had been initiated by this time and engineers were monitoring progress. Concurrently, multiple teams of engineers were investigating ways to speed up the transfer and recovery time without further degrading site usability or risking data corruption.

2018 October 22 06:51 UTC

Several clusters had completed restoration from backups in our US East Coast data center and begun replicating new data from the West Coast. This resulted in slow site load times for pages that had to execute a write operation over a cross-country link, but pages reading from those database clusters would return up-to-date results if the read request landed on the newly restored replica. Other larger database clusters were still restoring.

Our teams had identified ways to restore directly from the West Coast to overcome throughput restrictions caused by downloading from off-site storage and were increasingly confident that restoration was imminent, and the time left to establishing a healthy replication topology was dependent on how long it would take replication to catch up. This estimate was linearly interpolated from the replication telemetry we had available and the status page was updated to set an expectation of two hours as our estimated time of recovery.

2018 October 22 07:46 UTC

GitHub published a blog post to provide more context. We use GitHub Pages internally and all builds had been paused several hours earlier, so publishing this took additional effort. We apologize for the delay. We intended to send this communication out much sooner and will be ensuring we can publish updates in the future under these constraints.

2018 October 22 11:12 UTC

All database primaries established in US East Coast again. This resulted in the site becoming far more responsive as writes were now directed to a database server that was co-located in the same physical data center as our application tier. While this improved performance substantially, there were still dozens of database read replicas that were multiple hours delayed behind the primary. These delayed replicas resulted in users seeing inconsistent data as they interacted with our services. We spread the read load across a large pool of read replicas and each request to our services had a good chance of hitting a read replica that was multiple hours delayed.

In reality, the time required for replication to catch up had adhered to a power decay function instead of a linear trajectory. Due to increased write load on our database clusters as users woke up and began their workday in Europe and the US, the recovery process took longer than originally estimated.

2018 October 22 13:15 UTC

By now, we were approaching peak traffic load on GitHub.com. A discussion was had by the incident response team on how to proceed. It was clear that replication delays were increasing instead of decreasing towards a consistent state. We’d begun provisioning additional MySQL read replicas in the US East Coast public cloud earlier in the incident. Once these became available it became easier to spread read request volume across more servers. Reducing the utilization in aggregate across the read replicas allowed replication to catch up.

2018 October 22 16:24 UTC

Once the replicas were in sync, we conducted a failover to the original topology, addressing the immediate latency/availability concerns. As part of a conscious decision to prioritize data integrity over a shorter incident window, we kept the service status red while we began processing the backlog of data we had accumulated.

2018 October 22 16:45 UTC

During this phase of the recovery, we had to balance the increased load represented by the backlog, potentially overloading our ecosystem partners with notifications, and getting our services back to 100% as quickly as possible. There were over five million hook events and 80 thousand Pages builds queued.

As we re-enabled processing of this data, we processed ~200,000 webhook payloads that had outlived an internal TTL and were dropped. Upon discovering this, we paused that processing and pushed a change to increase that TTL for the time being.

To avoid further eroding the reliability of our status updates, we remained in degraded status until we had completed processing the entire backlog of data and ensured that our services had clearly settled back into normal performance levels.

2018 October 22 23:03 UTC

All pending webhooks and Pages builds had been processed and the integrity and proper operation of all systems had been confirmed. The site status was updated to green.

Next steps

Resolving data inconsistencies

During our recovery, we captured the MySQL binary logs containing the writes we took in our primary site that were not replicated to our West Coast site from each affected cluster. The total number of writes that were not replicated to the West Coast was relatively small. For example, one of our busiest clusters had 954 writes in the affected window. We are currently performing an analysis on these logs and determining which writes can be automatically reconciled and which will require outreach to users. We have multiple teams engaged in this effort, and our analysis has already determined a category of writes that have since been repeated by the user and successfully persisted. As stated in this analysis, our primary goal is preserving the integrity and accuracy of the data you store on GitHub.

Communication

In our desire to communicate meaningful information to you during the incident, we made several public estimates on time to repair based on the rate of processing of the backlog of data. In retrospect, our estimates did not factor in all variables. We are sorry for the confusion this caused and will strive to provide more accurate information in the future.

Technical initiatives

There are a number of technical initiatives that have been identified during this analysis. As we continue to work through an extensive post-incident analysis process internally, we expect to identify even more work that needs to happen.

  1. Adjust the configuration of Orchestrator to prevent the promotion of database primaries across regional boundaries. Orchestrator’s actions behaved as configured, despite our application tier being unable to support this topology change. Leader-election within a region is generally safe, but the sudden introduction of cross-country latency was a major contributing factor during this incident. This was emergent behavior of the system given that we hadn’t previously seen an internal network partition of this magnitude.

  2. We have accelerated our migration to a new status reporting mechanism that will provide a richer forum for us to talk about active incidents in crisper and clearer language. While many portions of GitHub were available throughout the incident, we were only able to set our status to green, yellow, and red. We recognize that this doesn’t give you an accurate picture of what is working and what is not, and in the future will be displaying the different components of the platform so you know the status of each service.

  3. In the weeks prior to this incident, we had started a company-wide engineering initiative to support serving GitHub traffic from multiple data centers in an active/active/active design. This project has the goal of supporting N+1 redundancy at the facility level. The goal of that work is to tolerate the full failure of a single data center failure without user impact. This is a major effort and will take some time, but we believe that multiple well-connected sites in a geography provides a good set of trade-offs. This incident has added urgency to the initiative.

  4. We will take a more proactive stance in testing our assumptions. GitHub is a fast growing company and has built up its fair share of complexity over the last decade. As we continue to grow, it becomes increasingly difficult to capture and transfer the historical context of trade-offs and decisions made to newer generations of Hubbers.

Organizational initiatives

This incident has led to a shift in our mindset around site reliability. We have learned that tighter operational controls or improved response times are insufficient safeguards for site reliability within a system of services as complicated as ours. To bolster those efforts, we will also begin a systemic practice of validating failure scenarios before they have a chance to affect you. This work will involve future investment in fault injection and chaos engineering tooling at GitHub.

Conclusion

We know how much you rely on GitHub for your projects and businesses to succeed. No one is more passionate about the availability of our services and the correctness of your data. We will continue to analyze this event for opportunities to serve you better and earn the trust you place in us.

The post October 21 post-incident analysis appeared first on The GitHub Blog.

Behind the scenes of GitHub Token Scanning

Post Syndicated from Patrick Toomey original https://github.blog/2018-10-17-behind-the-scenes-of-github-token-scanning/

Several years ago we started scanning all pushes to public repositories for GitHub OAuth tokens and personal access tokens. Now we’re extending this capability to include tokens from cloud service providers and additional credentials, such as unencrypted SSH private keys associated with a user’s GitHub account.

We live in amazing times for software development. Capabilities that were once only available to large technology companies are now accessible to the smallest of startups. Developers can leverage cloud services to quickly perform continuous integration testing, deploy their code to fully scalable infrastructure, accept credit card payments from customers, and nearly anything else you can imagine.

Composing cloud services like this is the norm going forward, but it comes with inherent security complexities. Each cloud service a developer typically uses requires one or more credentials, often in the form of API tokens. In the wrong hands, they can be used to access sensitive customer data—or vast computing resources for mining cryptocurrency, presenting significant risks to both users and cloud service providers.

diff --git a/config.yml b/config.yml
index 1110370..da7b5de 100644
--- a/config.yml
+++ b/config.yml
@@ -1,2 +1,2 @@
 production:
-  github_token: 4e862a8ec12ef0ad7c1337e8b16d98f4d764b8f6
+  github_token: <%= ENV["GITHUB_TOKEN"] %>

Fig 1: Your master branch might look innocent, but exposed credentials in your project’s history could prove costly.

Token Scanning 1.0

GitHub, and our users, aren’t immune to this problem. That is why we started scanning for GitHub OAuth tokens in public repositories several years ago. But, our users are not just GitHub customers. They are also customers of cloud infrastructure providers, cloud payment processing providers, and other cloud service providers that have become commonplace in modern development.

Our existing solution focused on GitHub OAuth tokens exclusively and was never designed for extensibility. The existing code leveraged hand-tuned assembly that was extremely fast at finding 40-hex character strings (the format of GitHub OAuth tokens). This bit of code was patched into Git and run inline whenever code was pushed to GitHub. It was an amazing piece of work, but could not support multiple credential formats. Our vision was to support all of the popular cloud service providers.

Token Scanning 2.0

So began our journey into the next generation of GitHub Token Scanning. The obvious path to a more extensible scanner is some form of regular expression support. However, scanning for credentials using typical regular expression libraries doesn’t scale from a performance perspective, as they optimize for a slightly different problem than the one we have.

The vast majority of regular expression libraries are designed to return the first match in a set of patterns. Given we have at least one pattern for each cloud service provider, we require all matches are returned and not just the first. The only way to ensure this with traditional libraries is to scan a given input once for each pattern. However, this increases the scan time dramatically for large repositories or large sets of patterns. Fortunately, scanning Git data for credentials is just a specific case of a general problem. For example, high-performance application-level firewalls similarly need to scan high-volume network traffic for sets of patterns to identify known viruses or malware. If you squint, scanning high-volume Git push data for credentials is a very similar problem.

Our research eventually lead us to a GitHub repository hosting the amazing Hyperscan library by Intel. This library is incredibly performant and provides exactly what we need. We will explore the technical details in more depth in a follow-up engineering post. But, in short, Hyperscan let us replace all of the assembly code patches to Git with a new standalone scanner, written in Go, that has scaled nicely.

In parallel with working on the implementation, we reached out to several cloud service providers we thought would be interested in testing out Token Scanning in a private beta. They were all enthusiastic to participate, as many of them had contacted us in the past looking for a solution to this widespread problem.

Since April, we’ve worked with cloud service providers in private beta to scan all changes to public repositories and public Gists for credentials (GitHub doesn’t scan private code). Each candidate credential is sent to the provider, including some basic metadata such as the repository name and the commit that introduced the credential. The provider can then validate the credential and decide if the credential should be revoked depending on the associated risks to the user or provider. Either way, the provider typically contacts the owner of the credential, letting them know what occurred and what action was taken.

Where we go from here

We have received amazing feedback from both providers and users during the private beta. Cloud service providers have told us that GitHub Token Scanning has been tremendously effective in helping them identify credentials before malicious users. And, while GitHub users were not aware that Token Scanning was in beta, we did take notice of a tweet from a GitHub user that included a number of enthusiastic exclamation marks and 🎉 emojis. This user was extremely grateful for having received a notification from a participating cloud service provider less than a minute after they had accidentally pushed a highly sensitive credential to a public repository.

During the beta we have scanned millions of public repository changes and identified millions of candidate credentials. As announced yesterday at GitHub Universe, Token Scanning is now in public beta, and supports an increasing number of cloud providers. We’re excited by the impact of Token Scanning today and have lots of ideas about how to make it even more powerful in the future. Dealing with credentials is an unavoidable part of modern development. With GitHub by your side, we hope to minimize the security impact of such accidents.

The post Behind the scenes of GitHub Token Scanning appeared first on The GitHub Blog.

Applying machine intelligence to GitHub security alerts

Post Syndicated from Ben Thompson original https://github.blog/2018-10-09-applying-machine-intelligence-to-security-alerts/

Last year, we released security alerts that track security vulnerabilities in Ruby and JavaScript packages. Since then, we’ve identified more than four million of these vulnerabilities and added support for Python. In our launch post, we mentioned that all vulnerabilities with CVE IDs are included in security alerts, but sometimes there are vulnerabilities that are not disclosed in the National Vulnerability Database. Fortunately, our collection of security alerts can be supplemented with vulnerabilities detected from activity within our developer community.

Leveraging the community

There are many places a project can publicize security fixes within a new version: the CVE feed, various mailing lists, and open source groups, or even within its release notes or changelog. Regardless of how projects share this information, some developers within the GitHub community will see the advisory and immediately bump their required versions of the dependency to a known safe version. If detected, we can use the information in these commits to generate security alerts for vulnerabilities which may not have been published in the CVE feed.

On an average day, the dependency graph can track around 10,000 commits to dependency files for any of our supported languages. We can’t manually process this many commits. Instead, we depend on machine intelligence to sift through them and extract those that might be related to a security release.

For this purpose, we created a machine learning model that scans text associated with public commits (the commit message and linked issues or pull requests) to filter out those related to possible security upgrades. With this smaller batch of commits, the model uses the diff to understand how required version ranges have changed. Then it aggregates across a specific timeframe to get a holistic view of all dependencies that a security release might affect. Finally, the model outputs a list of packages and version ranges it thinks require an alert and currently aren’t covered by any known CVE in our system.

Always quality focused

No machine learning model is perfect. While machine intelligence can sift through thousands of commits in an instant, this anomaly-detection algorithm will still generate false positives for packages where no security patch was released. Security alert quality is a focus for us, so we review all model output before the community receives an alert.

Learn more

Interested in learning more? Join us at GitHub Universe next week to explore the connections that push technology forward and keep projects secure through talks, trainings, and workshops. Tune in to the blog October 16-17 for more updates and announcements.

The post Applying machine intelligence to GitHub security alerts appeared first on The GitHub Blog.

Upgrading GitHub from Rails 3.2 to 5.2

Post Syndicated from Eileen M. Uchitelle original https://github.blog/2018-09-28-upgrading-github-from-rails-3-2-to-5-2/

On August 15th GitHub celebrated a major milestone: our main application is now running on the latest version of Rails: 5.2.1! 🎉

In total the project took a year and a half to upgrade from Rails 3.2 to Rails 5.2. Along the way we took time to clean up technical debt and improve the overall codebase while doing the upgrade. Below we’ll talk about how we upgraded Rails, lessons we learned and whether we’d do it again.

How did we do it?

Upgrading Rails on an application as large and as trafficked as GitHub is no small task. It takes careful planning, good organization, and patience. The upgrade started out as kind of a hobby; engineers would work on it when they had free time. There was no dedicated team. As we made progress and gained traction it became not only something we hoped we could do, but a priority.

Since GitHub is so important to our community, we can’t stop feature development or bug fixes in order to upgrade Rails.

Instead of using a long-running branch to upgrade Rails we added the ability to dual boot the application in multiple versions of Rails. We created two Gemfile.lock’s: one for the current version Gemfile.lock and one for the future version Gemfile_next.lock. The dual booting allows us to regularly deploy changes for the next version to GitHub without requiring long running branches or altering how production works. We do this by conditionally loading the code.

if GitHub.rails_3_2?
  ## 3.2 code (i.e. production a year and a half ago)
elsif GitHub.rails_4_2?
  # 4.2 code
else
  # all 5.0+ future code, ensuring we never accidentally
  # fall back into an old version going forward
end

Each time we got a minor version of Rails green we’d make the CI job required for all pushes to the GitHub application and start work on the next version. While we worked on Rails 4.1 a CI job would run on every push for 3.2 and 4.0. When 4.1 was green we’d swap out 4.1 for 4.0 and get to work on 4.2. This allowed us to prevent regressions once a version of Rails was green, and time for engineers to get used to writing code that worked in multiple versions of Rails.

The two versions that we deployed were 4.2 and 5.2. We deployed 4.2 because it was a huge milestone and was the first version of Rails that hadn’t been EOL’d yet (as an aside: we’d been backporting security fixes to 3.2 but not to 4.0+ so we couldn’t deploy 4.0 or 4.1. Rest assured your security is our top priority).

To roll out the Rails upgrade we created a careful and iterative process. We’d first deploy to our testing environment and requested volunteers from each team to click test their area of the codebase to find any regressions the test suite missed.

After fixing those regressions, we deployed in off-hours to a percentage of production servers. During each deploy we’d collect data about exceptions and performance of the site. With that information we’d fix bugs that came up and repeat those steps until the error rate was low enough be considered equal to the previous version. We merged the upgrade once we could deploy to full production for 30 minutes at peak traffic with no visible impact.

This process allowed us to deploy 4.2 and 5.2 with minimal customer impact and no down time.

Lessons Learned

The Rails upgrade took a year and a half. This was for a few reasons. First, Rails upgrades weren’t always smooth and some versions had major breaking changes. Rails improved the upgrade process for the 5 series so this meant that while 3.2 to 4.2 took 1 year, 4.2 to 5.2 only took 5 months.

Another reason is the GitHub codebase is 10 years old. Over the years technical debt builds and there’s bound to be gremlins lurking in the codebase. If you’re on an old version of Rails, your engineers will need to add more monkey patches or implement features that exist upstream.

Lastly, when we started it wasn’t clear what resources were needed to support the upgrade and since most of us had never done a Rails upgrade before, we were learning as we went. The project originally began with 1 full-time engineer and a small army of volunteers. We grew that team to 4 full-time engineers plus the volunteers. Each version bump meant we learned more and the next version went even faster.

Through this work we learned some important lessons that we hope make your next upgrade easier:

  • Upgrade early and upgrade often. The closer you are to a new version of Rails, the easier upgrades will be. This encourages your team to fix bugs in Rails instead of monkey-patching the application or reinventing features that exist upstream.
  • Keep upgrade infrastructure in place. There will always be a new version to upgrade to, so once you’re on a modern version of Rails add a build to run against the master branch. This will catch bugs in Rails and your application early, make upgrades easier, and increase your upstream contributions.
  • Upstream your tooling instead of rolling your own. The more you push upstream to gems or Rails, the less logic you need in your application. Save your application code for what truly makes your company special (i.e. Pull Requests), instead of tools to make your application run smoothly (i.e. concurrent testing libraries)
  • Avoid using private API’s in your frameworks. Rails has a lot of code that’s not private but isn’t documented on purpose. That code is subject to change without notice, so writing code that relies on private code can easily break in an upgrade.
  • Address technical debt often. It’s easy to think “this is working, why mess with it”, but if no one knows how that code works, it can quickly become a bottleneck for upgrades. Try to prevent coupling your application logic too closely to your framework. Ensure that the line where your application ends and your framework begins is clear. You can do this by addressing technical debt before it becomes difficult to remove.
  • Do incremental upgrades. Each minor version of Rails provides the deprecation warnings for the next version. By upgrading from 3.2 to 4.0, 4.0 to 4.1, etc we were able to identify problems in the next version early and define clear milestones.
  • Keep up the momentum. Rails upgrades can seem daunting. Create ways in which your team can have quick wins to keep momentum going. Share the responsibility across teams so that everyone is familiar with the new version of the framework and prevent burnout. Once you’re on the newest version add a build to your app that periodically runs your suite against edge Rails so you can catch bugs in your code or your framework early.
  • Expect things to break. Upgrades are hard and in an application as large as GitHub things are bound to break. While we didn’t take the site down during the upgrade we had issues with CI, local development, slow queries, and other problems that didn’t show up in our CI builds or click testing.

Was it worth it?

Absolutely.

Upgrading Rails has allowed us to address technical debt in our application. Our test suite is now closer to vanilla Rails, we were able to remove StateMachine in favor of Active Record enums, and start replacing our job runner with Active Job. And that’s just the beginning.

Rails upgrades are a lot of hard work and can be time-consuming, but they also open up a ton possibilities. We can push more of our tooling upstream to Rails, address areas of technical debt, and be one of the largest codebases running on the most recent version of Rails. Not only does this benefit us at GitHub, it benefits our customers and the open source community.

The post Upgrading GitHub from Rails 3.2 to 5.2 appeared first on The GitHub Blog.

Towards Natural Language Semantic Code Search

Post Syndicated from GitHub Engineering original https://github.blog/2018-09-18-towards-natural-language-semantic-code-search/

Hubot

This blog post complements a live demonstration on our recently announced site: experiments.github.com

Motivation

Searching code on GitHub is currently limited to keyword search. This assumes either the user knows the syntax, or can anticipate what keywords might be in comments surrounding the code they are looking for. Our machine learning scientists have been researching ways to enable the semantic search of code.

To fully grasp the concept of semantic search, consider the below search query, “ping REST api and return results”:

Vector-space diagram

Note that the demonstrated semantic search returns reasonable results even though there are no keywords in common between the search query and the text (the code & comments found do not contain the words “Ping”, “REST” or “api”)! The implications of augmenting keyword search with semantic search are profound. For example, such a capability would expedite the process of on-boarding new software engineers onto projects and bolster the discoverability of code in general.

In this post, we want to share how we are leveraging deep learning to make progress towards this goal. We also share an open source example with code and data that you can use to reproduce these results!

Introduction

One of the key areas of machine learning research underway at GitHub is representation learning of entities, such as repos, code, issues, profiles and users. We have made significant progress towards enabling semantic search by learning representations of code that share a common vector space as text. For example, consider the below diagram:

Vector-space diagram

In the above example, Text 2 (blue) is a reasonable description of the code, whereas Text 1 (red) is not related to the code at all. Our goal is to learn representations where (text, code) pairs that describe the same concept are close neighbors, whereas unrelated (text, code) pairs are further apart. By representing text and code in the same vector space, we can vectorize a user’s search query and lookup the nearest neighbor that represents code. Below is a four-part description of the approach we are currently using to accomplish this task:

1. Learning Representations of Code

In order to learn a representation of code, we train a sequence-to-sequence model that learns to summarize code. A way to accomplish this for Python is to supply (code, docstring) pairs where the docstring is the target variable the model is trying to predict. One active area of research for us is incorporating domain specific optimizations like tree-based LSTMs, gated-graph networks and syntax-aware tokenization. Below is a screenshot that showcases the code summarizer model at work. In this example, there are two python functions supplied as input, and in both cases the model produces a reasonable summary of the code as output:

code summarizer

It should be noted that in the above examples, the model produces the summary by using the entire code blob, not merely the function name.

Building a code summarizer is a very exciting project on its own, however, we can utilize the encoder from this model as a general purpose feature extractor for code. After extracting the encoder from this model, we can fine-tune it for the task of mapping code to the vector space of natural language.

We can evaluate this model objectively using the BLEU score. Currently we have been able to achieve a BLEU score of 13.5 on a holdout set of python code, using the fairseq-py library for sequence to sequence models.

2. Learning Representations of Text Phrases

In addition to learning a representation for code, we needed to find a suitable representation for short phrases (like sentences found in Python docstrings). Initially, we experimented with the Universal Sentence Encoder, a pre-trained encoder for text that is available on TensorFlow Hub. While the embeddings from worked reasonably well, we found that it was advantageous to learn embeddings that were specific to the vocabulary and semantics of software development. One area of ongoing research involves evaluating different domain-specific corpuses for training our own model, ranging from GitHub issues to third party datasets.

To learn this representation of phrases, we trained a neural language model by leveraging the fast.ai library. This library gave us easy access to state of the art architectures such as AWD LSTMs, and to techniques such as cyclical learning rates with random restarts. We extracted representations of phrases from this model by summarizing the hidden states using the concat pooling approach found in this paper.

One of the most challenging aspects of this exercise was to evaluate the quality of these embeddings. We are currently building a variety of downstream supervised tasks similar to those outlined here that will aid us in evaluating the quality of these embeddings objectively. In the meantime, we sanity check our embeddings by manually examining the similarity between similar phrases. The below screenshot illustrates examples where we search the vectorized docstrings for similarity against user-supplied phrases:

vec sim

3. Mapping Code Representations To The Same Vector-Space as Text

Next, we map the code representations we learned from the code summarization model (part 1) to the vector space of text. We accomplish this by fine-tuning the encoder of this model. The inputs to this model are still code blobs, however the target variable the model is now the vectorized version of docstrings. These docstrings are vectorized using the approach discussed in the previous section.

Concretely, we perform multi-dimensional regression with cosine proximity loss to bring the hidden state of the encoder into the same vector-space as text.

We are actively researching alternate approaches that directly learn a joint vector space of code and natural language, borrowing from some ideas outlined here.

4. Creating a Semantic Search System

Finally, after successfully creating a model that can vectorize code into the same vector-space as text, we can create a semantic search mechanism. In its most simple form, we can store the vectorized version of all code in a database, and perform nearest neighbor lookups to a vectorized search query.

Another active area of our research is determining the best way to augment existing keyword search with semantic results and how to incorporate additional information such as context and relevance.
Furthermore, we are actively exploring ways to evaluate the quality of search results that will allow us to iterate quickly on this problem. We leave these topics for discussion in a future blog post.

Summary

The below diagram summarizes all the steps in our current semantic-search workflow:

code summarizer

We are exploring ways to improve almost every component of this approach, including data preparation, model architecture, evaluation procedures, and overall system design. What is described in this blog post is only a minimal example that scratches the surface.

Open Source Examples

Our open-source end-to-end tutorial contains a detailed walkthrough of the approach outlined in this blog, along with code and data you can use to reproduce the results.

This open source example (with some modifications) is also used as a tutorial for the kubeflow project, which is implemented here.

Limitations and Intended Use Case(s)

We believe that semantic code search will be most useful for targeted searches of code within specific entities such as repos, organizations or users as opposed to general purpose “how to” queries. The live demonstration of semantic code search hosted on our recently announced Experiments site does not allow users to perform targeted searches of repos. Instead, this demonstration is designed to share a taste of what might be possible and searches only a limited, static set of python code.

Furthermore, like all machine learning techniques, the efficacy of this approach is limited by the training data used. For example, the data used to train these models are (code, docstring) pairs. Therefore, search queries that closely resemble a docstring have the greatest chance of success. On the other hand, queries that do not resemble a docstring or contain concepts for which there is little data may not yield sensible results. Therefore, it is not difficult to challenge our live demonstration and discover the limitations of this approach. Nevertheless, our initial results indicate that this is an extremely fruitful area of research that we are excited to share with you.

There are many more use cases for semantic code search. For example, we could extend the ideas presented here to allow users to search for code using the language of their choice (French, Mandarin, Arabic, etc.) across many different programming languages simultaneously.

Get In Touch

This is an exciting time for the machine learning research team at GitHub and we are looking to expand. If our work interests you, please get in touch!

Contributors


The post Towards Natural Language Semantic Code Search appeared first on The GitHub Blog.

Hello World Issue 5: Engineering

Post Syndicated from Russell Barnes original https://www.raspberrypi.org/blog/hello-world-issue-5/

Join us as we celebrate the Year of Engineering in the newest issue of Hello World, our magazine for computing and digital making educators.

 

Inspiring future engineers

We’ve brought together a wide range of experts to share their ideas and advice on how to bring engineering to your classroom — read issue 5 to find out the best ways to inspire the next generation.



Plus we’ve got plenty on GP and Scratch, we answer your latest questions, and we bring you our usual collection of useful features, guides, and lesson plans.

Highlights of issue 5 include:

  • The bluffers’ guide to putting together a tech-themed school trip
  • Inclusion, and coding for the visually impaired
  • Getting students interested in databases
  • Why copying may not always be a bad thing

How to get Hello World #5

Hello World is available as a free download under a Creative Commons license for everyone in world who is interested in computer science and digital making education. Get the latest issue as a PDF file straight from the Hello World website.

We’re currently offering free print copies of the magazine to serving educators in the UK. This offer is open to teachers, Code Club and CoderDojo volunteers, teaching assistants, teacher trainers, and others who help children and young people learn about computing and digital making. Subscribe to have your free print magazine posted directly to your home, or subscribe digitally — 20000 educators have already signed up to receive theirs!

Get in touch!

You could write for us about your experiences as an educator, and share your advice with the community. Wherever you are in the world, get in touch by emailing our editorial team about your article idea — we would love to hear from you!

Hello World magazine is a collaboration between the Raspberry Pi Foundation and Computing At School, which is part of the British Computing Society.

The post Hello World Issue 5: Engineering appeared first on Raspberry Pi.

Scanning snacks to your Wunderlist shopping list with Wunderscan

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/scanning-snacks-to-your-wunderlist-shopping-list/

Brian Carrigan found the remains of a $500 supermarket barcode scanner at a Scrap Exchange for $6.25, and decided to put it to use as a shopping list builder for his pantry.

Raspberry Pi Barcode Scanner Wunderscan Brian Carrigan

Upcycling from scraps

Brian wasn’t planning to build the Wunderscan. But when he stumbled upon the remains of a $500 Cubit barcode scanner at his local reuse center, his inner maker took hold of the situation.

It had been ripped from its connectors and had unlabeled wires hanging from it; a bit of hardware gore if such a thing exists. It was labeled on sale for $6.25, and a quick search revealed that it originally retailed at over $500… I figured I would try to reverse engineer it, and if all else fails, scrap it for the laser and motor.

Brian decided that the scanner, once refurbished with a Raspberry Pi Zero W and new wiring, would make a great addition to his home pantry as a shopping list builder using Wunderlist. “I thought a great use of this would be to keep near our pantry so that when we are out of a spice or snack, we could just scan the item and it would get posted to our shopping list.”

Reverse engineering

The datasheet for the Cubit scanner was available online, and Brian was able to discover the missing pieces required to bring the unit back to working order.

Raspberry Pi Barcode Scanner Wunderscan Brian Carrigan

However, no wiring diagram was provided with the datasheet, so he was forced to figure out the power connections and signal output for himself using a bit of luck and an oscilloscope.

Now that the part was powered and working, all that was left was finding the RS232 transmit line. I used my oscilloscope to do this part and found it by scanning items and looking for the signal. It was not long before this wire was found and I was able to receive UPC codes.

Scanning codes and building (Wunder)lists

When the scanner reads a barcode, it sends the ASCII representation of a UPC code to the attached Raspberry Pi Zero W. Brian used the free UPC Database to convert each code to the name of the corresponding grocery item. Next, he needed to add it to the Wunderlist shopping list that his wife uses for grocery shopping.

Raspberry Pi Barcode Scanner Wunderscan Brian Carrigan

Wunderlist provides an API token so users can incorporate list-making into their projects. With a little extra coding, Brian was able to convert the scanning of a pantry item’s barcode into a new addition to the family shopping list.

Curious as to how it all came together? You can find information on the project, including code and hardware configurations, on Brian’s blog. If you’ve built something similar, we’d love to see it in the comments below.

The post Scanning snacks to your Wunderlist shopping list with Wunderscan appeared first on Raspberry Pi.

A Day in the Life of Michele, Human Resources Coordinator at Backblaze

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/day-in-life-human-resources-coordinator/

Michele, HR Coordinator at Backblaze

Most of the time this blog is dedicated to cloud storage and computer backup topics, but we also want our readers to understand the culture and people at Backblaze who all contribute to keeping our company running and making it an enjoyable place to work. We invited our HR Coordinator, Michele, to talk about how she spends her day searching for great candidates to fill employment positions at Backblaze.

What’s a Typical Day for Michele at Backblaze?

After I’ve had a yummy cup of coffee — maybe with a honey and splash of half and half, I’ll generally start my day reviewing resumes and contacting potential candidates to set up an initial phone screen.

When I start the process of filling a position, I’ll spend a lot of time on the phone speaking with potential candidates. During a phone screen call we’ll chat about their experience, background and what they are ideally looking for in their next position. I also ask about what they like to do outside of work, and most importantly, how they feel about office dogs. A candidate may not always look great on paper, but could turn out to be a great cultural fit after speaking with them about their previous experience and what they’re passionate about.

Next, I push strong candidates to the subsequent steps with the hiring managers, which range from setting up a second phone screen, to setting up a Google hangout for completing coding tasks, to scheduling in-person interviews with the team.

At the end of the day after an in-person interview, I’ll check in with all the interviewers to debrief and decide how to proceed with the candidate. Everyone that interviewed the candidate will get together to give feedback. Is there a good cultural fit? Are they someone we’d like to work with? Keeping in contact with the candidates throughout the process and making sure they are organized and informed is a big part of my job. No one likes to wait around and wonder where they are in the process.

In between all the madness, I’ll put together offer letters, send out onboarding paperwork and links, and get all the necessary signatures to move forward.

On the candidate’s first day, I’ll go over benefits and the handbook and make sure everything is going smoothly in their overall orientation as they transition into their new role here at Backblaze!

What Makes Your Job Exciting?

  • I get to speak with many different types of people and see what makes them tick and if they’d be a good fit at Backblaze
  • The fast pace of the job
  • Being constantly kept busy with different tasks including supporting the FUN committee by researching venues and ideas for family day and the holiday party
  • I work on enjoyable projects like creating a people wall for new hires so we are able to put a face to the name
  • Getting to take a mini road trip up to Sacramento each month to check in with the data center employees
  • Constantly learning more and more about the job, the people, and the company

We’re growing rapidly and always looking for great people to join our team at Backblaze. Our team places a premium on open communications, being cleverly unconventional, and helping each other out.

Oh! We also offer competitive salaries, stock options, and amazing benefits.

Which Job Openings are You Currently Trying to Fill?

We are currently looking for the following positions. If you’re interested, please review the job description on our jobs page and then contact me at jobscontact@backblaze.com.

  • Engineering Director
  • Senior Java Engineer
  • Senior Software Engineer
  • Desktop and Laptop Windows Client Programmer
  • Senior Systems Administrator
  • Sales Development Representative

Thanks Michele!

The post A Day in the Life of Michele, Human Resources Coordinator at Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Continued: the answers to your questions for Eben Upton

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/eben-q-a-2/

Last week, we shared the first half of our Q&A with Raspberry Pi Trading CEO and Raspberry Pi creator Eben Upton. Today we follow up with all your other questions, including your expectations for a Raspberry Pi 4, Eben’s dream add-ons, and whether we really could go smaller than the Zero.

Live Q&A with Eben Upton, creator of the Raspberry Pi

Get your questions to us now using #AskRaspberryPi on Twitter

With internet security becoming more necessary, will there be automated versions of VPN on an SD card?

There are already third-party tools which turn your Raspberry Pi into a VPN endpoint. Would we do it ourselves? Like the power button, it’s one of those cases where there are a million things we could do and so it’s more efficient to let the community get on with it.

Just to give a counterexample, while we don’t generally invest in optimising for particular use cases, we did invest a bunch of money into optimising Kodi to run well on Raspberry Pi, because we found that very large numbers of people were using it. So, if we find that we get half a million people a year using a Raspberry Pi as a VPN endpoint, then we’ll probably invest money into optimising it and feature it on the website as we’ve done with Kodi. But I don’t think we’re there today.

Have you ever seen any Pis running and doing important jobs in the wild, and if so, how does it feel?

It’s amazing how often you see them driving displays, for example in radio and TV studios. Of course, it feels great. There’s something wonderful about the geographic spread as well. The Raspberry Pi desktop is quite distinctive, both in its previous incarnation with the grey background and logo, and the current one where we have Greg Annandale’s road picture.

The PIXEL desktop on Raspberry Pi

And so it’s funny when you see it in places. Somebody sent me a video of them teaching in a classroom in rural Pakistan and in the background was Greg’s picture.

Raspberry Pi 4!?!

There will be a Raspberry Pi 4, obviously. We get asked about it a lot. I’m sticking to the guidance that I gave people that they shouldn’t expect to see a Raspberry Pi 4 this year. To some extent, the opportunity to do the 3B+ was a surprise: we were surprised that we’ve been able to get 200MHz more clock speed, triple the wireless and wired throughput, and better thermals, and still stick to the $35 price point.

We’re up against the wall from a silicon perspective; we’re at the end of what you can do with the 40nm process. It’s not that you couldn’t clock the processor faster, or put a larger processor which can execute more instructions per clock in there, it’s simply about the energy consumption and the fact that you can’t dissipate the heat. So we’ve got to go to a smaller process node and that’s an order of magnitude more challenging from an engineering perspective. There’s more effort, more risk, more cost, and all of those things are challenging.

With 3B+ out of the way, we’re going to start looking at this now. For the first six months or so we’re going to be figuring out exactly what people want from a Raspberry Pi 4. We’re listening to people’s comments about what they’d like to see in a new Raspberry Pi, and I’m hoping by early autumn we should have an idea of what we want to put in it and a strategy for how we might achieve that.

Could you go smaller than the Zero?

The challenge with Zero as that we’re periphery-limited. If you run your hand around the unit, there is no edge of that board that doesn’t have something there. So the question is: “If you want to go smaller than Zero, what feature are you willing to throw out?”

It’s a single-sided board, so you could certainly halve the PCB area if you fold the circuitry and use both sides, though you’d have to lose something. You could give up some GPIO and go back to 26 pins like the first Raspberry Pi. You could give up the camera connector, you could go to micro HDMI from mini HDMI. You could remove the SD card and just do USB boot. I’m inventing a product live on air! But really, you could get down to two thirds and lose a bunch of GPIO – it’s hard to imagine you could get to half the size.

What’s the one feature that you wish you could outfit on the Raspberry Pi that isn’t cost effective at this time? Your dream feature.

Well, more memory. There are obviously technical reasons why we don’t have more memory on there, but there are also market reasons. People ask “why doesn’t the Raspberry Pi have more memory?”, and my response is typically “go and Google ‘DRAM price’”. We’re used to the price of memory going down. And currently, we’re going through a phase where this has turned around and memory is getting more expensive again.

Machine learning would be interesting. There are machine learning accelerators which would be interesting to put on a piece of hardware. But again, they are not going to be used by everyone, so according to our method of pricing what we might add to a board, machine learning gets treated like a $50 chip. But that would be lovely to do.

Which citizen science projects using the Pi have most caught your attention?

I like the wildlife camera projects. We live out in the countryside in a little village, and we’re conscious of being surrounded by nature but we don’t see a lot of it on a day-to-day basis. So I like the nature cam projects, though, to my everlasting shame, I haven’t set one up yet. There’s a range of them, from very professional products to people taking a Raspberry Pi and a camera and putting them in a plastic box. So those are good fun.

Raspberry Shake seismometer

The Raspberry Shake seismometer

And there’s Meteor Pi from the Cambridge Science Centre, that’s a lot of fun. And the seismometer Raspberry Shake – that sort of thing is really nice. We missed the recent South Wales earthquake; perhaps we should set one up at our Californian office.

How does it feel to go to bed every day knowing you’ve changed the world for the better in such a massive way?

What feels really good is that when we started this in 2006 nobody else was talking about it, but now we’re part of a very broad movement.

We were in a really bad way: we’d seen a collapse in the number of applicants applying to study Computer Science at Cambridge and elsewhere. In our view, this reflected a move away from seeing technology as ‘a thing you do’ to seeing it as a ‘thing that you have done to you’. It is problematic from the point of view of the economy, industry, and academia, but most importantly it damages the life prospects of individual children, particularly those from disadvantaged backgrounds. The great thing about STEM subjects is that you can’t fake being good at them. There are a lot of industries where your Dad can get you a job based on who he knows and then you can kind of muddle along. But if your dad gets you a job building bridges and you suck at it, after the first or second bridge falls down, then you probably aren’t going to be building bridges anymore. So access to STEM education can be a great driver of social mobility.

By the time we were launching the Raspberry Pi in 2012, there was this wonderful movement going on. Code Club, for example, and CoderDojo came along. Lots of different ways of trying to solve the same problem. What feels really, really good is that we’ve been able to do this as part of an enormous community. And some parts of that community became part of the Raspberry Pi Foundation – we merged with Code Club, we merged with CoderDojo, and we continue to work alongside a lot of these other organisations. So in the two seconds it takes me to fall asleep after my face hits the pillow, that’s what I think about.

We’re currently advertising a Programme Manager role in New Delhi, India. Did you ever think that Raspberry Pi would be advertising a role like this when you were bringing together the Foundation?

No, I didn’t.

But if you told me we were going to be hiring somewhere, India probably would have been top of my list because there’s a massive IT industry in India. When we think about our interaction with emerging markets, India, in a lot of ways, is the poster child for how we would like it to work. There have already been some wonderful deployments of Raspberry Pi, for example in Kerala, without our direct involvement. And we think we’ve got something that’s useful for the Indian market. We have a product, we have clubs, we have teacher training. And we have a body of experience in how to teach people, so we have a physical commercial product as well as a charitable offering that we think are a good fit.

It’s going to be massive.

What is your favourite BBC type-in listing?

There was a game called Codename: Druid. There is a famous game called Codename: Droid which was the sequel to Stryker’s Run, which was an awesome, awesome game. And there was a type-in game called Codename: Druid, which was at the bottom end of what you would consider a commercial game.

codename druid

And I remember typing that in. And what was really cool about it was that the next month, the guy who wrote it did another article that talks about the memory map and which operating system functions used which bits of memory. So if you weren’t going to do disc access, which bits of memory could you trample on and know the operating system would survive.

babbage versus bugs Raspberry Pi annual

See the full listing for Babbage versus Bugs in the Raspberry Pi 2018 Annual

I still like type-in listings. The Raspberry Pi 2018 Annual has a type-in listing that I wrote for a Babbage versus Bugs game. I will say that’s not the last type-in listing you will see from me in the next twelve months. And if you download the PDF, you could probably copy and paste it into your favourite text editor to save yourself some time.

The post Continued: the answers to your questions for Eben Upton appeared first on Raspberry Pi.

Ransomware Update: Viruses Targeting Business IT Servers

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/ransomware-update-viruses-targeting-business-it-servers/

Ransomware warning message on computer

As ransomware attacks have grown in number in recent months, the tactics and attack vectors also have evolved. While the primary method of attack used to be to target individual computer users within organizations with phishing emails and infected attachments, we’re increasingly seeing attacks that target weaknesses in businesses’ IT infrastructure.

How Ransomware Attacks Typically Work

In our previous posts on ransomware, we described the common vehicles used by hackers to infect organizations with ransomware viruses. Most often, downloaders distribute trojan horses through malicious downloads and spam emails. The emails contain a variety of file attachments, which if opened, will download and run one of the many ransomware variants. Once a user’s computer is infected with a malicious downloader, it will retrieve additional malware, which frequently includes crypto-ransomware. After the files have been encrypted, a ransom payment is demanded of the victim in order to decrypt the files.

What’s Changed With the Latest Ransomware Attacks?

In 2016, a customized ransomware strain called SamSam began attacking the servers in primarily health care institutions. SamSam, unlike more conventional ransomware, is not delivered through downloads or phishing emails. Instead, the attackers behind SamSam use tools to identify unpatched servers running Red Hat’s JBoss enterprise products. Once the attackers have successfully gained entry into one of these servers by exploiting vulnerabilities in JBoss, they use other freely available tools and scripts to collect credentials and gather information on networked computers. Then they deploy their ransomware to encrypt files on these systems before demanding a ransom. Gaining entry to an organization through its IT center rather than its endpoints makes this approach scalable and especially unsettling.

SamSam’s methodology is to scour the Internet searching for accessible and vulnerable JBoss application servers, especially ones used by hospitals. It’s not unlike a burglar rattling doorknobs in a neighborhood to find unlocked homes. When SamSam finds an unlocked home (unpatched server), the software infiltrates the system. It is then free to spread across the company’s network by stealing passwords. As it transverses the network and systems, it encrypts files, preventing access until the victims pay the hackers a ransom, typically between $10,000 and $15,000. The low ransom amount has encouraged some victimized organizations to pay the ransom rather than incur the downtime required to wipe and reinitialize their IT systems.

The success of SamSam is due to its effectiveness rather than its sophistication. SamSam can enter and transverse a network without human intervention. Some organizations are learning too late that securing internet-facing services in their data center from attack is just as important as securing endpoints.

The typical steps in a SamSam ransomware attack are:

1
Attackers gain access to vulnerable server
Attackers exploit vulnerable software or weak/stolen credentials.
2
Attack spreads via remote access tools
Attackers harvest credentials, create SOCKS proxies to tunnel traffic, and abuse RDP to install SamSam on more computers in the network.
3
Ransomware payload deployed
Attackers run batch scripts to execute ransomware on compromised machines.
4
Ransomware demand delivered requiring payment to decrypt files
Demand amounts vary from victim to victim. Relatively low ransom amounts appear to be designed to encourage quick payment decisions.

What all the organizations successfully exploited by SamSam have in common is that they were running unpatched servers that made them vulnerable to SamSam. Some organizations had their endpoints and servers backed up, while others did not. Some of those without backups they could use to recover their systems chose to pay the ransom money.

Timeline of SamSam History and Exploits

Since its appearance in 2016, SamSam has been in the news with many successful incursions into healthcare, business, and government institutions.

March 2016
SamSam appears

SamSam campaign targets vulnerable JBoss servers
Attackers hone in on healthcare organizations specifically, as they’re more likely to have unpatched JBoss machines.

April 2016
SamSam finds new targets

SamSam begins targeting schools and government.
After initial success targeting healthcare, attackers branch out to other sectors.

April 2017
New tactics include RDP

Attackers shift to targeting organizations with exposed RDP connections, and maintain focus on healthcare.
An attack on Erie County Medical Center costs the hospital $10 million over three months of recovery.
Erie County Medical Center attacked by SamSam ransomware virus

January 2018
Municipalities attacked

• Attack on Municipality of Farmington, NM.
• Attack on Hancock Health.
Hancock Regional Hospital notice following SamSam attack
• Attack on Adams Memorial Hospital
• Attack on Allscripts (Electronic Health Records), which includes 180,000 physicians, 2,500 hospitals, and 7.2 million patients’ health records.

February 2018
Attack volume increases

• Attack on Davidson County, NC.
• Attack on Colorado Department of Transportation.
SamSam virus notification

March 2018
SamSam shuts down Atlanta

• Second attack on Colorado Department of Transportation.
• City of Atlanta suffers a devastating attack by SamSam.
The attack has far-reaching impacts — crippling the court system, keeping residents from paying their water bills, limiting vital communications like sewer infrastructure requests, and pushing the Atlanta Police Department to file paper reports.
Atlanta Ransomware outage alert
• SamSam campaign nets $325,000 in 4 weeks.
Infections spike as attackers launch new campaigns. Healthcare and government organizations are once again the primary targets.

How to Defend Against SamSam and Other Ransomware Attacks

The best way to respond to a ransomware attack is to avoid having one in the first place. If you are attacked, making sure your valuable data is backed up and unreachable by ransomware infection will ensure that your downtime and data loss will be minimal or none if you ever suffer an attack.

In our previous post, How to Recover From Ransomware, we listed the ten ways to protect your organization from ransomware.

  1. Use anti-virus and anti-malware software or other security policies to block known payloads from launching.
  2. Make frequent, comprehensive backups of all important files and isolate them from local and open networks. Cybersecurity professionals view data backup and recovery (74% in a recent survey) by far as the most effective solution to respond to a successful ransomware attack.
  3. Keep offline backups of data stored in locations inaccessible from any potentially infected computer, such as disconnected external storage drives or the cloud, which prevents them from being accessed by the ransomware.
  4. Install the latest security updates issued by software vendors of your OS and applications. Remember to patch early and patch often to close known vulnerabilities in operating systems, server software, browsers, and web plugins.
  5. Consider deploying security software to protect endpoints, email servers, and network systems from infection.
  6. Exercise cyber hygiene, such as using caution when opening email attachments and links.
  7. Segment your networks to keep critical computers isolated and to prevent the spread of malware in case of attack. Turn off unneeded network shares.
  8. Turn off admin rights for users who don’t require them. Give users the lowest system permissions they need to do their work.
  9. Restrict write permissions on file servers as much as possible.
  10. Educate yourself, your employees, and your family in best practices to keep malware out of your systems. Update everyone on the latest email phishing scams and human engineering aimed at turning victims into abettors.

Please Tell Us About Your Experiences with Ransomware

Have you endured a ransomware attack or have a strategy to avoid becoming a victim? Please tell us of your experiences in the comments.

The post Ransomware Update: Viruses Targeting Business IT Servers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.