Tag Archives: Engineering

Data first, SLA always

Post Syndicated from Grab Tech original https://engineering.grab.com/data-first-sla-always

Introducing Trailblazer, the Data Engineering team’s solution to implementing change data capture of all upstream databases. In this article, we introduce the reason why we needed to move away from periodic batch ingestion towards a real time solution and show how we achieved this through an end to end streaming pipeline.

Context

Our mission as Grab’s Data Engineering team is to fulfill 100% of SLAs for data availability to our downstream users. Our 40 person team is responsible for providing accurate and reliable data to data analysts and data scientists so that they can produce actionable reports that will help Grab’s leadership team make data-driven decisions. We maintain data for a variety of business intelligence tools such as Tableau, Presto and Holistics as well as predictive algorithms for all of Grab.

We ingest data from multiple upstream sources, such as relational databases, Kafka or third party applications such as Salesforce or Zendesk. The majority of these source data exists in MySQL and we run ETL pipelines to mirror any updates into our data lake. These pipelines are triggered on an hourly or daily basis and are powered by an in-house Loader application which performs Spark batch ingestion and loading of data from source to sink.

Problems with the Loader application started to surface when Grab’s data exceeded the petabyte threshold. As such for larger tables, the most practical method to ingest data was to perform ETL only on rows that were updated within a specified timeframe. This is akin to issuing the query

SELECT * FROM table WHERE updated >= [start_time] AND updated < [end_time]

Now imagine two situations. One, firing this query to a huge table without an updated field. Two, firing the same query to the huge table, this time without indexes on the updated field. In the first scenario, the query will never work and we can never perform incremental ingestion on the table based on a timed window. The second scenario carries the dangers of creating high CPU load to replicate the database that we are querying from. Neither has an ideal outcome.

One other problem that we identified was the unpredictability of growth in data volume. Tables smaller than one gigabyte were ingested by fully scanning the table and overwriting the data in the data lake. This worked out well for us until the table size increased exponentially, at which point our Spark jobs failed due to JDBC timeouts. If we were only dealing with a handful of tables, this issue could have been addressed by switching our data ingestion strategy from full scan to a timed window.

When assessing the issue, we discovered that there were hundreds of tables running under the full scan strategy, all of them potentially crashing our data system, all time bombs silently waiting to explode.

The team urgently needed a new approach to ETL. Our Loader application was highly coupled to upstream table characteristics. We needed to find solutions that were truly scalable, which meant decoupling our pipelines from the upstream.

Change data capture (CDC)

Much like event sourcing, any log change to the database is captured and streamed out for downstream applications to consume. This process is lightweight since any row level update to the table is instantly captured by a real time processor, avoiding the need for large chunked queries on the table. In addition, CDC works regardless of upstream table definition, so we do not need to worry about missing updated columns impacting our data migration process.

Binary Logs (binlogs) are the CDC agents of MySQL. All updates, insertions or deletions performed on the table are captured as a series of logged events containing the past state of the row and it’s newly modified state. Check out the binlogs reference to find out more.

In order to persist all binlogs generated upstream, our team created a Spark Structured Streaming application called Trailblazer. Trailblazer streams all MySQL binlogs to our data lake. These binlogs serve as a foundation for us to build Presto tables for data auditing and help to remove the direct dependency of our batch ETL jobs to the source MySQL.

Trailblazer is an amalgamation of various data streaming stacks. Binlogs are captured by Debezium which runs on Kafka connect clusters. All binlogs are sent to our Kafka cluster, which is managed by the Data Engineering Infrastructure team and are streamed out to a real time bucket via a Spark structured streaming application. Hourly or daily ETL compaction jobs ingests the change logs from the real time bucket to materialize tables for downstream users to consume.

CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services
CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services

 

Some statistics

To date, we are streaming hundreds oftables across 60 Spark streaming jobs and with the constant increase in Grab’s database instances, the numbers are expected to keep growing.

Designing Trailblazer streams

We built our streaming application using Spark structured streaming 2.3. Structured streaming was designed to remove the technical aspects of provisioning streams. Developers can focus on perfecting business logic without worrying about fundamentals such as checkpoint management or reading and writing to data sources.

Key architecture for Trailblazer streaming
Key architecture for Trailblazer streaming

 

In the design phase, we made sure to follow several key principles that helped in managing our streams.

Checkpoints have to be externally managed

Structured streaming manages checkpoints both in a local directory and in a ‘_metadata’ directory on S3 buckets, such that the state of the stream can be restored in the event of failure and restart.

This is all well and good, with two exceptions. First, changing the starting point of data ingestion meant ssh-ing into the machine and manipulating metadata, which could be extremely dangerous. Second, we could not assume cluster prevalence since clusters can die and be recreated with data erased from its local disk or the distributed file system.

Our solution was to do a work around at the application level. All checkpoints will be stored in temporary directories with the existing timestamp appended as path (eg /tmp/checkpoint/job_A/1560697200/… ). A linearly progressive timestamp guarantees that the same directory will never be reused by new instances of the stream. This explains why we never restore its state from local disk but instead, store all checkpoints in a highly available Redis cluster, with key as the Kafka topic and value as a JSON of partition : offset.

Key

debz-schema-A.schema_A.table_B

Value

{"11":19183566,"12":19295602,"13":18992606[[a]](#cmnt1)[[b]](#cmnt2)[[c]](#cmnt3)[[d]](#cmnt4)[[e]](#cmnt5)[[f]](#cmnt6),"14":19269499,"15":19197199,"16":19060873,"17":19237853,"18":19107959,"19":19188181,"0":19193976,"1":19072585,"2":19205764,"3":19122454,"4":19231068,"5":19301523,"6":19287447,"7":19418871,"8":19152003,"9":19112431,"10":19151479}
Example of how offsets are stored in Redis as Key : Value pairs

 

Fortunately, structured streaming provides the StreamQueryListener class which we can use to register checkpoints after the completion of each microbatch.

Streams must handle 0, 1 or 1 million data

Scalability is at the heart of all well-designed applications. Spark streaming jobs are built for scalability in the face of varying data volumes.

In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing
In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing

 

There are a few settings that we can configure to influence the degree of scalability for a streaming app

  • spark.dynamicAllocation.enabled=true gives spark autonomy to provision / revoke executors to suit the workload
  • spark.dynamicAllocation.maxExecutors controls the maximum job parallelism
  • maxOffsetsPerTrigger controls the maximum number of messages ingested from Kafka per microbatch
  • trigger controls the duration between microbatchs and is a property of the DataStreamWriter class

Data as key health indicator

Scaling the number of streaming jobs without prior collection of performance metrics is a bad idea. There is a high chance that you will discover a dead stream when checking your stream hours after initialization. I’ll cite Murphy’s law as proof.

Thus we vigilantly monitored our data streams. We used tools such as Datadog for metric monitoring, Slack for oncall issue reporting, PagerDuty for urgent cases and our inhouse data auditor as a service (DASH) for counts discrepancy reporting between streamed and source data. More details on monitoring will be discussed in the later part.

Streams are ephemeral

Streams may die due to a hundred and one reasons so don’t blame yourself or your programming insecurities. Issues with upstream dependencies, such as a node within your Kafka cluster running out of disk space, could lead to partition unavailability which would crash the application. On one occasion, our streaming application was unable to resolve DNS when writing to AWS S3 storage. This amounted to multiple failures within our Spark job that eventually culminated in the termination of the stream.

In this case, allow the stream to  shutdown gracefully, send out your alerts and have a mechanism in place to retry the failed stream. We run all streaming jobs on Airflow and any failure to the stream will automatically be retried through a new task issued by the scheduler.

If you have had experience with large scale management of streams, please leave a comment so we can continue this discussion!

Monitoring data streams

Here are some key features that were set up to monitor our streams.

Running : Active jobs ratio

The number of streaming jobs could increase in the future, thus becoming a challenge for the oncall team to track all jobs that are supposed to be up and running.

One proposal  is  to track the number of jobs in production against the number of jobs that are actually running. By querying MySQL tables, we can filter out all the jobs that are meant to be active. Since Trailblazer streams are spark-submit jobs managed by YARN, we can query YARN’s resource manager REST API to retrieve  all the jobs that are running. We then construct a ratio of running : active jobs and report them to Datadog. If the ratio is not 1 for an extended duration, an alert will be issued for the oncall to take action.

If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert
If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert

 

Microbatch runtime

We define a 30 second window for each microbatch and track the actual runtime using metrics reported by the query listener. A runtime that exceeds the designated window is a potential indicator that the streaming job is deprived of resources and needs to be scaled up.

Job liveliness

Each job reports its health by emitting a count of 1 heartbeat. This heartbeat is created at the end of every microbatch via a query listener. This process is useful in detecting stale jobs (jobs that are registered as RUNNING in YARN but are actually hung).

Kafka offset divergence

In order to ensure that the message output rate to the consumer exceeds the message input rate from the producer, we sum up all presently ingested topic-partition offsets and compare that value to the sum of all topic-partition end offsets in Kafka. We then add an alerting logic on top of these metrics to inform the oncall team if the difference between the two values grows too big.

It is important to track the offset divergence parameter as streams can be lagging. Should the rate of consumption fall below the rate of message production, we would run the risk of falling short of Kafka’s retention window, leading to data losses.

Hourly data checks

DASH runs hourly and serves as our first line of defence to detect any data quality issues within the streams. We issue queries to the source database and our streaming layer to confirm that the ID counts of data created within the last hour match.

DASH helps in the early detection of upstream issues. We have noticed cases where our Debezium connectors failed and our checker reported fewer data than expected since there were no incoming messages to Kafka.

DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack

 

Materializing tables through compaction

Having CDC data in our data lake does not conclude our responsibilities. Batched compaction allows us to apply all captured CDC, to be available as Presto tables for downstream consumption. The job is set to trigger hourly and process all changes to the database within the past hour.  For example, changes to a record are visible in real-time, but the latest state of the record will not be reflected until the next time a batch job runs. We addressed several issues with streaming during this phase.

Deduplication of data

Trailblazer was not built to deliver exactly once guarantees. We ensure that the issues regarding duplicated CDCs are addressed during compaction.

Availability of all data until certain hour

We want to make sure that downstream pipelines use output data of the hourly batch job only when the pipeline has all records for that hour. In case there is an event that is processed late by streaming, the current pipeline will wait until the data is completed. In this case, we are consciously choosing consistency over availability for our downstream users. For example, missing a few insert booking records in peak hours due to consumer processing delay can generate the wrong downstream results leading to miscalculation in revenue. We want to start  downstream processes only when the data for the hour or day is complete.

Need for latest state of each event

Our compaction job performs upserts on the data to ensure that our downstream users can consume  records in their latest state.  

Future applications

Trailblazer is a milestone for the Data Engineering team as it represents our commitment to achieve large scale data streams to reduce latencies for our end users. Moving ahead, our team will be exploring how we can further optimize streaming jobs by analysing data trends over time and to build applications such as snapshot tables on top of the CDCs being streamed in our data lake.

How We Built A Logging Stack at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/how-built-logging-stack

And Solved Our Inhouse Logging Problem

Problem:

Let me take you back a year ago at Grab. When we lacked any visualizations or metrics for our service logs. When performing a query for a string from the last three days was something only run before you went for a beverage.

When a service stops responding, Grab’s core problems were and are:

  • We need to know it happened before the customer does.
  • We need to know why it happened.
  • We need to solve our customers’ problems fast.

We had a hodgepodge of log-based solutions for developers when they needed to figure out the above, or why a driver never showed up, or a customer wasn’t receiving our promised promotions. These included logs in a cloud based storage service (which could take hours to retrieve). Or a SAS provider constantly timing out on our queries. Or even asking our SREs to fetch logs from the potential machines for the service engineer, a rather laborious process.

Here’s what we did with our logs to solve these problems.

Issues:

Our current size and growth rate ruled out several available logging systems. By size, we mean a LOT of data and a LOT of users who search through hundreds of billions of logs to generate reports. Or who track down that one user who managed to find that pesky corner case in our code.

When we started this project, we generated 25TB of logging data a day. Our first thought was “Do we really need all of these logs?”. To this day our feeling is “probably not”.

However, we can’t always define what another developer can and cannot do. Besides, this gave us an amazing opportunity to build something to allow for all that data!

Some of our SREs had used the ELK Stack (Elasticsearch / Logstash / Kibana). They thought it could handle our data and access loads, so it was our starting point.

How We Built a Multi-Petabyte Cluster:

Information Gathering:

It started with gathering numbers. How much data did we produce each day? How many days were retained? What’s a reasonable response time to wait for?

Before starting a project, understand your parameters. This helps you spec out your cluster, get buy-in from higher ups, and increase your success rate when rolling out a product used by the entire engineering organization. Remember, if it’s not better than what they have now, why will they switch?

A good starting point was opening the floor to our users. What features did they want? If we offered a visualization suite so they can see ERROR event spikes, would they use it? How about alerting them about SEGFAULTs? Hands down the most requested feature was speed; “I want an easy webUI that shows me the user ID when I search for it, and get all the results in <5 seconds!”

Getting Our Feet Wet:

New concerns always pop up during a project. We’re sure someone has correlated the time spent in R&D to the number of problems. We had an always moving target, since as our proof of concept began, our daily logger volume kept increasing.

Thankfully, using Elasticsearch as our data store meant we could fully utilize horizontal scaling. This let us start with a simple 5 node cluster as we built out our proof-of-concept (POC). Once we were ready to onboard more services, we could move into a larger footprint.

The specs at the time called for about 80 nodes to handle all our data. But if we designed our system correctly, we’d only need to increase the number of Elasticsearch nodes as we enrolled more customers. Our key operating metrics were CPU utilization, heap memory needed for the JVM, and total disk space.

Initial Design:

First, we set up tooling to use Ansible both to launch a machine and to install and configure Elasticsearch. Then we were ready to scale.

Our initial goal was to keep the design as simple as possible. Opting to allow each node in our cluster to perform all responsibilities. In this setup each node would behave as all of the four available types:

  • Ingest: Used for transforming and enriching documents before sending them to data nodes for indexing.
  • Coordinator: Proxy node for directing search and indexing requests.
  • Master: Used to control cluster operations and determine a quorum on indexed documents.
  • Data: Nodes that hold the indexed data.

These were all design decisions made to move our proof of concept along, but in hindsight they might have created more headaches down the road with troubleshooting, indexing speed, and general stability. Remember to do your homework when spec’ing out your cluster.

It’s challenging to figure out why you are losing master nodes because someone filled up the field data cache performing a search. Separating your nodes can be a huge help in tracking down your problem.

We also decided to further reduce complexity by going with ingest nodes over Logstash. But at the time, the documentation wasn’t great so we had a lot of trial and error in figuring out how they work. Particularly as compared to something more battle tested like Logstash.

If you’re unfamiliar with ingest node design, they are lightweight proxies to your data nodes that accept a bulk payload, perform post-processing on documents,and then send the documents to be indexed by your data nodes. In theory, this helps keep your entire pipeline simple. And in Elasticsearch’s defense, ingest nodes have made massive improvements since we began.

But adding more ingest nodes means ADDING MORE NODES! This can create a lot of chatter in your cluster and cause more complexity when  troubleshooting problems. We’ve seen when an ingest node failing in an odd way caused larger cluster concerns than just a failed bulk send request.

Monitoring:

This isn’t anything new, but we can’t overstate the usefulness of monitoring. Thankfully, we already had a robust tool called Datadog with an additional integration for Elasticsearch. Seeing your heap utilization over time, then breaking it into smaller graphs to display the field data cache or segment memory, has been a lifesaver. There’s nothing worse than a node falling over due to an OOM with no explanation and just hoping it doesn’t happen again.

At this point, we’ve built out several dashboards which visualize a wide range of metrics from query rates to index latency. They tell us if we sharply drop on log ingestion or if circuit breakers are tripping. And yes, Kibana has some nice monitoring pages for some cluster stats. But to know each node’s JVM memory utilization on a 400+ node cluster, you need a robust metric system.

Pitfalls:

Common Problems:

There are many blogs about the common problems encountered when creating an Elasticsearch cluster and Elastic does a good job of keeping blog posts up to date. We strongly encourage you to read them. Of course, we ran into classic problems like ensuring our Java objects were compressed (Hints: Don’t exceed 31GB of heap for your JVM and always confirm you’ve enabled compression).

But we also ran into some interesting problems that were less common. Let’s look at some major concerns you have to deal with at this scale.

Grab’s Problems:

Field Data Cache:

So, things are going well, all your logs are indexing smoothly, and suddenly you’re getting Out Of Memory (OOMs) events on your data nodes. You rush to find out what’s happening, as more nodes crash.

A visual representation of your JVM heap’s memory usage is very helpful here. You can always hit the Elasticsearch API, but after adding more then 5 nodes to your cluster this kind of breaks down. Also, you don’t want to know what’s going on while a node is down, but what happened before it died.

Using our graphs, we determined the field data cache went from virtually zero memory used in the heap to 20GB! This forced us to read up on how this value is set, and, as of this writing, the default value is still 100% of the parent heap memory. Basically, this breaks down to allowing 70% of your total heap being allocated to a single search in the form of field data.

Now, this should be a rare case and it’s very helpful to keep the field names and values in memory for quick lookup. But, if, like us, you have several trillion documents, you might want to watch out.

From our logs, we tracked down a user who was sorting by the _id field. We believe this is a design decision in how Kibana interacts with Elasticsearch. A good counter argument would be a user wants a quick memory lookup if they search for a document using the _id. But for us, this meant a user could load into memory every ID in the indices over a 14 day period.

The consequences? 20+GB of data loaded into the heap before the circuit breaker tripped. It then only took 2 queries at a time to knock a node over.

You can’t disable indexing that field, and you probably don’t want to. But you can prevent users from stumbling into this and disable the _id field in the Kibana advanced settings. And make sure you re-evaluate your circuit breakers. We drastically lowered the available field cache and removed any further issues.

Translog Compression:

At first glance, compression seems an obvious choice for shipping shards between nodes. Especially if you have the free clock cycles, why not minimize the bandwidth between nodes?

However, we found compression between nodes can drastically slow down shard transfers. By disabling compression, shipping time for a 50GB shard went from 1h to 20m. This was because Lucenesegments are already compressed, a new issue we ran into full force and are actively working with the community to fix. But it’s also a configuration to watch out for in your setup, especially if you want a fast recovery of a shard.

Segment Memory:

Most of our issues involved the heap memory being exhausted. We can’t stress enough the importance of having visualizations around how the JVM is used. We learned this lesson the hard way around segment memory.

This is a prime example of why you need to understand your data when building a cluster. We were hitting a lot of OOMs and couldn’t figure out why. We had fixed the field cache issue, but what was using all our RAM?

There is a reason why having a 16TB data node might be a poorly spec’d machine. Digging into it, we realized we simply allocated too many shards to our nodes. Looking up the total segment memory used per index should give a good idea of how many shards you can put on a node before you start running out of heap space. We calculated on average our 2TB indices used about 5GB of segment memory spread over 30 nodes.

The numbers have since changed and our layout was tweaked, but we came up with calculations showing we could allocate about 8TB of shards to a node with 32GB heap memory before we running into issues. That’s if you really want to push it, but it’s also a metric used to keep your segment memory per node around 50%. This allows enough memory to run queries without knocking out your data nodes. Naturally this led us to ask “What is using all this segment memory per node?”

Index Mapping and Field Types:

Could we lower how much segment memory our indices used to cut our cluster operation costs? Using the segments data found in the ES cluster and some simple Python loops, we tracked down the total memory used per field in our index.

We used a lot of segment memory for the _id field (but can’t do much about that). It also gave us a good breakdown of our other fields. And we realized we indexed fields in completely unnecessary ways. A few fields should have been integers but were keyword fields. We had fields no one would ever search against and which could be dropped from index memory.

Most importantly, this began our learning process of how tokens and analyzers work in Elasticsearch/Lucene.

Picking the Wrong Analyzer:

By default, we use Elasticsearch’s Standard Analyzer on all analyzed fields. It’s great, offering a very close approximation to how users search and it doesn’t explode your index memory like an N-gram tokenizer would.

But it does a few things we thought unnecessary, so we thought we could save a significant amount of heap memory. For starters, it keeps the original tokens: the Standard Analyzer would break IDXVB56KLM into tokens IDXVB, 56,  and KLM. This usually works well, but it really hurts you if you have a lot of alphanumeric strings.

We never have a user search for a user ID as a partial value. It would be more useful to only return the entire match of an alphanumeric string. This has the added benefit of only storing the single token in our index memory. This modification alone stripped a whole 1GB off our index memory, or at our scale meant we could eliminate 8 nodes.

We can’t stress enough how cautious you need to be when changing analyzers on a production system. Throughout this process, end users were confused why search results were no longer returning or returning weird results. There is a nice kibana pluginthat gives you a representation of how your tokens look with a different analyzer, or use the build in ES tools to get the same understanding.

Be Careful with Cloud Maintainers:

We realized that running a cluster at this scale is expensive. The hardware alone sets you back a lot, but our hidden bigger cost was cross traffic between availability zones.

Most cloud providers offer different “zones” for your machines to entice you to achieve a High-Availability environment. That’s a very useful thing to have, but you need to do a cost/risk analysis. If you migrate shards from HOT to WARM to COLD nodes constantly, you can really rack up a bill. This alone was about 30% of our total cluster cost, which wasn’t cheap at our scale.

We re-worked how our indices sat in the cluster. This let us create a different index for each zone and pin logging data so it never left the zone it was generated in. One small tweak to how we stored data cut our costs dramatically. Plus, it was a smaller scope for troubleshooting. We’d know a zone was misbehaving and could focus there vs. looking at everything.

Conclusion:

Running our own logging stack started as a challenge. We roughly knew the scale we were aiming for; it wasn’t going to be trivial or easy. A year later, we’ve gone from pipe-dream to production and immensely grown the team’s ELK stack knowledge.

We could probably fill 30 more pages with odd things we ran into, hacks we implemented, or times we wanted to pull our hair out. But we made it through and provide a superior logging platform to our engineers at a significant price reduction while maintaining a stable platform.

There are many different ways we could have started knowing what we do now. For example, using Logstash over Ingest nodes, changing default circuit breakers, and properly using heap space to prevent node failures. But hindsight is 20/20 and it’s rare for projects to not change.

We suggest anyone wanting to revamp their centralized logging system look at the ELK solutions. There is a learning curve, but the scalability is outstanding and having subsecond lookup time for assisting a customer is phenomenal. But, before you begin, do your homework to save yourself weeks of troubleshooting down the road. In the end though, we’ve received nothing but praise from Grab engineers about their experiences with our new logging system.

Catwalk: Serving Machine Learning Models at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/catwalk-serving-machine-learning-models-at-scale

Introduction

Grab’s unwavering ambition is to be the best Super App in Southeast Asia that adds value to the everyday for our customers. In order to achieve that, the customer experience must be flawless for each and every Grab service. Let’s take our frequently used ride-hailing service as an example. We want fair pricing for both drivers and passengers, accurate estimation of ETAs, effective detection of fraudulent activities, and ensured ride safety for our customers. The key to perfecting these customer journeys is artificial intelligence (AI).

Grab has a tremendous amount of data that we can leverage to solve complex problems such as fraudulent user activity, and to provide our customers personalized experiences on our products. One of the tools we are using to make sense of this data is machine learning (ML).

As Grab made giant strides towards increasingly using machine learning across the organization, more and more teams were organically building model serving solutions for their own use cases. Unfortunately, these model serving solutions required data scientists to understand the infrastructure underlying them. Moreover, there was a lot of overlap in the effort it took to build these model serving solutions.

That’s why we came up with Catwalk: an easy-to-use, self-serve, machine learning model serving platform for everyone at Grab.

Goals

To determine what we wanted Catwalk to do, we first looked at the typical workflow of our target audience – data scientists at Grab:

  • Build a trained model to solve a problem.
  • Deploy the model to their project’s particular serving solution. If this involves writing to a database, then the data scientists need to programmatically obtain the outputs, and write them to the database. If this involves running the model on a server, the data scientists require a deep understanding of how the server scales and works internally to ensure that the model behaves as expected.
  • Use the deployed model to serve users, and obtain feedback such as user interaction data. Retrain the model using this data to make it more accurate.
  • Deploy the retrained model as a new version.
  • Use monitoring and logging to check the performance of the new version. If the new version is misbehaving, revert back to the old version so that production traffic is not affected. Otherwise run an AB test between the new version and the previous one.

We discovered an obvious pain point – the process of deploying models requires additional effort and attention, which results in data scientists being distracted from their problem at hand. Apart from that, having many data scientists build and maintain their own serving solutions meant there was a lot of duplicated effort. With Grab increasingly adopting machine learning, this was a state of affairs that could not be allowed to continue.

To address the problems, we came up with Catwalk with goals to:

  1. Abstract away the complexities and expose a minimal interface for data scientists
  2. Prevent duplication of effort by creating an ML model serving platform for everyone in Grab
  3. Create a highly performant, highly available, model versioning supported ML model serving platform and integrate it with existing monitoring systems at Grab
  4. Shorten time to market by making model deployment self-service

What is Catwalk?

In a nutshell, Catwalk is a platform where we run Tensorflow Serving containers on a Kubernetes cluster integrated with the observability stack used at Grab.

In the next sections, we are going to explain the two main components in Catwalk – Tensorflow Serving and Kubernetes, and how they help us obtain our outlined goals.

What is Tensorflow Serving?

Tensorflow Serving is an open-source ML model serving project by Google. In Google’s own words, “Tensorflow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. Tensorflow Serving provides out-of-the-box integration with Tensorflow models, but can be easily extended to serve other types of models and data.”

Why Tensorflow Serving?

There are a number of ML model serving platforms in the market right now. We chose Tensorflow Serving because of these three reasons, ordered by priority:

  1. Highly performant. It has proven performance handling tens of millions of inferences per second at Google according to their website.
  2. Highly available. It has a model versioning system to make sure there is always a healthy version being served while loading a new version into its memory
  3. Actively maintained by the developer community and backed by Google

Even though, by default, Tensorflow Serving only supports models built with Tensorflow, this is not a constraint, though, because Grab is actively moving toward using Tensorflow.

How are we using Tensorflow Serving?

In this section, we will explain how we are using Tensorflow Serving and how it helps abstract away complexities for data scientists.

Here are the steps showing how we are using Tensorflow Serving to serve a trained model:

  1. Data scientists export the model using tf.saved_model API and drop it to an S3 models bucket. The exported model is a folder containing model files that can be loaded to Tensorflow Serving.
  2. Data scientists are granted permission to manage their folder.
  3. We run Tensorflow Serving and point it to load the model files directly from the S3 models bucket. Tensorflow Serving supports loading models directly from S3 out of the box. The model is served!
  4. Data scientists come up with a retrained model. They export and upload it to their model folder.
  5. As Tensorflow Serving keeps watching the S3 models bucket for new models, it automatically loads the retrained model and serves. Depending on the model configuration, it can either gracefully replace the running model version with a newer version or serve multiple versions at the same time.
Tensorflow Serving Diagram

The only interface to data scientists is a path to their model folder in the S3 models bucket. To update their model, they upload exported models to their folder and the models will automatically be served. The complexities are gone. We’ve achieved one of the goals!

Well, not really…

Imagine youare going to run Tensorflow Serving to serve one model in a cloud provider, which means you  need a compute resource from a cloud provider to run it. Running it on one box doesn’t provide high availability, so you need another box running the same model. Auto scaling is also needed in order to scale out based on the traffic. On top of these many boxes lies a load balancer. The load balancer evenly spreads incoming traffic to all the boxes, thus ensuring that there is a single point of entry for any clients, which can be abstracted away from the horizontal scaling. The load balancer also exposes an HTTP endpoint to external users. As a result, we form a Tensorflow Serving cluster that is ready to serve.

Next, imagine you have more models to deploy. You have three options

  1. Load the models into the existing cluster – having one cluster serve all models.
  2. Spin up a new cluster to serve each model – having multiple clusters, one cluster serves one model.
  3. Combination of 1 and 2 – having multiple clusters, one cluster serves a few models.

The first option would not scale, because it’s just not possible to load all models into one cluster as the cluster has limited resources.

The second option will definitely work but it doesn’t sound like an effective process, as you need to create a set of resources every time you have a new model to deploy. Additionally, how do you optimize the usage of resources, e.g., there might be unutilized resources in your clusters that could potentially be shared by the rest.

The third option looks promising, you can manually choose the cluster to deploy each of your new models into so that all the clusters’ resource utilization is optimal. The problem is you have to manuallymanage it. Managing 100 models using 25 clusters can be a challenging task. Furthermore, running multiple models in a cluster can also cause a problem as different models usually have different resource utilization patterns and can interfere with each other. For example, one model might use up all the CPU and the other model won’t be able to serve anymore.

Wouldn’t it be better if we had a system that automatically orchestrates model deployments based on resource utilization patterns and prevents them from interfering with each other? Fortunately, that  is exactly what Kubernetes is meant to do!

So what is Kubernetes?

Kubernetes abstracts a cluster of physical/virtual hosts (such as EC2) into a cluster of logical hosts (pods in Kubernetes terms). It provides a container-centric management environment. It orchestrates computing, networking, and storage infrastructure on behalf of user workloads.

Let’s look at some of the definitions of Kubernetes resources

Tensorflow Serving Diagram
  • Cluster – a cluster of nodes running Kubernetes.
  • Node – a node inside a cluster.
  • Deployment – a configuration to instruct Kubernetes the desired state of an application. It also takes care of rolling out an update (canary, percentage rollout, etc), rolling back and horizontal scaling.
  • Pod – a single processing unit. In our case, Tensorflow Serving will be running as a container in a pod. Pod can have CPU/memory limits defined.
  • Service – an abstraction layer that abstracts out a group of pods and exposes the application to clients.
  • Ingress – a collection of routing rules that govern how external users access services running in a cluster.
  • Ingress Controller – a controller responsible for reading the ingress information and processing that data accordingly such as creating a cloud-provider load balancer or spinning up a new pod as a load balancer using the rules defined in the ingress resource.

Essentially, we deploy resources to instruct Kubernetes the desired state of our application and Kubernetes will make sure that it is always the case.

How are we using Kubernetes?

In this section, we will walk you through how we deploy Tensorflow Serving in Kubernetes cluster and how it makes managing model deployments very convenient.

We used a managed Kubernetes service, to create a Kubernetes cluster and manually provisioned compute resources as nodes. As a result, we have a Kubernetes cluster with nodes that are ready to run applications.

An application to serve one model consists of

  1. Two or more Tensorflow Serving pods that serves a model with an autoscaler to scale pods based on resource consumption
  2. A load balancer to evenly spread incoming traffic to pods
  3. An exposed HTTP endpoint to external users

In order to deploy the application, we need to

  1. Deploy a deployment resource specifying
  2. Number of pods of Tensorflow Serving
  3. An S3 url for Tensorflow Serving to load model files
  4. Deploy a service resource to expose it
  5. Deploy an ingress resource to define an HTTP endpoint url

Kubernetes then allocates Tensorflow Serving pods to the cluster with the number of pods according to the value defined in deployment resource. Pods can be allocated to any node inside the cluster, Kubernetes makes sure that the node it allocates a pod into has sufficient resources that the pod needs. In case there is no node that has sufficient resources, we can easily scale out the cluster by adding new nodes into it.

In order for the rules defined inthe ingressresource to work, the cluster must have an ingress controller running, which is what guided our choice of the load balancer. What an ingress controller does is simple: it keeps checking the ingressresource, creates a load balancer and defines rules based on rules in the ingressresource. Once the load balancer is configured, it will be able to redirect incoming requests to the Tensorflow Serving pods.

That’s it! We have a scalable Tensorflow Serving application that serves a model through a load balancer! In order to serve another model, all we need to do is to deploy the same set of resources but with the model’s S3 url and HTTP endpoint.

To illustrate what is running inside the cluster, let’s see how it looks like when we deploy two applications: one for serving pricing model another one for serving fraud-check model. Each application is configured to have two Tensorflow Serving pods and exposed at /v1/models/model

Tensorflow Serving Diagram

There are two Tensorflow Serving pods that serve fraud-check model and exposed through a load balancer. Same for the pricing model, the only differences are the model it is serving and the exposed HTTP endpoint url. The load balancer rules for pricing and fraud-check model look like this

IfThen forward to
Path is /v1/models/pricingpricing pod ip-1
pricing pod ip-2
Path is /v1/models/fraud-checkfraud-check pod ip-1
fraud-check pod ip-2

Stats and Logs

The last piece is how stats and logs work. Before getting to that, we need to introduce DaemonSet. According to the document, DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage collected. Deleting a DaemonSet will clean up the pods it created.

We deployed datadog-agent and filebeat as a DaemonSet. As a result, we always have one datadog-agent pod and one filebeat pod in all nodes and they are accessible from Tensorflow Serving pods in the same node. Tensorflow Serving pods emit a stats event for every request to datadog-agent pod in the node it is running in.

Here is a sample of DataDog stats:

DataDog stats

And logs that we put in place:

Logs

Benefits Gained from Catwalk

Catwalk has become the go-to, centralized system to serve machine learning models. Data scientists are not required to take care of the serving infrastructure hence they can focus on what matters the most: come up with models to solve customer problems. They are only required to provide exported model files and estimation of expected traffic in order to prepare sufficient resources to run their model. In return, they are presented with an endpoint to make inference calls to their model, along with all necessary tools for monitoring and debugging. Updating the model version is self-service, and the model improvement cycle is much shorter than before. We used to count in days, we now count in minutes.

Future Plans

Improvement on Automation

Currently, the first deployment of any model will still need some manual task from the platform team. We aim to automate this processentirely. We’ll work with our awesome CI/CD team who is making the best use of Spinnaker.

Model serving on mobile devices

As a platform, we are looking at setting standards for model serving across Grab. This includes model serving on mobile devices as well. Tensorflow Serving also provides a Lite version to be used on mobile devices. It is a whole new paradigm with vastly different tradeoffs for machine learning practitioners. We are quite excited to set some best practices in this area.

gRPC support

Catwalk currently supports HTTP/1.1. We’ll hook Grab’s service discovery mechanism to open gRPC traffic, which TFS already supports.

If you are interested in building pipelines for machine learning related topics, and you share our vision of driving South East Asia forward, come join us!

C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages

Post Syndicated from Kavita Ganesan original https://github.blog/2019-07-02-c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages/

GitHub hosts over 300 programming languages—from commonly used languages such as Python, Java, and Javascript to esoteric languages such as Befunge, only known to very small communities.

JavaScript is the top programming language on GitHub, followed by Java and HTML
Figure 1: Top 10 programming languages hosted by GitHub by repository count

One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.

Despite the appearance, language recognition isn’t a trivial task. File names and extensions, while providing a good indication of what the coding language is likely to be, do not offer the full picture. In fact, many extensions are associated with the same language (e.g., “.pl”, “.pm”, “.t”, “.pod” are all associated with Perl), while others are ambiguous and used almost interchangeably across languages (e.g., “.h” is commonly used to indicate many languages of the “C” family, including C, C++, and Objective-C). In other cases, files are simply provided with no extension (especially for executable scripts) or with the incorrect extension (either on purpose or accidentally).

Linguist is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data. 

Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README’s, issues, and pull requests.

In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance. 

The Nuts and Bolts Behind OctoLingua

OctoLingua was built from scratch using Python, Keras with TensorFlow backend—and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language. 

Data sources

The current version of OctoLingua was trained on files retrieved from Rosetta Code and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.

Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a Fibonacci sequence is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.

Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist. 

Features: leveraging prior knowledge

Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:

  1. Top five special characters per file
  2. Top 20 tokens per file
  3. File extension
  4. Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons

The Artificial Neural Network (ANN) model

We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend. 

The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.

image
Figure 2: The ANN Structure of our initial model (50 languages + 1 for “other”)

We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.

Performance benchmark

OctoLingua vs. Linguist

In Figure 3, we show the F1 Score (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source). 

Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a “.txt” extension and a Python file may have a “.java”) extension. 

The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).

The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.

image
Figure 3: Performance of OctoLingua vs. Linguist on the same test set

 

Effect of removing file extension during training time

As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table below shows the performance of our model with different fractions of file extensions removed during training time. 

image
Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations

Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features. 

Supporting a new language

Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.

image
Figure 5: Adding a new language with OctoLingua

Our plans

As of now, OctoLingua is at the “advanced prototyping stage”. Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support—which would be rather straightforward—we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages. 

We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you’re interested.

Summary

With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering—all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code.  If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter @github!

Authors

The post C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages appeared first on The GitHub Blog.

Atom editor is now faster

Post Syndicated from Rafael Oleza original https://github.blog/2019-06-12-atom-editor-is-now-faster/

The Atom Team improved a few of Atom’s most common features—they’re now dramatically faster and ready to help you be even more productive.

Fuzzy Finder

The fuzzy finder is one of the most popular features in Atom, and it’s used by nearly every user to quickly open files by name. We wanted to make it even better by speeding up the process. Atom version 1.38 introduces an experimental fast mode that brings drastic speed improvements to any project. With this mode, indexing a medium or large project is roughly six times faster than when using the standard mode.

Medium to large projects are crawled six times quicker using fast mode.

This change also made the process to display filtered results in Atom about 11 times faster. You’ll notice the difference in speed as soon as you start searching for a query.

Medium to large projects are filtered 11 times quicker using fast mode.

What’s next

Our goal is to make the new experimental mode the default option on Atom version 1.39 when it’s released next month. Until then, switch to this mode by clicking the try experimental fast mode option in the fuzzy finder.

The next Atom version 1.39 will also introduce performance improvements to the find and replace package. We’re adding a new mode to make searching for files between 10 and 20 times faster than the current mode.

Thank you

We’d like to thank the open source community since it has allowed us to build and improve Atom over the years. More specifically, we’ve made these performance improvements by building on the shoulders of two big giants:

Try out Atom.io

The post Atom editor is now faster appeared first on The GitHub Blog.

Direct instruction marking in Ruby 2.6

Post Syndicated from Aaron Patterson original https://github.blog/2019-06-04-direct-instruction-marking-in-ruby-2-6/

We recently upgraded GitHub to use the latest version of Ruby 2.6. Ruby 2.6 contains an optimization for reducing memory usage. We’ve found it to reduce the “post-boot live heap” by about 3 percent. The “post-boot live heap” are the objects still referenced and not garbage collected after booting our Rails application, but before accepting any requests.

Ruby’s virtual machine

MRI (Matz’s Ruby Implementation) uses a stack-based virtual machine.  Each instruction manipulates a stack, and that stack can be thought of as a sort of a “scratch space”. For example, the program 3 + 5 could be represented with the instruction sequences:

push 3
push 5
add

As the virtual machine executes, each instruction manipulates the stack:

Ruby VM Stack Manipulation

Abstract Syntax Trees in Ruby

Before the virtual machine has something to execute, the code must go through a few different processing phases. The code is tokenized, parsed, turned in to an AST (or Abstract Syntax Tree), and finally the AST is converted to byte code. The byte code is what the virtual machine will eventually execute.

Let’s look at an example Ruby program:

"hello" + "world"

This code is turned into an AST, which is a tree data structure. Each node in the tree is represented internally by an object called a T_NODE object. The tree for this code will look like this:

Abstract Synatax Tree of hello + world

Some of the T_NODE objects reference literals. In this case some T_NODE objects reference the string literals “hello” and “world”. Those string literals are allocated via Ruby’s Garbage Collector. They are just like any other string in Ruby, except that they happen to be allocated as the code is being parsed.

Translating the AST to instructions

Instructions and their operands are represented as integers and can only be represented as integers. Instruction sequences for a program are just a list of integers, and it’s up to the virtual machine to interpret the meaning of those integers.

The compilation process produces the list of integers that the virtual machine will interpret. To compile a program, we simply walk the tree translating nodes and their operands to integers.  The program "hello" + "world" will result in three instructions: two push operations and one add operation. The push instructions have "hello" and "world" as operands.

There are a fixed number of instructions, and we can represent each instruction with an integer. In this case, let’s use the number 7 for push and the number 9 for add. But how can we convert the strings “hello” and “world” to integers?

Half translation of AST to instructions

In order to represent these strings in the instruction sequences, the compiler will add the address of the Ruby object that represents each string. The virtual machine knows that the operand to the push instruction is actually the address of a Ruby object, and will act appropriately. This means that the final instructions will look like this:

Full translation of AST to instructions

After the AST is fully processed, it is thrown away and only the instruction sequences remain:

Instructions after AST is gone

 

Literal liveness in Ruby 2.5

Instruction Sequences are Ruby objects and are managed via Ruby’s garbage collector. As mentioned earlier, string literals are also Ruby objects and are managed by the garbage collector. If the string literals are not marked, they could be collected, and the instruction sequences would point to an invalid address.

To prevent these literal objects from being collected, Ruby 2.5 would maintain a “mark array”. The mark array is simply a Ruby array that contains references to all literals referenced for that set of instruction sequences:

InstructionSequences with Mark Array

Both the mark array and the instructions contain references to the string literals found in the code. But the instruction sequences depend on the mark array to keep the literals from being collected.

Literal liveness in Ruby 2.6+

Ruby 2.6 introduced a patch that eliminates this mark array. When instruction sequences are marked, rather than marking an array, it disassembles the instructions and marks instruction operands that were allocated via the garbage collector. This disassembly process means that the mark array can be completely eliminated:

Instruction Sequences without Mark Array

We found that this reduced the number of live objects in our heap by three percent after the application starts.

Performance

Of course, disassembling instructions is more expensive than iterating an array. We found that only 30 percent of instruction sequences actually contain references to objects that need marking. In order to prevent needless disassembly, instructions that contain objects allocated from Ruby’s garbage collector are flagged at compile time, and only those instruction sequences are disassembled during mark time. On top of this, instruction sequence objects typically become “old”. This means that thanks to Ruby’s generational garbage collector, they are examined very infrequently. As a result, we observed memory reduction with zero cost to throughput.

Have a good day!

The post Direct instruction marking in Ruby 2.6 appeared first on The GitHub Blog.

React Native in GrabPay

Post Syndicated from Grab Tech original https://engineering.grab.com/react-native-in-grabpay

Overview

It wasn’t too long ago that Grab formed a new team, GrabPay, to improve the cashless experience in Southeast Asia and to venture into the promising mobile payments arena. To support the work, Grab also decided to open a new R&D center in Bangalore.

It was an exciting journey for the team from the very beginning, as it gave us the opportunity to experiment with new cutting edge technologies. Our first release was the GrabPay Merchant App, the first all React Native Grab App. Its success gave us the confidence to use React Native to optimize the Grab PAX app.

React Native is an open source mobile application framework. It lets developers use React (a JavaScript library for building user interfaces) with native platform capabilities. Its two big advantages are:

  • We could make cross-platform mobile apps and components completely in JavaScript.
  • Its hot reloading feature significantly reduced development time.

This post describes our work on developing React Native components for Grab apps, the challenges faced during implementation, what we learned from other internal React Native projects, and our future roadmap.

Before embarking on our work with React Native, these were the goals we set out. We wanted to:

  • Have a reusable code between Android and iOS as well as across various Grab apps (Driver app, Merchant app, etc.).
  • Have a single codebase to minimize the effort needed to modify and maintain our code long term.
  • Match the performance and standards of existing Grab apps.
  • Use as few Engineering resources as possible.

Challenges

Many Grab teams located across Southeast Asia and in the United States support the App platform. It was hard to convince all of them to add React Native as a project dependency and write new feature code with React Native. In particular, having React Native dependency significantly increases a project’s binary’s size,

But the initial cost was worth it. We now have only a few modules, all written in React Native:

  • Express
  • Transaction History
  • Postpaid Billpay

As there is only one codebase instead of two, the modules take half the maintenance resources. Debugging is faster with React Native’s hot reloading. And it’s much easier and faster to implement one of our modules in another app, such as DAX.

Another challenge was creating a universally acceptable format for a bridging library to communicate between existing code and React Native modules. We had to define fixed guidelines to create new bridges and define communication protocols between React Native modules and existing code.

Invoking a module written in React Native from a Native Module (written in a standard computer language such as Swift or Kotlin) should follow certain guidelines. Once all Grab’s tech families reached consensus on solutions to these problems, we started making our bridges and doing the groundwork to use React Native.

Foundation

On the native side, we used the Grablet architecture to add our React Native modules. Grablet gave us a wonderful opportunity to scale our Grab platform so it could be used by any tech family to plug and play their module. And the module could be in any of  Native, React Native, Flutter, or Web.

We also created a framework encapsulating all the project’s React Native Binaries. This simplified the React Native Upgrade process. Dependencies for the framework are react, react-native, and react-native-event-bridge.

We had some internal proof of concept projects for determining React Native’s performance on different devices, as discussed here. Many teams helped us make an extensive set of JS bridges for React Native in Android and iOS. Oleksandr Prokofiev wrote this bridge creation example:

publicfinalclassDeviceKitModule: NSObject, RCTBridgeModule {
 privateletdeviceKit: DeviceKitService

 publicinit(deviceKit: DeviceKitService) {
   self.deviceKit = deviceKit
   super.init()
 }
 publicstaticfuncmoduleName() -> String {
   return"DeviceKitModule"
 }
 publicfuncmethodsToExport() -> [RCTBridgeMethod] {
   let methods: [RCTBridgeMethod?] = [
     buildGetDeviceID()
     ]
   return methods.compactMap { $0 }
 }

 privatefuncbuildGetDeviceID() -> BridgeMethodWrapper? {
   returnBridgeMethodWrapper("getDeviceID", { [weakself] (_: [Any], _, resolve) in
     letvalue = self?.deviceKit.getDeviceID()
     resolve(value)
   })
 }
}

GrabPay Components and React Native

The GrabPay Merchant App gave us a good foundation for React Native in terms of

  • Component libraries
  • Networking layer and api middleware
  • Real world data for internal assessment of performance and stability

We used this knowledge to build theTransaction History and GrabPay Digital Marketplace components inside the Grab Pax App with React Native.

Component Library

We selected particularly useful components from the Merchant App codebase such as GPText, GPTextInput, GPErrorView, and GPActivityIndicator. We expanded that selection to a common (internal) component library of approximately 20 stateless and stateful components.

API Calls

We used to make api calls using axios (now deprecated). We now make calls from the Native side using bridges that return a promise and make api calls using an existing framework. This helped us remove the dependency for getting an access token from  Native-Android or Native-iOS to make the calls. Also it helped us optimize the api requests, as suggested by Parashuram from Facebook’s React Native team.

Locale

We use React Localize Redux for all our translations and moment for our date time conversion as per the device’s current Locale. We currently support translation in five languages; English, Chinese Simplified, Bahasa Indonesia, Malay, and Vietnamese. This Swift code shows how we get the device’s current Locale from the native-react Native Bridge.

public func methodsToExport() -> [RCTBridgeMethod] {
   let methods: [RCTBridgeMethod?] =  [
     BridgeMethodWrapper("getLocaleIdentifier", { (_, _, resolver) in
     letlocaleIdentifier = self.locale.getLocaleIdentifier()
     resolver(localeIdentifier)
   })]
   return methods.compactMap { $0 }
 }

Redux

Redux is an extremely lightweight predictable state container that behaves consistently in every environment. We use Redux with React Native to manage its state.

For in-app navigation we use react-navigation. It is very flexible in adapting to both the Android and iOS navigation and gesture recognition styles.

End Product

After setting up our foundation bridges and porting skeleton boilerplate code from the GrabPay Merchant App, we wrote two payments modules using GrabPay Digital Marketplace (also known as BillPay), React Native, and Transaction History.

Grab app - Selecting a company

The ios Version is on the left and the Android version is on the right. Not only do their UIs look identical, but also their code is identical. A single codebase lets us debug faster, deliver quicker, and maintain smaller (codebase; apologies but it was necessary for the joke).

Grab app - Company selected

We launched BillPay first in Indonesia, then in Vietnam and Malaysia. So far, it’s been a very stable product with little to no downtime.

Transaction History started in Singapore and is now rolling out in other countries.

Flow For BillPay

BillPay Flow

The above shows BillPay’s flow.

  1. We start with the first screen, called Biller List. It shows all the postpaid billers available for the current region. For now, we show Billers based on which country the user is in. The user selects a biller.
  2. We then asks for your customerID (or prefills that value if you have paid your bill before). The amount is either fetched from the backend or filled in by the user, depending on the region and biller type.
  3. Next, the user confirms all the entered details before they pay the dues.
  4. Finally, the user sees their bill payment receipt. It comes directly from the biller, and so it’s a valid proof of payment.

Our React Native version has kept the same experience as our Native developed App and help users pay their bills seamlessly and hassle free.

Future

We are moving code to Typescript to reduce compile-time bugs and clean up our code. In addition to reducing native dependencies, we will refactor modules as needed. We will also have 100% unit test code coverage. But most importantly, we plan to open source our component library as soon as we feel it is stable.

Preventing Pipeline Calls from Crashing Redis Clusters

Post Syndicated from Grab Tech original https://engineering.grab.com/preventing-pipeline-calls-from-crashing-redis-clusters

Introduction

On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement. During this period, roughly only 1 in 21 calls to Apollo, a primary transport booking service, succeeded. This brought Grab rides down significantly for the one minute it took the Redis Cluster to self-recover. This behavior was totally unexpected and completely breached our intention of having multiple replicas.

This blog post describes Grab’s outage post-mortem findings.

Understanding the infrastructure

With Grab’s continuous growth, our services must handle large amounts of data traffic involving high processing power for reading and writing operations. To address this significant growth, reduce handler latency, and improve overall performance, many of our services use Redis – a common in-memory data structure storage – as a cache, database, or message broker. Furthermore, we use a Redis Cluster, a distributed implementation of Redis, for shorter latency and higher availability.

Apollo is our driver-side state machine. It is on almost all requests’ critical path and is a primary component for booking transport and providing great service for customer bookings. It stores individual driver availability in an AWS ElastiCache Redis Cluster, letting our booking service efficiently assign jobs to drivers. It’s critical to keep Apollo running and available 24/7.

Apollo's infrastructure

Because of Apollo’s significance, its Redis Cluster has 3 shards each with 2 slaves. It hashes all keys and, according to the hash value, divides them into three partitions. Each partition has two replications to increase reliability.

We use the Go-Redis client, a popular Redis library, to direct all written queries to the master nodes (which then write to their slaves) to ensure consistency with the database.

Master and slave nodes in the Redis Cluster

For reading related queries, engineers usually turn on the ReadOnly flag and turn off the RouteByLatency flag. These effectively turn on ReadOnlyFromSlaves in the Grab gredis3 library, so the client directs all reading queries to the slave nodes instead of the master nodes. This load distribution frees up master node CPU usage.

Client reading and writing from/to the Redis Cluster

When designing a system, we consider potential hardware outages and network issues. We also think of ways to ensure our Redis Cluster is highly efficient and available; setting the above-mentioned flags help us achieve these goals.

Ideally, this Redis Cluster configuration would not cause issues even if a master or slave node breaks. Apollo should still function smoothly. So, why did that February Apollo outage happen? Why did a single down slave node cause a 95+% call failure rate to the Redis Cluster during the dim-out time?

Let’s start by discussing how to construct a local Redis Cluster step by step, then try and replicate the outage. We’ll look at the reasons behind the outage and provide suggestions on how to use a Redis Cluster client in Go.

How to set up a local Redis Cluster

1. Download and install Redis from here.

2. Set up configuration files for each node. For example, in Apollo, we have 9 nodes, so we need to create 9 files like this with different port numbers(x).

// file_name: node_x.conf (do not include this line in file)

port 600x

cluster-enabled yes

cluster-config-file cluster-node-x.conf

cluster-node-timeout 5000

appendonly yes

appendfilename node-x.aof

dbfilename dump-x.rdb

3. Initiate each node in an individual terminal tab with:

$PATH/redis-4.0.9/src/redis-server node_1.conf

4. Use this Ruby script to create a Redis Cluster. (Each master has two slaves.)

$PATH/redis-4.0.9/src/redis-trib.rb create --replicas 2127.0.0.1:6001..... 127.0.0.1:6009

>>> Performing Cluster Check (using node 127.0.0.1:6001)

M: 7b4a5d9a421d45714e533618e4a2b3becc5f8913 127.0.0.1:6001

   slots:0-5460 (5461 slots) master

   2 additional replica(s)

S: 07272db642467a07d515367c677e3e3428b7b998 127.0.0.1:6007

   slots: (0 slots) slave

   replicates 05363c0ad70a2993db893434b9f61983a6fc0bf8

S: 65a9b839cd18dcae9b5c4f310b05af7627f2185b 127.0.0.1:6004

   slots: (0 slots) slave

   replicates 7b4a5d9a421d45714e533618e4a2b3becc5f8913

M: 05363c0ad70a2993db893434b9f61983a6fc0bf8 127.0.0.1:6003

   slots:10923-16383 (5461 slots) master

   2 additional replica(s)

S: a78586a7343be88393fe40498609734b787d3b01 127.0.0.1:6006

   slots: (0 slots) slave

   replicates 72306f44d3ffa773810c810cfdd53c856cfda893

S: e94c150d910997e90ea6f1100034af7e8b3e0cdf 127.0.0.1:6005

   slots: (0 slots) slave

   replicates 05363c0ad70a2993db893434b9f61983a6fc0bf8

M: 72306f44d3ffa773810c810cfdd53c856cfda893 127.0.0.1:6002

   slots:5461-10922 (5462 slots) master

   2 additional replica(s)

S: ac6ffbf25f48b1726fe8d5c4ac7597d07987bcd7 127.0.0.1:6009

   slots: (0 slots) slave

   replicates 7b4a5d9a421d45714e533618e4a2b3becc5f8913

S: bc56b2960018032d0707307725766ec81e7d43d9 127.0.0.1:6008

   slots: (0 slots) slave

   replicates 72306f44d3ffa773810c810cfdd53c856cfda893

[OK] All nodes agree about slots configuration.

5. Finally, we try to send queries to our Redis Cluster, e.g.

$PATH/redis-4.0.9/src/redis-cli -c -p 6001 hset driverID 100 state available updated_at 11111

What happens when nodes become unreachable?

Redis Cluster Server

As long as the majority of a Redis Cluster’s masters and at least one slave node for each unreachable master are reachable, the cluster is accessible. It can survive even if a few nodes fail.

Let’s say we have N masters, each with K slaves, and random T nodes become unreachable. This algorithm calculates the Redis Cluster failure rate percentage:

if T <= K:
        availability = 100%
else:
        availability = 100% - (1/(N*K - T))

If you successfully built your own Redis Cluster locally, try to kill any node with a simple command-c. The Redis Cluster broadcasts to all nodes that the killed node is now unreachable, so other nodes no longer direct traffic to that port.

If you bring this node back up, all nodes know it’s reachable again. If you kill a master node, the Redis Cluster promotes a slave node to a temp master for writing queries.

$PATH/redis-4.0.9/src/redis-server node_x.conf

With this information, we can’t answer the big question of why a single slave node failure caused an over 95% failure rate in the Apollo outage. Per the above theory, the Redis Cluster should still be 100% available. So, the Redis Cluster server could properly handle an outage, and we concluded it wasn’t the failure rate’s cause. So we looked at the client side and Apollo’s queries.

Golang Redis Cluster Client & Apollo Queries

Apollo’s client side is based on the Go-Redis Library.

During the Apollo outage, we found some code returned many errors during certain pipeline GET calls. When Apollo tried to send a pipeline of HMGET calls to its Redis Cluster, the pipeline returned errors.

First, we looked at the pipeline implementation code in the Go-Redis library. In the function defaultProcessPipeline, the code assigns each command to a Redis node in this line err:=c.mapCmdsByNode(cmds, cmdsMap).

func (c *ClusterClient) mapCmdsByNode(cmds []Cmder, cmdsMap *cmdsMap) error {
state, err := c.state.Get()
        if err != nil {
                setCmdsErr(cmds, err)
                returnerr
        }

        cmdsAreReadOnly := c.cmdsAreReadOnly(cmds)
        for_, cmd := range cmds {
                var node *clusterNode
                var err error
                if cmdsAreReadOnly {
                        _, node, err = c.cmdSlotAndNode(cmd)
                } else {
                        slot := c.cmdSlot(cmd)
                        node, err = state.slotMasterNode(slot)
                }
                if err != nil {
                        returnerr
                }
                cmdsMap.mu.Lock()
                cmdsMap.m[node] = append(cmdsMap.m[node], cmd)
                cmdsMap.mu.Unlock()
        }
        return nil
}

Next, since the readOnly flag is on, we look at the cmdSlotAndNode function. As mentioned earlier, you can get better performance by setting readOnlyFromSlaves to true, which sets RouteByLatency to false. By doing this, RouteByLatency will not take priority and the master does not receive the read commands.

func (c *ClusterClient) cmdSlotAndNode(cmd Cmder) (int, *clusterNode, error) {
        state, err := c.state.Get()
        if err != nil {
                return 0, nil, err
        }

        cmdInfo := c.cmdInfo(cmd.Name())
        slot := cmdSlot(cmd, cmdFirstKeyPos(cmd, cmdInfo))

        if c.opt.ReadOnly && cmdInfo != nil && cmdInfo.ReadOnly {
                if c.opt.RouteByLatency {
                        node, err:= state.slotClosestNode(slot)
                        return slot, node, err
                }

                if c.opt.RouteRandomly {
                        node:= state.slotRandomNode(slot)
                        return slot, node, nil
                }

                node, err:= state.slotSlaveNode(slot)
                return slot, node, err
        }

        node, err:= state.slotMasterNode(slot)
        return slot, node, err
}

Now, let’s try and better understand the outage.

  1. When a slave becomes unreachable, all commands assigned to that slave node fail.
  2. We found in Grab’s Redis library code that a single error in all cmds could cause the entire pipeline to fail.
  3. In addition, engineers return a failure in their code if err != nil. This explains the high failure rate during the outage.
func (w *goRedisWrapperImpl) getResultFromCommands(cmds []goredis.Cmder) ([]gredisapi.ReplyPair, error) {
        results := make([]gredisapi.ReplyPair, len(cmds))
        var err error
        for idx, cmd := range cmds {
                results[idx].Value, results[idx].Err = cmd.(*goredis.Cmd).Result()
                if results[idx].Err == goredis.Nil {
                        results[idx].Err = nil
                        continue
                }
                if err == nil && results[idx].Err != nil {
                        err = results[idx].Err
                }
        }

        return results, err
}

Our next question was, “Why did it take almost one minute for Apollo to recover?”.  The Redis Cluster broadcasts instantly to its other nodes when one node is unreachable. So we looked at how the client assigns jobs.

When the Redis Cluster client loads the node states, it only refreshes the state once a minute. So there’s a maximum one minute delay of state changes between the client and server. Within that minute, the Redis client kept sending queries to that unreachable slave node.

func (c *clusterStateHolder) Get() (*clusterState, error) {
        v := c.state.Load()
        if v != nil {
                state := v.(*clusterState)
                if time.Since(state.createdAt) > time.Minute {
                        c.LazyReload()
                }
                return state, nil
        }
        return c.Reload()
}

What happened to the write queries? Did we lose new data during that one min gap? That’s a very good question! The answer is no since all write queries only went to the master nodes and the Redis Cluster client with a watcher for the master nodes. So, whenever any master node becomes unreachable, the client is not oblivious to the change in state and is well aware of the current state. See the Watcher code.

How to use Go Redis safely?

Redis Cluster Client

One way to avoid a potential outage like our Apollo outage is to create another Redis Cluster client for pipelining only and with a true RouteByLatency value. The Redis Cluster determines the latency according to ping calls to its server.

In this case, all pipelining queries would read through the master nodesif the latency is less than 1ms (code), and as long as the majority side of partitions are alive, the client will get the expected results. More load would go to master with this setting, so be careful about CPU usage in the master nodes when you make the change.

Pipeline Usage

In some cases, the master nodes might not handle so much traffic. Another way to mitigate the impact of an outage is to check for  errors on individual queries when errors happen in a pipeline call.

In Grab’s Redis Cluster library, the function Pipeline(PipelineReadOnly) returns a response with an error for individual reply.

func (c *clientImpl) Pipeline(ctx context.Context, argsList [][]interface{}) ([]gredisapi.ReplyPair, error) {
        defer c.stats.Duration(statsPkgName, metricElapsed, time.Now(), c.getTags(tagFunctionPipeline)...)
        pipe := c.wrappedClient.Pipeline()
        cmds := make([]goredis.Cmder, len(argsList))
        for i, args := range argsList {
                cmd := goredis.NewCmd(args...)
                cmds[i] = cmd
                _ = pipe.Process(cmd)
        }
        _, _ = pipe.Exec()
        return c.wrappedClient.getResultFromCommands(cmds)
}

func (w *goRedisWrapperImpl) getResultFromCommands(cmds []goredis.Cmder) ([]gredisapi.ReplyPair, error) {
        results := make([]gredisapi.ReplyPair, len(cmds))
        var err error
        for idx, cmd := range cmds {
                results[idx].Value, results[idx].Err = cmd.(*goredis.Cmd).Result()
                if results[idx].Err == goredis.Nil {
                        results[idx].Err = nil
                        continue
                }
                if err == nil && results[idx].Err != nil {
                        err = results[idx].Err
                }
        }

        return results, err
}

type ReplyPair struct {
        Value interface{}
        Err   error
}

Instead of returning nil or an error message when err != nil, we could check for errors for each result so successful queries are not affected. This might have minimized the outage’s business impact.

Go Redis Cluster Library

One way to fix the Redis Cluster library is to reload nodes’ status when an error happens.In the go-redis library, defaultProcessor has this logic, which can be applied to defaultProcessPipeline.

In Conclusion

We’ve shown how to build a local Redis Cluster server, explained how Redis Clusters work, and identified its potential risks and solutions. Redis Cluster is a great tool to optimize service performance, but there are potential risks when using it. Please carefully consider our points about how to best use it. If you have any questions, please ask them in the comments section.

Loki, a dynamic mock server for HTTP/TCP testing

Post Syndicated from Grab Tech original https://engineering.grab.com/loki-dynamic-mock-server-http-tcp-testing

Background

In a previous article we introduced Mockers – an innovative tool for local box testing at Grab. Mockers used a Shift Left testing strategy, making testing more effective and cheaper for development teams. Mockers’ popularity and success motivated us to create Loki – a one-stop dynamic mock server for local box testing of mobile apps.

There are some unique challenges in mobile apps testing at Grab. End-to-end testing of an app is difficult due to high dependency on backend services and other apps. Staging environment, which hosts a plethora of backend services, is tough to manage and maintain. Issues such as staging downtime, configuration mismatches, and data corruption can affect staging adding to the testing woes. Moreover, our apps are fairly complex, utilizing multiple transport protocols such as HTTP, HTTPS, TCP for various business flows.

The business flows are also complex, requiring exhaustive set up such as credit card payments set up, location spoofing, etc resulting in high maintenance costs for automated testing. Loki simulates these flows and developers can easily test use cases that take longer to set up in a real backend staging.

Loki is our attempt to address challenges in mobile app testing by turning every developer local box into a full fledged pseudo backend environment where all mobile workflows can be tested without any external dependencies. It mocks backend services on developer local boxes, decoupling the mobile apps from real backend services, which provides several advantages such as:

No need to deploy frequently to staging

Testing is blocked if the app receives a bad response from staging. In these cases, code changes have to be deployed on staging to fix issues before resuming tests. In contrast, using Loki lets developers continue testing without any immediate need to deploy code changes to staging.

Allows parallel frontend and backend development

Loki acts as a mock backend service when the real backend is still evolving. It lets the frontend development run in parallel with backend development.

Overcome time limitations

In a one week regression-and-release scenario, testing time is limited. However, the application UI rendering and functionality still needs reasonable testing. Loki lets developers concentrate on testing in the available time instead of fixing dependencies on backend services.

Loki – Grab’s solution to simplify mobile apps testing

At Grab, we have multiple mobile apps that are dependent on each other. For example, our Passenger and Driver apps are two sides of a coin; the driver gets a job card only when a passenger requests a booking. These apps are developed by different teams, each with its own release cycle. This can make it tricky to confidently and repeatedly test the whole business flow across apps. Apps also depend on multiple backend services to execute a booking or food order and communicate over different protocols.

Here’s a look at how our mobile apps interact with backend services over different protocols:

Mobile app interaction with backend services

Loki is a dynamic mock server, written in Golang, running in a Docker container on the local box or in CI. It is easy to set up and run through standard Docker commands. In the context of mobile app testing, it plays the role of backend services, so you no longer need to set up an extensive staging environment.

The Loki architecture looks like this:

Loki architecture

The technical challenges we had to overcome

We wanted a comprehensive mocking solution so that teams don’t need to integrate multiple tools to achieve independent testing. It turned out that mocking TCP was most challenging because:

  • It is a long running client-server connection, and it doesn’t follow an HTTP-like request/response pattern.
  • Messages can be sent to the app without an incoming request as well, hence we had to expose a way via Loki to set a mock expectation which can send messages to the app without any request triggering it.
  • As TCP is a long running connection, we needed a way to delimit incoming requests so we know when we can truncate and deserialize the incoming request into JSON.

We engineered the Loki backend to support both HTTP and TCP protocols on different ports. Yet, the mock expectations are set up using RESTful APIs over HTTP for both protocols. A single point of entry for setting expectations made it more intuitive for our developers.

An in-memory cron implementation pushes scheduled messages to the app over a TCP connection. This enabled testing of complex use cases such as drivers getting new job cards, driver and passenger chat workflows, etc. The delimiter for TCP protocol is configurable at start up, so each team can decide when to truncate the request.

To enable Loki on our CI, we had to reduce its memory footprint. Hence, we built Loki with pluggable storages. MySQL is used when running on local and on CI we switch seamlessly to in-memory cache or Redis.

For testing apps locally, developers must validate complex use cases such as:

  • Payment related flows, which require the response to include the same payment ID as sent in the request. This is a case of simple mapping of request fields in the response JSON.

  • Flows requiring runtime logic execution. For example, a job card sent to a driver must have a valid timestamp, requiring runtime computation on Loki.

To support these cases and many more, we added JavaScript injection capability to Loki. So, when we set an expectation for an HTTP request/response pair or for TCP events, we can specify JavaScript for computing the dynamic response. This is executed in a sandbox by an in-house JS execution library.

Grab follows a transactional workflow for bookings. Over the life of a ride, bookings go through different statuses. So, Loki had to address multiple HTTP requests to the same endpoint returning different responses. This feature is required for successfully mocking a whole ride end-to-end.

Loki uses  an HTTP API “httpTimesAndOrder” for this feature. For example, using “httpTimesAndOrder”, you can configure the same status endpoint (/ride/status) to return different ride statuses such as “PICKING” for the first five requests, “IN_RIDE” for the next three requests, and so on.

Now, let’s look at how to use Loki to mock HTTP requests and TCP events.

Mocking HTTP requests

To mock HTTP requests, developers first point their app to send requests to the Loki mock server. Then, they set up expectations for all requests sent to the Loki mock server.

Loki mock server

For example, the Passenger app calls an HTTP dependency GET /closeby/drivers/ to get nearby drivers. To mock it with Loki, you set an expected response on the Loki mock server. When the GET /closeby/drivers/ request is actually made from the Passenger app, Loki returns the set response.

This snippet shows how to set an expected response for the GET /closeby/drivers/request:

Loki API: POST `/api/v1/expectations`

Request Body :

{
  "uriToMock": "/closeby/drivers",
  "method": "GET",
  "response": {
    "drivers": [
      1001,
      1002,
      1010
    ]
  }
}

Workflow for setting expectations and receiving responses

Workflow for setting expectations and receiving responses

Mocking TCP events

Developers point their app to Loki over a TCP connection and set up the TCP expectations. Loki then generates scheduled events such as sending push messages (job cards, notifications, etc) to the apps pointing at Loki.

For example, if the Driver app, after it starts, wants to get a job card, you can set an expectation in Loki to push a job card over the TCP connection to the Driver app after a scheduled time interval.

This snippet shows how to set the TCP expectation and schedule a push message:

Loki API: POST `/api/v1/tcp/expectations/pushmessage`

Request Body :

{
  "name": "samplePushMsg",
  "msgSequence": [
    {
      "messages": {
        "body": {
          "jobCardID": 1001
        }
      }
    },
    {
      "messages": {
        "body": {
          "jobCardID": 1002
        }
      }
    }
  ],
  "schedule": "@every 1m"
}

Workflow for scheduling a push message over TCP

Workflow for scheduling a push message over TCP

Some example use cases

Now that you know about Loki, let’s look at some example use cases.

Generating a custom response at runtime

Our first example is customizing a runtime response for both HTTP and TCP requests. This is helpful when developers need dynamic responses to requests. For example, you can add parameters from the request URL or request body to the runtime response.

It’s simple to implement this with a JavaScript function. Assume you want to embed a message parameter in the request URL to the response. To do this, you first use a POST method to set up the expectation (in JSON format) for the request on Loki:

Loki API: POST `/api/v1/feature/expectations`

Request Body :

{
  "expectations": [{
    "name": "Sample call",
    "desc": "v1/test/{name}",
    "tags": "v1/test/{name}",
    "resource": "/v1/test?name=user1",
    "verb": "POST",
    "response": {
      "body": "{ \"msg\": \"Hi \"}",
      "status": 200
    },
    "clientOptions": {
"javascript": "function main(req, resp) { var url = req.RequestURI; var captured = /name=([^&]+)/.exec(url)[1]; resp.msg =  captured ? resp.msg + captured : resp.msg + 'myDefaultValue'; return resp }"
    },
    "isActive": 1
  }]
}

When Loki receives the request, the JavaScript function used in the clientOptionskey, adds name to the response at runtime. For example, this is the request’s fixed response:

{
    "msg": "Hi "
}

But, after using the JavaScript function to add the URL parameter, the dynamic response is:

{
    "msg": "Hi user1"
}

Similarly, you can use JavaScript to add other dynamic responses such as modifying the response’s JSON array, adding parameters to push messages, etc.

Defining a response sequence for mocked API endpoints

Here’s another interesting example – defining the response sequence for API endpoints.

A response sequence is useful when you need different responses from the same API endpoint. For example, a status endpoint should return different ride statuses such as ‘allocating’, ‘allocated’, ‘picking’, etc. depending on the stage of a ride.

To do this, developers set up their HTTP expectations on Loki. Then, they easily define the response sequence for an API endpoint using a Loki POST method.

In this example:

  • times – specifies the number of times the same response is returned.
  • after – specifies one or more expectations that must match before a specified expectation is matched.

Here, the expectations are matched in this sequence when a request is made to an endpoint – Allocating > Allocated > Pickuser > Completed. Further, Completed is set to two times, so Loki returns this response two times.

Loki API: POST `/api/v1/feature/sequence`

Request Body :
  "httpTimesAndOrder": [
      {
          "name": "Allocating",
          "times": 1
      },
      {
          "name": "Allocated",
          "times": 1,
          "after": ["Allocating"]
      },
      {
          "name": "Pickuser",
          "times": 1,
          "after": ["Allocated"]
      },
      {
          "name": "Completed",
          "times": 2,
          "after": ["Pickuser"]
      }
  ]
}

In conclusion

Since Loki’s inception, we have set up a full range CI with proper end-to-end app UI tests and, to a great extent, decoupled our app releases from the staging backend. This improved delivery cycles, and we did faster bug catching and more exhaustive testing. Moreover, both developers and QAs can easily play with apps to perform exploratory testing as well as manual functional validations. Teams are also using Loki to run automated scripts (Espresso and XCUItests) for validating the mobile app pages.

Loki’s adoption is growing steadily at Grab. With our frequent release of new mobile app features, Loki helps teams meet our high quality bar and achieve huge productivity gains.

If you have any feedback or questions on Loki, please leave a comment.

Designing resilient systems beyond retries (Part 3): Architecture Patterns and Chaos Engineering

Post Syndicated from Grab Tech original https://engineering.grab.com/beyond-retries-part-3

This post is the third of a three-part series on going beyond retries and circuit breakers to improve system resiliency. This whole series covers techniques and architectures that can be used as part of a strategy to improve resiliency. In this article, we will focus on architecture patterns and chaos engineering to reduce, prevent, and test resiliency.

Reducing failure through architecture patterns

Resiliency is all about preparing for and handling failure. So the most effective way to improve resiliency is undoubtedly to reduce the possible ways in which failure can occur, and several architectural patterns have emerged with this aim in mind. Unfortunately these are easier to apply when designing new systems and less relevant to existing ones, but if resiliency is still an issue and no other techniques are helping, then refactoring the system is a good approach to consider.

Idempotency

One popular pattern for improving resiliency is the concept of idempotency. Strictly speaking, an idempotent endpoint is one which always returns the same result given the same parameters, no matter how many times it is called. However, the definition is usually extended to mean it returns the results and has no side-effects, or any side-effects are only executed once. The main benefit of making endpoints idempotent is that they are always safe to retry, so it complements the retry technique to make it more effective. It also means there is less chance of the system getting into an inconsistent or worse state after experiencing failure.

If an operation has side-effects but cannot distinguish unique calls with its current parameters, it can be made to be idempotent by adding an idempotency key parameter. The classic example is money: a ‘transfer money to X’ operation may legitimately occur multiple times with the same parameters, but making the same call twice would be a mistake, so it is not idempotent. A client would not be able to retry a call that timed out, because it does not know whether or not the server processed the request. However, if the client generates and sends a unique ID as an idempotency key parameter, then it can safely retry. The server can then use this information to determine whether to process the request (if it sees the request for the first time) or return the result of the previous operation.

Using idempotency keys can guarantee idempotency for endpoints with side-effects
Using idempotency keys can guarantee idempotency for endpoints with side-effects

 

Asynchronous responses

A second pattern is making use of asynchronous responses. Rather than relying on a successful call to a dependency which may fail, a service may complete its own work and return a successful or partial response to the client. The client would then have to receive the response in an alternate way, either by polling (‘pull’) until the result is ready or the response being ‘pushed’ from the server when it completes.

From a resiliency perspective, this guarantees that the downstream errors do not affect the endpoint. Furthermore, the risk of the dependency causing latency or consuming resources goes away, and it can be retried in the background until it succeeds. The disadvantage is that this works against the ‘fail fast’ principle, since the call might be retried indefinitely without ever failing. It might not be clear to the client what to do in this case.

Not all endpoints have to be made asynchronous, and the decision to be synchronous or not could be made by the endpoint dynamically, depending on the service health. Work that can be made asynchronous is known as deferrable work, and utilizing this information can save resources and allow the more critical endpoints to complete. For example, a fraud system may decide whether or not a newly registered user should be allowed to use the application, but such decisions are often complex and costly. Rather than slow down the registration process for every user and create a poor first impression, the decision can be made asynchronously. When the fraud-decision system is available, it picks up the task and processes it. If the user is then found to be fraudulent, their account can be deactivated at that point.

Preventing disaster through chaos engineering

It is famously understood that disaster recovery is worthless unless it’s tested regularly. There are dozens of stories of employees diligently performing backups every day only to find that when they actually needed to restore from it, the backups were empty. The same thing applies to resiliency, albeit with less spectacular consequences.

The emerging best practice for testing resiliency is chaos engineering. This practice, made famous by Netflix’s Chaos Monkey, is the idea of deliberately causing parts of a system to fail in order to test (and subsequently improve) its resiliency. There are many different kinds of chaos engineering that vary in scope, from simulating an outage in an entire AWS region to injecting latency into a single endpoint. A chaos engineering strategy may include multiple types of failure, to build confidence in the ability of various parts of the system to withstand failure.

Chaos engineering has evolved since its inception, ironically becoming less ‘chaotic’, despite the name. Shutting off parts of a system without a clear plan is unlikely to provide much value, but is practically guaranteed to frustrate your customers – and upper management! Since it is recommended to experiment on production, minimizing the blast radius of chaos experiments, at least at the beginning, is crucial to avoid unnecessary impact to the system.

Chaos experiment process

The basic process for conducting a chaos experiment is as follows:

  1. Define how to measure a ‘steady state’, in order to confirm that the system is currently working as expected.
  2. Decide on a ‘control group’ (which does not change) and an ‘experiment group’ from the pool of backend servers.
  3. Hypothesize that the steady state will not change during the experiment.
  4. Introduce a failure in one component or aspect of the system in the control group, such as the network connection to the database.
  5. Attempt to disprove the hypothesis by analyzing the difference in metrics between the control and experiment groups.

If the hypothesis is disproved, then the parts of the system which failed are candidates for improvement. After making changes, the experiments are run again, and gradually confidence in the system should improve.

Chaos experiments should ideally mimic real-world scenarios that could actually happen, such as a server shutting down or a network connection being disconnected. These events do not necessarily have to be directly related to failure – ordinary events such as auto-scaling or a change in server hardware or VM type can be experimented with, as they could still potentially affect the steady state.

Finally, it is important to automate as much of the chaos experiment process as possible. From setting up the control group to starting the experiment and measuring the results, to automatically disabling the experiment if the impact to production has exceeded the blast radius, the investment in automating them will save valuable engineering time and allow for experiments to eventually be run continuously.

Conclusion

Retries are a useful and important part of building resilient software systems. However, they only solve one part of the resiliency problem, namely recovery. Recovery via retries is only possible under certain conditions and could potentially exacerbate a system failure if other safeguards aren’t also in place. Some of these safeguards and other resiliency patterns have been discussed in this article.

The excellent Hystrix library combines multiple resiliency techniques, such as circuit-breaking, timeouts and bulkheading, in a single place. But even Hystrix cannot claim to solve all resiliency issues, and it would not be wise to rely on a single library completely. However, just as it can’t be recommended to only use Hystrix, suddenly introducing all of the above patterns isn’t advisable either. There is a point of diminishing returns with adding more; more techniques means more complexity, and more possible things that could go wrong.

Rather than implement all of the resiliency patterns described above, it is recommended to selectively apply patterns that complement each other and cover existing gaps that have previously been identified. For example, an existing retry strategy can be enhanced by gradually switching to idempotent endpoints, improving the coverage of API calls that can be retried.

A microservice architecture is a good foundation for building a resilient system, but it requires careful planning and implementation to achieve. By identifying the possible ways in which a system can fail, then evaluating and applying the tried-and-tested patterns to withstand them, a reliable system can become one that is truly resilient.

I hope you found this series useful. Comments are always welcome.

Designing resilient systems beyond retries (Part 2): Bulkheading, Load Balancing, and Fallbacks

Post Syndicated from Grab Tech original https://engineering.grab.com/beyond-retries-part-2

This post is the second of a three-part series on going beyond retries to improve system resiliency. We’ve previously discussed about rate-limiting as a strategy to improve resiliency. In this article, we will cover these techniques: bulkheading, load balancing, and fallbacks.

Introducing Bulkheading (Isolation)

Bulkheading is a fundamental pattern which underpins many other resiliency techniques, especially where microservices are concerned, so it’s worth introducing first. The term actually comes from an ancient technique in ship building, where a ship’s hull would be partitioned into several watertight compartments. If one of the compartments has a leak, then the water fills just that compartment and is contained, rather than flooding the entire ship. We can apply this principle to software applications and microservices: by isolating failures to individual components, we can prevent a single failure from cascading and bringing down the entire system.

Bulkheads also help to prevent single points of failure, by reducing the impact of any failures so services can maintain some level of service.

Level of bulkheads

It is important to note that bulkheads can be applied at multiple levels in software architecture. The two highest levels of bulkheads are at the infrastructure level, and the first is hardware isolation. In a cloud environment, this usually means isolating regions or availability zones. The second is isolating the operating system, which has become a widespread technique with the popularity of virtual machines and now containerization. Previously, it was common for multiple applications to run on a single (very powerful) dedicated server. Unfortunately, this meant that a rogue application could wreak havoc on the entire system in a number of ways, from filling the disk with logs to consuming memory or other resources.

Isolation can be achieved by applying bulkheading at multiple levels
Isolation can be achieved by applying bulkheading at multiple levels

 

This article focuses on resiliency from the application perspective, so below the system level is process-level isolation. In practical terms, this isolation prevents an application crash from affecting multiple system components. By moving those components into separate processes (or microservices), certain classes of application-level failures are prevented from causing cascading failure.

At the lowest level, and perhaps the most common form of bulkheading to software engineers, are the concepts of connection pooling and thread pools. While these techniques are commonly employed for performance reasons (reusing resources is cheaper than acquiring new ones), they also help to put a finite limit on the number of connections or concurrent threads that an operation is allowed to consume. This ensures that if the load of a particular operation suddenly increases unexpectedly (such as due to external load or downstream latency), the impact is contained to only a partial failure.

Bulkheading support in the Hystrix library

The Hystrix library for Go supports a form of bulkheading through its MaxConcurrentRequests parameter. This is conveniently tied to the circuit name, meaning that different levels of isolation can be achieved by choosing an appropriate circuit name. A good rule of thumb is to use a different circuit name for each operation or API call. This ensures that if just one particular endpoint of a remote service is failing, the other circuits are still free to be used for the remaining healthy endpoints, achieving failure isolation.

Load balancing

Global rate-limiting with a central server
Global rate-limiting with a central server

 

Load balancing is where network traffic from a client may be served by one of many backend servers. You can think of load balancers as traffic cops who distribute traffic on the road to prevent congestion and overload. Assuming the traffic is distributed evenly on the network, this effectively increases the computing power of the backend. Adding capacity like this is a common way to handle an increase in load from the clients, such as when a website becomes more popular.

Almost always, load balancers provide high availability for the application. When there is just a single backend server, this server is a ‘single point of failure’, because if it is ever unavailable, there are no servers remaining to serve the clients. However, if there is a pool of backend servers behind a load balancer, the impact is reduced. If there are 4 backend servers and only 1 is unavailable, evenly distributed requests would only fail 25% of the time instead of 100%. This is already an improvement, but modern load balancers are more sophisticated.

Usually, load balancers will include some form of a health check. This is a mechanism that monitors whether servers in the pool are ‘healthy’, ie. able to serve requests. The implementations for the health check vary, but this can be an active check such as sending ‘pings’, or passive monitoring of responses and removing the failing backend server instances.

As with rate-limiting, there are many strategies for load balancing to consider.

There are four main types of load balancer to choose from, each with their own pros and cons:

  • Proxy. This is perhaps the most well-known form of load-balancer, and is the method used by Amazon’s Elastic Load Balancer. The proxy sits on the boundary between the backend servers and the public clients, and therefore also doubles as a security layer: the clients do not know about or have direct access to the backend servers. The proxy will handle all the logic for load balancing and health checking. It is a very convenient and popular approach because it requires no special integration with the client or server code. They also typically perform ‘SSL termination’, decrypting incoming HTTPS traffic and using HTTP to communicate with the backend servers.
  • Client-side. This is where the client performs all of the load-balancing itself, often using a dedicated library built for the purpose. Compared with the proxy, it is more performant because it avoids an extra network ‘hop.’ However, there is a significant cost in developing and maintaining the code, which is necessarily complex and any bugs have serious consequences.
  • Lookaside. This is a hybrid approach where the majority of the load-balancing logic is handled by a dedicated service, but it does not proxy; the client still makes direct connections to the backend. This reduces the burden of the client-side library but maintains high performance, however the load-balancing service becomes another potential point of failure.
  • Service mesh with sidecar. A service mesh is an all-in-one solution for service communication, with many popular open-source products available. They usually include a sidecar, which is a proxy that sits on the same server as the application to route network traffic. Like the traditional proxy load balancer, this handles many concerns of load-balancing for free. However, there is still an extra network hop, and there can be a significant development cost to integrate with existing systems for logging, reporting and so on, so this must be weighed against building a client-side solution in-house.
Comparison of load-balancer architectures
Comparison of load-balancer architectures

 

Grab’s load-balancing implementation

At Grab, we have built our own internal client-side solution called CSDP, which uses the distributed key-value store etcd as its backend store.

Fallbacks

There are scenarios when simply retrying a failed API call doesn’t work. If the remote server is completely down or only returning errors, no amount of retries are going to help; the failure is unrecoverable. When recovery isn’t an option, mitigation is an alternative. This is related to the concept of graceful degradation: sometimes it is preferable to return a less optimal response than fail completely, especially for user-facing applications where user experience is important.

One such mitigation strategy is fallbacks. This is a broad topic with many different sub-strategies, but here are a few of the most common:

Fail silently

Starting with the easiest to implement, one basic fallback strategy is fail silently. This means returning an empty or null response when an error is encountered, as if the call had succeeded. If the data being requested is not critical functionality then this can be considered: missing part of a UI is less noticeable than an error page! For example, UI bubbles showing unread notifications are a common feature. But if the service providing the notifications is failing and the bubble shows 0 instead of N notifications, the user’s experience is unlikely to be significantly affected.

Local computation

A second fallback strategy when a downstream dependency is failing could be to compute the value locally instead. This could mean either returning a default (static) value, or using a simple formula to compute the response. For example, a marketplace application might have a service to calculate shipping costs. If it is unavailable, then using a default price might be acceptable. Or even $0 – users are unlikely to complain about errors that benefit them, and it’s better than losing business!

Cached values

Similarly, cached values are often used as fallbacks. If the service isn’t available to calculate the most up to date value, returning a stale response might be better than returning nothing. If an application is already caching the value with a short expiration to optimize performance, it can be reused as a fallback cache by setting two expiration times: one for normal circumstances, and another when the service providing the response has failed.

Backup service

Finally, if the response is too complex to compute locally or if major functionality of the application is required to have a fallback, then an entirely new service can act as a fallback; a backup service. Such a service is a big investment, so to make it worthwhile some trade-offs must be accepted. The backup service should be considerably simpler than the service it is intended to replace; if it is too complex then it will require constant testing and maintenance, not to mention documentation and training to make sure it is well understood within the engineering team. Also, a complex system is more likely to fail when activated. Usually such systems will have very few or no dependencies, and certainly should not depend on any parts of the original system, since they could have failed, rendering the backup system useless.

Grab’s fallback implementation

At Grab, we make use of various fallback strategies in our services. For example, our microservice framework Grab-Kit has built-in support for returning cached values when a downstream service is unresponsive. We’ve even built a backup service to replicate our core functionality, so we can continue to serve customers despite severe technical difficulties!

Up next, Architecture Patterns and Chaos Engineering…

We’ve covered various techniques in designing reliable and resilient systems in the previous articles. I hope you found them useful. Comments are always welcome.

In our next post, we will look at ways to prevent and reduce failures through architecture patterns and testing.

Please stay tuned!

Designing resilient systems beyond retries (Part 1): Rate-Limiting

Post Syndicated from Grab Tech original https://engineering.grab.com/beyond-retries-part-1

This post is the first of a three-part series on going beyond retries to improve system resiliency. In this series, we will discuss other techniques and architectures that can be used as part of a strategy to improve resiliency. To start off the series, we will cover rate-limiting.

Software engineers aim for reliability. Systems that have predictable and consistent behaviour in terms of performance and availability. In the electricity industry, reliability may equate to being able to keep the lights on. But just because a system has remained reliable up until a certain point, does not mean that it will continue to be. This is where resiliency comes in: the ability to withstand or recover from problematic conditions or failure. Going back to our electricity analogy – resiliency is the ability to turn the lights back on quickly when say, a natural disaster hits the power grid.

Why we value resiliency

Being resilient to many different failures is the best way to ensure a system is reliable and – more importantly – stays that way. At Grab, our architecture features hundreds of microservices, which is constantly stressed in an increasing number of different ways at higher and higher volumes. Failures that would be rare or unusual become more likely as our scale increases. For that reason, we proactively focus on – and require our services to think about – resiliency, even if they have historically been very reliable.

As software systems evolve and become more complex, the number of potential failure modes that software engineers have to account for grows. Fortunately, so too have the techniques for dealing with them. The circuit-breaker pattern and retries are two such techniques commonly employed to improve resiliency specifically in the context of distributed systems. In pursuit of reliability, this is a fine start, but it would be wrong to assume that this will keep the service reliable forever. This article will discuss how you can use rate-limiting as part of a strategy to improve resilience, beyond retries.

Challenges with retries and circuit breakers

A common risk when introducing retries in a resiliency strategy is ‘retry storms’. Retries by definition increase the number of requests from the client, especially when the system is experiencing some kind of failure. If the server is not prepared to handle this increase in traffic, and is possibly already struggling to handle the load, it can quickly become overwhelmed. This is counter-productive to introducing retries in the first place!

When using a circuit-breaker in combination with retries, the application has some form of safety net: too many failures and the circuit will open, preventing the retry storms. However, this can be dangerous to rely on. For one thing, it assumes that all clients have the correct circuit-breaker configurations. Knowing how to configure the circuit-breaker correctly is difficult because it requires knowledge of the downstream service’s configurations too.

Introducing rate-limiting

In a large organization such as Grab with hundreds of microservices, it becomes increasingly difficult to coordinate and maintain the correct circuit-breaker configurations as the number of services increases.

Secondly, it is never a good idea for the server to depend on its clients for resiliency. The circuit-breaker could fail or simply be bypassed, and the server would have to deal with all requests the client makes.

It is therefore desirable to have some form of rate-limiting/throttling as another line of defense. There are many strategies for rate-limiting to consider.

Types of thresholds for rate-limiting

The traditional approach to rate-limiting is to implement a server-side check which monitors the rate of incoming requests and if it exceeds a certain threshold, an error will be returned instead of processing the request. There are many algorithms such as ‘leaky bucket’, fixed/sliding window and so on. A key decision is where to set the thresholds: usually by client, endpoint, or a combination of both.

Rate-limiting by client or user account is the approach taken by many public APIs: Each client is allowed to make a certain number of requests over a period, say 1000 requests per hour, and once that number is exceeded then their requests will be rejected until the time window resets. In this approach, the server must ensure that it has enough capacity (or can scale adequately) to handle the maximum allowed number of requests for each client. If new clients are added frequently, the overhead of maintaining and adjusting the limits may be significant. However, it can be a good way to guarantee a service-level agreement (SLA) with your clients.

An alternative to per-client thresholds is to use per-endpoint thresholds. This limit is applied across all clients and can be set according to the server’s true capacity using benchmarks. Compared with per-client limits this is easier to configure and more reliable in preventing the server from becoming overloaded. However, one misbehaving client may be able to consume the entire quota, blocking other clients of the service.

A rate-limiting strategy may use different levels of thresholds, and this is the best approach to get the benefits of both per-client and per-endpoint thresholds. For example, the following rules might be applied (in order):

  • Per-client, per-endpoint: For example, client A accessing the sendEmail endpoint. It is not necessary to configure thresholds at this granularity, but may be useful for critical endpoints.
  • Per-client: In addition to any per-client per-endpoint settings, client A could have a global threshold of 1000 requests/hour to any API.
  • Per-endpoint: This is the server’s catch-all guard to guarantee that none of its endpoints become overloaded. If client limits are properly configured, this limit should never be reached.
  • Server-wide: Finally, a limit on the number of requests a server can handle in total. This is important because even if endpoints can meet their limits individually, they are never completely isolated: the server will have some overhead and limited resources for processing any kind of request, opening and closing network connections etc.

Local vs global rate-limiting

Another consideration is local vs global rate-limiting. As we saw in the previous section, backend servers are usually pooled together for resiliency. A naive rate-limiting solution might be implemented at the individual server instance level. This sounds intuitive because the thresholds can be calculated exactly according to the instance’s computing power, and it scales automatically as the number of instances increases. However, in a microservice architecture, this is rarely correct as the bottlenecks are unlikely to be so closely tied to individual instance hardware.

More often, the capacity is reached when a downstream resource is exhausted, such as a database, a third-party service or another microservice. If the rate-limiting is only enforced at the instance level, when the service scales, the pressure on these resources will increase and quickly overload them. Local rate-limiting’s effectiveness is limited.

Global rate-limiting on the other hand monitors thresholds and enforces limits across the entire backend server pool. This is usually achieved through the use of a centralized rate-limiting service to make the decisions about whether or not requests should be allowed to go through. While this is much more desirable, implementing such a service is not without challenges.

Considerations when implementing rate-limiting

Care must be taken to ensure the rate-limiting service does not become a single point of failure. The system should still function when the rate-limiter itself is experiencing problems (perhaps by falling back to a local limiter). Since the rate-limiter must be in the request path, it should not add significant latency because any latency would be multiplied across every endpoint being monitored. Grab’s own Quotas service is an example of a global rate-limiter which addresses these concerns.

Global rate-limiting with a central server
Global rate-limiting with a central server. The servers send information about the request volumes, and the rate-limiting service responds with the rate-limiting decisions. This is done asynchronously to avoid introducing a point of failure.

 

Generally, it is more important to implement rate-limiting at the server side. This is because, once again, assuming that clients have correct implementation and configurations is risky. However, there is a case to be made for rate-limiting on the client as well, especially if the clients can be trusted or share a common SDK.

With server-side limiting, the server still has to accept the initial connection, process the rate-limiting logic and return an appropriate error response. With sufficient load, this overhead can be enough to render the system unresponsive; an unintentional denial-of-service (DoS) effect.

Client-side limiting can be implemented by using a central service as described above or, more commonly, utilizing response headers from the server. In this approach, the server response may include information about the client’s remaining quota and/or a timestamp at which the quota is reset. If the client implements logic for these headers, it can avoid sending requests at all if it knows they will be rate-limited. The disadvantage of this is that the client-side logic becomes more complex and another possible source of bugs, so this cost has to be considered against the simpler server-only method.

Up next, Bulkheading, Load Balancing, and Fallbacks…

So we’ve taken a look at rate-limiting as a strategy for having resilient systems. I hope you found this article useful. Comments are always welcome.

In our next post, we will look at the other resiliency techniques such as bulkheading (isolation), load balancing, and fallbacks.

Please stay tuned!

Context Deadlines and How to Set Them

Post Syndicated from Grab Tech original https://engineering.grab.com/context-deadlines-and-how-to-set-them

At Grab, our microservice architecture involves a huge amount of network traffic and inevitably, network issues will sometimes occur, causing API calls to fail or take longer than expected. We strive to make such incidents a non-event, by designing with the expectation of such incidents in mind. With the aid of Go’s context package, we have improved upon basic timeouts by passing timeout information along the request path. However, this introduces extra complexity, and care must be taken to ensure timeouts are configured in a way that is efficient and does not worsen problems. This article explains from the ground up a strategy for configuring timeouts and using context deadlines correctly, drawing from our experience developing microservices in a large scale and often turbulent network environment.

Timeouts

Timeouts are a fundamental concept in computer networking. Almost every kind of network communication will have some kind of timeout associated with it, often configurable with a parameter. The idea is to place a time limit on some event happening, often a network response; after the limit has passed, the operation is aborted rather than waiting indefinitely. Examples of useful places to put timeouts include connecting to a database, making a HTTP request or on idle connections in a pool.

Figure 1.1: How timeouts prevent long API calls
Figure 1.1: How timeouts prevent long API calls

 

Timeouts allow a program to continue where it otherwise might hang, providing a better experience to the end user. Often the default way for programs to handle timeouts is to return an error, but this doesn’t have to be the case: there are several better alternatives for handling timeouts which we’ll cover later.

While they may sound like a panacea, timeouts must be configured carefully to be effective: too short a timeout will result in increased errors from a resource which could still be working normally, and too long a timeout will risk consuming excess resources and a poor user experience. Furthermore, timeouts have evolved over time with new concepts such as Go’s context package, and the trend towards distributed systems has raised the stakes: timeouts are more important, and can cause more damage if misused!

Why timeouts are useful

In the context of microservices, timeouts are useful as a defensive measure against misbehaving or faulty dependencies. It is a guarantee that no matter how badly the dependency is failing, your call will never take longer than the timeout setting (for example 1 second). With so many other things to worry about, that’s a really nice thing to have! So there’s an instant benefit to your service’s resiliency, even if you do nothing more than set the timeout.

However, a service can choose what to do when it encounters a timeout, which can make them even more useful. Generally there are three options:

  1. Return an error. This is the simplest, but unless you know there is error handling upstream, this can actually deliver the worst user experience.
  2. Return a fallback value. We can return a default value, a cached value, or fall back to a simpler computed value. Depending on the circumstances, this can offer a better user experience.
  3. Retry. In the best case, a retry will succeed and deliver the intended response to the caller, albeit with the added timeout delay. However, there are other complexities to consider for retries to be effective. For a full discussion on this topic, see Circuit Breaker vs Retries Part 1and Circuit Breaker vs Retries Part 2.

At Grab, our services tend towards using retries wherever possible, to make minor errors as transparent as possible.

The main advantage of timeouts is that they give your service time to do something else, and this should be kept in mind when considering a good timeout value: not only do you want to allow the remote call time to complete (or not), but you need to allow enough time to handle the potential timeout as well.

Different types of timeouts

Not all timeouts are the same. There are different types of timeouts with crucial differences in semantics, and you should check the behaviour of the timeout settings in the library or resource you’re using before configuring them for production use.

In Go, there are three common classes of timeouts:

  • Network timeouts: These come from the net package and apply to the underlying network connection. These are the best to use when available, because you can be sure that the network call has been cancelled when the call returns to your function.
  • Context timeouts: Context is discussed later in this article, but for now just note that these timeouts are propagated to the server. Since the server is aware of the timeout, it can avoid wasted effort by abandoning computation after the timeout is reached.
  • Asynchronous timeouts: These occur when a goroutine is executed and abandoned after some time. This does not automatically cancel the goroutine (you can’t really cancel goroutines without extra handling), so it risks leaking the goroutine and other resources. This approach should be avoided in production unless combined with some other measures to provide cancellation or avoid leaking resources.

Dangers of poor timeout configuration for microservice calls

The benefits of using timeouts are enticing, but there’s no free lunch: relying on timeouts too heavily can lead to disastrous cascading failure scenarios. Worse, the effects of a poor timeout configuration often don’t become evident until it’s too late: it’s peak hour, traffic just reached an all-time high and… all your services froze up at the same time. Not good.

To demonstrate this effect, imagine a simple 3-service architecture where each service naively uses a default timeout of 1 second:

Figure 1.2: Example of how incorrect timeout configuration causes cascading failure
Figure 1.2: Example of how incorrect timeout configuration causes cascading failure

 

Service A’s timeout does not account for the fact that Service B calls C. If B itself is experiencing problems and takes 800ms to complete its work, then C effectively only has 200ms to complete before service A gives up. But since B’s timeout to C is also 1s, that means that C could be wasting up to 800ms of computational effort that ‘leaks’ – it has no chance of being used. Both B and C are blissfully unaware at first that anything is wrong – they happily return successful responses that A never receives!

This resource leak can soon be catastrophic, though: since the calls from B to A are timing out, A (or A’s clients) are likely to retry, causing the load on B to increase. This in turn causes the load on C to increase, and eventually all services will stop responding.

The same thing happens if B is healthy but C is experiencing problems: B’s calls to C will build up and cause B to become overloaded and fail too. This is a common cause of cascading failure.

How to set a good timeout

Given the importance of correctly configuring timeout values, the question remains as to how to decide upon a ‘correct’ timeout value. If the timeout is for an API call to another service, a good place to start would be that service’s service-level agreements (SLAs). Often SLAs are based on latency percentiles, which is a value below which a given percentage of latencies fall. For example, a system might have a 99th percentile (also known as P99) latency of 300ms; this would mean that 99% of latencies are below 300ms. A high-order percentile such as P99 or even P99.9 can be used as a ballpark worst-case value.

Let’s say a service (B)’s endpoint has a 99th percentile latency of 600ms. Setting the timeout for this call at 600ms would guarantee that no calls take longer than 600ms, while returning errors for the rest and accepting an error rate of at most 1% (assuming the service is keeping to their SLA). This is an example of how the timeout can be combined with information about latencies to give predictable behaviour.

This idea can be taken further by considering retries too. If the median latency for this service is 50ms, then you could introduce a retry of 50ms for an overall timeout of 50ms + 600ms = 650ms:

Service B

Service B P99 latency SLA = 600ms

Service B median latency = 50ms

Service A

Request timeout = 600ms

Number of retries = 1

Retry request timeout = 50ms

Overall timeout = 50ms+600ms = 650ms

Chance of timeout after retry = 1% * 50% = 0.5%

Figure 1.3: Example timeout configuration settings based on latency data

 

This would still cut off the top 1% of latencies, while optimistically making another attempt for the median latency. This way, even for the 1% of calls that encounter a timeout, our service would still expect to return a successful response within 650ms more than half the time, for an overall success rate of 99.5%.

Context propagation

Go officially introduced the concept of context in Go 1.7, as a way of passing request-scoped information across server boundaries. This includes deadlines, cancellation signals and arbitrary values. Let’s ignore the last part for now and focus on deadlines and cancellations. Often, when setting a regular timeout on a remote call, the server side is unaware of the timeout. Even if the server is notified indirectly when the client closes the connection, it’s still not necessarily clear whether the client timed out or encountered another issue. This can lead to wasted resources, because without knowing the client timed out, the server often carries on regardless. Context aims to solve this problem by propagating the timeout and context information across API boundaries.

Figure 1.4: Context propagation cancels work on B and C
Figure 1.4: Context propagation cancels work on B and C

 

Server A sets a context timeout of 1 second. Since this information spans the entire request and gets propagated to C, C is always aware of the remaining time it has to do useful work – work that won’t get discarded. The remaining time can be defined as (1 – b), where b is the amount of time that server B spent processing before calling C. When the deadline is exceeded, the context is immediately cancelled, along with any child contexts that were created from the parent.

The context timeout can be a relative time (eg. 3 seconds from now) or an absolute time (eg. 7pm). In practice they are equivalent, and the absolute deadline can be queried from a timeout created with a relative time and vice-versa.

Another useful feature of contexts is cancellation. The client has the ability to cancel the request for any reason, which will immediately signal the server to stop working. When a context is cancelled manually, this is very similar to a context being cancelled when it exceeds the deadline. The main difference is the error message will be ‘context cancelled’ instead of ‘context deadline exceeded’. This is a common cause of confusion, but context cancelled is always caused by an upstream client, while deadline exceeded could be a deadline set upstream or locally.

The server must still listen for the ‘context done’ signal and implement cancellation logic, but at least it has the option of doing so, unlike with ordinary timeouts. The most common reason for cancelling a request is because the client encountered an error and no longer needs the response that the server is processing. However, this technique can also be used in request hedging, where concurrent duplicate requests are sent to the server to decrease the impact of an individual call experiencing latency. When the first response returns, the other requests are cancelled because they are no longer needed.

Context can be seen as ‘distributed timeouts’ – an improvement to the concept of timeouts by propagating them. But while they achieve the same goal, they introduce other issues that must be considered.

Context propagation and timeout configuration

When propagating timeout information via context, there is no longer a static ‘timeout’ setting per call. This can complicate debugging: even if the client has correctly configured their own timeout as above, a context timeout could mean that either the remote downstream server is slow, or that an upstream client was slow and there was insufficient time remaining in the propagated context!

Let’s revisit the scenario from earlier, and assume that service A has set a context timeout of 1 second. If B is still taking 800ms, then the call to C will time out after 200ms. This changes things completely: although there is no longer the resource leak (because both B and C will terminate the call once the context timeout is exceeded), B will have an increase in errors whereas previously it would not (at least until it became overloaded). This may be worse than completing the request after A has given up, depending on the circumstances. There is also a dangerous interaction with circuit breakers which we will discuss in the next section.

If allowing the request to complete is preferable than cancelling it even in the event of a client timeout, the request should be made with a new context decoupled from the parent (ie. context.Background()). This will ensure that the timeout is not propagated to the remote service. When doing this, it is still a good idea to set a timeout, to avoid waiting indefinitely for it to complete.

Context and circuit-breakers

A circuit-breaker is a software library or function which monitors calls to external resources with the aim of preventing calls which are likely to fail, ‘short-circuiting’ them (hence the name). It is a good practice to use a circuit-breaker for all outgoing calls to dependencies, especially potentially unreliable ones. But when combined with context propagation, that raises an important question: should context timeouts or cancellation cause the circuit to open?

Let’s consider the options. If ‘yes’, this means the client will avoid wasting calls to the server if it’s repeatedly hitting the context timeout. This might seem desirable at first, but there are drawbacks too.

Pros:

  • Consistent behaviour with other server errors
  • Avoids making calls that are unlikely to succeed
  • It is obvious when things are going wrong
  • Client has more time to fall back to other behaviour
  • More lenient on misconfigured timeouts because circuit-breaking ensures that subsequent calls will fail fast, thus avoiding cascading failure

Cons:

  • Unpredictable
  • A misconfigured upstream client can cause the circuit to open for all other clients
  • Can be misinterpreted as a server error

It is generally better not to open the circuit when the context deadline set upstream is exceeded. The only timeout allowed to trigger the circuit-breaker should be the request timeout of the specific call for that circuit.

Pros:

  • More predictable
  • Circuit depends mostly on server health, not client
  • Clients are isolated

Cons:

  • May be confusing for clients who expect the circuit to open
  • Misconfigured timeouts are more likely to waste resources

Note that the above only applies to propagated contexts. If the context only spans a single individual call, then it is equivalent to a static request timeout, and such errors should cause circuits to open.

How to set context deadlines

Let’s recap some of the concepts covered in this article so far:

  • Timeouts are a time limit on an event taking place, such as a microservice completing an API call to another service.
  • Request timeouts refer to the timeout of a single individual request. When accounting for retries, an API call may include several request timeouts before completing successfully.
  • Context timeouts are introduced in Go to propagate timeouts across API boundaries.
  • A context deadline is an absolute timestamp at which the context is considered to be ‘done’, and work covered by this context should be cancelled when the deadline is exceeded.

Fortunately, there is a simple rule for correctly configuring context timeouts:

The upstream timeout must always be longer than the total downstream timeouts including retries.

The upstream timeout should be set at the ‘edge’ server and cascade throughout.

In our scenario, A is the edge server. Let’s say that B’s timeout to C is 1s, and it may retry at most once, after a delay of 500ms. The appropriate context timeout (CT) set from A can be calculated as follows:

CT(A) = (timeout to C * number of attempts) + (retry delay * number of retries)

CT(A) = (1s * 2) + (500ms * 1) = 2,500ms

Figure 1.5: Formula for calculating context timeouts
Figure 1.5: Formula for calculating context timeouts

 

Extra time can be allocated for B’s processing time and to allow B to return a fallback response if appropriate.

Note that if A configures its timeout according to this rule, then many of the above issues disappear. There are no wasted resources, because B and C are given the maximum time to complete their requests successfully. There is no chance for B’s circuit-breaker to open unexpectedly, and cascading failure is mostly avoided: a failure in C will be handled and be returned by B, instead of A timing out as well.

A possible alternative would be to rely on context cancellation: allow A to set a shorter timeout, which cancels B and C if the timeout is exceeded. This is an acceptable approach to avoiding cascading failure (and cancellation should be implemented in any case), but it is less optimal than configuring timeouts according to the above formula. One reason is that there is no guarantee of the downstream services handling the timeout gracefully; as mentioned previously, the service must explicitly check for ctx.Done() and this is rarely followed in practice. It is also impractical to place checks at every point in the code, so there could be a considerable delay between the client cancellation and the server abandoning the processing.

A second reason not to set shorter timeouts is that it could lead to unexpected errors on the downstream services. Even if B and C are healthy, a shorter context timeout could lead to errors if A has timed out. Besides the problem of having to handle the cancelled requests, the errors could create noise in the logs, and more importantly could have been avoided. If the downstream services are healthy and responding within their SLA, there is no point in timing out earlier. An exception might be for the edge server (A) to allow for only 1 attempt or fewer retries than the downstream service actually performs. But this is tricky to configure and weakens the resiliency. If it is desirable to shorten the timeouts to decrease latency, it is better to start adjusting the timeouts of the downstream resources first, starting from the innermost service outwards.

A model implementation for using context timeouts in calls between microservices

We’ve touched on several useful concepts for improving resiliency in distributed systems: timeouts, context, circuit-breakers and retries. It is desirable to use all of them together in a good resiliency strategy. However, the actual implementation is far from trivial; finding the right order and configuration to use them effectively can seem like searching for the holy grail, and many teams go through a long process of trial and error, continuously improving their implementation. Let’s try to formally put together an ideal implementation, step by step.

Note that the code below is not a final or production-ready implementation. At Grab we have developed independent circuit-breaker and retry libraries, with many settings that can be configured for fine-tuning. However, it should serve as a guide for writing resilient client libraries.

Step 1: Context propagation

Context propagation code

The skeleton function signature includes a context object as the first parameter, which is the best practice intended by Google. We check whether the context is already done before proceeding, in which case we ‘fail fast’ without wasting any further effort.

Step 2: Create child context with request timeout

Child context with request timeout code

Our service has no control over the parent context. Indeed, it could have no deadline at all! Therefore it’s important to create a new context and timeout for our own outgoing request as well, using WithTimeout. It is mandatory to call the returned cancel function to ensure the context is properly cancelled and avoid a goroutine leak.

Step 3: Introduce circuit-breaker logic

Introduce circuit-breaker logic code

Next, we wrap our call to the external service in a circuit-breaker. The actual circuit-breaker implementation has been omitted for brevity, but there are two important points to consider:

  • It should only consider opening the circuit-breaker when requestTimeout is reached, not on ctx.Done().
  • The circuit name should ideally be unique for this specific endpoint
Introduce circuit-breaker logic code - 2

Step 4: Introduce retries

The last step is to add retries to our request in the case of error. This can be implemented as a simple for loop, but there are some key things to include in a complete retry implementation:

  • ctx.Done() should be checked after each retry attempt to avoid wasting a call if the client has given up.
  • The request context should be cancelled before the next retry to avoid duplicate concurrent calls and goroutine leaks.
  • Not all kinds of requests should be retried.
  • A delay should be added before the next retry, using exponential backoff.
  • See Circuit Breaker vs Retries Part 2 for a thorough guide to implementing retries.

Step 5: The complete implementation

Complete implementation

And here we have arrived at our ‘ideal’ implementation of an external call including context handling and propagation, two levels of timeout (parent and request), circuit-breaking and retries. This should be sufficient for a good level of resiliency, avoiding wasted effort on both the client and server.

As a future enhancement, we could consider introducing a ‘minimum time per request’, which the retry loop should use to check for remaining time as well as ctx.Done() (but not instead – we need to account for client cancellation too). Of course metrics, logging and error handling should also be added as necessary.

Important Takeaways

To summarise, here are a few of the best practices for working with context timeouts:

Use SLAs and latency data to set effective timeouts

Having a default timeout value for everything doesn’t scale well. Use available information on SLAs and historic latency to set timeouts that give predictable results.

Understand the common error messages

The context canceled (context.Canceled) error occurs when the context is manually cancelled. This automatically cancels any child contexts attached to the parent. It is rare for this error to surface on the same service that triggered the cancellation; if cancel is called, it is usually because another error has been detected (such as a timeout) which would be returned instead. Therefore, context canceled is usually caused by an upstream error: either the client timed out and cancelled the request, or cancelled the request because it was no longer needed, or closed the connection (this typically results in a cancelled context from Go libraries).

The context deadline exceeded error occurs only when the time limit was reached. This could have been set locally (by the server processing the request) or by an upstream client. Unfortunately, it’s often difficult to distinguish between them, although they should generally be handled in the same way. If a more granular error is required, it is recommended to use child contexts and explicitly check them for ctx.Done(), as shown in our model implementation.

Check for ctx.Done() before starting any significant work

Don’t enter an expensive block of code without checking the context; if the client has already given up, the work will be wasted.

Don’t open circuits for context errors

This leads to unpredictable behaviour, because there could be a number of reasons why the context might have been cancelled. Only context errors due to request timeouts originating from the local service should lead to circuit-breaker errors.

Set context timeouts at the edge service, using a cascading timeout budget

The upstream timeout must always be longer than the total downstream timeouts. Following this formula will help to avoid wasted effort and cascading failure.

In Conclusion

Go’s context package provides two extremely valuable tools that complement timeouts: deadline propagation and cancellation. This article has shown the benefits of using context timeouts and how to correctly configure them in a multi-server request path. Finally, we have discussed the relationship between context timeouts and circuit-breakers, proposing a model implementation for integrating them together in a common library.

If you have a Go server, chances are it’s already making heavy use of context. If you’re new to Go or had been confused by how context works, hopefully this article has helped to clarify misunderstandings. Otherwise, perhaps some of the topics covered will be useful in reviewing and improving your current context handling or circuit-breaker implementation.

Recipe for building a widget: How we helped to “peak-shift” demand by helping passengers understand travel trends

Post Syndicated from Grab Tech original https://engineering.grab.com/peak-shift-demand-travel-trends

Credits: Photo by rawpixel on Unsplash

 

Stuck in traffic in a Grab ride? Pass the time by opening your Grab app and checking out the Feed – just scroll down! You’ll find widgets for games, polls, videos, news and even food recommendations!

Beyond serving your everyday needs, we want to provide our users with information that is interesting, useful and relevant. That’s why we’re always coming up with new widgets.

Building each widget takes close collaboration across multiple different teams – from Product Management to Design, Engineering, Behavioral Science, and Data Science and Analytics. Sounds like a lot of people, doesn’t it? But you’ll be surprised to hear that this behind-the-scenes collaboration works rapidly, usually in the span of one month! Which means we’re often moving from ideation phase to product release in just a few weeks.

Travel Trends Widget

This fast-and-furious process is anchored on one word – “customer-centric”. And that’s how it all began with our  “Travel Trends Widget” – a widget that provides passengers with an overview of historical supply and demand trends for their current location and nearby time periods.

Because we had so much fun developing this widget, we wanted to write a blog post to share with you what we did and how we did it!

Inspiration: Where it all started

Transport demand can be rather lumpy. Owing to organic patterns (e.g. office hours), a lot of passengers tend to request for cars around the same time. In periods like this, the increase in demand could outpace the arrival of driver supply, increasing the waiting time for passengers.

Our goal at Grab is to make sure people get a ride when they want it and at the price they want, so we got to thinking about how we can ease this friction by leveraging our treasure trove – Big Data! – to help our passengers better plan their trips.

As we were looking at the data, we noticed that there is a seasonality to demand and supply: at certain times and days, imbalances appear, peak and disappear, and the process repeats itself. Studies say that humans in general, unless shown a compelling reason or benefit for change, are habitual beings subject to inertia. So we set out to achieve exactly that: To create a widget to surface information to our passengers that may help them alter their decisions on when they choose to book a ride, thereby redistributing some of the present peak demands to periods just before and after peak – also known as “peak shifting the demand”!

While this widget is the first-of-its-kind in the ride-hailing industry, “peak-shifting” was actually coined and introduced long ago!

London Transport Museum Trends

As you can see from this post from the London Transport Museum (Source: Transport for London), London tube tried peak-shifting long before anyone else: Original Ad from 1928 displayed on the left, and Ad from 2015 displayed on the right, comparing the trends to 1928.

Trends from a hotel in Beijing

You may also have seen something similar at the last hotel you stayed at. Notice here a poster in an elevator at a Beijing hotel, announcing the best times to eat breakfast in comfort and avoid the crowd. (Photo credits to Prashant, our Product Manager, who saw this on holiday.)

To apply “peak-shifting” and help our users better plan their trips, we decided to dig in and leverage our data. It was way more complex than we had initially thought, as market conditions could be different on different days. This meant that  generic statements like “5PM-8PM are peak hours and prices will be hight” would not hold true. Contrary to general perception, we observed that even during peak hours, there are buckets of time when there is no surge or low surge.

For instance, plot 1 and plot 2 below shows how a typical Monday and Tuesday surge looks like in a given month respectively. One of the key insights is that the surge trends during peak hour is different on Monday from Tuesday. It reinforces our initial hypothesis that every day is unique.

So we used machine learning techniques to build a forecasting widget which can help our users and give them the power to plan their trips beforehand. This widget is able to provide the pricing trends for the next 2 hours. So with a bit of flexibility, riders can ride the tide!

Grab trends

So how exactly does this widget work?!

Historical trends for Monday

It pulls together historically observed imbalances between supply and demand, for the consumer’s current location and nearby time periods. Aggregated data is displayed to consumers in easily interpreted visualisations, so that they can plan to leave at times when there are more supply, and with potentially more savings for fares.

How did we build the widget? Loop, agile working process, POC & workstream

Widget-building is an agile, collaborative, and simultaneous process. First, we started the process with analysis from Product Analytics team, pulling out data on traffic trends, surge patterns, and behavioral insights of both passengers and drivers in Singapore.

When we noticed the existence of seasonality for each day of the week, we came up with more precise analytical and business questions to dig deeper into the data. Upon verification of hypotheses, we decided that we will build a widget.

Then joined the Behavioural Science, UX (User Experience) Design and the Product Management teams, who started giving shape to the problem we are solving. Our Behavioural Scientists shared their expertise on how information, suggestions and choices should be presented to enable easy assimilation and beneficial action. Daily whiteboarding breakouts, endless back-and forth conversations, and a healthy amount of challenge-and-accept culture ensured that we distilled the idea down to its core. We then presented the relevant information with just the right level of detail, and with the right amount of messaging, to allow users to take the intended action i.e. shift his/her demand outside of peak periods if possible.

Our amazing regional Copywriting team then swung in to put our intent into words in 7 different languages for our users across South-East Asia. Simultaneously, our UX designers and Full-stack Engineers started exploring the best visual components to communicate data on time trends to users. More on this later, but suffice to say that plenty of ideas were explored and discarded in a collaborative process, which aimed to create something that’s intuitive and engaging while being robust and scalable to work across all types of devices.

While these designs made their way up to engineering, the Data Science team worked on finding the most rigorous method to deduce the historical trend of surge across all our cities and areas, and time periods within them. There were discussions on how to best store and update this data reliably so that the widget itself can access it with great performance.

Soon after, we went into the development process, and voila! We had the first iteration of the widget ready on our staging (internal testing) servers in just 2 weeks! This prototype was opened up to the core team for influx of feedback.

And just two weeks later, the widget made its way to our Singapore and Jakarta Feeds, accessible to the world at large! Feedback from our users started pouring in almost immediately (thanks to the rich feedback functionality that comes with each widget), ranging from great to sometimes not-so-great, and we listened to all of it with a keen ear! And thus began a new cycle of iterations and continuous improvement, more of which we will share in a subsequent post.

In the trenches with the creators: How multiple teams got together to make this come true

Various disciplines within our cross functional team came together to whip out this widget by quipping their expertise to the end product.

Using Behavioural Science to simplify choices and design good outcomes

Behavioural Science helped to explore many facets of consumer behaviour in order to plan and design the widget: understanding how consumers think and conceptualizing a widget that can be easily understood and used by the consumers.

While fares are governed entirely by market conditions, it’s important for us to explain the economics to customers. As a customer-centric company, we aim to make the consumers feel like they own their decisions, which they can take based on full information. And this is the role of Behavioral Scientists at Grab!

In guiding the customers through the information, Behavioural Science team had the following three objectives in mind while building this Travel Trends widget:

  1. Offer transparency on the fares: By exposing our historic surge levels for a 4 hour period, we wanted to ensure that the passenger is aware of the surge levels and does not treat the fare as a nasty shock.
  2. Give information that helps them plan: By showing them surge levels for the future 2 hours, we wanted to help customers who have the flexibility, plan for a better time, hence, giving them the power to decide based on transparent information.
  3. Provide helpful tips: Every bar gives users tips on the conditions at that time and the immediate future. For instance, a low surge bar, followed by a high surge bar gives the tip “Psst… Leave now, It might get busy later!”, helping people understand the graph better and nudging them to take an action. If you are interested in saving fares, may we suggest tapping around all the bars to reveal the secret pro-tips?

Designing interfaces that lead to consumer success by abstracting complexity

Design team is the one behind the colors and shapes that make up the widget that you see and interact with! The team took inspiration from Google’s Popular Times.

Source/Credits: Google Live Popular Times
Source/Credits: Google Live Popular Times

 

Right from the offset, our content and product teams were keen to surface additional information and actions with each bar to keep the widget interactive and useful. One of the early challenges was to arrive at the right gesture that invites the user to interact and intuitively navigate the bars on the widget but also does not conflict with other gestures (eg scrolling and scrubbing) that the user was pre-trained to perform on the feed. We found out that tapping was simultaneously an unused and yet intuitive gesturethat we could use for interaction with the bars.

We then went into rounds of iteration on the visual design of the widget. In this process, multiple stakeholders were involved ranging from Product to Content to Engineering. We had to overcome a number of constraints i.e. the limited canvas of a widget and the context of a user when she is exploring the feed. By re-using existing libraries and components, we managed to keep the development light and ship something fast.

GrabCar trends near you

Dozens of revisions and four iterations later, we landed with a design that we felt equipped the feature for its user-facing goal, and did so in a manner which was aesthetically appealing!

And finally we managed to deliver on the feature’s goal, by surfacing just the right detail of information in a manner that is intuitive yet effective to peak-shift demand.  

Bringing all of this to fruition through high performance engineering

Our Development Engineering team was in charge of developing the widget and making it available to our users in just a few weeks’ time – materialising the work of the other teams.

One of their challenges was to find the best way to process the vast amount of data (millions of database entries) so it can be visualized simply as bar charts. Grab’s engineers had to achieve this while making sure performance is as resilient as possible.

There were two options in doing this:

a) Fetch the data directly from the DB for each API call; or

b) Store the data in an in-memory data structure on a timely basis, so when a user calls the API will no longer have to hit the DB.

After considering that this feature will likely expect a lot of traffic thus high QPS, we decided that the former option would be too costly. Ultimately, we chose the latter option since it is more performant and more scalable.

At the frontend, the challenge was to cater to the intricate request from our designers. We use chart libraries to increase our development speed, and not all of the requirements were readily supported by these libraries.

For instance, let’s say this library makes visualising charts easy, but not so much for customising them. If designers wanted to have an average line in a dotted form, the library did not support this so easily. Also, the moving arrow pointers as you move between bar chart, changing colors of the bars changes when clicked – all required countless CSS tweaks.

CSS tweak on trends widget
CSS tweak on trends widget

Closing the product loop with user feedback and data driven insights

One of the most crucial parts of launching any product is to ensure that customers are engaging with the widget and finding it useful.

To understand what customers think about the widget, whether they find it useful and whether it is helping them to plan better,  we delved into the huge mine of clickstream data.

User feedback on the trends widget

We found that 1 in 3 users who make a booking everyday interact with the widget. And of these people, more than 70% users have given positive rating for the widget. This validates our initial hypothesis that if given an option, our customers will love the freedom to plan their trips and inculcate more transparent ecosystem.

These users also indicate the things they like most about the widget. 61% of users gave positive rating for usefulness, 20% were impressed by the design (Kudos to our fantastic designer Ajmal!!) and 13% for usability.

Tweet about the widget

Beyond internal data, our widget made some rounds on social media channels. For Example, here is screenshot of what our users have to say on Twitter.

We closely track these metrics on user engagement and feedback to ensure that we keep improving and coming up with new iterations which helps us to serve our customers in a better way.

Conclusion

We hope you enjoyed reading about how we went from ideation, through iterations to a finished widget in the hands of the user, all in 1 month! Many hands helped along the way. If you are interested in joining this hyper-proactive problem-solving team, please check out Grab’s career site!

And if you have feedback for us, we are here to listen! While we cannot be happier to see some positive reaction from the public, we are also thrilled to hear your suggestions and advice. Please leave us a memo using the Widget’s comment function!

Epilogue

We just released an upgrade to this widget which allows users to set reminders and be notified about availability of good fares in a time period of their choosing. We will keep a watch and come knocking! Go ahead, find the widget on your Grab feed, set a reminder and save on fares on your next ride!

Vulcanizer: a library for operating Elasticsearch

Post Syndicated from GitHub Engineering original https://github.blog/2019-03-05-vulcanizer-a-library-for-operating-elasticsearch/

At GitHub, we use Elasticsearch as the main technology backing our search services. In order to administer our clusters, we use ChatOps via Hubot. As of 2017, those commands were a collection of Bash and Ruby-based scripts.

Although this served our needs for a time, it was becoming increasingly apparent that these scripts lacked composability and reusability. It was also difficult to contribute back to the community by open sourcing any of these scripts due to the fact they are specific to bespoke GitHub infrastructure.

Why build something new?

There are plenty of excellent Elasticsearch libraries, both official and community driven. For Ruby, GitHub has already released the Elastomer library and for Go we make use of the Elastic library by user olivere. However, these libraries focus primarily on indexing and querying data. This is exactly what an application needs to use Elasticsearch, but it’s not the same set of tools that operators of an Elasticsearch cluster need. We wanted a high-level API that corresponded to the common operations we took on a cluster, such as disabling allocation or draining the shards from a node. Our goal was a library that focused on these administrative operations and that our existing tooling could easily use.

Full speed ahead with Go…

We started looking into Go and were inspired by GitHub’s success with freno and orchestrator.

Go’s structure encourages the construction of composable (self-contained, stateless, components that can be selected and assembled) software, and we saw it as a good fit for this application.

… Into a wall

We initially scoped the project out to be a packaged chat app and planned to open source only what we were using internally. During implementation, however, we ran into a few problems:

  • GitHub uses a simple protocol based on JSON-RPC over HTTPS called ChatOps RPC. However, ChatOps RPC is not widely adopted outside of GitHub. This would make integration of our application into ChatOps infrastructure difficult for most parties.
  • The internal REST library our ChatOps commands relied on was not open sourced. Some of the dependencies of this REST library would also need to be open sourced. We’ve started the process of open sourcing this library and its dependencies, but it will take some time.
  • We relied on Consul for service discovery, which not everyone uses.

Based on these factors we decided to break out the core of our library into a separate package that we could open source. This would decouple the package from our internal libraries, Consul, and ChatOps RPC.

The package would only have a few goals:

  • Access the REST endpoints on a single host.
  • Perform an action.
  • Provide results of the action.

This module could then be open sourced without being tied to our internal infrastructure, so that anyone could use it with the ChatOps infrastructure, service discovery, or tooling they choose.

To that end, we wrote vulcanizer.

Vulcanizer

Vulcanizer is a Go library for interacting with an Elasticsearch cluster. It is not meant to be a full-fledged Elasticsearch client. Its goal is to provide a high-level API to help with common tasks that are associated with operating an Elasticsearch cluster such as querying health status of the cluster, migrating data off of nodes, updating cluster settings, and more.

Examples of the Go API

Elasticsearch is great in that almost all things you’d want to accomplish can be done via its HTTP interface, but you don’t want to write JSON by hand, especially during an incident. Below are a few examples of how we use Vulcanizer for common tasks and the equivalent curl commands. The Go examples are simplified and don’t show error handling.

Getting nodes of a cluster

You’ll often want to list the nodes in your cluster to pick out a specific node or to see how many nodes of each type you have in the cluster.

$ curl localhost:9200/_cat/nodes?h=master,role,name,ip,id,jdk
- mdi vulcanizer-node-123 172.0.0.1 xGIs 1.8.0_191
* mdi vulcanizer-node-456 172.0.0.2 RCVG 1.8.0_191

Vulcanizer exposes typed structs for these types of objects.

v := vulcanizer.NewClient("localhost", 9200)

nodes, err := v.GetNodes()

fmt.Printf("Node information: %#v\n", nodes[0])
// Node information: vulcanizer.Node{Name:"vulcanizer-node-123", Ip:"172.0.0.1", Id:"xGIs", Role:"mdi", Master:"-", Jdk:"1.8.0_191"}

Update the max recovery cluster setting

The index recovery speed is a common setting to update when you want balance time to recovery and I/O pressure across your cluster. The curl version has a lot of JSON to write.

$ curl -XPUT localhost:9200/_cluster/settings -d '{ "transient": { "indices.recovery.max_bytes_per_sec": "1000mb" } }'
{
"acknowledged": true,
"persistent": {},
"transient": {
"indices": {
"recovery": {
"max_bytes_per_sec": "1000mb"
}
}
}
}

The Vulcanizer API is fairly simple and will also retrieve and return any existing setting for that key so that you can record the previous value.

v := vulcanizer.NewClient("localhost", 9200)
oldSetting, newSetting, err := v.SetSetting("indices.recovery.max_bytes_per_sec", "1000mb")
// "50mb", "1000mb", nil

Move shards on to and off of a node

To safely update a node, you can set allocation rules so that data is migrated off a specific node. In the Elasticsearch settings, this is a comma-separated list of node names, so you’ll need to be careful not to overwrite an existing value when updating it.

$ curl -XPUT localhost:9200/_cluster/settings -d '
{
"transient" : {
"cluster.routing.allocation.exclude._name" : "vulcanizer-node-123,vulcanizer-node-456"
}
}'

The Vulcanizer API will safely add or remove nodes from the exclude settings so that shards won’t allocate on to a node unexpectedly.

v := vulcanizer.NewClient("localhost", 9200)

// Existing exclusion settings:
// vulcanizer-node-123,vulcanizer-node-456

exclusionSetttings1, err := v.DrainServer("vulcanizer-node-789")
// vulcanizer-node-123,vulcanizer-node-456,vulcanizer-node-789

exclusionSetttings2, err := v.FillOneServer("vulcanizer-node-456")
// vulcanizer-node-123,vulcanizer-node-789

Command-line application

Included is a small CLI application that leverages the library:

$ vulcanizer -h
Usage:
  vulcanizer [command]

Available Commands:
  allocation  Set shard allocation on the cluster.
  drain       Drain a server or see what servers are draining.
  fill        Fill servers with data, removing shard allocation exclusion rules.
  health      Display the health of the cluster.
  help        Help about any command
  indices     Display the indices of the cluster.
  nodes       Display the nodes of the cluster.
  setting     Interact with cluster settings.
  settings    Display all the settings of the cluster.
  snapshot    Interact with a specific snapshot.
  snapshots   Display the snapshots of the cluster.

Flags:
  -c, --cluster string      Cluster to connect to defined in config file
  -f, --configFile string   Configuration file to read in (default to "~/.vulcanizer.yaml")
  -h, --help                help for vulcanizer
      --host string         Host to connect to (default "localhost")
  -p, --port int            Port to connect to (default 9200)

Use "vulcanizer [command] --help" for more information about a command.

ChatOps

ChatOps is important for GitHub and our geographically distributed workforce. Vulcanizer enables us to build ChatOps tooling around Elasticsearch quickly and easily for common tasks:

jessbreckenridge
jessbreckenridge

.vulcan-go health testbed
hubot
Hubot

The cluster is green: All primary and replica shards are allocated. This does NOT mean the cluster is otherwise healthy.
+--------+------------+------+------------+----------+
| STATUS | RELOCATING | INIT | UNASSIGNED | ACTIVE % |
+--------+------------+------+------------+----------+
| green  | 0          | 0    | 0          | 100      |
+--------+------------+------+------------+----------+
jessbreckenridge
jessbreckenridge

.vulcan-go nodes testbed
hubot
Hubot

testbed nodes
+--------+------+------------------------+-----------+-----------+
| MASTER | ROLE |          NAME          |    JDK    |      IP   |
+--------+------+------------------------+-----------+-----------+
| *      | mi   | search-arbiter-aaaaaaa | 1.8.0_111 | 172.0.0.1 |
| -      | mi   | search-arbiter-bbbbbbb | 1.8.0_111 | 172.0.0.2 |
| -      | mi   | search-arbiter-ccccccc | 1.8.0_111 | 172.0.0.3 |
| -      | di   | search-storage-ddddddd | 1.8.0_111 | 172.0.0.4 |
| -      | di   | search-storage-eeeeeee | 1.8.0_111 | 172.0.0.5 |
| -      | di   | search-storage-eeeeeee | 1.8.0_111 | 172.0.0.6 |
+--------+------+------------------------+-----------+-----------+
nickcanz
nickcanz

.vulcan-go drain status testbed
hubot
Hubot

IP's filtered: None
Name Excluded: None
Host Excluded: None
nickcanz
nickcanz

.vulcan-go drain server search-storage-zzzzzzz testbed
hubot
Hubot

Server "search-storage-zzzzzzz" added to list of servers to drain.

Complete list servers draining:
    Ips: None
    Names: search-storage-zzzzzzz
    Hosts: None
nickcanz
nickcanz

.vulcan-go settings testbed
hubot
Hubot

testbed settings
No persistent settings are set.
+------------------------------------------+------------------------+
|            TRANSIENT SETTINGS            |         VALUE          |
+------------------------------------------+------------------------+
| cluster.routing.allocation.exclude._name | search-storage-zzzzzzz |
+------------------------------------------+------------------------+

Closing

We stumbled a bit when we first started down this path, but the end result is best for everyone:

  • Since we had to regroup about what exact functionality we wanted to open source, we made sure we were providing value to ourselves and the community instead of just shipping something.
  • Internal tooling doesn’t always follow engineering best practices like proper release management, so developing Vulcanizer in the open provides an external pressure to make sure we follow all of the best practices.
  • Having all of the Elasticsearch functionality in its own library allows our internal applications to be very slim and isolated. Our different internal applications have a clear dependency on Vulcanizer instead of having different internal applications depend on each other or worse, trying to get ChatOps to talk to other ChatOps.

Visit the Vulcanizer repository to clone or contribute to the project. We have ideas for future development in the Vulcanizer roadmap.

Authors


The post Vulcanizer: a library for operating Elasticsearch appeared first on The GitHub Blog.

Structured Logging: The Best Friend You’ll Want When Things Go Wrong

Post Syndicated from Grab Tech original https://engineering.grab.com/structured-logging

Introduction

Everyday millions of people around Southeast Asia count on Grab to get themselves or what they need from point A to B in a safe, comfortable and reliable manner. In fact, just very recently we crossed our 3 billion transport rides milestone, gaining the last billion in just a mere 6 months!

We take this responsibility very seriously, and as we continue to grow and expand, it’s important for us to maintain a sophisticated backend system that is capable of sustaining the kind of scale needed to support all our customers in Southeast Asia. This backend system is comprised of multiple services that interact with each other in many different ways. As Grab evolves, maintaining them becomes a significantly larger and harder task as developers continuously develop new features.

To maintain these systems well, it’s important to have better observability; data that helps us better understand what is happening in the system by having good monitoring (metrics), event logs, and tracing for request scope data. Out of these, logs provide the most complete picture of what happened within the system – and is typically the first and most engaged point of contact. With good logs, the backend becomes much easier to understand, maintain, and debug. Without logs or with bad logs – we have a recipe for disaster; making it nearly impossible to understand what’s happening.

In this article, we focus on a form of logging called structured logging. We discuss what it is, why is it better, and how we built a framework that integrates well with our current Elastic stack-based logging backend, allowing us to do logging better and more efficiently.

Structured Logging is a part of a larger endeavour which will enable us to reduce the Mean Time To Resolve (MTTR), helping developers to mitigate issues faster when outages happen.

What are Logs?

Logs are lines of texts containing some information about some event that occurred in our system, and they serve a crucial function of helping us understand what’s happening in the backend. Logs are usually placed at points in the code where a significant event has happened (for example, some database operation succeeded or a passenger got assigned to a driver) or at any other place in the code that we are interested in observing.

The first thing that a developer would normally do when an error is reported is check the logs – sort of like walking through the history of the system and finding out what happened. Therefore, logs can be a developer’s best friend in times of service outages, errors, and failed builds.

Logs in today’s world have varying formats and features.

  • Log Format: These range from simple key-value based (like syslog) to quite structured and detailed (like JSON). Since logs are mostly meant for developer eyes, how detailed or structured a log is dictates how fast the developer can query the logs, as well as read them. The more structured the data is – the larger the size is per log line, although it’s more queryable and contains richer information.
  • Levelled Logging (or Log Levels): Logs with different severities can be logged at different levels. The visibility can be limited to a single level, limiting all logs only with a certain severity or above (for example, only logs WARN and above). Usually log levels are static in production environments, and finding DEBUG logs usually requires redeploying.
  • Log Aggregation Backend: Logs can have different log aggregation backends, which means different backends (i.e. Splunk, Kibana, etc.) decide what your logs might look like or what you might be able to do with them. Some might cost a lot more than others.
  • Causal Ordering: Logs might or might not preserve the exact time in which they are written. This is important, as how exact the time is dictates how accurately we can predict the sequence of events via logs.
  • Log Correlation: We serve countless requests from our backend services. Being able to see all the logs relevant to a particular request or a particular event helps us drill down to relevant  information for a specific request (e.g. for a specific passenger trying to book a ride).

Combine this with the plethora of logging libraries available and you easily have a developer who is holding his head in confusion, unable to decide what to use. Also, each library has their own set of advantages and disadvantages, so the discussion might quickly become subjective and polarized – therefore it is crucial that you choose the appropriate library and backend pair for your applications.

We at Grab use different types of logging libraries. However, as requirements changed  – we also found ourselves re-evaluating our logging strategy.

The State of Logging at Grab

The number of Golang services at Grab has continuously grown. Most services used syslog-style key-value format logs, recognized as the most common format of logs for server-side applications due to its simplicity and ease for reading and writing. All these logs were made possible by a handful of common libraries, which were directly imported and used by different services.

We used a cloud-based SaaS vendor as a frontend for these logs, where application-emitted logs were routed to files and sent to our logging vendor, making it possible to view and query them in real time. Things were pretty great and frictionless for a long time.

However, as time went by, our logging bills started mounting to unprecedented levels and we found ourselves revisiting and re-evaluating how we did logging. A few issues surfaced:

  • Logging volume reduction efforts were successful to some extent – but were arduous and painful. Part of the reason was that almost all the logs were at a single log level – INFO.
Figure 1: Log Level Usage
Figure 1: Log Level Usage

 

This issue was not limited to a single service, but pervasive across services. For mitigation, some services added sampling to logs, some removed logs altogether. The latter is only a recipe for disaster, so it was known that we had to improve levelled logging.

  • The vendor was expensive for us at the time and also had a few concerns – primarily with limitations around DSL (query language). There were many good open source alternatives available – Elastic stack to name one. Our engineers felt confident that we could probably manage our logging infrastructure and manage the costs better – which led to the proposal and building of Elastic stack logging cluster. Elasticsearch is vastly more powerful and rich than our vendor at the time and our current libraries weren’t enough to fully leverage its capabilities, so we needed a library which can leverage structure in logs better and easily integrate with Elastic stack.
  • There were some minor issues in our logging libraries namely:
    • Singleton initialisation pattern that made unit-testing harder
    • Single logger interface that reduced the possibility of extending the core logging functionality as almost all the services imported the logger interface directly
    • No out-of-the-box support for multiple writers
  • If we were to write a library, we had to fix these issues – and also encourage usage of best practices.

  • Grab’s critical path (number of services traversed by a single booking flow request) has grown in size. On average, a single booking request touches multiple microservices – each of which does something different. At the large scale at which we operate, it’s necessary therefore to easily view logs from all the services for a single request – however this was not something which was done automatically by the library. Hence, we also wanted to make log correlation easier and better.
  • Logs are events which happened at some point of time. The order in which these events occurred gives us a complete history of what happened in the system. However, the core logging library which formed the base of the logging across our Golang services didn’t preserve the log generation time (it instead used write time). This led to jumbling of logs which are generated in a span of a few microseconds – which not only makes the lives of our developers harder, but makes it near impossible to get an exact history of the system. This is why we wanted to also improve and enable causal ordering of logs – one of the key steps in understanding what’s happening in the system.

Why Change?

As mentioned, we knew there were issues with how we were logging. To best approach the problem and be able to solve it as much as possible without affecting existing infrastructure and services, it was decided to bootstrap a new library from the ground up. This library would solve known issues, as well as contain features which would not have been possible by modifying existing libraries. For a recap, here’s what we wanted to solve:

  • Improve levelled logging
  • Leverate structure in logs better
  • Easily integrate with Elastic stack
  • Encourage usage of best practices
  • Make log correlation easier and better
  • Improve and enable causal ordering of logs for a better understanding of service distribution

Enter Structured Logging. Structured Logging has been quite popular around the world, finding widespread adoption. It was easily integrable with our Elastic stack backend and would also solve most of our pain points.

Structured Logging

Keeping our previous problems and requirements in mind, we bootstrapped a library in Golang, which has the following features:

Dynamic Log Levels

This allows us to change our initialized log levels at runtime from a configuration management system – something which was not possible and encouraged before.

This makes the log levels actually more meaningful now –  developers can now deploy with the usual WARN or INFO log levels, and when things go wrong, just with a configuration change they can update the log level to DEBUG and make their services output more logs when debugging. This also helps us keep our logging costs in check. We made support for integrating this with our configuration management system easy and straightforward.

Consistent Structure in Logs

Logs are inherently unstructured unlike database schema, which is rigid, or a freeform text, which has no structure. Our Elastic stack backend is primarily based on indices (sort of like tables) with mapping (sort of like a loose schema). For this, we needed to output logs in JSON with a consistent structure (for example, we cannot output integer and string under the same JSON field because that will cause an indexing failure in Elasticsearch). Also, we were aware that one of our primary goals was keeping our logging costs in check, and since it didn’t make sense to structure and index almost every field – adding only the structure which is useful to us made sense.

For addressing this, we built a utility that allows us to add structure to our logs deterministically. This is built on top of a schema in which we can add key-value pairs with a specific key name and type, generate code based on that – and use the generated code to make sure that things are consistently formatted and don’t break. We called this schema (a collection of key name and type pairs) the Common Grab Log Schema (CGLS). We only add structure to CGLS which is important – everything included in CGLS gets formatted in the different field and everything else gets formatted in a single field in the generated JSON. This helps keeps our structure consistent and easily usable with Elastic stack.

Figure 2: Overview of Common Grab Log Schema for Golang backend services
Figure 2: Overview of Common Grab Log Schema for Golang backend services

Plug and Play support with Grab-Kit

We made the initialization and use easy and out-of-the-box with our in-house support for Grab-Kit, so developers can just use it without making any drastic changes. Also, as part of this integration, we added automatic log correlation based on request IDs present in traces, which ensured that all the logs generated for a particular request already have that trace ID.

Configurable Log Format

Our primary requirement was building a logger expressive and consistent enough to integrate with the Elastic stack backend well – without going through fancy log parsing in the downstream. Therefore, the library is expressive and configurable enough to allow any log format (we can write different log formats for different future use cases. For example, readable format in development settings and JSON output in production settings), with a default option of JSON output. This ensures that we can produce log output which is compatible with Elastic stack, but still be configurable enough for different use cases.

Support for Multiple Writes with Different Formats

As part of extending the library’s functionality, we needed enough configurability to be able to send different logs to different places at different settings. For example, sending FATAL logs to Slack asynchronously in some readable format, while sending all the usual logs to our Elastic stack backend. This library includes support for chaining such “cores” to any arbitrary degree possible – making sure that this logger can be used in such highly specialized cases as well.

Production-like Logging Environment in Development

Developers have been seeing console logs since the dawn of time, however having structured JSON logs which are only meant for production logs and are more searchable provides more power. To leverage this power in development better and allow developers to directly see their logs in Kibana, we provide a dockerized version of Kibana which can be spun up locally to accept structured logs. This allows developers to directly use the structured logs and see their logs in Kibana – just like production!

Having this library enabled us to do logging in a much better way. The most noticeable impact was that our simple access logs can now be queried better – with more filters and conditions.

Figure 3: Production-like Logging Environment in Development
Figure 3: Production-like Logging Environment in Development

Causal Ordering

Having an exact history of events makes debugging issues in production systems easier – as one can just look at the history and quickly hypothesize what’s wrong and fix it. To this end, the structured logging library adds the exact write timestamp in nanoseconds in the logger. This combined with the structured JSON-like format makes it possible to sort all the logs by this field – so we can see logs in the exact order as they happened – achieving causal ordering in logs. This is an underplayed but highly powerful feature that makes debugging easier.

Figure 4: Causal ordering of logs with Y'ALL
Figure 4: Causal ordering of logs with Y’ALL

But Why Structured Logging?

Now that you know about the history and the reasons behind our logging strategy, let’s discuss the benefits that you reap from it.

On the outset, having logs well-defined and structured (like JSON) has multiple benefits, including but not limited to:

  • Better root cause analysis: With structured logs, we can ingest and perform more powerful queries which won’t be possible with simple unstructured logs. Developers can do more informative queries on finding the logs which are relevant to the situation. Not only this, log correlation and causal ordering make it possible to gain a better understanding of the distributed logs. Unlike unstructured data, where we are only limited to full-text or a handful of log types, structured logs take the possibility to a whole new level.
  • More transparency or better observability: With structured logs, you increase the visibility of what is happening with your system – since now you can log information in a better, more expressive way. This enables you to have a more transparent view of what is happening in the system and makes your systems easier to maintain and debug over longer periods of time.
  • Better consistency: With structured logs, you increase the structure present in your logs – and in turn, make your logs more consistent as the systems evolve. This allows us to index our logs in a system like Elastic stack more easily as we can be sure that we are sticking to some structure. Also with the adoption of a common schema, we can be rest assured that we are all using the same structure.
  • Better standardization: Having a single, well-defined, structured way to do logging allows us to standardize logging – which reduces cognitive overhead of figuring out what happened in systems via logs and allows easier adoption. Instead of going through 100 different types of logs, you instead would only have a single format. This is also one of the goals of the library – standardizing the usage of the library across Golang backend services.

We get some additional benefits as well:

  • Dynamic Log Levels: This allows us to have meaningful log levels in our code – where we can deploy with baseline warning settings and switch to lower levels (debug logs) only when we need them. This helps keep our logging costs low, as well as reduces the noise that developers usually need to go through when debugging.
  • Future-proof Consistency in Logs: With the adoption of a common schema, we make sure that we stick with the same structure, even if say tomorrow our logging infrastructure changes – making us future-ready. Instead of manually specifying what to log, we can simply expose a function in our loggers.
  • Production-Like Logging Environment in Development: The dockerized Kibana allows developers to enjoy the same benefits as the production Kibana. This also encourages developers to use Elastic stack more and explore its features such as building dashboards based on the log data, having better watchers, and so on.

I hope you have enjoyed this article and found it useful. Comments and corrections are always welcome.

Happy Logging!

How we simplified our Data Ingestion & Transformation Process

Post Syndicated from Grab Tech original https://engineering.grab.com/data-ingestion-transformation-product-insights

Introduction

As Grab grew from a small startup to an organisation serving millions of customers and driver partners, making day-to-day data-driven decisions became paramount. We needed a system to efficiently ingest data from mobile apps and backend systems and then make it available for analytics and engineering teams.

Thanks to modern data processing frameworks, ingesting data isn’t a big issue. However, at Grab scale it is a non-trivial task. We had to prepare for two key scenarios:

  • Business growth, including organic growth over time and expected seasonality effects.
  • Any unexpected peaks due to unforeseen circumstances. Our systems have to be horizontally scalable.

We could ingest data in batches, in real time, or a combination of the two. When you ingest data in batches, you can import it at regularly scheduled intervals or when it reaches a certain size. This is very useful when processes run on a schedule, such as reports that run daily at a specific time. Typically, batched data is useful for offline analytics and data science.

On the other hand, real-time ingestion has significant business value, such as with reactive systems. For example, when a customer provides feedback for a Grab superapp widget, we re-rank widgets based on that customer’s likes or dislikes. Note when information is very time-sensitive, you must continuously monitor its data.

This blog post describes how Grab built a scalable data ingestion system and how we went from prototyping with Spark Streaming to running a production-grade data processing cluster written in Golang.

Building the system without reinventing the wheel

The data ingestion system:

  1. Collects raw data as app events.
  2. Transforms the data into a structured format.
  3. Stores the data for analysis and monitoring.

In a previous blog post, we discussed dealing with batched data ETL with Spark. This post focuses on real-time ingestion.

We separated the data ingestion system into 3 layers: collection, transformation, and storage. This table and diagram highlights the tools used in each layer in our system’s first design.

LayerTools
CollectionGateway, Kafka
TransformationGo processing service, Spark Streaming
StorageTalariaDB

Our first design might seem complex, but we used battle-tested and common tools such as Apache Kafka and Spark Streaming. This let us get an end-to-end solution up and running quickly.

Collection layer

Our collection layer had two sub-layers:

  1. Our custom built API Gateway received HTTP requests from the mobile app. It simply decoded and authenticated HTTP requests, streaming the data to the Kafka queue.
  2. The Kafka queue decoupled the transformation layer (shown in the above figure as the processing service and Spark streaming) from the collection layer (shown above as the Gateway service). We needed to retain raw data in the Kafka queue for fault tolerance of the entire system. Imagine an error where a data pipeline pollutes the data with flawed transformation code or just simply crashes. The Kafka queue saves us from data loss by data backfilling.

Since it’s robust and battle-tested, we chose Kafka as our queueing solution. It perfectly met our requirements, such as high throughput and low latency. Although Kafka takes some operational effort such as self-hosting and monitoring, Grab has a proficient and dedicated team managing our Kafka cluster.

Transformation layer

There are many options for real-time data processing, including Spark Streaming, Flink, and Storm. Since we use Spark for all our batch processing, we decided to use Spark Streaming.

We deployed a Golang processing service between Kafka and Spark Streaming. This service converts the data from Protobuf to Avro. Instead of pointing Spark Streaming directly to Kafka, we used this processing service as an intermediary. This was because our Spark Streaming job was written in Python and Spark doesn’t natively support protobuf decoding.  We used Avro format, since Grab historically used it for archiving streaming data. Each raw event was enriched and batched together with other events. Batches were then uploaded to S3.

Storage layer

TalariaDB is a Grab-built time-series database. It ingests events as columnar ORC files, indexing them by event name and time. We use the same ORC format files for batch processing. TalariaDB also implements the Presto Thrift connector interface, so our users could query certain event types by time range. They did this by connecting a Presto to a TalariaDB hosting distributed cluster.

Problems

Building and deploying our data pipeline’s MVP provided great value to our data analysts, engineers, and QA team. For example, our mobile app team could monitor any abnormal change in the real-time metrics, such as the screen load time for the latest released app version. The QA team could perform app side actions (book a ride, make payment, etc.) and check which events were triggered and received by the backend. The latency between the ingestion and the serving layer was only 4 minutes instead of the batch processing system’s 60 minutes. The streaming processing’s data showed good business value.

This prompted us to develop more features on top of our platform-collected real-time data. Very soon our QA engineers and the product analytics team used more and more of the real-time data processing system. They started instrumenting various mobile applications so more data started flowing in. However, as our ingested data increased, so did our problems. These were mostly related to operational complexity and the increased latency.

Operational complexity

Only a few team members could operate Spark Streaming and EMR. With more data and variable rates, our streaming jobs had scaling issues and failed occasionally. This was due to checkpoint issues when the cluster was under heavy load. Increasing the cluster size helped, but adding more nodes also increased the likelihood of losing more cluster nodes. When we lost nodes,our latency went up and added more work for our already busy on-call engineers.

Supporting native Protobuf

To simplify the architecture, we initially planned to bypass our Golang-written processing service for the real-time data pipeline. Our plan was to let Spark directly talk to the Kafka queue and send the output to S3. This required packaging the decoders for our protobuf messages for Python Spark jobs, which was cumbersome. We thought about rewriting our job in Scala, but we didn’t have enough experience with it.

Also, we’d soon hit some streaming limits from S3. Our Spark streaming job was consuming objects from S3, but the process was not continuous due to S3’s  eventual consistency. To avoid long pagination queries in the S3 API, we had to prefix the data with the hour in which it was ingested. This resulted in some data loss after processing by the Spark streaming. The loss happened because the new data would appear in S3 while Spark Streaming had already moved on to the next hour. We tried various tweaks, but it was just a bad design. As our data grew to over one terabyte per hour, our data loss grew with it.

Processing lag

On average, the time from our system ingesting an event to when it was available on the Presto was 4 to 6 minutes. We call that processing lag, as it happened due to our data processing. It was substantially worse under heavy loads, increasing to 8 to 13 minutes. While that wasn’t bad at this scale (a few TBs of data), it made some use cases impossible, such as monitoring. We needed to do better.

Simplifying the architecture and rewriting in Golang

After completing the MVP phase development, we noticed the Spark Streaming functionality we actually used was relatively trivial. In the Spark Streaming job, we only:

  • Partitioned the batch of events by event name.
  • Encoded the data in ORC format.
  • And uploaded to an S3 bucket.

To mitigate the problems mentioned above, we tried re-implementing the features in our existing Golang processing service. Besides consuming the data and publishing to an S3 bucket, the transformation service also needed to deal with event partitioning and ORC encoding.

One key problem we addressed was implementing a robust event partitioner with a large write throughput and low read latency. Fortunately, Golang has a nice concurrent map package. To further reduce the lock contention, we added sharding.

We made the changes, deployed the service to production,and discovered our service was now memory-bound as we buffered data for 1 minute. We did thorough benchmarking and profiling on heap allocation to improve memory utilization. By iteratively reducing inefficiencies and contributing to a lower CPU consumption, we made our data transformation more efficient.

Performance

After revamping the system, the elapsed time for a single event to travel from the gateway to our dashboard is about 1 minute. We also fixed the data loss issue. Finally, we significantly reduced our on-call workload by removing Spark Streaming.

Validation

At this point, we had both our old and new pipelines running in parallel. After drastically improving our performance, we needed to confirm we still got the same end results. This was done by running a query against each of the pipelines and comparing the results. Both systems were registered to the same Presto cluster.

We ran two SQL “excerpts” between the two pipelines in different order. Both queries returned the same events, validating our new pipeline’s correctness.

select count(1) from ((
 select uuid, time from grab_x.realtime_new
 where event = 'app.metric1' and time between 1541734140 and 1541734200
) except (
 select uuid, time from grab_x.realtime_old
 where event = 'app.metric1' and time between 1541734140 and 1541734200
))

/* output: 0 */

Conclusions

Scaling a data ingestion system to handle hundreds of thousands of events per second was a non-trivial task. However, by iterating and constantly simplifying our overall architecture, we were able to efficiently ingest the data and drive down its lag to around one minute.

Spark Streaming was a great tool and gave us time to understand the problem. But, understanding what we actually needed to build and iteratively optimise the entire data pipeline led us to:

  • Replacing Spark Streaming with our new Golang-implemented pipeline.
  • Removing Avro encoding.
  • Removing an intermediary S3 step.

Differences between the old and new pipelines are:

Old PipelineNew Pipeline
LanguagesPython, GoGo
Stages4 services3 services
ConversionsProtobuf → Avro → ORCProtobuf → ORC
Lag4-13 min1 min

Systems usually become more and more complex over time, leading to tech debt and decreased performance. In our case, starting with more steps in the data pipeline was actually the simple solution, since we could re-use existing tools. But as we reduced processing stages, we’ve also seen fewer failures. By simplifying the problem, we improved performance and decreased operational complexity. At the end of the day, our data pipeline solves exactly our problem and does nothing else, keeping things fast.

Highlights from Git 2.21

Post Syndicated from Taylor Blau original https://github.blog/2019-02-24-highlights-from-git-2-21/

The open source Git project just released Git 2.21 with features and bug fixes from over 60 contributors. We last caught up with you on the latest Git releases when 2.19 was released. Here’s a look at some of the most interesting features and changes introduced since then.

Human-readable dates with --date=human

As part of its output, git log displays the date each commit was authored. Without making an alternate selection, timestamps will display in Git’s “default” format (for example, “Tue Feb 12 09:00:33 2019 -0800”).

That’s very precise, but a lot of those details are things that you might already know, or don’t care about. For example, if the commit happened today, you already know what year it is. Likewise, if a commit happened seven years ago, you don’t care which second it was authored. So what do you do? You could use --date=relative, which would give you output like 6 days ago, but often, you want to relate “six days ago” to a specific event, like “my meeting last Wednesday.” So, was six days ago Tuesday, or was it Wednesday that you were interested in?

Git 2.21 introduces a new date format that tells you exactly when something occurred with just the right amount of detail: --date=human. Here’s how git log looks with the new format:

git log --date=human example

That’s both more accurate than --date=relative and easier to consume than the full weight of --date=default.

But what about when you’re scripting? Here, you might want to frequently switch between the human and machine-readable formats while putting together a pipeline. Git 2.21 has an option suitable for this setting, too: --date=auto:human. When printing output to a pager, if Git is given this option it will act as if it had been passed --date=human. When otherwise printing output to a non-pager, Git will act as if no format had been given at all. If human isn’t quite your speed, you can combine auto with any other format of your choosing, like --date=auto:relative.

git log --date=auto:human example[source]

Detecting case-insensitive path collisions

One commonly asked Git question is, “After cloning a repository, why does git status report some of the files as modified?” Quite often, the answer is that the repository contains a tree which cannot be represented on your file system. For instance, if it contains both file as well as FILE and your file system is case-insensitive, Git can only checkout one of those files. Worse, Git doesn’t actually detect this case during the clone; it simply writes out each path, unaware that the file system considers them to be the same file. You only find out something has gone wrong when you see a mystery modification.

The exact rules for when this occurs will vary from system to system. In addition to “folding” what we normally consider upper and lowercase characters in English, you may also see this from language-specific conversions, non-printing characters, or Unicode normalization.

In Git 2.20, git clone now detects and reports colliding groups during the initial checkout, which should remove some of the confusion. Unfortunately, Git can’t actually fix the problem for you. What the original committer put in the repository can’t be checked out as-is on your file system. So if you’re thinking about putting files into a multi-platform project that differ only in case, the best advice is still: don’t.

git
[source]

Performance improvements and other bits

Behind the scenes, a lot has changed over the last couple of Git releases, too. We’re dedicating this section to overview a few of these changes. Not all of them will impact your Git usage day-to-day, but some will, and all of the changes are especially important for server administrators.

Multi-pack indexes

Git stores objects (e.g., representations of the files, directories, and more that make up your Git repository) in both the “loose” and “packed” formats. A “loose” object is a compressed encoding of an object, stored in a file. A “packed” object is stored in a packfile, which is a collection of objects, written in terms of deltas of one another.

Because it can be costly to rewrite these packs every time a new object is added to the repository, repositories tend to accumulate many loose objects or individual packs over time. Eventually, these are reconciled during a “repack” operation. However, this reconciliation is not possible for larger repositories, like the Windows repository.

Instead of repacking, Git can now create a multi-pack index file, which is a listing of objects residing in multiple packs, removing the need to perform expensive repacks (in many cases).

[source]

Delta islands

An important optimization for Git servers is that the format for transmitted objects is the same as the heavily-compressed on-disk packfiles. That means that in many cases, Git can serve repositories to clients by simply copying bytes off disk without having to inflate individual objects.

But sometimes this assumption breaks down. Objects on disk may be stored as “deltas” against one another. When two versions of a file have similar content, we might store the full contents of one version (the “base”), but only the differences against the base for the other version. This creates a complication when serving a fetch. If object A is stored as a delta against object B, we can only send the client our on-disk version of A if we are also sending them B (or if we know they already have B). Otherwise, we have to reconstruct the full contents of A and re-compress it.

This happens rarely in many repositories where clients clone all of the objects stored by the server. But it can be quite common when multiple distinct but overlapping sets of objects are stored in the same packfile (for example, due to repository forks or unmerged pull requests). Git may store a delta between objects found only in two different forks. When someone clones one of the forks, they want only one of the objects, and we have to discard the delta.

Git 2.20 solves this by introducing the concept of “delta islands. Repository administrators can partition the ref namespace into distinct “islands”, and Git will avoid making deltas between islands. The end result is a repository which is slightly larger on disk but is still able to serve client fetches much more cheaply.

[source 1, source 2]

Delta reuse with bitmaps

We already discussed the importance of reusing on-disk deltas when serving fetches, but how do we know when the other side has the base object they’ need to use the delta we send them? If we’re sending them the base, too, then the answer is easy. But if we’re not, how do we know if they have it?

That answer is deceptively simple: the client will have already told us which commits it has (so that we don’t bother sending them again). If they claim to have a commit which contains the base object, then we can re-use the delta. But there’s one hitch: we not only need to know about the commit they mentioned, but also the entire object graph. The base may have been part of a commit hundreds or thousands of commits deep in the history of the project.

Git doesn’t traverse the entire object graph to check for possible bases because it’s too expensive to do so. For instance, walking the entire graph of a Linux kernel takes roughly 30 seconds.

Fortunately, there’s already a solution within Git: reachability bitmaps. Git has an optional on-disk data structure to record the sets of objects “reachable” from each commit. When this data is available, we can query it to quickly determine whether the client has a base object. This results in the server generating smaller packs that are produced more quickly for an overall faster fetch experience.

[source]

Custom alternates reference advertisement

Repository alternates are a tool that server administrators have at their disposal to reduce redundant information. When two repositories are known to share objects (like a fork and its parent), the fork can list the parent as an “alternate”, and any objects the fork doesn’t have itself, it can look for in its parent. This is helpful since we can avoid storing twice the vast number of objects shared between the fork and parent.

Likewise, a repository with alternates advertises “tips” it has when receiving a push. In other words, before writing from your computer to a remote, that remote will tell you what the tips of its branches are, so you can determine information that is already known by the remote, and therefore use less bandwidth. When a repository has alternates, the tips advertisement is the union of all local and alternate branch tips.

But what happens when computing the tips of an alternate is more expensive than a client sending redundant data? It makes the push so slow that we have disabled this feature for years at GitHub. In Git 2.20, repositories can hook into the way that they enumerate alternate tips, and make the corresponding transaction much faster.

[source]

Tidbits

Now that we’ve highlighted a handful of the changes in the past two releases, we want to share a summary of a few other interesting changes. As always, you can learn more by clicking the “source” link, or reading the documentation or release notes.

  • Have you ever tried to run git cherry-pick on a merge commit only to have it fail? You might have found that the fix involves passing -m1 and moved on. In fact, -m1 says to select the first parent as the mainline, and it replays the relevant commits. Prior to Git 2.21, passing this option on a non-merge commit caused an error, but now it transparently does what you meant. [source]

  • Veteran Git users from our last post might recall that git branch -l establishes a reflog for a newly created branch, instead of listing all branches. Now, instead of doing something you almost certainly didn’t mean, git branch -l will list all of your repository’s branches, keeping in line with other commands that accept -l. [source]

  • If you’ve ever been stuck or forgotten what a certain command or flag does, you might have run git --help (or git -h) to learn more. In Git 2.21, this invocation now follows aliases, and shows the aliased command’s helptext. [source]

  • In repositories with large on-disk checkouts, git status can take a long time to complete. In order to indicate that it’s making progress, the status command now displays a progress bar. [source]

  • Many parts of Git have historically been implemented as shell scripts, calling into tools written in C to do the heavy lifting. While this allowed rapid prototyping, the resulting tools could often be slow due to the overhead of running many separate programs. There are continuing efforts to move these scripts into C, affecting git submodule, git bisect, and git rebase. You may notice rebase in particular being much faster, due to the hard work of the Summer of Code students, Pratik Karki and Alban Gruin.

  • The -G option tells git log to only show commits whose diffs match a particular pattern. But until Git 2.21, it was searching binary files as if they were text, which made things slower and often produced confusing results. [source]

That’s all for now

We went through a few of the changes that have happened over the last couple of versions, but there’s a lot more to discover. Read the release notes for 2.21, or review the release notes for previous versions in the Git repository.

The post Highlights from Git 2.21 appeared first on The GitHub Blog.

Five years of the GitHub Bug Bounty program

Post Syndicated from philipturnbull original https://github.blog/2019-02-19-five-years-of-the-github-bug-bounty-program/

GitHub launched our Security Bug Bounty program in 2014, allowing us to reward independent security researchers for their help in keeping GitHub users secure. Over the past five years, we have been continuously impressed by the hard work and ingenuity of our researchers. Last year was no different and we were glad to pay out $165,000 to researchers from our public bug bounty program in 2018.

We’ve previously talked about our other initiatives to engage with researchers. In 2018, our researcher grants, private bug bounty programs, and a live-hacking event allowed us to reach even more independent security talent. These different ways of working with the community helped GitHub reach a huge milestone in 2018: $250,000 paid out to researchers in a single year.

We’re happy to share some of our highlights from the past year and introduce some big changes for the coming year: full legal protection for researchers, more GitHub properties eligible for rewards, and increased reward amounts.

2018 Highlights

GraphQL and API authorization researcher grant

Since the launch of our researcher grants program in 2017 we’ve been on the lookout for bug bounty researchers who show a specialty in particular features of our products. In mid-2018 @kamilhism submitted a series of vulnerabilities to the public bounty program showing his expertise in the authorization logic of our REST and GraphQL APIs. To support their future research, we provided Kamil with a fixed grant payment to perform a systematic audit of our API authorization logic. Kamil’s audit was excellent, uncovering and allowing us to fix an additional seven authorization flaws in our API.

H1-702

In August, GitHub took part in HackerOne’s H1-702 live-hacking event in Las Vegas. This brought together over 75 of the top researchers from HackerOne to focus on GitHub’s products for one evening of live-hacking. The event didn’t disappoint—GitHub’s security improved and nearly $75,000 was paid out for 43 vulnerabilities. This included one critical-severity vulnerability in GitHub Enterprise Server. We also met with our researchers in-person and received great feedback on how we could improve our bug bounty program.

GitHub Actions private bug bounty

In October, GitHub launched a limited public beta of GitHub Actions. As part of the limited beta, we also ran a private bug bounty program to complement our extensive internal security assessments. We sent out over 150 invitations to researchers from last year’s private program, all H1-702 participants, and invited a number of the best researchers that have worked with our public program. The private bounty program allowed us to uncover a number of vulnerabilities in GitHub Actions.

We also held an office-hours event so that the GitHub security team and researchers could meet. We took the opportunity to meet face-to-face with other researchers because it’s a great way to build a community and learn from each other. Two of our researchers, @not-an-aardvark and @ngaloggc, gave an overview of their submissions and shared details of how they approached the target with everyone.

Workflow improvements

We’ve been making refinements to our internal bug bounty workflow since we last announced it back in 2017.  Our ChatOps-based tools have continued to evolve over the past year as we find more ways to streamline the process. These aren’t just technical changes—each day we’ve had individual on-call first responders who were responsible for handling incoming bounty submissions. We’ve also added a weekly status meeting to review current submissions with all members of the Application Security team. These meetings allow the team to ensure that submissions are not stalled, work is correctly prioritized by engineering teams based on severity, and researchers are getting timely updates on their submissions.

A key success metric for our program is how much time it takes to validate a submission and triage that information to the relevant engineering team so remediation work can begin. Our workflow improvements have paid off and we’ve significantly reduced the average time to triage from four days in 2017 down to 19 hours. Likewise, we’ve reduced our average time to resolution from 16 days to six days. Keep in mind: for us to consider a submission as resolved, the issue has to either be fixed or properly prioritized and tracked, by the responsible engineering team.

We’ve continued to reach our target of replying to researchers in less than 24 hours on average. Most importantly for our researchers, we’ve also dropped our average time for rewarding a submission from 17 days in 2017 down to 11 days. We’re grateful for the effort that researchers invest in our program and we aim to reduce these times further over the next year.

2019 initiatives

Although our program has been running successfully for the past five years, we know that we can always improve. We’ve taken feedback from our researchers and are happy to announce three major changes to our program for 2019:

Keeping bounty program participants safe from the legal risks of security research is a high priority for GitHub. To make sure researchers are as safe as possible, we’ve added a robust set of Legal Safe Harbor terms to our site policy. Our new policies are based on CC0-licensed templates by GitHub’s Associate Corporate Counsel, @F-Jennings. These templates are a fork of EdOverflow’s Legal Bug Bounty repo, with extensive modifications based on broad discussions with security researchers and Amit Elazari’s general research in this field. The templates are also inspired by other best-practice safe harbor examples including Bugcrowd’s disclose.io project and Dropbox’s updated vulnerability disclosure policy.

Our new Legal Safe Harbor terms cover three main sources of legal risk:

  • Your research activity remains protected and authorized even if you accidentally overstep our bounty program’s scope. Our safe harbor now includes a firm commitment not to pursue civil or criminal legal action, or support any prosecution or civil action by others, for participants’ bounty program research activities. You remain protected even for good faith violations of the bounty policy.
  • We will do our best to protect you against legal risk from third parties who won’t commit to the same level of safe harbor protections. Our safe harbor terms now limit report-sharing with third parties in two ways. We will share only non-identifying information with third parties, and only after notifying you and getting that third party’s written commitment not to pursue legal action against you. Unless we get your written permission, we will not share identifying information with a third party.
  • You won’t be violating our site terms if it’s specifically for bounty research. For example, if your in-scope research includes reverse engineering, you can safely disregard the GitHub Enterprise Agreement’s restrictions on reverse engineering. Our safe harbor now provides a limited waiver for relevant parts of our site terms and policies. This protects against legal risk from DMCA anti-circumvention rules or similar contract terms that could otherwise prohibit necessary research tasks like reverse engineering or deobfuscating code.

Other organizations can look to these terms as an industry standard for safe harbor best practices—and we encourage others to freely adopt, use, and modify them to fit their own bounty programs. In creating these terms, we aim to go beyond the current standards for safe harbor programs and provide researchers with the best protection from criminal, civil, and third-party legal risks. The terms have been reviewed by expert security researchers, and are the product of many months of legal research and review of other legal safe harbor programs. Special thanks to MG, Mugwumpjones, and several other researchers for providing input on early drafts of @F-Jennings’ templates.

Expanded scope

Over the past five years, we’ve been steadily expanding the list of GitHub products and services that are eligible for reward. We’re excited to share that we are now increasing our bounty scope to reward vulnerabilities in all first party services hosted under our github.com domain. This includes GitHub Education, GitHub Learning Lab, GitHub Jobs, and our GitHub Desktop application. While GitHub Enterprise Server has been in scope since 2016, to further increase the security of our enterprise customers we are now expanding the scope to include Enterprise Cloud.

It’s not just about our user-facing systems. The security of our users’ data also depends on the security of our employees and our internal systems. That’s why we’re also including all first-party services under our employee-facing githubapp.com and github.net domains.

Increased rewards

We regularly assess our reward amounts against our industry peers. We also recognize that finding higher-severity vulnerabilities in GitHub’s products is becoming increasingly difficult for researchers and they should be rewarded for their efforts. That’s why we’ve increased our reward amounts at all levels:

  • Critical: $20,000–$30,000+
  • High: $10,000–$20,000
  • Medium: $4,000–$10,000
  • Low: $617–$2,000

Our broad ranges have served us well, but we’ve been consistently impressed by the ingenuity of researchers. To recognize that, we no longer have a maximum reward amount for critical vulnerabilities. Although we’ve listed $30,000 as a guideline amount for critical vulnerabilities, we’re reserving the right to reward significantly more for truly cutting-edge research.

Get involved

The bounty program remains a core part of GitHub’s security process and we’re learning a lot from our researchers. With our new initiatives, now is the perfect time to get involved. Details about our safe harbor, expanded scope, and increased awards are available on the GitHub Bug Bounty site.

Working with the community has been a great experience—we’re looking forward to triaging your submissions in the future!

The post Five years of the GitHub Bug Bounty program appeared first on The GitHub Blog.

An open source parser for GitHub Actions

Post Syndicated from Patrick Reynolds original https://github.blog/2019-02-07-an-open-source-parser-for-github-actions/

Since the beta release of GitHub Actions last October, thousands of users have added workflow files to their repositories. But until now, those files only work with the tools GitHub provided: the Actions editor, the Actions execution platform, and the syntax highlighting built into pull requests. To expand that universe, we need to release the parser and the specification for the Actions workflow language as open source. Today, we’re doing that.

We believe that tools beyond GitHub should be able to run workflows. We believe there should be programs to check, format, compose, and visualize workflow files. We believe that text editors can provide syntax highlighting and autocompletion for Actions workflows. And we believe all that can only happen if the Actions community is empowered to build these tools along with us. That can happen better and faster if there is a single language specification and a free parser implementation.

The first project to use the open source parser will be act, which is @nektos‘s tool for running Actions workflows in a local development environment.

The parser and language specification are both in actions/workflow-parser, which we’re sharing under an MIT license. As of today, there is a Go implementation, which is the same code that powers both the Actions UI and the Actions execution platform. The repository also contains a Javascript parser in development, along with syntax-highlighting configurations for Atom and Vim.

GitHub Actions

GitHub Actions is platform for automating software development workflows, from idea to production. Developers add a simple text file to their repository, .github/main.workflow, to describe automation. The workflow file describes how events like pushing code or opening and closing issues map to automation actions, implemented in any Docker container. Those automation actions have whatever powers you grant them: pushing commits to the repository, cutting a new release, building it through continuous integration, deploying it to staging in the cloud, testing the deployment, flipping it to production, and announcing it to the world — and any others you can build.

Every workflow begins with an event and runs through a set of actions to reach some target or goal. Those events and actions are described in a main.workflow file, which you can create and edit with the visual editor or any text editor you like. Here is a simple example:

workflow "when I push" {
  on = "push"
  resolves = "ci"
}

action "ci" {
  uses = "docker://golang:latest"
  runs = "./script/cibuild"
}

Whenever I push to a branch that contains that file, the when I push workflow executes. It resolves the target action ci, which runs ./script/cibuild in a golang:latest Docker container. I can add more workflow blocks to harness more events, and I can add more action blocks to run after the ci action or in parallel with it.

The Actions workflow language

All main.workflow files are written in the Actions workflow language, which is a subset of Hashicorp’s HCL. In fact, our parser builds on top of the open source hashicorp/hcl parser.

All Actions workflow files are valid HCL, but not all HCL files are valid workflows. The Actions workflow parser is stricter, allowing only a specific set of keywords and prohibiting nested objects, among other restrictions. The reason for that is a long-standing goal of Actions: making the Actions editor and the text representation of workflows equivalent and interchangeable. Any file you write in the graphical editor can be expressed in a main.workflow file, of course, but also: any main.workflow can be fully displayed and edited in the graphical editor. There is one exception to this: the graphical editor does not display comments. But it preserves them: changes you make in the graphical editor do not disturb comments you have added to your main.workflow file.

Contributing

We have two reasons for releasing the parser. First, we want to encourage the Actions community to build tools that generate and manipulate workflow files. The language specification should help developers understand the language, and the parser should save developers the trouble of writing their own.

Second, we welcome your contributions. If you find bugs, please open an issue or send a pull request. If you want to add a syntax highlighter for a new editor or implement the parser in another language, we welcome that. The only real limitation is on features that go beyond the parser to the rest of Actions. For broader questions and suggestions about the rest of Actions, reach out through support; we’re listening.

The post An open source parser for GitHub Actions appeared first on The GitHub Blog.