Tag Archives: Engineering

Preventing App Performance Degradation Due to Sudden Ride Demand Spikes

Post Syndicated from Grab Tech original https://engineering.grab.com/preventing-app-performance-degradation-due-to-sudden-ride-demand-spikes

In Southeast Asia, when it rains, it pours. It’s a major mood dampener especially if you are stuck outside when the rain starts, you are about to have an awful day.

In the early days of Grab, if the rains came at the wrong time, like during morning rush hour, then we engineers were also in for a terrible day.

In those days, demand for Grab’s ride services grew much faster than our ability to scale our tech system and this often meant clocking late nights just to ensure our system could handle the ever-growing demand. When there’s a massive, sudden spike in ride bookings, our system often struggled to manage the load.

There were also other contributors to demand spikes, for example when public transport services broke down or when a major event such as an international concert ends and event-goers all need a ride at the same time.

Upon reflection, we realized there were two integral aspects to these incidents.

Firstly, they were localized events. The increase in demand came from a particular geographical location; in some cases a very small area. These localized events had the potential to cause so much load on our system that it impacted the experience of other users outside the geolocation.

Secondly, the underlying problem was a lack of drivers (supply) in that particular geographical area.

At Grab, our goal has always been to get everyone a ride when and where they needed it, but in this situation, it was just not possible. We needed to find a way to ensure this localized demand spike did not affect our ability to meet the needs of other users.

Enter the Spampede Filter

The Spampede (a play of the words spam and stampede) filter was inspired by another concept you may have read on this blog, circuit breakers.

In software, as in electronics, circuit breakers are designed to protect a system by short-circuiting in the face of adverse conditions.

Let’s break this down.

There are two key concepts here: short-circuiting and adverse conditions.

Firstly, short-circuiting, in this context means performing minimal processing on a particular booking, and by doing so, reducing the overall load on the system. Secondly, adverse conditions, in this, we refer to a large number of unfulfilled requests for a particular service, from a small geographical area, within a short time window. With these two concepts in mind, we devised the following process.

Spampede Design

First, we needed to track unallocated requests in a location-aware manner. To do this, we convert the requested pickup location of an unallocated request using the Geohash Integer algorithm.  

After the conversion, the resulting value is an exact location. We can convert this location into a “bucket” or area by reducing the precision.

This method is by no means smart or aware of the local geography, but it is incredibly CPU efficient and requires no external resources like network API calls.

Now that we can track unallocated requests, we needed a way for the tracking to be time-aware. After all, traffic conditions, driver locations, and passenger demand are continually changing. We could have implemented something precise like a sliding window sum, but that would have introduced a lot of complexity and a significantly higher CPU and memory cost.

By using the Unix timestamp, we converted the current time to a “bucket” of time by using the straightforward formula:

Event Sourcing

where bs is the size of the time buckets in seconds

With location and time buckets calculated, we can track the unallocated bookings using Redis. We could have used any data store, but Redis was familiar and battle-proven to us.

To do this, we first constructed the Redis key by combining the service type, the geographic location, and the time bucket. With this key, we call the INCR command, which increments the value stored in that location and returns the new value.

If the value returned is 1, this indicates that this is the first value stored for this bucket combination, and we would then make a second call, this time to EXPIRE. With this second call, we would set a time to live (TTL) on the Redis item, allowing the data to be self-cleaning.

You will notice that we are blindly calling increment and only making a second call if needed. This pattern is more efficient and resource-friendly than using a more traditional, load-check-store pattern.

The next step was the configuration. Specifically, setting how many unallocated bookings could happen in a particular location and time bucket before the circuit opened. For this, we decided on Redis again. Again, we could have used anything, but we were already using Redis and, as mentioned previously, quite familiar with it.

Finally, the last piece. We introduced code at the beginning of our booking processing, most importantly, before any calls to any other services and before any significant processing was done. This code compared the location, time, and requested service to the currently configured Spampede setting, along with the previously unallocated bookings. If the maximum had already been reached, then we immediately stopped processing the booking.

This might sound harsh- to immediately refuse a booking request without even trying to fulfill it. But the goal of the Spampede filter is to prevent excessive, localized demand from impacting all of the users of the system.

Conclusion

Reading about this as a programmer, it probably feels strange, intentionally dropping bookings and impacting the business this way.

After all, we want nothing more than to help people get to where they need to be. This process is a system safety mechanism to ensure that the system stays alive and able to do just that.

I would be remiss if I didn’t highlight the critical software-engineering takeaway here is a combination of the Observer effect and the underlying goals of the CAP theorem. Observing a system will influence the system due to the cost of instrumentation and monitoring.

Generally, the higher the accuracy or consistency of the monitoring and limits, the higher the resource cost.

In this case, we have intentionally chosen the most resource-efficient options and traded accuracy for more throughput.

Plumbing At Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/plumbing-at-scale

When you open the Grab app and hit book, a series of events are generated that define your personalised experience with us: booking state machines kick into motion, driver partners are notified, reward points are computed, your feed is generated, etc. While it is important for you to know that a request has been received, a lot happens asynchronously in our back-end services.

As custodians and builders of the streaming platform at Grab operating at massive scale (think terabytes of data ingress each hour), the Coban team’s mission is to provide a NoOps, managed platform for seamless, secure access to event streams in real-time, for every team at Grab.

Coban Sewu Waterfall In Indonesia
Coban Sewu Waterfall In Indonesia. (Streams, get it?)

Streaming systems are often at the heart of event-driven architectures, and what starts as a need for a simple message bus for asynchronous processing of events quickly evolves into one that requires a more sophisticated stream processing paradigms.
Earlier this year, we saw common patterns of event processing emerge across our Go backend ecosystem, including:

  • Filtering and mapping stream events of one type to another
  • Aggregating events into time windows and materializing them back to the event log or to various types of transactional and analytics databases

Generally, a class of problems surfaced which could be elegantly solved through an event sourcing1 platform with a stream processing framework built over it, similar to the Keystone platform at Netflix2.

This article details our journey building and deploying an event sourcing platform in Go, building a stream processing framework over it, and then scaling it (reliably and efficiently) to service over 300 billion events a week.

Event Sourcing

Event sourcing is an architectural pattern where changes to an application state are stored as a sequence of events, which can be replayed, recomputed, and queried for state at any time. An implementation of the event sourcing pattern typically has three parts to it:

  • An event log
  • Processor selection logic: The logic that selects which chunk of domain logic to run based on an incoming event
  • Processor domain logic: The domain logic that mutates an application’s state
Event Sourcing
Event Sourcing

Event sourcing is a building block on which architectural patterns such as Command Query Responsibility Segregation3, serverless systems, and stream processing pipelines are built.

The Case For Stream Processing

Below are some use cases serviced by stream processing, built on event sourcing.

Asynchronous State Management

A pub-sub system allows for change events from one service to be fanned out to multiple interested subscribers without letting any one subscriber block the progress of others. Abstracting the event log and centralising it democratises access to this log to all back-end services. It enables the back-end services to apply changes from this centralised log to their own state, independent of downstream services, and/or publish their state changes to it.

Time Windowed Aggregations

Time-windowed aggregates are a common requirement for machine learning models (as features) as well as analytics. For example, personalising the Grab app landing page requires counting your interaction with various widget elements in recent history, not any one event in particular. Similarly, an analyst may not be interested in the details of a singular booking in real-time, but in building demand heatmaps segmented by geohashes. For latency-sensitive lookups, especially for the personalisation example, pre-aggregations are preferred instead of post-aggregations.

Stream Joins, Filtering, Mapping

Event logs are typically sharded by some notion of topics to logically divide events of interest around a theme (booking events, profile updates, etc.). Building bigger topics out of smaller ones, as well as smaller ones from bigger ones are common ways to compose “substreams” of the log of interest directed towards specific services. For example, a promo service may only be interested in listening to booking events for promotional bookings.

Realtime Business Intelligence

Outputs of stream processing workloads are also plugged into realtime Business Intelligence (BI) and stream analytics solutions upstream, as raw data for visualizations on operations dashboards.

Archival

For offline analytics, as well as reconciliation and disaster recovery, having an archive in a cold store helps for certain mission critical streams.

Platform Requirements

Any processing platform for event sourcing and stream processing has certain expectations around its functionality.

Scaling and Elasticity

Stream/Event Processing pipelines need to be elastic and responsive to changes in traffic patterns, especially considering that user activity (rides, food, deliveries, payments) varies dramatically during the course of a day or week. A spike in food orders on rainy days shouldn’t cause indefinite order processing latencies.

NoOps

For a platform team, it’s important that users can easily onboard and manage their pipeline lifecycles, at their preferred cadence. To scale effectively, the process of scaffolding, configuring, and deploying pipelines needs to be standardised, and infrastructure managed. Both the platform and users are able to leverage common standards of telemetry, configuration, and deployment strategies, and users benefit from a lack of infrastructure management overhead.

Multi-Tenancy

Our platform has quickly scaled to support hundreds of pipelines. Workload isolation, independent processing uptime guarantees, and resource allocation and cost audit are important requirements necessitating multi-tenancy, which help amortize platform overhead costs.

Resiliency

Whether latency sensitive or latency tolerant, all workloads have certain expectations on processing uptime. From a user’s perspective, there must be guarantees on pipeline uptimes and data completeness, upper bounds on processing delays, instrumentation for alerting, and self-healing properties of the platform for remediation.

Tunable Tradeoffs

Some pipelines are latency sensitive, and rely on processing completeness seconds after event ingress. Other pipelines are latency tolerant, and can tolerate disruption to processing lasting in tens of minutes. A one size fits all solution is likely to be either cost inefficient or unreliable. Having a way for users to make these tradeoffs consciously becomes important for ensuring efficient processing guarantees at a reasonable cost. Similarly, in the case of upstream failures or unavailability, being able to tune failure modes (like wait, continue, or retry) comes in handy.

Stream Processing Framework

While basic event sourcing covers simple use cases like archival, more complicated ones benefit from a common framework that shifts the mental model for processing from per event processing to stream pipeline orchestration.
Given that Go is a “paved road” for back-end development at Grab, and we have service code and bindings for streaming data in a mono-repository, we built a Go framework with a subset of capabilities provided by other streaming frameworks like Flink4.

Logic Blocks In A Stream Processing Pipeline
Logic Blocks In A Stream Processing Pipeline

Capabilities

Some capabilities built into the framework include:

  • Deduplication: Enables pipelines to idempotently reprocess data in case of rewinds/replays, and provides some processing guarantees within a time window for certain use cases including sinking to datastores.
  • Filtering and Mapping: An ability to filter a source stream data and map them onto target streams.
  • Aggregation: An ability to generate and execute aggregation logic such as sum, avg, max, and min in a window.
  • Windowing: An ability to window processing into tumbling, sliding, and session windows.
  • Join: An ability to combine two streams together with certain join keys in a window.
  • Processor Chaining: Various functionalities can be chained to build more complicated pipelines from simpler ones. For example: filter a large stream into a smaller one, aggregate it over a time window, and then map it to a new stream.
  • Rewind: The ability to rewind the processing logic by a few hours through configuration.
  • Replay: The ability to replay archived data into the same or a separate pipeline via configuration.
  • Sinks: A number of connectors to standard Grab stores are provided, with concerns of auth, telemetry, etc. managed in the runtime.
  • Error Handling: Providing an easy way to indicate whether to wait, skip, and/or retry in case of upstream failures is an important tuning parameter that users need for making sensible tradeoffs in dimensions of backpressure, latency, correctness, etc.

Architecture

Coban Platform
Coban Platform

Our event log is primarily a bunch of critical Kafka clusters, which are being polled by various pipelines deployed by service teams on the platform for incoming events. Each pipeline is an isolated deployment, has an identity, and the ability to connect to various upstream sinks to materialise results into, including the event log itself.
There is also a metastore available as an intermediate store for processing pipelines, so the pipelines themselves are stateless with their lifecycle completely beholden to the whims of their owners.

Anatomy of a Processing Pipeline

Anatomy Of A Stream Processing Pod
Anatomy Of A Stream Processing Pod

Anatomy of a Stream Processing Pod
Each stream processing pod (the smallest unit of a pipeline’s deployment) has three top level components:

  • Triggers: An interface that connects directly to the source of the data and converts it into an event channel.
  • Runtime: This is the app’s entrypoint and the orchestrator of the pod. It manages the worker pools, triggers, event channels, and lifecycle events.
  • The Pipeline plugin: The plugin is provided by the user, and conforms to a contract that the platform team publishes. It contains the domain logic for the pipeline and houses the pipeline orchestration defined by a user based on our Stream Processing Framework.

Deployment Infrastructure

Our deployment infrastructure heavily leverages Kubernetes on AWS. After a (pretty high) initial cost for infrastructure set up, we’ve found scaling to hundreds of pipelines a breeze with the Kubernetes provided controls. We package our stateless pipeline workloads into Kubernetes deployments, with each pod containing a unit of a stream pipeline, with sidecars that integrate them with our monitoring systems. Other cluster wide tooling deployed (usually as DaemonSets) deal with metric collection, log ingestion, and autoscaling. We currently use the Horizontal Pod Autoscaler5 to manage traffic elasticity, and the Cluster Autoscaler6 to manage worker node scaling.

Kubernetes
A Typical Kubernetes Set Up On AWS

Metastore

Some pipelines require storage for use cases ranging from deduplication to stores for materialised results of time windowed aggregations. All our pipelines have access to clusters of ScyllaDB instances (which we use as our internal store), made available to pipeline authors via interfaces in the Stream Processing Framework. Results of these aggregations are then made available to backend services via our GrabStats service, which is a thin query layer over the latest pipeline results.

Compute Isolation

A nice property of packaging pipelines as Kubernetes deployments is a good degree of compute workload isolation for pipelines. While node resources of pipeline pods are still shared (and there are potential noisy neighbour issues on matters like logging throughput), the pipeline pods of various pods can be scheduled and rescheduled across a wide range of nodes safely and swiftly, with minimal impact to pods of other pipelines.

Redundancy

Stateless processing pods mean we can set up backup or redundant Kubernetes clusters in hot-hot, hot-warm, or hot-cold modes. We use this to ensure high processing availability despite limited control plane guarantees from any single cluster. (Since EKS SLAs for the Kubernetes control plane guarantee only 99.9% uptime today7.) Transparent to our users, we make the deployment systems aware of multiple available targets for scheduling.

Availability vs Cost

As alluded to in the “Platform Requirements” section, having a way of trading off availability for cost becomes important where the requirements and criticality of each processing pipeline are very different. Given that AWS spot instances are a lot cheaper8 than on-demand ones, we use user annotated Kubernetes priority classes to determine deployment targets for pipelines. For latency tolerant pipelines, we schedule them on Spot instances which are routinely between 40-90% cheaper than on demand instances on which latency sensitive pipelines run. The caveat is that Spot instances occasionally disappear, and these workloads are disrupted until a replacement node for their scheduling can be found.

What’s Next?

  • Expand the ecosystem of triggers to support custom sources of data i.e. the “event log”, as well as push based (RPC driven) versus just pull based triggers
  • Build a control plane for API integration with pipeline lifecycle management
  • Move some workloads to use the Vertical Pod Autoscaler9 in Kubernetes instead of horizontal scaling, as most of our workloads have a limit on parallelism (which is their partition count in Kafka topics)
  • Move from Go plugins for pipelines to plugins over RPC, like what HashiCorp does10, to enable processing logic in non-Go languages.
  • Use either pod gossip or a service mesh with a control plane to set up quotas for shared infrastructure usage per pipeline. This is to protect upstream dependencies and the metastore from surges in event backlogs.
  • Improve availability guarantees for pipeline pods by occasionally redistributing/rescheduling pods across nodes in our Kubernetes cluster to prevent entire workloads being served out of a few large nodes.

Authored By Karan Kamath on behalf of the Coban team at Grab-
Zezhou Yu, Ryan Ooi, Hui Yang, Yuguang Xiao, Ling Zhang, Roy Kim, Matt Hino, Jump Char, Lincoln Lee, Jason Cusick, Shrinand Thakkar, Dean Barlan, Shivam Dixit, Shubham Badkur, Fahad Pervaiz, Andy Nguyen, Ravi Tandon, Ken Fishkin, and Jim Caputo.


Footnotes

Coban Sewu Waterfall Photo by Dwinanda Nurhanif Mujito on Unsplash

Cover Photo by tian kuan on Unsplash

  1. https://martinfowler.com/eaaDev/EventSourcing.html 

  2. https://medium.com/netflix-techblog/keystone-real-time-stream-processing-platform-a3ee651812a 

  3. https://martinfowler.com/bliki/CQRS.html 

  4. https://flink.apache.org 

  5. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ 

  6. https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler 

  7. https://aws.amazon.com/eks/sla/ 

  8. https://aws.amazon.com/ec2/pricing/ 

  9. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler 

  10. https://github.com/hashicorp/go-plugin 

Journey to a Faster Everyday Super App Where Every Millisecond Counts

Post Syndicated from Grab Tech original https://engineering.grab.com/journey-to-a-faster-everyday-super-app

Introduction

At Grab, we are moving faster than ever. In 2019 alone, we released dozens of new features in the Grab passenger app. With our goal to delight users in Southeast Asia with a powerful everyday super app, the app’s performance became one of the most critical components in delivering that experience to our users.

This post narrates the journey of our performance improvement efforts on the Grab passenger app. It highlights how we were able to reduce the time spent starting the app by more than 60%, while preventing regressions introduced by new features. We use the p95 scale when referring to these improvements.

Here’s a quick look at the improvements and timeline:

Improvements Timeline

Improving App Performance

While app performance consists of different aspects – such as battery consumption rate, network performance, app responsiveness, etc. – the first thing users notice is the time it takes for an app to start. Apps that take too long to load frustrate users, leading to bad reviews and uninstalls.

We focused our efforts on the app’s time to interactive(TTI), which consists of two main operations:

  • Starting the app
  • Displaying interactive service tiles (these are the icons for the services offered on the app such as Transport, Food, Delivery, and so on)

There are many other operations that occur in the background, which we won’t cover in this article.

We prioritised on optimising the app’s ability to load the service tiles (highlighted in the image below) and render them as interactive upon startup (cold start). This allowed users to use the app as soon as they launch it.

Service Tiles

Instrumentation and Benchmarking

Before we could start improving the app’s performance, we needed to know where we stood and set measurable goals.

We couldn’t get a baseline from local performance testing as it did not simulate the real environment condition, where network variability and device performance are contributing factors. Thus, we needed to use real production data to get an accurate reflection of our current performance at a scale. In production, we measured the performance of ~8-9 millions users per day – a small subset of our overall active user base.

As a start, we measured the different components contributing to TTI, such as binary loading, library initialisations, and tiles loading. For example, if we had to measure the time taken by function A, this is how it looked like in the code:

functionA (){
// start the timer
....
....
...
//Stop the timer, calculate the time difference and send it as an analytic event
}

With all the numbers from the contributing components, we took the sum to calculate the full TTI (as shown in the following image).

Full TTI

When the numbers started rolling in from production, we needed specific measurements to interpret those numbers, so we started looking at TTI’s 50th, 90th, and 95th percentile. A 90th percentile (p90) of x seconds means that 90% of the users have an interactive screen in at most x seconds.

We chose to only focus on p50 and p95 as these cover the majority of our users who deal with performance issues. Improving performance for <p50 (who already have high-end devices) would not bring too much of a value, and improving for >p95 would be very difficult as the app performance improvements will be limited by device performance.

By the end of January, we got the p50, p90, and p95 numbers for the contributing components that summed up to TTI numbers for tiles, which allowed us to start identifying areas with potential improvements.

Caching and Animation Removal

While reviewing the TTI numbers, we were drawn to contributors with high time consumption rates such as tile loading and app start animation. Other evident improvement we worked on was caching data between app launches instead of waiting for a network response for loading tiles at every single app launch.

Tile Caching

Based on the gathered data, the service tiles only change when a user travels between cities. This is because the available services vary in each city. Since users do not frequently change cities, the service tiles do not change very frequently either, and so caching the tiles made sense. However, we also needed to sync the fresh tiles, in case of any change. So, we updated the logic based on these findings. as illustrated in the following image:

Tile Caching Logic

Caching tiles brought us a huge improvement of ~3s on each platform.

Animation Removal

We came across a beautifully created animation at appstart that didn’t provide any additional value in terms of information or practicality.

With detailed discussions and trade-offs with designers, we removed the animation and improved our TTI further by 1s.

In conclusion, with the caching and animation removal alone, we improved the TTI by 4s.

Welcome Static Linking and Coroutines

At this point, our users gained 4 seconds of their time back, but we didn’t want to stop with that number. So, we dug through the data to see what further enhancements we could do. When we could not find anything else that was similar to caching and animation removal, we shifted to architecture fundamentals.

We knew that this was not an easy route to take and that it would come with a cost; if we decided to choose a component related to architecture fundamentals, all the other teams working on the Grab app would be impacted. We had to evaluate our options and make decisions with trade-offs for overall improvements. And this eventually led to static linking on iOS and coroutines on Android.

Binary Loading

Binary loading is one of the first steps in both mobile platforms when an app is launched. It primarily contributes to pre-main and dex-loading, on iOS and Android respectively.

The pre-main time on iOS was about 7.9s. It is known in the iOS development world that each framework (binary) can either be dynamically or statically linked. While static helps in a faster app start, it brings complexity in building frameworks that are elaborate or contain resources bundles.Building a lot of libraries statically also impact build times negatively.With proper evaluations, we decided to take the route to enable more static linking due to the trade-offs.

Apple recommends a maximum of half a dozen dynamic frameworks for an optimal performance. Guess what? Our passenger app had 107 dynamically linked frameworks, a lot of them were internal.

The task looked daunting at first, since it affected all parts of the app, but we were ready to tackle the challenge head on. Deciding to take this on was the easy part, the actual work entailed lots of tricky coordination and collaboration with multiple teams.

We created an RFC (Request For Comments) doc to propose the static linking of frameworks, wherever applicable, and co-ordinated with teams with the agreed timelines to execute this change.

While collaborating with teams, we learned that we could remove 12 frameworks entirely that were no longer required. This exercise highlighted the importance of regular cleanup and deprecation in our codebase, and was added into our standard process.

And so, we were left with 95 frameworks; 75 of which were statically linked successfully, resulting in our p90 pre-main dropping by 41%.

As Grabbers, it’s in our DNA to push ourselves a little more. With the remaining 20 frameworks, our pre-main was still considerably high. Out of the 20 frameworks, 10 could not be statically linked without issues. As a workaround, we merged multiple dynamic frameworks into one. One of our outstanding engineers even created a plug-in for this, which is called the Cocoapod Merge. With this plug-in, we were able to merge 10 dynamically linked frameworks into 2. We’ve made this plug-in open source: https://github.com/grab/cocoapods-pod-merge.

With all of the above steps, we were finally left with 12 dynamic frameworks – a huge 88% reduction.

The following image illustrates the complex numbers mentioned above:

Static Linking

Using cocoapod merge further helped us with ~0.8s of improvement.

Coroutines

While we were executing the static linking initiative on iOS, we also started refactoring the application initialisation for a modular and clean code on Android. This resulted in creating an ApplicationInitialiser class, which handles the entire application initialisation process with maximum parallelism using coroutines.

Now all the libraries are being initialised in parallel via coroutines and thus enabling better utilisations of computing resources and a faster TTI.

This refactoring and background initialisation for libraries on Android helped in gaining ~0.4s of improvements.

Changing the Basics – Visualisation Setup

By the end of H1 2019, we observed a 50% improvement in TTI, and now it was time to set new goals for H2 2019. Until this point, we would query our database for all metric numbers, copy the numbers into a spreadsheet, and compare them against weeks and app versions.

Despite the high loads of manual work and other challenges, this method still worked at the beginning due to the improvements we had to focus on.

However, in H2 2019 it became apparent that we had to reassess our methodology of reading numbers. So, we started thinking about other ways to present and visualise these numbers better. With help from our Product Analyst, we took advantage of metabase’s advanced capabilities and presented our goals and metrics in a clear and easy to understand format.

For example, here is a graph that shows the top contributing metrics for Android:

Android Metrics

Looking at it, we could clearly tell which metric needed to be prioritised for improvements.

We did this not only for our metrics, but also for our main goals, which allowed us to easily see our progress and track our improvements on a daily basis.

Visualisation

The color bars in the above image depicts the status of our numbers against our goals and also shows the actual numbers at p50, p90, and p95.

As our tracking progressed, we started including more granular and precise measurements, to help guide the team and achieve more impactful improvements of around ~0.3-0.4s.

Fortunately, we were deprecating a third-party library for analytics and experimentation, which happened to be one of the highest contributing metrics for both platforms due to a high number of operations on the main thread. We started using our own in-house experimentation platform where we had better control over performance. We removed this third-party dependency, and it helped us with huge improvements of ~2.5s on Android and ~0.5-0.7s on iOS.

You might be wondering as to why there is such a big difference on the iOS and Android improvement numbers for this dependency. This was due to the setting user attributes operations that ran only in the Android codebase, which was performed on the main thread and took a huge amount of time. These were the times that made us realise that we should focus more on the consistency for both platforms, as well as to identify the third-party library APIs that are used, and to assess whether they are absolutely necessary.

*Tip*: So, it is time for you as well to eliminate such inconsistencies, if there are any.

Ok, there goes our third quarter with ~3s of improvement on Android and ~1.3s on iOS.

Performance Regression Detection

Entering into Q4 brought us many challenges as we were running out of improvements to make. Even finding an improvement worth ~0.05s was really difficult! We were also strongly challenged by regressions (increase in TTI numbers) because of continuous feature releases and code additions to the app start process.

So, maintaining the TTI numbers became our primary task for this period. We started looking into setting up processes to block regressions from being merged to the master, or at least get notified before they hit production.

To begin with, we identified the main sources of regressions: static linking breakage on iOS and library initialisation in the app startup process on Android.

We took the following measures to cover these cases:

Linters

We built linters on the Continuous Integration (CI) pipeline to detect potential changes in static linking on iOS and the ApplicationInitialiser class on Android. The linters block the changelist and enforce a special review process for such changes.

Library Integration Process

The team also focused on setting up a process for library integrations, where each library (internal or third party) will first be evaluated for performance impact before it is integrated into the codebase.

While regression guarding was in process, we were simultaneously trying to bring in more improvements for TTI. We enabled the Link Time Optimisations (LTO) flag on iOS to improve the overall app performance. We also experimented on order files on iOS and anko layout on Android, but were ruled out due to known issues.

On Android, we hit the bottom hard as there were minimal improvements. Fortunately, it was a different story for iOS. We managed to get improvements worth ~0.6s by opting for lazy loading, optimising I/O operations, and deferring more operations to post app start (if applicable).

Next Steps

We will be looking at the different aspects of performance such as network, battery, and storage, while maintaining our current numbers for TTI.

  • Network performance – Track the turnaround time for network requests then move on to optimisations.
  • Battery performance – Focus on profiling the app for CPU and energy intensive operations, which drains the battery, and then move to optimisations.
  • Storage performance – Review our caching and storage mechanisms, and then look for ways to optimise them.

In addition to these, we are also focusing on bringing performance initiatives for all the teams at Grab. We believe that performance is a collaborative approach, and we would like to improve the app performance in all aspects.

We defined different metrics to track performance e.g. Time to Interactive, Time to feedback (the time taken to get the feedback for a user action), UI smoothness indicators, storage, and network metrics.

We are enabling all teams to benchmark their performance numbers based on defined metrics and move on to a path of improvement.

Conclusion

Overall, we improved by 60%, and this calls for a big celebration! Woohoo! The bigger celebration came from knowing that we’ve improved our customers’ experience in using our app.

This graph represents our performance improvement journey for the entire 2019, in terms of TTI.

Performance Graph

Based on the graph, looking at our p95 improvements and converting them to number of hours saved per day gives us ~21,388 hours on iOS and ~38,194 hours saved per day on Android.

Hey, did you know that it takes approximately 80-85 hours to watch all the episodes of Friends? Just saying. 🙂

We will continue to serve our customers for a better and faster experience in the upcoming years.

Marionette – Enabling E2E user-scenario simulation

Post Syndicated from Grab Tech original https://engineering.grab.com/marionette-enabling-e2e-user-scenario-simulation

Introduction

A plethora of interconnected microservices is what powers the Grab’s app. The microservices work behind the scenes to delight millions of our customers in Southeast Asia. It is a no-brainer that we emphasize on strong testing tools, so our app performs flawlessly to continuously meet our customers’ needs.

Background

We have a microservices-based architecture, in which microservices are interconnected to numerous other microservices. Each passing day sees teams within Grab updating their microservices, which in turn enhances the overall app. If any of the microservices fail after changes are rolled out, it may lead to the whole app getting into an unstable state or worse. This is a major risk and that’s why we stress on conducting “end-to-end (E2E) testing” as an integral part of our software test life-cycle.

E2E tests are done for all crucial workflows in the app, but not for every detail. For that we have conventional tests such as unit tests, component tests, functional tests, etc. Consider E2E testing as the final approval in the quality assurance of the app.

Writing E2E tests in the microservices world is not a trivial task. We are not testing just a single monolithic application. To test a workflow on the app from a user’s perspective, we need to traverse multiple microservices, which communicate through different protocols such as HTTP/HTTPS and TCP. E2E testing gets even more challenging with the continuous addition of microservices. Over the years, we have grown tremendously with hundreds of microservices working in the background to power our app.

Some major challenges in writing E2E tests for the microservices-based apps are:

  • Availability

    Getting all microservices together for E2E testing is tough. Each development team works independently and is responsible only for its microservices. Teams use different programming languages, data stores, etc for each microservice. It’s hard to construct all pieces in a common test environment as a complete app for E2E testing each time.

  • Data or resource set up

    E2E testing requires comprehensive data set up. Otherwise, testing results are affected because of data constraints, and not due to any recent changes to underlying microservices. For example, we need to create real-life equivalent driver accounts, passenger accounts, etc and to have those, there are a few dependencies on other internal systems which manage user accounts. Further, data and booking generation should be robust enough to replicate real-world scenarios as far as possible.

  • Access and authentication

    Usually, the test cases require sequential execution in E2E testing. In a microservices architecture, it is difficult to test a workflow which requires access and permissions to several resources or data that should remain available throughout the test execution.

  • Resource and time intensive

    It is expensive and time consuming to run E2E tests; significant time is involved in deploying new changes, configuring all the necessary test data, etc.

Though there are several challenges, we had to find a way to overcome them and test workflows from the beginning to the end in our app.

Our approach to overcome challenges

We knew what our challenges were and what we wanted to achieve from E2E testing, so we started thinking about how to develop a platform for E2E tests. To begin with, we determined that the scope of E2E testing that we’re going to primarily focus on is Grab’s transport domain — the microservices powering the driver and passenger apps.

One approach is to “simulate” user scenarios through a single platform before any new versions of these microservices are released. Ideally, the platform should also have the capabilities to set up the data required for these simulations. For example, ride booking requires data set up such as driver accounts, passenger accounts, location coordinates, geofencing, etc.

We wanted to create a single platform that multiple teams could use to set up their test data and run E2E user-scenario simulations easily. We put ourselves to work on that idea, which resulted in the creation of an internal platform called “Marionette”. It simulates actions performed by Grab’s passenger and driver apps as they are expected to behave in the real world. The objective is to ensure that all standard user workflows are tested before deploying new app versions.

Introducing Marionette

Marionette enables Grabbers (developers and QAs) to run E2E user-scenario simulations without depending on the actual passenger and driver apps. Grabbers can set up data as well as configure data such as drivers, passengers, taxi types, etc to mimic the real-world behavior.

Let’s look at the overall architecture to understand Marionette better:

Overall Architecture

Grabbers can interact with Marionette through three channels: UI, SDK, and through RESTful API endpoints in their test scripts. All requests are routed through a load balancer to the Marionette platform. The Marionette platform in turn talks to the required microservices to create test data and to run the simulations.

The benefits

With Marionette, Grabbers now have the ability to:

  • Simulate the whole booking flow including customer and driver behavior as well as transition through the booking life cycle including pick-up, drop-off, cancellation, etc. For example, developers can make passenger booking from the UI and configure pick-up points, drop-off points, taxi types, and other parameters easily. They can define passenger behaviour such as “make bookings after a specified time interval”, “cancel each booking”, etc. They can also set driver locations, define driver behaviour such as “always accept booking manually”, “decline received bookings”, etc.
  • Simulate bookings in all cities where Grab operates. Further, developers can run simulations for multiple Grab taxi types such as JustGrab, GrabShare, etc.
  • Visualize passengers, drivers, and ride transitions on the UI, which lets them easily test their workflows.
  • Save efforts and time spent on installing third-party android or iOS emulators, troubleshooting or debugging .apk installation files, etc before testing workflows.
  • Conduct E2E testing without real mobile devices and installed apps.
  • Run automatic simulations, in which a particular set of scenarios are run continuously, thus helping developers with exploratory testing.

How we isolated simulations among users

It is important to have independent simulations for each user. Otherwise, simulations don’t yield correct results. This was one of the challenges we faced when we first started running simulations on Marionette.

To resolve this issue, we came up with the idea of “cohorts”. A cohort is a logical group of passengers and drivers who are located in a particular city. Each simulation on Marionette is run using a “cohort” containing the number of drivers and passengers required for that simulation. When a passenger/driver needs to interact with other passengers/drivers (such as for ride bookings), Marionette ensures that the interaction is constrained to resources within the cohort. This ensures that drivers and passengers are not shared in different test cases/simulations, resulting in more consistent test runs.

How to interact with Marionette

Let’s take a look at how to interact with Marionette starting with its user interface first.

User Interface

The Marionette UI is designed to provide the same level of granularity as available on the real passenger and driver apps.

Generally, the UI is used in the following scenarios:

  • To test common user scenarios/workflows after deploying a change on staging.
  • To test the end-to-end booking flow right from the point where a passenger makes a booking till drop-off at the destination.
  • To simulate functionality of other teams within Grab – the passenger app developers can simulate the driver app for their testing and vice versa. Usually, teams work independently and the ability to simulate the dependent app for testing allows developers to work independently.
  • To perform E2E testing (such as by QA teams) without writing any test scripts.

The Marionette UI also allows Grabbers to create and set up data. All that needs to be done is to specify the necessary resources such as number of drivers, number of passengers, city to run the simulation, etc. Running E2E simulations involves just the click of a button after data set up. Reports generated at the end of running simulations provide a graphical visualization of the results. Visual reports save developers’ time, which otherwise is spent on browsing through logs to ascertain errors.

SDK

Marionette also provides an SDK, written in the Go programming language.

It lets developers:

  • Create resources such as passengers, drivers, and cohorts for simulating booking flows.
  • Create booking simulations in both staging and production.
  • Set bookings to specific states as needed for simulation through customizable driver and passenger behaviour.
  • Make HTTP requests and receive responses that matter in tests.
  • Run load tests by scaling up booking requests to match the required workload (QPS).

Let’s look at a high-level booking test case example to understand the simulation workflow.

Assume we want to run an E2E booking test with this driver behavior type — “accepts passenger bookings and transits between booking states according to defined behavior parameters”. This is just one of the driver behavior types in Marionette; other behavior types are also supported. Similarly, passengers also have behaviour types.

To write the E2E test for this example case, we first define the driver behavior in a function like this:

Overall Architecture

Then, we handle the booking request for the driver like this:

Overall Architecture

The SDK client makes the handling of passengers, drivers, and bookings very easy as developers don’t need to worry about hitting multiple services and multiple APIs to set up their required driver and passenger actions. Instead, teams can focus on testing their use cases.

To ensure that passengers and drivers are isolated in our test, we need to group them together in a cohort before running the E2E test.

Overall Architecture

In summary, we have defined the driver’s behavior, created the booking request, created the SDK client and associated the driver and passenger to a cohort. Now, we just have to trigger the E2E test from our IDE. It’s just that simple and easy!

Previously, developers had to write boilerplate code to make HTTP requests and parse returned HTTP responses. With the Marionette SDK in place, developers don’t have to write any boilerplate code saving significant time and effort in E2E testing.

RESTful APIs in test scripts

Marionette provides several RESTful API endpoints that cover different simulation areas such as resource or data creation APIs, driver APIs, passenger APIs, etc. APIs are particularly suitable for scripted testing. Developers can directly call these APIs in their test scripts to facilitate their own tests such as load tests, integration tests, E2E tests, etc.

Developers use these APIs with their preferred programming languages to run simulations. They don’t need to worry about any underlying complexities when using the APIs. For example, developers in Grab have created custom libraries using Marionette APIs in Python, Java, and Bash to run simulations.

What’s next

Currently, we cover E2E tests for our transport domain (microservices for the passenger and driver apps) through Marionette. The next phase is to expand into a full-fledged platform that can test microservices in other Grab domains such as Food, Payments, and so on. Going forward, we are also looking to further simplify the writing of E2E tests and running them as a part of the CD pipeline for seamless testing before deployment.

In conclusion

We had an idea of creating a simulation platform that can run and facilitate E2E testing of microservices. With Marionette, we have achieved this objective. Marionette has helped us understand how end users use our apps, allowing us to make improvements to our services. Further, Marionette ensures there are no breaking changes and provides additional visibility into potential bugs that might be introduced as a result of any changes to microservices.

If you have any comments or questions about Marionette, please leave a comment below.

Behind the scenes: GitHub security alerts

Post Syndicated from Justin Hutchings original https://github.blog/2019-12-11-behind-the-scenes-github-vulnerability-alerts/

If you have code on GitHub, chances are that you’ve had a security vulnerability alert at some point. Since the feature launched, GitHub has sent more than 62 million security alerts for vulnerable dependencies.

How does it work?

Vulnerability alerts rely on two pieces of data: an inventory of all the software that your code depends on, and a curated list of known vulnerabilities in open-source code. 

Any time you push a change to a dependency manifest file, GitHub has a job that parses those manifest files, and stores your dependency on those packages in the dependency graph. If you’re dependent on something that hasn’t been seen before, a background task runs to get more information about the package from the package registries themselves and adds it. We use the information from the package registries to establish the canonical repository that the package came from, and to help populate metadata like readmes, known versions, and the published licenses. 

On GitHub Enterprise Server, this process works identically, except we don’t get any information from the public package registries in order to protect the privacy of the server and its code. 

The dependency graph supports manifests for JavaScript (npm, Yarn), .NET (Nuget), Java (Maven), PHP (Composer), Python (PyPI), and Ruby (Rubygems). This data powers our vulnerability alerts, but also dependency insights, the used by badge, and the community contributors experiences. 

Beyond the dependency graph, we aggregate data from a number of sources and curate those to bring you actionable security alerts. GitHub brings in security vulnerability data from a number of sources, including the National Vulnerability Database (a service of the United States National Institute of Standards and Technology), maintainer security advisories from open-source maintainers, community datasources, and our partner WhiteSource

Once we learn about a vulnerability, it passes through an advanced machine learning model that’s trained to recognize vulnerabilities which impact developers. This model rejects anything that isn’t related to an open-source toolchain. If the model accepts the vulnerability, a bot creates a pull request in a GitHub private repository for our  team of curation experts to manually review.

GitHub curates vulnerabilities because CVEs (Common Vulnerability Entries) are often ambiguous about which open-source projects are impacted. This can be particularly challenging when multiple libraries with similar names exist, or when they’re a part of a larger toolkit. Depending on the kind of vulnerability, our curation team may follow-up with outside security researchers or maintainers about the impact assessment. This follow-up helps to confirm that an alert is warranted and to identify the exact packages that are impacted. 

Once the curation team completes the mappings, we merge the pull request and it starts a background job that notifies users about any affected repositories. Depending on the vulnerability, this can cause a lot of alerts. In a recent incident, more than two million repositories were alerted about a vulnerable version of lodash, a popular JavaScript utility library.

GitHub Enterprise Server customers get a slightly different experience. If an admin has enabled security vulnerability alerts through GitHub Connect, the server will download the latest curated list of vulnerabilities from GitHub.com over the private GitHub Connect channel on its next scheduled sync (about once per hour). If a new vulnerability exists, the server determines the impacted users and repositories before generating alerts directly. 

Security vulnerabilities are a matter of public good. High-profile breaches impact the trustworthiness of the entire tech industry, so we publish a curated set of vulnerabilities on our GraphQL APIs for community projects and enterprise tools to use in custom workflows as necessary. Users can also browse the known vulnerabilities from public sources on the GitHub Advisory Database.

Engineers behind the feature

Despite advanced technology, security alerting is a human process driven by dedicated GitHubbers. Meet Rob (@rschultheis), one of the core members of our security team, and learn about his experiences at GitHub through a friendly Q&A:

Humphrey Dogart (German Shepherd) and Rob Schultheis (Software Engineer on the GitHub Security team)

How long have you been with GitHub? 

Two years

How did you get into software security? 

I’ve worked with open source software for most of my 20 year career in tech, and honestly for much of that time I didn’t pay much attention to security. When I started at GitHub I was given the opportunity to work on the first iteration of security alerts. It quickly became clear that having a high quality, open dataset was going to be a critical factor in the success of the feature. I dove into the task of curating that advisory dataset and found a whole side to the industry that was open for exploration, and I’ve stayed with it ever since!

What are the trickiest parts of vulnerability curation? 

The hardest problem is probably confirming that our advisory data correctly identifies which version(s) of a package are vulnerable to a given advisory, and which version(s) first address it.

What was the most difficult security vulnerability you’ve had to publish? 

One memorable vulnerability was CVE-2015-9284. This one was tough in several ways because it was a part of a popular library, it was also unpatched when it became fully public, and finally, it was published four years after the initial disclosure to maintainers. Even worse, all attempts to fix it had stalled.

We ended up proceeding to publish it and the community quickly responded and finally got the security issue patched.

What’s your favorite feel-good moment working in security? 

Seeing tweets and other feedback thanking us is always wonderful. We do read them! And that goes the same for those critical of the feature or the way certain advisories were disclosed or published. Please keep them coming—they’re really valuable to us as we keep evolving our security offerings.

Since you work at home, can you introduce us to your furry officemate? 

I live with a seven month old shepherd named Humphrey Dogart. His primary responsibilities are making sure I don’t spend all day on the computer, and he does a great job of that. I think we make a great team!


Learn more about GitHub security alerts

The post Behind the scenes: GitHub security alerts appeared first on The GitHub Blog.

Debugging network stalls on Kubernetes

Post Syndicated from Theo Julienne original https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/

We’ve talked about Kubernetes before, and over the last couple of years it’s become the standard deployment pattern at GitHub. We now run a large portion of both internal and public-facing services on Kubernetes. As our Kubernetes clusters have grown, and our targets on the latency of our services have become more stringent, we began to notice that certain services running on Kubernetes in our environment were experiencing sporadic latency that couldn’t be attributed to the performance characteristics of the application itself.

Essentially, applications running on our Kubernetes clusters would observe seemingly random latency of up to and over 100ms on connections, which would cause downstream timeouts or retries. Services were expected to be able to respond to requests in well under 100ms, which wasn’t feasible when the connection itself was taking so long. Separately, we also observed very fast MySQL queries, which we expected to take a matter of milliseconds and that MySQL observed taking only milliseconds, were being observed taking 100ms or more from the perspective of the querying application.

The problem was initially narrowed down to communications that involved a Kubernetes node, even if the other side of a connection was outside Kubernetes. The most simple reproduction we had was a Vegeta benchmark that could be run from any internal host, targeting a Kubernetes service running on a node port, and would observe the sporadically high latency. In this post, we’ll walk through how we tracked down the underlying issue.

Removing complexity to find the path at fault

Using an example reproduction, we wanted to narrow down the problem and remove layers of complexity. Initially, there were too many moving parts in the flow between Vegeta and pods running on Kubernetes to determine if this was a deeper network problem, so we needed to rule some out.

vegeta to kube pod via NodePort

The client, Vegeta, creates a TCP connection to any kube-node in the cluster. Kubernetes runs in our data centers as an overlay network (a network that runs on top of our existing datacenter network) that uses IPIP (which encapsulates the overlay network’s IP packet inside the datacenter’s IP packet). When a connection is made to that first kube-node, it performs stateful Network Address Translation (NAT) to convert the kube-node’s IP and port to an IP and port on the overlay network (specifically, of the pod running the application). On return, it undoes each of these steps. This is a complex system with a lot of state, and a lot of moving parts that are constantly updating and changing as services deploy and move around.

As part of running a tcpdump on the original Vegeta benchmark, we observed the latency during a TCP handshake (between SYN and SYN-ACK). To simplify some of the complexity of HTTP and Vegeta, we can use hping3 to just “ping” with a SYN packet and see if we observe the latency in the response packet—then throw away the connection. We can filter it to only include packets over 100ms and get a simpler reproduction case than a full Layer 7 Vegeta benchmark or attack against the service. The following “pings” a kube-node using TCP SYN/SYN-ACK on the “node port” for the service (30927) with an interval of 10ms, filtered for slow responses:

[email protected] ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms

Our first new observation from the sequence numbers and timings is that this isn’t a one-off, but is often grouped, like a backlog that eventually gets processed.

Next up, we want to narrow down which component(s) were potentially at fault. Is it the kube-proxy iptables NAT rules that are hundreds of rules long? Is it the IPIP tunnel and something on the network handling them poorly? One way to validate this is to test each step of the system. What happens if we remove the NAT and firewall logic and only use the IPIP part:

hping 3 on the IPIP tunnel alone

Linux thankfully lets you just talk directly to an overlay IP when you’re on a machine that’s part of the same network, so that’s pretty easy to do:

[email protected] ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms

Based on our results, the problem still remains! That rules out iptables and NAT. Is it TCP that’s the problem? Let’s see what happens when we perform a normal ICMP ping:

[email protected] ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms

len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms

len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms

len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms

Our results show that the problem still exists. Is it the IPIP tunnel that’s causing the problem? Let’s simplify things further:

hping 3 directly between hosts

Is it possible that it’s every packet between these two hosts?

[email protected] ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms

len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms

Behind the complexity, it’s as simple as two kube-node hosts sending any packet, even ICMP pings, to each other. They’ll still see the latency, if the target host is a “bad” one (some are worse than others).

Now there’s one last thing to question: we clearly don’t observe this everywhere, so why is it just on kube-node servers? And does it occur when the kube-node is the sender or the receiver? Luckily, this is also pretty easy to narrow down by using a host outside Kubernetes as a sender, but with the same “known bad” target host (from a staff shell host to the same kube-node). We can observe this is still an issue in that direction:

[email protected] ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms

And then perform the same from the previous source kube-node to a staff shell host (which rules out the source host, since a ping has both an RX and TX component):

[email protected] ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms

Looking into the packet captures of the latency we observed, we get some more information. Specifically, that the “sender” host (bottom) observes this timeout while the “receiver” host (top) does not—see the Delta column (in seconds):

packet captures of the latency

Additionally, by looking at the difference between the ordering of the packets (based on the sequence numbers) on the receiver side of the TCP and ICMP results above, we can observe that ICMP packets always arrive in the same sequence they were sent, but with uneven timing, while TCP packets are sometimes interleaved, but a subset of them stall. Notably, we observe that if you count the ports of the SYN packets, the ports are not in order on the receiver side, while they’re in order on the sender side.

There is a subtle difference between how modern server NICs—like we have in our data centers—handle packets containing TCP vs ICMP. When a packet arrives, the NIC hashes the packet “per connection” and tries to divvy up the connections across receive queues, each (approximately) delegated to a given CPU core. For TCP, this hash includes both source and destination IP and port. In other words, each connection is hashed (potentially) differently. For ICMP, just the IP source and destination are hashed, since there are no ports.

Another new observation is that we can tell that ICMP observes stalls on all communications between the two hosts during this period from the sequence numbers in ICMP vs TCP, while TCP does not. This tells us that the RX queue hashing is likely in play, almost certainly indicating the stall is in processing RX packets, not in sending responses.

This rules out kube-node transmits, so we now know that it’s a stall in processing packets, and that it’s on the receive side on some kube-node servers.

Deep dive into Linux kernel packet processing

To understand why the problem could be on the receiving side on some kube-node servers, let’s take a look at how the Linux kernel processes packets.

Going back to the simplest traditional implementation, the network card receives a packet and sends an interrupt to the Linux kernel stating that there’s a packet that should be handled. The kernel stops other work, switches context to the interrupt handler, processes the packet, then switches back to what it was doing.

traditional interrupt-driven approach

This context switching is slow, which may have been fine on a 10Mbit NIC in the 90s, but on modern servers where the NIC is 10G and at maximal line rate can bring in around 15 million packets per second, on a smaller server with eight cores that could mean the kernel is interrupted millions of times per second per core.

Instead of constantly handling interrupts, many years ago Linux added NAPI, the networking API that modern drivers use for improved performance at high packet rates. At low rates, the kernel still accepts interrupts from the NIC in the method we mentioned. Once enough packets arrive and cross a threshold, it disables interrupts and instead begins polling the NIC and pulling off packets in batches. This processing is done in a “softirq”, or software interrupt context. This happens at the end of syscalls and hardware interrupts, which are times that the kernel (as opposed to userspace) is already running.

NAPI polling in softirq at end of syscalls

This is much faster, but brings up another problem. What happens if we have so many packets to process that we spend all our time processing packets from the NIC, but we never have time to let the userspace processes actually drain those queues (read from TCP connections, etc.)? Eventually the queues would fill up, and we’d start dropping packets. To try and make this fair, the kernel limits the amount of packets processed in a given softirq context to a certain budget. Once this budget is exceeded, it wakes up a separate thread called ksoftirqd (you’ll see one of these in ps for each core) which processes these softirqs outside of the normal syscall/interrupt path. This thread is scheduled using the standard process scheduler, which already tries to be fair.

NAPI polling crossing threshold in schedule ksoftirqd

With an overview of the way the kernel is processing packets, we can see there is definitely opportunity for this processing to become stalled. If the time between softirq processing calls grows, packets could sit in the NIC RX queue for a while before being processed. This could be something deadlocking the CPU core, or it could be something slow preventing the kernel from running softirqs.

Narrow down processing to a core or method

At this point, it makes sense that this could happen, and we know we’re observing something that looks a lot like it. The next step is to confirm this theory, and if we do, understand what’s causing it.

Let’s revisit the slow round trip packets we saw before:

len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms

len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms

len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms

len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms

As discussed previously, these ICMP packets are hashed to a single NIC RX queue and processed by a single CPU core. If we want to understand what the kernel is doing, it’s helpful to know where (cpu core) and how (softirq, ksoftirqd) it’s processing these packets so we can catch them in action.

Now it’s time to use the tools that allow live tracing of a running Linux kernel—bcc is what was used here. This allows you to write small C programs that hook arbitrary functions in the kernel, and buffer events back to a userspace Python program which can summarize and return them to you. The “hook arbitrary functions in the kernel” is the difficult part, but it actually goes out of its way to be as safe as possible to use, because it’s designed for tracing exactly this type of production issue that you can’t simply reproduce in a testing or dev environment.

The plan here is simple: we know the kernel is processing those ICMP ping packets, so let’s hook the kernel function icmp_echo which takes an incoming ICMP “echo request” packet and initiates sending the ICMP “echo response” reply. We can identify the packet using the incrementing icmp_seq shown by hping3 above.

The code for this bcc script looks complex, but breaking it down it’s not as scary as it sounds. The icmp_echo function is passed a struct sk_buff *skb, which is the packet containing the ICMP echo request. We can delve into this live and pull out the echo.sequence (which maps to the icmp_seq shown by hping3 above), and send that back to userspace. Conveniently, we can also grab the current process name/id as well. This gives us results like the following, live as the kernel processes these packets:

TGID    PID     PROCESS NAME    ICMP_SEQ
0       0       swapper/11      770
0       0       swapper/11      771
0       0       swapper/11      772
0       0       swapper/11      773
0       0       swapper/11      774
20041   20086   prometheus      775
0       0       swapper/11      776
0       0       swapper/11      777
0       0       swapper/11      778
4512    4542   spokes-report-s  779

One thing to note about this process name is that in a post-syscall softirq context, you see the process that made the syscall show as the “process”, even though really it’s the kernel processing it safely within the kernel context.

With that running, we can now correlate back from the stalled packets observed with hping3 to the process that’s handling it. A simple grep on that capture for the icmp_seq values with some context shows what happened before these packets were processed. The packets that line up with the above hping3 icmp_seq values have been marked along with the rtt’s we observed above (and what we’d have expected if <50ms rtt’s weren’t filtered out):

TGID    PID     PROCESS NAME    ICMP_SEQ ** RTT
--
10137   10436   cadvisor        1951
10137   10436   cadvisor        1952
76      76      ksoftirqd/11    1953 ** 99ms
76      76      ksoftirqd/11    1954 ** 89ms
76      76      ksoftirqd/11    1955 ** 79ms
76      76      ksoftirqd/11    1956 ** 69ms
76      76      ksoftirqd/11    1957 ** 59ms
76      76      ksoftirqd/11    1958 ** (49ms)
76      76      ksoftirqd/11    1959 ** (39ms)
76      76      ksoftirqd/11    1960 ** (29ms)
76      76      ksoftirqd/11    1961 ** (19ms)
76      76      ksoftirqd/11    1962 ** (9ms)
--
10137   10436   cadvisor        2068
10137   10436   cadvisor        2069
76      76      ksoftirqd/11    2070 ** 75ms
76      76      ksoftirqd/11    2071 ** 65ms
76      76      ksoftirqd/11    2072 ** 55ms
76      76      ksoftirqd/11    2073 ** (45ms)
76      76      ksoftirqd/11    2074 ** (35ms)
76      76      ksoftirqd/11    2075 ** (25ms)
76      76      ksoftirqd/11    2076 ** (15ms)
76      76      ksoftirqd/11    2077 ** (5ms)

The results tells us a few things. First, these packets are being processed by ksoftirqd/11 which conveniently tells us this particular pair of machines have their ICMP packets hashed to core 11 on the receiving side. We can also see that every time we see a stall, we always see some packets processed in cadvisor’s syscall softirq context, followed by ksoftirqd taking over and processing the backlog, exactly the number we’d expect to work through the backlog.

The fact that cadvisor is always running just prior to this immediately also implicates it in the problem. Ironically, cadvisor “analyzes resource usage and performance characteristics of running containers”, yet it’s triggering this performance problem. As with many things related to containers, it’s all relatively bleeding-edge tooling which can result in some somewhat expected corner cases of bad performance.

What is cadvisor doing to stall things?

With the understanding of how the stall can happen, the process causing it, and the CPU core it’s happening on, we now have a pretty good idea of what this looks like. For the kernel to hard block and not schedule ksoftirqd earlier, and given we see packets processed under cadvisor’s softirq context, it’s likely that cadvisor is running a slow syscall which ends with the rest of the packets being processed:

slow syscall causing stalled packet processing on NIC RX queue

That’s a theory but how do we validate this is actually happening? One thing we can do is trace what’s running on the CPU core throughout this process, catch the point where the packets are overflowing budget and processed by ksoftirqd, then look back a bit to see what was running on the CPU core. Think of it like taking an x-ray of the CPU every few milliseconds. It would look something like this:

tracing of cpu to catch bad syscall and preceding work

Conveniently, this is something that’s already mostly supported. The perf record tool samples a given CPU core at a certain frequency and can generate a call graph of the live system, including both userspace and the kernel. Taking that recording and manipulating it using a quick fork of a tool from Brendan Gregg’s FlameGraph that retained stack trace ordering, we can get a one-line stack trace for each 1ms sample, then get a sample of the 100ms before ksoftirqd is in the trace:

# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100

This results in the following:
(hundreds of traces that look similar)

cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run

There’s a lot there, but looking through it you can see it’s the cadvisor-then-ksoftirqd pattern we saw from the ICMP tracer above. What does it mean?

Each line is a trace of the CPU at a point in time. Each call down the stack is separated by ; on that line. Looking at the middle of the lines we can see the syscall being called is read(): .... ;do_syscall_64;sys_read; ... So cadvisor is spending a lot of time in a read() syscall relating to mem_cgroup_* functions (the top of the call stack / end of line).

The call stack trace isn’t convenient to see what’s being read, so let’s use strace to see what cadvisor is doing and find 100ms-or-slower syscalls:

[email protected] ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0\.[1-9]'
[pid 10436] <... futex resumed> )       = 0 <0.156784>
[pid 10432] <... futex resumed> )       = 0 <0.258285>
[pid 10137] <... futex resumed> )       = 0 <0.678382>
[pid 10384] <... futex resumed> )       = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880\nrss 507904\nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> )       = 0 <0.104614>
[pid 10436] <... futex resumed> )       = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> )   = 0 <0.118113>
[pid 10382] <... pselect6 resumed> )    = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880\nrss 507904\nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> )       = 0 <0.917495>
[pid 10436] <... futex resumed> )       = 0 <0.208172>
[pid 10417] <... futex resumed> )       = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 576 <0.154442>

Sure enough, we see the slow read() calls. From the content being read and mem_cgroup context above, these read() calls are to a memory.stat file which shows the memory usage and limits of a cgroup (the resource isolation technology used by Docker). cadvisor is polling this file to get resource utilization details for the containers. Let’s see if it’s the kernel or cadvisor that’s doing something unexpected by attempting the read ourselves:

[email protected] ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null

real    0m0.153s
user    0m0.000s
sys    0m0.152s
[email protected] ~ $

Since we can reproduce it, this indicates that it’s the kernel hitting a pathologically bad case.

What causes this read to be so slow

At this point it’s much more simple to find similar issues reported by others. As it turns out, this has been reported to cadvisor as an excessive CPU usage problem, it just hadn’t been observed that latency was also being introduced to the network stack randomly as well. In fact, some folks internally had noticed cadvisor was consuming more CPU than expected, but it didn’t seem to be causing an issue since our servers had plenty of CPU capacity, and so the CPU usage hadn’t yet been investigated.

The overview of the issue is that the memory cgroup is accounting for memory usage inside a namespace (container). When all processes in that cgroup exit, the memory cgroup is released by Docker. However, “memory” isn’t just process memory, and although processes memory usage itself is gone, it turns out the kernel also assigns cached content like dentries and inodes (directory and file metadata) that are cached to the memory cgroup. From that issue:

“zombie” cgroups: cgroups that have no processes and have been deleted but still have memory charged to them (in my case, from the dentry cache, but it could also be from page cache or tmpfs).

Rather than the kernel iterating over every page in the cache at cgroup release time, which could be very slow, they choose to wait for those pages to be reclaimed and then finally clean up the cgroup once all are reclaimed when memory is needed, lazily. In the meantime, the cgroup still needs to be counted during stats collection.

From a performance perspective, they are trading off time on a slow process by amortizing it over the reclamation of each page, opting to make the initial cleanup fast in return for leaving some cached memory around. That’s fine, when the kernel reclaims the last of the cached memory, the cgroup eventually gets cleaned up, so it’s not really a “leak”. Unfortunately the search that memory.stat performs, the way it’s implemented on the kernel version (4.9) we’re running on some servers, combined with the huge amount of memory on our servers, means it can take a significantly long time for the last of the cached data to be reclaimed and for the zombie cgroup to be cleaned up.

It turns out we had nodes that had such a large number of zombie cgroups that some had reads/stalls of over a second.

The workaround on that cadvisor issue, to immediately free the dentries/inodes cache systemwide, immediately stopped the read latency, and also the network latency stalls on the host, since the dropping of the cache included the cached pages in the “zombie” cgroups and so they were also freed. This isn’t a solution, but it does validate the cause of the issue.

As it turns out newer kernel releases (4.19+) have improved the performance of the memory.stat call and so this is no longer a problem after moving to that kernel. In the interim, we had existing tooling that was able to detect problems with nodes in our Kubernetes clusters and gracefully drain and reboot them, which we used to detect the cases of high enough latency that would cause issues, and treat them with a graceful reboot. This gave us breathing room while OS and kernel upgrades were rolled out to the remainder of the fleet.

Wrapping up

Since this problem manifested as NIC RX queues not being processed for hundreds of milliseconds, it was responsible for both high latency on short connections and latency observed mid-connection such as between MySQL query and response packets. Understanding and maintaining performance of our most foundational systems like Kubernetes is critical to the reliability and speed of all services that build on top of them. As we invest in and improve on this performance, every system we run benefits from those improvements.

The post Debugging network stalls on Kubernetes appeared first on The GitHub Blog.

How we implemented domain-driven development in Golang

Post Syndicated from Grab Tech original https://engineering.grab.com/domain-driven-development-in-golang

Partnerships have always been core to Grab’s super app strategy. We believe in collaborating with partners who are the best in what they do – combining their expertise with what we’re good at so that we can bring high-quality new services to our customers, at the same time create new opportunities for the merchant and driver-partners in our ecosystem.

That’s why we launched GrabPlatform last year. To make it easier for partners to either integrate Grab into their services, or integrate their services into Grab.

In view of that, part of the GrabPlatform’s team mission is to make it easy for partners to integrate with Grab services. These partners are external companies that would like to offer Grab’s services such as ride-booking through their own websites or applications. To do that, we decided to build a website that will serve as a one-stop-shop that would allow them to self-service these integrations.

The challenges we faced with the conventional approach

In the process of building this website, our team noticed that the majority of the functions and responsibilities were added to files without proper segregation. A single file would contain more than 500 lines of code. Each of these files were imported from different collections of source codes, resulting in an unstructured codebase. Any changes to the existing functions risked breaking existing functionality; we realized then that we needed to proactively plan for the future. Hence, we decided to use the principles of Domain-Driven Design (DDD) and idiomatic Go. This blog aims to demonstrate the process of how we leveraged those concepts to design a modern application.

How we implemented DDD in our codebase

Here’s how we went about solving our unstructured codebase using DDD principles.

Step 1: Gather domain (business) knowledge

We collaborated closely with our domain experts (in our case, this was our product team) to identify functionality and flow. From them, we discovered the following key points:

  • After creating a project, developers are added to the project.
  • The domain experts wanted an ability to add other products (e.g. Pricing service, ETA service, GrabPay service) to their projects.
  • They wanted the ability to create multiple authentication clients to access the above products.

Step 2: Break down domain knowledge into bounded context

Now that we had gathered the required domain knowledge (i.e. what our code needed to reflect to our partners), it was time to use the DDD strategic tool Bounded Context to break down problems into subcontexts. Here is a graphical representation of how we converted the problem into smaller units.

Bounded Context

We identified several dependencies on each of the units involved in the project. Take some of these examples:

  • The project domain overlapped with the product and developer domains.
  • Our RideBooking project can only exist if it has some products like Ridebooking APIs and not the other way around.

What this means is a product can exist independent of the project, but a project will have no significance without any product. In the same way, a project is dependent on the developers, but developers can exist whether or not they belong to a project.

Step 3: Identify value objects or entity (lowest layer)

Looking at the above bounded contexts, we figured out the building blocks (i.e. value objects or entity) to break down the above functionality and flow.

// ProjectDAO ...
type ProjectDAO struct {
  ID            int64
  UUID          string
  Status        ProjectStatus
  CreatedAt     time.Time
}

// DeveloperDAO ...
type DeveloperDAO struct {
  ID            int64
  UUID          string
  PhoneHash     *string
  Status        Status
  CreatedAt     time.Time
}

// ProductDAO ...
type ProductDAO struct {
  ID            int64
  UUID          string
  Name          string
  Description   *string
  Status        ProductStatus
  CreatedAt     time.Time
}

// DeveloperProjectDAO to map developer's to a project
type DeveloperProjectDAO struct {
  ID            int64
  DeveloperID   int64
  ProjectID     int64
  Status        DeveloperProjectStatus
}

// ProductProjectDAO to map product's to a project
type ProductProjectDAO struct {
  ID            int64
  ProjectID     int64
  ProductID     int64
  Status        ProjectProductStatus
}

All the objects shown above have ID as a field and can be identifiable, hence they are identified as entities and not as value objects. But if we apply domain knowledge, DeveloperProjectDAO and ProductProjectDAO are actually not independent entities. Project object is the aggregate root since it must exist before the child fields, DevProjectDAO and ProdcutProjectDAO, can exist.

Step 4: Create the repositories

As stated above, we created an interface to abstract the working logic of a particular domain (i.e. Repository). Here is an example of how we designed the repositories:

// ProductRepositoryImpl responsible for product functionality
type ProductRepositoryImpl struct {
  productDao storage.IProductDao // private field
}

type ProductRepository interface {
  GetProductsByIDs(ctx context.Context, ids []int64) ([]IProduct, error)
}

// DeveloperRepositoryImpl
type DeveloperRepositoryImpl struct {
  developerDAO storage.IDeveloperDao // private field
}

type DeveloperRepository interface {
  FindActiveAllowedByDeveloperIDs(ctx context.Context, developerIDs []interface{}) ([]*Developer, error)
  GetDeveloperDetailByProfile(ctx context.Context, developerProfile *appdto.DeveloperProfile) (IDeveloper, error)
}

Here is a look at how we designed our repository for aggregate root project:

// Unexported Struct
type productProjectRepositoryImpl struct {
  productProjectDAO storage.IProjectProductDao // private field
}

type ProductProjectRepository interface {
  GetAllProjectProductByProjectID(ctx context.Context, projectID int64) ([]*ProjectProduct, error)
}

// Unexported Struct
type developerProjectRepositoryImpl struct {
  developerProjectDAO storage.IDeveloperProjectDao // private field
}

type DeveloperProjectRepository interface {
  GetDevelopersByProjectIDs(ctx context.Context, projectIDs []interface{}) ([]*DeveloperProject, error)
  UpdateMappingWithRole(ctx context.Context, developer IDeveloper, project IProject, role string) (*DeveloperProject, error)
}

// Unexported Struct
type projectRepositoryImpl struct {
  projectDao storage.IProjectDao // private field
}

type ProjectRepository interface {
  GetProjectsByIDs(ctx context.Context, projectIDs []interface{}) ([]*Project, error)
  GetActiveProjectByUUID(ctx context.Context, uuid string) (IProject, error)
  GetProjectByUUID(ctx context.Context, uuid string) (*Project, error)
}

type ProjectAggregatorImpl struct {
  projectRepositoryImpl           // private field
  developerProjectRepositoryImpl  // private field
  productProjectRepositoryImpl    // private field
}

type ProjectAggregator interface {
  GetProjects(ctx context.Context) ([]*dto.Project, error)
  AddDeveloper(ctx context.Context, request *appdto.AddDeveloperRequest) (*appdto.AddDeveloperResponse, error)
  GetProjectWithProducts(ctx context.Context, uuid string) (IProject, error)
}

Step 5: Identify Domain Events

The functions described in Step 4 only returns the ID of the developer and product, which conveys no information to the users. In order to provide developer and product information, we use the domain-event technique to return the actual product and developer attributes.

A domain event is something that happened in a bounded context that you want another context of a domain to be aware of. For example, if there are new updates to the developer domain, it’s important to convey these updates to the project domain. This propagation technique is termed as domain event. Domain events enable independence between different classes.

One way to implement it is seen here:

// file: project\_aggregator.go
func (p *ProjectAggregatorImpl) GetProjects(ctx context.Context) ([]*dto.Project, error) {
  ....
  ....
  developers := p.EventHandler.Handle(DomainEvent.FindDeveloperByDeveloperIDs{DeveloperIDs})
  ....
}

// file: event\_type.go
type FindDeveloperByDeveloperIDs struct{ developerID []interface{} }

// file: event\_handler.go
func (e *EventHandler) Handle(event interface{}) interface{} {
  switch op := event.(type) {
      case FindDeveloperByDeveloperIDs:
            developers, _ := e.developerRepository.FindDeveloperByDeveloperIDs(op.developerIDs)
            return developers
      case ....
      ....
    }
}
Domain Event

Some common mistakes to avoid when implementing DDD in your codebase:

  • Not engaging with domain experts. Not interacting with domain experts is a common mistake when using DDD. Talking to domain experts to get an understanding of the problem domain from their perspective is at the core of DDD. Starting with schemas or data modelling instead of talking to domain experts may create code based on a relational model instead of it built around a domain model.
  • Ignoring the language of the domain experts. Creating a ubiquitous language shared with domain experts is also a core DDD practice. This common language must be used in all discussions as well as in the code, e.g. in class and method names.
  • Not identifying bounded contexts. A common approach to solving a complex problem is breaking it down into smaller parts. Creating bounded contexts is breaking down a large domain into smaller ones, each handling one cohesive part of the domain.
  • Using an anaemic domain model. This is a common sign that a team is not doing DDD and often a symptom of a failure in the modelling process. At first, an anaemic domain model often looks like a real domain model with correct names, but the classes lack functionalities. They contain only the Get and Set methods.

How the DDD model improved our software development

Thanks to this brand new clean up, we achieved the following:

  • Core functionalities are evenly distributed to the overall codebase and not limited to just a few files.
  • The developers are aware of what each folder is responsible for by simply looking at the file naming and folder structure.
  • The risk of breaking major functionalities by merely making small changes is greatly reduced. Changing a feature is now more efficient.

The team now finds the code well structured and we require less hand-holding for onboarders, thanks to the simplicity of the structure.

Finally, the most important thing, we now have a system oriented towards our business necessities. Everyone ends up using the same language and terms. Developers communicate better with the business team. The work is more efficient when it comes to establishing solutions for the models that reflect how the business operates, instead of how the software operates.

Lessons Learnt

  • Use DDD to collaborate among all project disciplines (product, business, partner, and so on) and clearly understand the business requirements.
  • Establish a ubiquitous language to discuss domain-related concepts.
  • Use bounded contexts to break down complex domains into manageable parts.
  • Implement a layered architecture (i.e. DDD building blocks) to focus on particular aspects of the application.
  • To simplify your dependency, use domain event to communicate with sub-bounded context.

Griffin, an anti-fraud risk rule engine making billions of predictions daily

Post Syndicated from Grab Tech original https://engineering.grab.com/griffin

Introduction

At Grab, the scale and fast-moving nature of our business means we need to be vigilant about potential risks to our customers and to our business. Some of the things we watch for include promotion abuse, or passenger safety on late-night ride allocations. To overcome these issues, the TIS (Trust/Identity/Safety) taskforce was formed with a group of AI developers dedicated to fraud detection and prevention.

The team’s mission is:

  • to keep fraudulent users away from our app or services
  • ensure our customers’ safety, and
  • Manage user identities to securely login to the Grab app.

The TIS team’s scope covers not just transport, but also our food, deliver and other Grab verticals.

How we prevented fraudulent transactions in the earlier days

In our early days when Grab was smaller, we used a rules-based approach to block potentially fraudulent transactions. Rules are like boolean conditions that determines if the result will be true or false. These rules were very effective in mitigating fraud risk, and we used to create them manually in the code.

We started with very simple rules. For example:

Rule 1:

 IF a credit card has been declined today

 THEN this card cannot be used for booking

To quickly incorporate rules in our app or service, we integrated them in our backend service code and deployed our service frequently to use the latest rules.

It worked really well in the beginning. Our logic was relatively simple, and only one developer managed the changes regularly. It was very lightweight to trigger the rule deployment and enforce the rules.

However, as the business rapidly expanded, we had to exponentially increase the rule complexity. For example, consider these two new rules:

Rule 2:

IF a credit card has been declined today but this passenger has good booking history

THEN we would still allow this booking to go through, but precharge X amount

Rule 3:

IF a credit card has been declined(but paid off) more than twice in the last 3-months

THEN we would still not allow this booking

The system scans through the rules, one by one, and if it determines that any rule is tripped it will check the other rules. In the example above, if a credit card has been declined more than twice in the last 3-months, the passenger will not be allowed to book even though he has a good booking history.

Though all rules follow a similar pattern, there are subtle differences in the logic and they enable different decisions. Maintaining these complex rules was getting harder and harder.

Now imagine we added more rules as shown in the example below. We first check if the device used by the passenger is a high-risk one. e.g using an emulator for booking. If not, we then check the payment method to evaluate the risk (e.g. any declined booking from the credit card), and then make a decision on whether this booking should be precharged or not. If passenger is using a low-risk  device but is in some risky location where we traditionally see a lot of fraud bookings, we would then run some further checks about the passenger booking history to decide if a pre-charge is also needed.

Now consider that instead of a single passenger, we have thousands of passengers. Each of these passengers can have a large number of rules for review. While not impossible to do, it can be difficult and time-consuming, and it gets exponentially more difficult the more rules you have to take into consideration. Time has to be spent carefully curating these rules.

Rules flow

The more rules you add to increase accuracy, the more difficult it becomes to take them all into consideration.

Our rules were getting 10X more complicated than the example shown above. Consequently, developers had to spend long hours understanding the logic of our rules, and also be very careful to avoid any interference with new rules.

In the beginning, we implemented rules through a three-step process:

  1. Data Scientists and Analysts dived deep into our transaction data, and discovered patterns.
  2. They abstracted these patterns and wrote rules in English (e.g. promotion based booking should be limited to 5 bookings and total finished bookings should be greater than 6, otherwise unallocate current ride)
  3. Developers implemented these rules and deployed the changes to production

Sometimes, the use of English between steps 2 and 3 caused inaccurate rule implementation (e.g. for “X should be limited to 5”, should the implementation be X < 5 or  X <= 5?)

Once a new rule is deployed, we monitored the performance of the rule. For example,

  • How often does the rule fire (after minutes, hours, or daily)?
  • Is it over-firing?
  • Does it conflict with other rules?

Based on implementation, each rule had dependency with other rules. For example, if Rule 1 is fired, we should not continue with Rule 2 and Rule 3.

As a result, we couldn’t  keep each rule evaluation independent.  We had no way to observe the performance of a rule with other rules interfering. Consider an example where we change Rule 1:

From IF a credit card has been declined today

To   IF a credit card has been declined this week

As Rules 2 and 3 depend on Rule 1, their trigger-rate would drop significantly. It means we would have unstable performance metrics for Rule 2 and Rule 3 even though the logic of Rule 2 and Rule 3 does not change. It is very hard for a rule owner to monitor the performance of Rules 2 and Rule 3.

When it comes to the of A/B testing of a new rule, Data Scientists need to put a lot of effort into cleaning up noise from other rules, but most of the time, it is mission-impossible.

After several misfiring events (wrong implementation of rules) and ever longer rule development time (weekly), we realized “No one can handle this manually.“

Birth of Griffin Rule Engine

We decided to take a step back, sit down and closely review our daily patterns. We realized that our daily patterns fall into two categories:

  1. Fetching new data:  e.g. “what is the credit card risk score”, or “how many food bookings has this user ordered in last 7 days”, and transform this data for easier consumption.
  2. Updating/creating rules: e.g. if a credit card risk score is high, decline a booking.

These two categories are essentially divided into two independent components:

  1. Data orchestration – collecting/transforming the data from different data sources.
  2. Rule-based prediction

Based on these findings, we got started with our Data Orchestrator (open sourced at https://github.com/grab/symphony) and Griffin projects.

The intent of Griffin is to provide data scientists and analysts with a way to add new rules to monitor, prevent, and detect fraud across Grab.

Griffin allows technical novices to apply their fraud expertise to add very complex rules that can automate the review of rules without manual intervention.

Griffin  now predicts billions of events every day with 100K+ Queries per second(QPS) at peak time (on only 6 regular EC2s).

Data scientists and analysts can self-service rule changes on the web portal directly, deploy rules with just a few clicks, experiment and monitor performance in real time.

Why we came up with Griffin instead of using third-party tools in the market

Before we decided to create our in-built tool, we did some research for common business rule engines available in the market such as Drools and checked if we should use them. In that process, we found:

  1. Drools has its own Java-based DSL with a non-trivial learning curve (whereas our major users are from Python background).
  2. Limited [expressive power](https://en.wikipedia.org/wiki/Expressive_power_(computer_science),
  3. Limited support for some common math functions (e.g. factorial/ Greatest Common Divisor).
  4. Our nature of business needed dynamic dataset for predictions (for example, a rule may need only passenger booking history on Day 1, but it may use passenger booking history, passenger credit balance, and passenger favorite places on Day 2). On the other hand, Drools usually works well with a static list of dataset instead of dynamic dataset.

Given the above constraints, we decided to build our own rule engine which can better fit our needs.

Griffin Architecture

The diagram depicts the high-level flow of making a prediction through Griffin.

High-level flow of making a prediction through Griffin

Components

  • Data Orchestration: a service that collects all data needed for predictions
  • Rule Engine: a service that makes prediction based on rules
  • Rule Editor: the portal through which users can create/update rules

Workflow

  1. Users create/update rules in the Rule Editor web portal, and save the rules in the database.
  2. Griffin Rule Engine reloads rules immediately as long as it detects any rule changes.
  3. Data Orchestrator sends all dataset (features) needed for a prediction (e.g. whether to block a ride based on passenger past ride pattern, credit card risk) to the Rule Engine
  4. Griffin Rule Engine makes a prediction.

How you can create rules using Griffin

In an abstract view, a rule inside Griffin is defined as:

Rule:

Input:JSON => Result:Boolean

We allow users (analysts, data scientists) to write Python-based rules on WebUI to accommodate some very complicated rules like:

len(list(filter(lambdax: x \>7, (map(lambdax: math.factorial(x), \[1,2,3,4,5,6\]))))) \>2

This significantly optimizes the expressive power of rules.

To match and evaluate a rule more efficiently, we also have other key components associated:

Scenarios

  • Here are some examples: PreBooking, PostBookingCompletion, PostFoodDelivery

Actions

  • Actions such as NotAllowBooking, AuthCapture, SendNotification
  • If a rule result is True, it returns a list of treatments as selected by users, e.g. AuthCapture and SendNotification (the example below is treatments for one Safety-related rule).The one below is for a checkpoint to detect credit-card risk.
Treatments: AuthCapture
  • Each checkpoint has a default treatment. If no rule inside this checkpoint is hit, the rule engine would return the default one (in most cases, it is just “do nothing”).
  • A treatment can only belong to one checkpoint, but one checkpoint can have multiple treatments.

For example, the graph below demonstrates a checkpoint PaxPreRide associated with three treatments: Pass, Decline, Hold

Treatments: Adding

Segments

  • The scope/dimension of a rule. Based on the sample segments below, a rule can be applied only to countries=\[MY,PH\] and verticals=\[GrabBus, GrabCar\]
  • It can be changed at any time on WebUI as well.
Segments

Values of a rule

 

When a rule is hit, more than just treatments, users also want some dynamic values returned. E.g. a max distance of the ride allowed if we believe this booking is medium risk.

Does Python make Griffin run slow?

We picked Python to enjoy its great expressive power and neatness of syntax, but some people ask: Python is slow, would this cause a latency bottleneck?

Our answer is No.

The below graph shows the Latency P99 of Prediction Request from load balancer side(actually the real latency for each prediction is < 6ms, the metrics are peaked at 30ms because some batch requests contain 50 predictions in a single call)

Prediction Request Latency P99

What we did to achieve this?

  • The key idea is to make all computations in CPU and memory only (in other words, no extra I/O).
  • We do not fetch the rules from database for each prediction. Instead, we keep a record called dirty_key, which keeps the latest rule update timestamp. The rule engine would actively check this timestamp and trigger a rule reload only when the dirty_key timestamp in the DB is newer than the latest rule reload time.
  • Rule engine would not fetch any additional new data, instead, all data should be from Data Orchestrator.
  • So the whole prediction flow is only between CPU & memory (and if the data size is small, it could be on CPU cache only).
  • Python GIL essentially enforces a process to have up to one active thread running at a time, no matter how many cores a CPU has. We have Gunicorn to wrap our service, so on the Production machine, we have (2x$num_cores) + 1 processes (see http://docs.gunicorn.org/en/latest/design.html#how-many-workers). The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

The below screenshot is the process snapshot on C5.large machine with 2 vCPU. Note only green processes are active.

Process snapshot on C5.large machine

A lot of trial and error performance tuning:

  • We used to have python-jsonpath-rw for JSONPath query, but the performance was not strong enough. We switched to jmespath and observed about 10ms latency reduction.
  • We use sqlalchemy for DB Query and ORM. We enabled cache for some use cases, but turned out it was over-optimized with stale data. We ended up turning off some caching points to ensure the data consistency.
  • For new dict/list creation, we prefer native call (e.g. {}/[]) instead of function call (see the comparison below).
Native call and Function call
  • Use built-in functions https://docs.python.org/3/library/functions.html. It is written in C, no one can beat it.
  • Add randomness to rule reload so that not all machines run at the same time causing latency spikes.
  • Caching atomic feature units as they are used so that we don’t have to requery for them each time a checkpoint uses it.

How Griffin makes on-call engineers relax

One of the most popular aspects of Griffin is the WebUI. It opens a door for non-developers to make production changes in real time which significantly boosts organisation productivity. In the past a rule change needed 1 week for code change/test/deployment, now it is just 1 minute.

But this also introduces extra risks. Anyone can turn the whole checkpoint down, whether unintentionally or maliciously.

Hence we implemented Shadow Mode and Percentage-based rollout for each rule. Users can put a rule into Shadow Mode to verify the performance without any production impact, and if needed, rollout of a rule can be from 1% all the way to 100%.

We implemented version control for every rule change, and in case anything unexpected happened, we could rollback to the previous version quickly.

Version control
Rollback button

We also built RBAC-based permission system, along with Change Approval flow to make sure any prod change needs at least two people(and approver role has higher permission)

Closing thoughts

Griffin evolved from a fraud-based rule engine to generic rule engine. It can apply to any rule at Grab. For example, Grab just launched Appeal automation several days ago to reduce 50% of the  human effort it typically takes to review straightforward appeals from our passengers and drivers. It was an unplanned use case, but we are so excited about this.

This could happen because from the very beginning we designed Griffin with minimized business context, so that it can be generic enough.

After the launch of this, we observed an amazing adoption rate for various fraud/safety/identity use cases. More interestingly, people now treat Griffin as an automation point for various integration points.

Using Grab’s Trust Counter Service to Detect Fraud Successfully

Post Syndicated from Grab Tech original https://engineering.grab.com/using-grabs-trust-counter-service-to-detect-fraud-successfully

Background

Fraud is not a new phenomenon, but with the rise of the digital economy it has taken different and aggressive forms. Over the last decade, novel ways to exploit technology have appeared, and as a result, millions of people have been impacted and millions of dollars in revenue have been lost. According to ACFE survey, companies lost USD6.3 billion due to fraud. Organizations lose 5% of its revenue annually due to fraud.

In this blog, we take a closer look at how we developed an anti-fraud solution using the Counter service, which can be an indispensable tool in the highly complex world of fraud detection.

Anti-fraud solution using counters

At Grab, we detect fraud by deploying data science, analytics, and engineering tools to search for anomalous and suspicious transactions, or to identify high-risk individuals who are likely to commit fraud. Grab’s Trust Platform team provides a common anti-fraud solution across a variety of business verticals, such as transportation, payment, food, and safety. The team builds tools for managing data feeds, creates SDK for engineering integration, and builds rules engines and consoles for fraud detection.

One example of fraudulent behavior could be that of an individual who masquerades as both driver and passenger, and makes cashless payments to get promotions, for example, earn a one dollar rebate in the next transaction.In our system, we analyze real time booking and payment signals, compare it with the historical data of the driver and passenger pair, and create rules using the rule engine. We count the number of driver and passenger pairs at a given time frame. This counter is provided as an input to the rule.If the counter value exceeds a predefined threshold value, the rule evaluates it as a fraud transaction. We send this verdict back to the booking service.

The conventional method

Fraud detection is a job that requires cross-functional teams like data scientists, data analysts, data engineers, and backend engineers to work together. Usually data scientists or data analysts come up with an offline idea and apply it to real-time traffic. For example, a rule gets invented after brainstorming sessions by data scientists and data analysts. In the conventional method, the rule needs to be communicated to engineers.

Automated solution using the Counter service

To overcome the challenges in the conventional method, the Trust platform team decided to come out with the Counter service, a self-service platform, which provides management tools for users, and a computing engine for integrating with the backend services. This service provides an interface, such as a UI based rule editor and data feed, so that analysts can experiment and create rules without interacting with engineers. The platform team also decided to provide different data contracts, APIs, and SDKs to engineers so that the business verticals can use it quickly and easily.

The major engineering challenges faced in designing the Counter service

There are millions of transactions happening at Grab every day, which implies we needed to perform billions of fraud and safety detections. As seen from the example shared earlier, most predictions require a group of counters. In the above use case, we need to know how many counts of the cashless payment happened for a driver and passenger pair. Due to the scale of Grab’s business, the potential combinations of drivers and passengers could be exponential. However, this is only one use case. So imagine that there could be hundreds of counters for different use cases. Hence it’s important that we provide a platform for stakeholders to manage counters.

Some of the common challenges we faced were:

Scalability

As mentioned above, we could potentially have an exponential number of passengers and drivers in a single counter. So it’s a great challenge to store the counters in the database, read, and query them in real-time. When there are billions of counter keys across a long period of time, the Trust team had to find a scalable way to write and fetch keys effectively and meet the client’s SLA.

Self-serving

A counter is usually invented by data scientists or analysts and used by engineers. For example, every time a new type of counter is needed from data scientists, developers need to manually make code changes, such as adding a new stream, capturing related data sets for the counter, and storing it on the fraud service, then doing a deployment to make the counters ready. It usually takes two or more weeks for the whole iteration, and if there are any changes from the data analysts’ side, which happens often, the situation loops again. The team had to come up with a solution to prevent the long loop of manual tasks by coming out with a self-serving interface.

Manageable and extendable

Due to a lack of connection between real-time and offline data, data analysts and data scientists did not have a clear picture of what is written in the counters. That’s because the conventional counter data were stored in Redis database to satisfy the query SLA. They could not track the correctness of counter value, or its history. With the new solution, the stakeholders can get a real-time picture of what is stored in the counters using the data engineering tools.

The Machine Learning challenges solved by the Counter service

The Counter service plays an important role in our Machine Learning (ML) workflow.

Data Consistency Challenge/Issue

Most of the machine learning workflows need dedicated input data. However, when there is an anti-fraud model that is trained using offline data from the data lake, it is difficult to use the same model in real-time. This is because the model lacks the data contract and the consistency with the data source. In this case, the Counter service becomes a type of data source by providing the value of counters to file system.

ML featuring

Counters are important features for the ML models. Imagine there is a new invention of counters, which data scientists need to evaluate. We need to provide a historical data set for counters to work. The Counter service provides a counter replay feature, which allows data scientists to simulate the counters via historical payload.

In general, the Counter service is a bridge between online and offline datasets, data scientists, and engineers. There was technical debt with regards to data consistency and automation on the ML pipeline, and the Counter service closed this loop.

How we designed the Counter service

We followed the principle of asynchronized data ingestion, and synchronized transaction for designing the Counter service.

The diagram shows how the counters are generated and saved to database.

How the counters are generated and saved to the database

Counter creation workflow

  1. User opens the Counter Creation UI and creates a new key “fraud:counter:counter_name”.
  2. Configures required fields.
  3. The Counter service monitors the new counter-creation, puts a new counter into load script storage, and starts processing new counter events (see Counter Write below).

Counter write workflow

  1. The Counter service monitors multiple streams, assembles extra data from online data services (i.e. Common Data Service (CDS), passenger service, hydra service, etc), so that rich dataset would also be available for editors on each stream resource.
  2. The Counter Processor evaluates the user-configured expression and writes the evaluated values to the dedicated Grab-Stats stream using the GrabPlugin tool.

Counter read workflow

Counter read workflow

We use Grab-Stats as our storage service. Basically Grab-Stats runs above ScyllaDB, which is a distributed NoSQL data store. We use ScyllaDB because of its good performance on aggregation in memory to deal with the time series dataset. In comparison with in-memory storage like AWS elasticCache, it is 10 times cheaper and as reliable as AWS in terms of stability. The p99 of reading from ScyllaDB is less than 150ms which satisfies our SLA.

How we improved the Counter service performance

We used the multi-buckets strategy to improve the Counter service performance.

Background

There are different time windows when you perform a query. Some counters are time sensitive so that it needs to know what happened in the last 30 or 60 minutes. Some other counters focus on the long term and need to know the events in the last 30 or 90 days.

From a transactional database perspective, it’s not possible to serve small range as well as long term events at the same time. This is because the more the need for the accuracy of the data and the longer the time range, the more aggregations need to happen on database. Which means we would not be able to satisfy the SLA. Otherwise we will need to block other process which leads to the service downgrade.

Solution for improving the query

We resolved this problem by using different granularities of the tables. We pre-aggregated the signals into different time buckets, such as 15min, 1 hour, and 1 day.

When a request comes in, the time-range of the request will be divided by the buckets, and the results are conquered. For example, if there is a request for 9/10 23:15:20 to 9/12 17:20:18, the handler will query 15min buckets within the hour.  It will query for hourly buckets for the same day. And it will query the daily buckets for the rest of 2 days. This way, we avoid doing heavy aggregations, but still keep the accuracy in 15 minutes level in a scalable response time.

Counter service UI

We allowed data analysts and data scientists to onboard counters by themselves, from a dedicated web portal. After the counter is submitted, the Counter service takes care of the integration and parsing the logic at runtime.

Counter service UI

Backend integration

We provide SDK for quicker and better integration. The engineers only need to provide the counter identifier ID (which is shown in the UI) and the time duration in the query. Under the hood we provide a GRPC protocol to communicate across services. We divide the query time window to smaller granularities, fetching from different time series tables and then conquering the result. We are also providing a short TTL cache layer to take the uncommon traffic from client such as network retry or traffic throttle. Our QPS are designed to target 100K.

Monitoring the Counter service

The Counter service dashboard helps to track the human errors while editing the counters in real-time. The Counter service sends alerts to slack channel to notify users if there is any error.

Counter service dashboard

We setup Datadog for monitoring multiple system metrics. The figure below shows a portion of stream processing and counter writing. In the example below, the total stream QPS would reach 5k at peak hour, and the total counter saved to storage tier is about 4k. It will keep climbing without an upper limit, when more counters are onboarded.

Counter service dashboard with multiple metrics

The Counter service UI portal also helps users to fetch real-time counter results for verification purposes.

Counter service UI

Future plans

Here’s what we plan to do in the near future to improve the Counter service.

Close the ML workflow loop

As mentioned above, we plan to send the resource payload of the Counter service to the offline data lake, in order to complete the counter replay function for data scientists. We are working on the project called “time traveler”. As the name indicates, it is used not only for serving the online transactional data, but also supports historical data analytics, and provides more flexibility on counter inventions and experiments.

There are more automation steps we plan to do, such as adding a replay button on the web portal, and hooking up with the offline big data engine to trigger the analytics jobs. The performance metrics will be collected and displayed on the web portal. A single platform would be able to manage both the online and offline data.

Integration to Griffin

Griffin is our rule engine. Counters are sometimes an input to a particular rule, and one rule usually needs many counters to work together. We need to provide a better integration to Griffin on backend. We plan to minimize the current engineering effort when using counters on Griffin. A counter then becomes an automated input variable on Griffin, which can be configured on the web portal by any users.

Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2

Post Syndicated from Amanda Pinsker original https://github.blog/2019-10-02-get-started-easier-with-github-desktop-2-2/

Anyone who uses Git knows that it has a steep learning curve. We’ve learned from developers that most people tend to learn from a buddy, whether that’s a coworker, a professor, a friend, or even a YouTube video. In GitHub Desktop 2.2, we’re releasing the first version of an interactive Git and GitHub tutorial that can be your buddy and help you get started. If you’re new to Desktop, you can download and try out the tutorial at desktop.github.com.

Get set up

To get set up, we help you through two major pieces: creating a repository and connecting an editor. When you first open Desktop, a welcome page appears with a new option to “Create a Tutorial Repository”. Starting with this option creates a tutorial repository that guides you through the core concepts of working with Git using GitHub Desktop.

There are a lot of tools you need to get started with Git and GitHub. The most important of these is your code editor. In the first step of the tutorial, you’re prompted to install an editor if you don’t have one already.

Learn the GitHub flow

Next, we guide you through how to use GitHub Desktop to make changes to code locally and get your work on GitHub. You’ll create a new branch, make a change to a file, commit it, push it to GitHub, and open your first pull request.

We’ve also heard that new users initially experience confusion between Git, GitHub, and GitHub Desktop. We cover these differences in the tutorial and make sure to reinforce the explanations.

Keep going with your own project

In GitHub Desktop 1.6, we introduced suggested next steps based on the state of your repository. Now when you complete the tutorial, we similarly suggest next steps: exploring projects on GitHub that you might want to contribute to, creating a new project, or adding an existing project to Desktop. We always want GitHub Desktop to be the tool that makes your next steps clear, whether you’re in the flow of your work, or you’re a new developer just getting started.

What’s next?

With GitHub Desktop 2.2, we’re making the product our users love more approachable to newcomers. We’ll be iterating on the tutorial based on your feedback, and we’ll continue to build on the connection between GitHub and your local machine. If you want to start building something but don’t know how, think of GitHub Desktop as your buddy to help you get started.

Learn more about GitHub Desktop

The post Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2 appeared first on The GitHub Blog.

New workflow editor for GitHub Actions

Post Syndicated from Chris Patterson original https://github.blog/2019-10-01-new-workflow-editor-for-github-actions/

It’s now even easier to create and edit a GitHub Actions workflow with the updated editor. We’ve provided inline auto-complete and lint as you type so you can say goodbye to YAML indentation issues and explore the full workflow syntax without going to the docs.

Get through code faster with auto-complete

Auto-complete can be triggered with Ctrl+Space almost anywhere, and in some cases, automatically. It suggests keys or values depending on the current position of the cursor, and displays brief contextual documentation so you can discover and understand all the available options without losing the focus. Auto-complete works even inside expressions.

Explore context options without losing focus

Additionally, the editor will insert a new line with the right indentation when you press enter. It will also either suggest the next key to insert or insert code snippets that you can navigate through using the tab key.

Snippets also work when using functions inside expressions, allowing you to easily write and navigate through the required arguments.

Minimize mistakes

There are many updates to help you write workflow files, but if you make a mistake, we’ve got you covered too.

The editor now highlights structural errors in your file, unexpected values, or even conflicting values, such as an invalid shell value for the chosen operating system.

Understand cron expressions quickly

Finally, the editor will also help you edit your scheduled jobs by describing the cron expressions you have used with natural language:

Tell us what you think

These are just a few features we’ve made to help you edit workflows with fewer errors. We’d love to hear any feedback you have—share your questions and comments with us in the GitHub Actions Community Forum.

Learn more about GitHub Actions

The post New workflow editor for GitHub Actions appeared first on The GitHub Blog.

Introducing the CodeSearchNet challenge

Post Syndicated from Janey Jack original https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/

Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day. However, search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines. We started using modern machine learning techniques to improve code search but quickly realized that we were unable to measure our progress. Unlike natural language processing with GLUE benchmarks, there is no standard dataset suitable for code search evaluation.

With our partners from Weights & Biases, today we’re announcing the CodeSearchNet Challenge evaluation environment and leaderboard. We’re also releasing a large dataset to help data scientists build models for this task, as well as several baseline models showing the current state of the art. Our leaderboard uses an annotated dataset of queries to evaluate the quality of code search tools.

Learn more from our technical report 

The CodeSearchNet Corpus and models

We collected a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. We used our TreeSitter infrastructure for this effort, and we’re also releasing our data preprocessing pipeline for others to use as a starting point in applying machine learning to code. While this data is not directly related to code search, its pairing of code with related natural language description is suitable to train models for this task. Its substantial size also makes it possible to apply high-capacity models based on modern Transformer architectures.

Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, including:

  • Six million methods overall
  • Two million of which have associated documentation (docstrings, JavaDoc, and more)
  • Metadata that indicates the original location (repository or line number, for example) where the data was found

Building on our earlier efforts in semantic code search, we’re also releasing a collection of baseline models leveraging modern techniques in learning from sequences (including a BERT-like self-attentional model) to help data scientists get started on code search. 

The CodeSearchNet Challenge

To evaluate code search models, we collected an initial set of code search queries and had programmers annotate the relevance of potential results. We started by collecting common search queries from Bing that had high click-through rates to code and combined these with queries from StaQC, yielding 99 queries for concepts related to code (i.e., we removed everything that was just an API documentation lookup).

We then used a standard Elasticsearch installation and our baseline models to obtain 10 likely results per query from our CodeSearchNet Corpus. Finally, we asked programmers, data scientists, and machine learning researchers to annotate the proposed results for relevance to the query on a scale from zero (“totally irrelevant”) to three (“exact match”). See our technical report for an in-depth explanation of the annotation process and data.

We want to expand our evaluation dataset to include more languages, queries, and annotations in the future. As we continue adding more over the next few months, we aim to include an extended dataset for the next version of CodeSearchNet Challenge in the future.

Other use cases

We anticipate other use cases for this dataset beyond code search and are presenting code search as one possible task that leverages learned representations of natural language and code. We’re excited to see what the community builds next.

Special thanks

The CodeSearchNet Challenge would not be possible without the Microsoft Research Team and core contributors from GitHub, including Marc Brockschmidt, Miltos Allamanis, Ho-Hsiang Wu, Hamel Husain, and Tiferet Gazit.

We’re also thankful for all of the contributors from the community who helped put this project together:

@nbardy, @raubitsj, @staceysv, @cvphelps, @tejaskannan, @s-zanella, @AntonioND, @goutham7r, @campoy, @cal58, @febuiles, @letmaik, @sebastiandziadzio, @panthap2, @CoderPat.


Learn more about the CodeSearchNet Challenge

The post Introducing the CodeSearchNet challenge appeared first on The GitHub Blog.

About being a Principal Engineer at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/about-being-a-principal-engineer-at-grab

Over the past few years Grab has grown from a small startup to one of the largest technology companies in South-East Asia. Along with the company’s growth, the number of microservices, features and teams also grew substantially. At the time of writing this blog, we have around 350 microservices powering our super-app.

A great engineering team is a critical component of our success. As an engineer you have two career paths in front of you: an individual contributor role, or a management role. While a management role is generally better understood, this article clarifies what it means to be a principal engineer at Grab, which is one of the highest levels of our engineering career ladder.

Engineering Career Ladder

Improving the Quality

“You set the standard for engineering excellence in your technical family. Your architectures are exemplary in terms of efficiency, stability, extensibility, testability and the ability to evolve over time. Your software is robust in the presence of failures, scalable, and cost-effective. Your coding practices are exemplary in terms of code organization, clarity, simplicity, error handling, and documentation. You tackle intrinsically hard problems, acquiring expertise as needed. You decompose complex problems into straightforward solutions.” – Grab’s Engineering Career Ladder

 

So, what does a principal engineer do? As your career progresses from junior to senior to lead engineer we have more and more responsibilities; you manage larger and larger systems. For example, junior engineer might manage a specific component of a micro-service. A senior engineer would be tasked with designing and operating an entire micro-service or product. While a lead engineer would typically be concerned with the architecture at a team level.

Principal engineer level is akin to a senior manager where instead of indirectly managing people (manager of managers) you take care of the architecture of an entire sub-organisation, known as Tech Family/Platform. These Tech Families usually have more than 50 engineers spread across multiple teams and function as a tiny company with their own business owners, designers, product managers, etc.

Challenging Projects

“You take engineering ownership of projects that may require the work of several teams to implement; you divide responsibilities so that each team can work independently and have the system come together into an integrated whole. Your projects often cross team, tech family, platform, and even R&D center boundaries. You solicit differing views and keep an open mind. You are adept at building consensus.” – Grab’s Engineering Career Ladder

 

As a principal engineer, your job is to solve larger problems and translate somewhat vague problems into a set of actionable items. You might be faced with a large problem such as “improve efficiency and interoperability of Grab’s transportation system.” You will need to understand the problem, the business impact and see how can it be improved. It might require you to design new systems, change existing systems, understand the costs involved and get the right people together to make it happen.

Solving such a problem all by yourself is pretty much impossible. You have to work with other managers and other engineers together as a team to make it happen. Help your lead/senior engineers to design the right system by giving them a clear objective but let them take care of the system-level architecture.  

You will also need to work with managers, advise them to get things done, and get the right things prioritised by the team. While you don’t need to be well-versed in project management and agile methodologies, you do need to be able to plan ahead with your teams and have an understanding of how much time a project or migration will take.

A Tech Family can easily have 20 or more micro-services. You need to have a good understanding of their functional requirements and interactions. This is challenging as learning new things is always “uncomfortable” and takes time. You must reach out to engineers, product managers, and data scientists, ideally face-to-face to build empathy. Keep asking questions and try to understand how things work. You will also need to read the existing documentation and their code.

Technical Ownership

“You are the origin of significant technical contributions to our architecture and infrastructure.  You take technical ownership of the design and quality of the security, performance, availability, and operational aspects of the software built by one or more teams. You identify where your time is needed, transitioning between coding, design, and architecture based on project and team needs. You deliver software in ways that empower teams to self-service, providing clear adoption/migration paths.” – Grab’s Engineering Career Ladder

 

As a principal engineer you work together with the Head of Engineering and managers within the Tech Family and improve the quality of systems across the board. Typically, no-one tells you what needs to be done. You need to identify gaps, raise them and keep improving the systems.

You also need to learn how to manage your own time better so you can prioritise effectively. This boils down to knowing your strengths, your weaknesses. For example, if you are really good in building distributed systems but have no clue about the latest-and-greatest design in information security, get the right InfoSec engineers in this meeting and consider skipping it yourself. Avoid trying to do everything at once and be in every single meeting you get invited – you still have to review code, design and focus, so plan accordingly.

You will also need to understand the business impact of your decisions. For example, if you contribute to product features, know how impactful this feature is going to be to the organisation. If you don’t know it – ask the Product Manager responsible for it. If you work on a platform feature, for example improving the build system, know how it will help: saving 30 minutes of build time for every engineer each day is a huge achievement.

More often than not, you will have to drive migrations, this is akin to code refactoring but on a system-level and will involve a lot of collaboration with the people. Understand what a technical debt is and how it can be mitigated – a good architecture minimises technical debt and in turn accelerates time-to-market and helps business flourish.

Technical Leadership

“You amplify your impact by leading design reviews for complex software and/or critical features. You probe assumptions, illuminate pitfalls, and foster shared understanding. You align teams toward coherent architectural strategies.” – Grab’s Engineering Career Ladder

 

In Grab we have a process known as RFC (Request For Comments) which allows engineers to submit designs and ideas for a larger audience to debate. This is especially important given that our organisation is spread across several continents with research and development offices in Southeast Asia, the US, India and China. While any engineer is welcome to comment on these RFCs, it is a duty of lead and principal engineers’ to review them on a regular basis. This will help you to expand your knowledge of existing systems and help others with improving their designs.

Communication is a key skill that you need to keep improving and it is often the Achilles’ heel of many engineers who would rather be doing work in their corner without talking to anyone else. This is perfectly fine for a junior (or even some senior engineers) but it is critical for a principal engineer to communicate. Let’s break this down to a set of specific skills that you’d need to sharpen.

You need to be able to write effectively in order to convey your ideas to others. This includes knowing your audience and wording it in such a way that readers can understand. A technical design document whose audience are engineers is not written the same way as a design proposal whose audience are product and business managers.

You need to be able to publicly present and talk about various projects that you are working on. This includes creation of slide decks with good visuals and distilling down months of work to just a couple of slides. The best way of learning this is to get out there and keep presenting your work – you will get better over time.

You also need to be able to drive meetings and discussions without wasting anyone’s time. As a technical leader, one of your key responsibilities is to get people moving in the same direction and driving consensus during meetings.

Teaching and Learning

“You educate other engineers, both at an individual level and at scale: keeping the engineering community up to date on advanced technical issues, technologies, and trends. Examples include onboarding bootcamps for new hires, interns, specific skill-gap training development, and sharing specialized knowledge to raise the technical bar for other engineers/teams/dev centers.”

 

A principal engineer is a technical leader and as a leader you have the responsibility to mentor, coach fellow engineers, regardless of their level. In addition to code-reviews, you can organise office hours in your team and knowledge sharing sessions where everyone could present something. You could also help with bootcamps and help new hires in getting up-to-speed.

Most importantly, you will also need to keep learning whichever way works for you – reading journals and papers, blog posts, watching video-recorded talks, attending conferences and browsing through a variety of open-source projects. You will also learn from other Grabbers as even a junior engineer can teach you something, we all have our strengths and weaknesses. Keep improving and working on yourself!

Track your work easily with the latest changes to project boards

Post Syndicated from Lauren Brose original https://github.blog/2019-09-25-project-board-improvements/

We want to make it easy for you to track your work with GitHub Projects, but noticed a few issues get in the way of a seamless experience. Project automation helps keep your issues in sync with your development process but created a problem with your column prioritization by placing moved issues at the top of the list. We also updated the issue sidebar to help show you where an item was in your project board flow but required you to navigate away to make any changes.

Issue sidebar updates

The projects section of the issue sidebar was updated to better convey related project information. Closed projects are now collapsed in the sidebar by default so you can focus on the most relevant information. Additionally, you can change the issue’s project column directly from the issue page without needing to navigate away.

Automation placement

Now, any issues moved via automation or the inline sidebar will move to the bottom of the destination column. However, any “Done” automation triggers will continue to move issues to the top of the column so that your team can always see what was recently finished.

Let us know what you think

We hope this change makes your workflow smoother, but we can always make changes to improve. Let us know how the issue sidebar changes are working for you, and if there’s anything we can do to help.

Learn more about project boards

The post Track your work easily with the latest changes to project boards appeared first on The GitHub Blog.

Accelerating the GitHub Sponsors beta with Stripe Connect

Post Syndicated from Katie Delfin original https://github.blog/2019-09-10-accelerating-the-github-sponsors-beta/

Update: GitHub Sponsors is available in every country where GitHub does business, not just the 30 countries supported by Stripe Connect. We’ll continue to use our existing manual payout system for anyone outside of Connect’s list of currently supported countries.


Back in May, we announced GitHub Sponsors, a new way to support the developers who build and maintain the open source software you use every day. We launched in beta as early as possible so we could work closely with the community to address feedback and understand their needs before adding more developers to the program. Today, we’re taking the next step in accelerating the availability of GitHub Sponsors through a new streamlined onboarding and payment experience with Stripe Connect.

The early days of manual payments

Within hours of announcing GitHub Sponsors, thousands of people signed up for the waitlist from all over the world. It was awesome to see so much excitement for the program, but we knew this meant we had a lot of work ahead of us. 

We started the beta with a small group of sponsored developers, and every few weeks, we invited more into the program. Everything was a manual process—setting up account information, verifying identity, waiting for approval, running reports, and processing payouts across multiple departments. Due to the manual nature of the process, we were limited to a small number of people in the program until we could automate and streamline our operations.

Onboarding more developers, faster

At GitHub, we do business with companies of all sizes, from Fortune 50 companies to early stage startups from all over the world. Collecting revenue globally is something we’ve done for years, but with GitHub Sponsors, it’s not just about receiving money—it’s also about paying it out. This means providing a seamless and secure experience for sponsored developers to verify their identities, enter banking information, receive funds, and manage payouts. We also know that, for many maintainers, this will be the first time they’ve been financially rewarded for their contributions to open source. We want to ensure the process is easy for developers to complete, so they can spend more time doing what they love, like contributing to open source.

Stripe Connect Express offers a simple, low overhead onboarding experience that complies with web accessibility guidelines, coupled with the ability to localize the experience to simplify payouts in countries with unique tax laws and compliance regulations. And by not building our own onboarding solution, we’re able to save months of engineering and maintenance time, and start onboarding more sponsored developers today.

Localization

One of Stripe Connect’s features that’s helped us deliver a great experience to our users is their localization support. Stripe just released international support for 30 countries (with more to come) for Connect Express. With this expanded support, Stripe takes care of adjusting for localized rules and regulations for supported countries—no small feat. We couldn’t be more thrilled to be one of the first to offer support across all 30 countries to serve our global community.

Our maintainer onboarding process can now start to keep up with the growing demand of the program. With Stripe, users can onboard using Connect Express, reducing onboarding time from a week-long process to under five minutes.  

And there’s much more to come. While we’re just under four months into the program, we’re focused on making GitHub Sponsors available to even more developers around the world.

Sign up to join the beta—you’ll hear from us soon.

Learn more about GitHub Sponsors

The post Accelerating the GitHub Sponsors beta with Stripe Connect appeared first on The GitHub Blog.

Running GitHub on Rails 6.0

Post Syndicated from Eileen M. Uchitelle original https://github.blog/2019-09-09-running-github-on-rails-6-0/

On August 26, 2019, the GitHub application was deployed to production with 100 percent of traffic on the newest Rails version: 6.0. This change came just 1.5 weeks after the final release of Rails 6.0. Rails upgrades aren’t always something companies announce, but looking back at GitHub’s history of being on a custom fork of Rails 3.2, this upgrade is a big deal. It represents how far we’ve come in the last few years, and the hard work and dedication from our upgrade team made it smoother, easier, and faster than any of our previous upgrades.

At GitHub, we have a lot to celebrate with the release of Rails 6.0 and the subsequent production deploy. First, we were more involved in this release than we have been in any previous release of Rails. GitHub engineers sent over 100 pull requests to Rails 6.0 to improve documentation, fix bugs, add features, and speed up performance. For many GitHub contributors, this was the first time sending changes to the Rails framework, demonstrating that upgrading Rails not only helps GitHub internally, but also improves our developer community as well.

Second, we deployed Rails 6.0 to production without any negative impact to customers—we had only one Rails 6.0 exception occur during testing, and it was hit by a bot! We were able to achieve this level of stability for the upgrade because we were heavily involved with its development. As soon as we finished the Rails 5.2 upgrade last year, we started upgrading our application to Rails 6.0.

Instead of waiting for the final release, we’d upgrade every week by pulling in the latest changes from Rails master and run all of our tests against that new version. This allowed us to find regressions quickly and early—often finding regressions in Rails master just hours after they were introduced. Upgrading weekly made it easy to find where these regressions were introduced since we were bisecting Rails with only a week’s worth of commits instead of more than a year of commits. Once our build for Rails 6.0 was green, we’d merge the pull request to master, and all new code that went into GitHub would need to pass in Rails 5.2 and the newest master build of Rails. Upgrading every week worked so well that we’ll continue using this process for upgrading from 6.0 to 6.1.

In addition to ensuring that Rails 6.0 was stable, we also contributed to the new features of the framework like parallel testing and multiple databases. The code for these tools is used in our production application every day—it’s well-tested, GitHub-approved code in a public, open source framework. By upstreaming this tooling, we’re able to reduce complexity in our code base and set a standard where so many companies once had to implement this functionality on their own.

There are so many wins to staying upgraded that go beyond more security, faster performance, and new features. By staying current with Rails master, we’re influencing the future of the framework to meet our needs and giving back to the open source community in big ways. This process means that the GitHub code base evolves alongside Rails instead of in response to Rails. Investing in our application by staying up to date with the Rails framework has had a tremendous positive effect on our code base and engineering teams. Staying current allows us to invest in our community, invest in our tools for the long term, and improve the experience of working with the GitHub code base for our engineers.

Keep a look out for all the great stuff we’ll be contributing to Rails 6.1 and beyond—this is just the beginning.

The post Running GitHub on Rails 6.0 appeared first on The GitHub Blog.

Data first, SLA always

Post Syndicated from Grab Tech original https://engineering.grab.com/data-first-sla-always

Introducing Trailblazer, the Data Engineering team’s solution to implementing change data capture of all upstream databases. In this article, we introduce the reason why we needed to move away from periodic batch ingestion towards a real time solution and show how we achieved this through an end to end streaming pipeline.

Context

Our mission as Grab’s Data Engineering team is to fulfill 100% of SLAs for data availability to our downstream users. Our 40 person team is responsible for providing accurate and reliable data to data analysts and data scientists so that they can produce actionable reports that will help Grab’s leadership team make data-driven decisions. We maintain data for a variety of business intelligence tools such as Tableau, Presto and Holistics as well as predictive algorithms for all of Grab.

We ingest data from multiple upstream sources, such as relational databases, Kafka or third party applications such as Salesforce or Zendesk. The majority of these source data exists in MySQL and we run ETL pipelines to mirror any updates into our data lake. These pipelines are triggered on an hourly or daily basis and are powered by an in-house Loader application which performs Spark batch ingestion and loading of data from source to sink.

Problems with the Loader application started to surface when Grab’s data exceeded the petabyte threshold. As such for larger tables, the most practical method to ingest data was to perform ETL only on rows that were updated within a specified timeframe. This is akin to issuing the query

SELECT * FROM table WHERE updated >= [start_time] AND updated < [end_time]

Now imagine two situations. One, firing this query to a huge table without an updated field. Two, firing the same query to the huge table, this time without indexes on the updated field. In the first scenario, the query will never work and we can never perform incremental ingestion on the table based on a timed window. The second scenario carries the dangers of creating high CPU load to replicate the database that we are querying from. Neither has an ideal outcome.

One other problem that we identified was the unpredictability of growth in data volume. Tables smaller than one gigabyte were ingested by fully scanning the table and overwriting the data in the data lake. This worked out well for us until the table size increased exponentially, at which point our Spark jobs failed due to JDBC timeouts. If we were only dealing with a handful of tables, this issue could have been addressed by switching our data ingestion strategy from full scan to a timed window.

When assessing the issue, we discovered that there were hundreds of tables running under the full scan strategy, all of them potentially crashing our data system, all time bombs silently waiting to explode.

The team urgently needed a new approach to ETL. Our Loader application was highly coupled to upstream table characteristics. We needed to find solutions that were truly scalable, which meant decoupling our pipelines from the upstream.

Change data capture (CDC)

Much like event sourcing, any log change to the database is captured and streamed out for downstream applications to consume. This process is lightweight since any row level update to the table is instantly captured by a real time processor, avoiding the need for large chunked queries on the table. In addition, CDC works regardless of upstream table definition, so we do not need to worry about missing updated columns impacting our data migration process.

Binary Logs (binlogs) are the CDC agents of MySQL. All updates, insertions or deletions performed on the table are captured as a series of logged events containing the past state of the row and it’s newly modified state. Check out the binlogs reference to find out more.

In order to persist all binlogs generated upstream, our team created a Spark Structured Streaming application called Trailblazer. Trailblazer streams all MySQL binlogs to our data lake. These binlogs serve as a foundation for us to build Presto tables for data auditing and help to remove the direct dependency of our batch ETL jobs to the source MySQL.

Trailblazer is an amalgamation of various data streaming stacks. Binlogs are captured by Debezium which runs on Kafka connect clusters. All binlogs are sent to our Kafka cluster, which is managed by the Data Engineering Infrastructure team and are streamed out to a real time bucket via a Spark structured streaming application. Hourly or daily ETL compaction jobs ingests the change logs from the real time bucket to materialize tables for downstream users to consume.

CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services
CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services

 

Some statistics

To date, we are streaming hundreds oftables across 60 Spark streaming jobs and with the constant increase in Grab’s database instances, the numbers are expected to keep growing.

Designing Trailblazer streams

We built our streaming application using Spark structured streaming 2.3. Structured streaming was designed to remove the technical aspects of provisioning streams. Developers can focus on perfecting business logic without worrying about fundamentals such as checkpoint management or reading and writing to data sources.

Key architecture for Trailblazer streaming
Key architecture for Trailblazer streaming

 

In the design phase, we made sure to follow several key principles that helped in managing our streams.

Checkpoints have to be externally managed

Structured streaming manages checkpoints both in a local directory and in a ‘_metadata’ directory on S3 buckets, such that the state of the stream can be restored in the event of failure and restart.

This is all well and good, with two exceptions. First, changing the starting point of data ingestion meant ssh-ing into the machine and manipulating metadata, which could be extremely dangerous. Second, we could not assume cluster prevalence since clusters can die and be recreated with data erased from its local disk or the distributed file system.

Our solution was to do a work around at the application level. All checkpoints will be stored in temporary directories with the existing timestamp appended as path (eg /tmp/checkpoint/job_A/1560697200/… ). A linearly progressive timestamp guarantees that the same directory will never be reused by new instances of the stream. This explains why we never restore its state from local disk but instead, store all checkpoints in a highly available Redis cluster, with key as the Kafka topic and value as a JSON of partition : offset.

Key

debz-schema-A.schema_A.table_B

Value

{"11":19183566,"12":19295602,"13":18992606[[a]](#cmnt1)[[b]](#cmnt2)[[c]](#cmnt3)[[d]](#cmnt4)[[e]](#cmnt5)[[f]](#cmnt6),"14":19269499,"15":19197199,"16":19060873,"17":19237853,"18":19107959,"19":19188181,"0":19193976,"1":19072585,"2":19205764,"3":19122454,"4":19231068,"5":19301523,"6":19287447,"7":19418871,"8":19152003,"9":19112431,"10":19151479}
Example of how offsets are stored in Redis as Key : Value pairs

 

Fortunately, structured streaming provides the StreamQueryListener class which we can use to register checkpoints after the completion of each microbatch.

Streams must handle 0, 1 or 1 million data

Scalability is at the heart of all well-designed applications. Spark streaming jobs are built for scalability in the face of varying data volumes.

In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing
In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing

 

There are a few settings that we can configure to influence the degree of scalability for a streaming app

  • spark.dynamicAllocation.enabled=true gives spark autonomy to provision / revoke executors to suit the workload
  • spark.dynamicAllocation.maxExecutors controls the maximum job parallelism
  • maxOffsetsPerTrigger controls the maximum number of messages ingested from Kafka per microbatch
  • trigger controls the duration between microbatchs and is a property of the DataStreamWriter class

Data as key health indicator

Scaling the number of streaming jobs without prior collection of performance metrics is a bad idea. There is a high chance that you will discover a dead stream when checking your stream hours after initialization. I’ll cite Murphy’s law as proof.

Thus we vigilantly monitored our data streams. We used tools such as Datadog for metric monitoring, Slack for oncall issue reporting, PagerDuty for urgent cases and our inhouse data auditor as a service (DASH) for counts discrepancy reporting between streamed and source data. More details on monitoring will be discussed in the later part.

Streams are ephemeral

Streams may die due to a hundred and one reasons so don’t blame yourself or your programming insecurities. Issues with upstream dependencies, such as a node within your Kafka cluster running out of disk space, could lead to partition unavailability which would crash the application. On one occasion, our streaming application was unable to resolve DNS when writing to AWS S3 storage. This amounted to multiple failures within our Spark job that eventually culminated in the termination of the stream.

In this case, allow the stream to  shutdown gracefully, send out your alerts and have a mechanism in place to retry the failed stream. We run all streaming jobs on Airflow and any failure to the stream will automatically be retried through a new task issued by the scheduler.

If you have had experience with large scale management of streams, please leave a comment so we can continue this discussion!

Monitoring data streams

Here are some key features that were set up to monitor our streams.

Running : Active jobs ratio

The number of streaming jobs could increase in the future, thus becoming a challenge for the oncall team to track all jobs that are supposed to be up and running.

One proposal  is  to track the number of jobs in production against the number of jobs that are actually running. By querying MySQL tables, we can filter out all the jobs that are meant to be active. Since Trailblazer streams are spark-submit jobs managed by YARN, we can query YARN’s resource manager REST API to retrieve  all the jobs that are running. We then construct a ratio of running : active jobs and report them to Datadog. If the ratio is not 1 for an extended duration, an alert will be issued for the oncall to take action.

If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert
If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert

 

Microbatch runtime

We define a 30 second window for each microbatch and track the actual runtime using metrics reported by the query listener. A runtime that exceeds the designated window is a potential indicator that the streaming job is deprived of resources and needs to be scaled up.

Job liveliness

Each job reports its health by emitting a count of 1 heartbeat. This heartbeat is created at the end of every microbatch via a query listener. This process is useful in detecting stale jobs (jobs that are registered as RUNNING in YARN but are actually hung).

Kafka offset divergence

In order to ensure that the message output rate to the consumer exceeds the message input rate from the producer, we sum up all presently ingested topic-partition offsets and compare that value to the sum of all topic-partition end offsets in Kafka. We then add an alerting logic on top of these metrics to inform the oncall team if the difference between the two values grows too big.

It is important to track the offset divergence parameter as streams can be lagging. Should the rate of consumption fall below the rate of message production, we would run the risk of falling short of Kafka’s retention window, leading to data losses.

Hourly data checks

DASH runs hourly and serves as our first line of defence to detect any data quality issues within the streams. We issue queries to the source database and our streaming layer to confirm that the ID counts of data created within the last hour match.

DASH helps in the early detection of upstream issues. We have noticed cases where our Debezium connectors failed and our checker reported fewer data than expected since there were no incoming messages to Kafka.

DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack

 

Materializing tables through compaction

Having CDC data in our data lake does not conclude our responsibilities. Batched compaction allows us to apply all captured CDC, to be available as Presto tables for downstream consumption. The job is set to trigger hourly and process all changes to the database within the past hour.  For example, changes to a record are visible in real-time, but the latest state of the record will not be reflected until the next time a batch job runs. We addressed several issues with streaming during this phase.

Deduplication of data

Trailblazer was not built to deliver exactly once guarantees. We ensure that the issues regarding duplicated CDCs are addressed during compaction.

Availability of all data until certain hour

We want to make sure that downstream pipelines use output data of the hourly batch job only when the pipeline has all records for that hour. In case there is an event that is processed late by streaming, the current pipeline will wait until the data is completed. In this case, we are consciously choosing consistency over availability for our downstream users. For example, missing a few insert booking records in peak hours due to consumer processing delay can generate the wrong downstream results leading to miscalculation in revenue. We want to start  downstream processes only when the data for the hour or day is complete.

Need for latest state of each event

Our compaction job performs upserts on the data to ensure that our downstream users can consume  records in their latest state.  

Future applications

Trailblazer is a milestone for the Data Engineering team as it represents our commitment to achieve large scale data streams to reduce latencies for our end users. Moving ahead, our team will be exploring how we can further optimize streaming jobs by analysing data trends over time and to build applications such as snapshot tables on top of the CDCs being streamed in our data lake.

How We Built A Logging Stack at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/how-built-logging-stack

And Solved Our Inhouse Logging Problem

Problem:

Let me take you back a year ago at Grab. When we lacked any visualizations or metrics for our service logs. When performing a query for a string from the last three days was something only run before you went for a beverage.

When a service stops responding, Grab’s core problems were and are:

  • We need to know it happened before the customer does.
  • We need to know why it happened.
  • We need to solve our customers’ problems fast.

We had a hodgepodge of log-based solutions for developers when they needed to figure out the above, or why a driver never showed up, or a customer wasn’t receiving our promised promotions. These included logs in a cloud based storage service (which could take hours to retrieve). Or a SAS provider constantly timing out on our queries. Or even asking our SREs to fetch logs from the potential machines for the service engineer, a rather laborious process.

Here’s what we did with our logs to solve these problems.

Issues:

Our current size and growth rate ruled out several available logging systems. By size, we mean a LOT of data and a LOT of users who search through hundreds of billions of logs to generate reports. Or who track down that one user who managed to find that pesky corner case in our code.

When we started this project, we generated 25TB of logging data a day. Our first thought was “Do we really need all of these logs?”. To this day our feeling is “probably not”.

However, we can’t always define what another developer can and cannot do. Besides, this gave us an amazing opportunity to build something to allow for all that data!

Some of our SREs had used the ELK Stack (Elasticsearch / Logstash / Kibana). They thought it could handle our data and access loads, so it was our starting point.

How We Built a Multi-Petabyte Cluster:

Information Gathering:

It started with gathering numbers. How much data did we produce each day? How many days were retained? What’s a reasonable response time to wait for?

Before starting a project, understand your parameters. This helps you spec out your cluster, get buy-in from higher ups, and increase your success rate when rolling out a product used by the entire engineering organization. Remember, if it’s not better than what they have now, why will they switch?

A good starting point was opening the floor to our users. What features did they want? If we offered a visualization suite so they can see ERROR event spikes, would they use it? How about alerting them about SEGFAULTs? Hands down the most requested feature was speed; “I want an easy webUI that shows me the user ID when I search for it, and get all the results in <5 seconds!”

Getting Our Feet Wet:

New concerns always pop up during a project. We’re sure someone has correlated the time spent in R&D to the number of problems. We had an always moving target, since as our proof of concept began, our daily logger volume kept increasing.

Thankfully, using Elasticsearch as our data store meant we could fully utilize horizontal scaling. This let us start with a simple 5 node cluster as we built out our proof-of-concept (POC). Once we were ready to onboard more services, we could move into a larger footprint.

The specs at the time called for about 80 nodes to handle all our data. But if we designed our system correctly, we’d only need to increase the number of Elasticsearch nodes as we enrolled more customers. Our key operating metrics were CPU utilization, heap memory needed for the JVM, and total disk space.

Initial Design:

First, we set up tooling to use Ansible both to launch a machine and to install and configure Elasticsearch. Then we were ready to scale.

Our initial goal was to keep the design as simple as possible. Opting to allow each node in our cluster to perform all responsibilities. In this setup each node would behave as all of the four available types:

  • Ingest: Used for transforming and enriching documents before sending them to data nodes for indexing.
  • Coordinator: Proxy node for directing search and indexing requests.
  • Master: Used to control cluster operations and determine a quorum on indexed documents.
  • Data: Nodes that hold the indexed data.

These were all design decisions made to move our proof of concept along, but in hindsight they might have created more headaches down the road with troubleshooting, indexing speed, and general stability. Remember to do your homework when spec’ing out your cluster.

It’s challenging to figure out why you are losing master nodes because someone filled up the field data cache performing a search. Separating your nodes can be a huge help in tracking down your problem.

We also decided to further reduce complexity by going with ingest nodes over Logstash. But at the time, the documentation wasn’t great so we had a lot of trial and error in figuring out how they work. Particularly as compared to something more battle tested like Logstash.

If you’re unfamiliar with ingest node design, they are lightweight proxies to your data nodes that accept a bulk payload, perform post-processing on documents,and then send the documents to be indexed by your data nodes. In theory, this helps keep your entire pipeline simple. And in Elasticsearch’s defense, ingest nodes have made massive improvements since we began.

But adding more ingest nodes means ADDING MORE NODES! This can create a lot of chatter in your cluster and cause more complexity when  troubleshooting problems. We’ve seen when an ingest node failing in an odd way caused larger cluster concerns than just a failed bulk send request.

Monitoring:

This isn’t anything new, but we can’t overstate the usefulness of monitoring. Thankfully, we already had a robust tool called Datadog with an additional integration for Elasticsearch. Seeing your heap utilization over time, then breaking it into smaller graphs to display the field data cache or segment memory, has been a lifesaver. There’s nothing worse than a node falling over due to an OOM with no explanation and just hoping it doesn’t happen again.

At this point, we’ve built out several dashboards which visualize a wide range of metrics from query rates to index latency. They tell us if we sharply drop on log ingestion or if circuit breakers are tripping. And yes, Kibana has some nice monitoring pages for some cluster stats. But to know each node’s JVM memory utilization on a 400+ node cluster, you need a robust metric system.

Pitfalls:

Common Problems:

There are many blogs about the common problems encountered when creating an Elasticsearch cluster and Elastic does a good job of keeping blog posts up to date. We strongly encourage you to read them. Of course, we ran into classic problems like ensuring our Java objects were compressed (Hints: Don’t exceed 31GB of heap for your JVM and always confirm you’ve enabled compression).

But we also ran into some interesting problems that were less common. Let’s look at some major concerns you have to deal with at this scale.

Grab’s Problems:

Field Data Cache:

So, things are going well, all your logs are indexing smoothly, and suddenly you’re getting Out Of Memory (OOMs) events on your data nodes. You rush to find out what’s happening, as more nodes crash.

A visual representation of your JVM heap’s memory usage is very helpful here. You can always hit the Elasticsearch API, but after adding more then 5 nodes to your cluster this kind of breaks down. Also, you don’t want to know what’s going on while a node is down, but what happened before it died.

Using our graphs, we determined the field data cache went from virtually zero memory used in the heap to 20GB! This forced us to read up on how this value is set, and, as of this writing, the default value is still 100% of the parent heap memory. Basically, this breaks down to allowing 70% of your total heap being allocated to a single search in the form of field data.

Now, this should be a rare case and it’s very helpful to keep the field names and values in memory for quick lookup. But, if, like us, you have several trillion documents, you might want to watch out.

From our logs, we tracked down a user who was sorting by the _id field. We believe this is a design decision in how Kibana interacts with Elasticsearch. A good counter argument would be a user wants a quick memory lookup if they search for a document using the _id. But for us, this meant a user could load into memory every ID in the indices over a 14 day period.

The consequences? 20+GB of data loaded into the heap before the circuit breaker tripped. It then only took 2 queries at a time to knock a node over.

You can’t disable indexing that field, and you probably don’t want to. But you can prevent users from stumbling into this and disable the _id field in the Kibana advanced settings. And make sure you re-evaluate your circuit breakers. We drastically lowered the available field cache and removed any further issues.

Translog Compression:

At first glance, compression seems an obvious choice for shipping shards between nodes. Especially if you have the free clock cycles, why not minimize the bandwidth between nodes?

However, we found compression between nodes can drastically slow down shard transfers. By disabling compression, shipping time for a 50GB shard went from 1h to 20m. This was because Lucenesegments are already compressed, a new issue we ran into full force and are actively working with the community to fix. But it’s also a configuration to watch out for in your setup, especially if you want a fast recovery of a shard.

Segment Memory:

Most of our issues involved the heap memory being exhausted. We can’t stress enough the importance of having visualizations around how the JVM is used. We learned this lesson the hard way around segment memory.

This is a prime example of why you need to understand your data when building a cluster. We were hitting a lot of OOMs and couldn’t figure out why. We had fixed the field cache issue, but what was using all our RAM?

There is a reason why having a 16TB data node might be a poorly spec’d machine. Digging into it, we realized we simply allocated too many shards to our nodes. Looking up the total segment memory used per index should give a good idea of how many shards you can put on a node before you start running out of heap space. We calculated on average our 2TB indices used about 5GB of segment memory spread over 30 nodes.

The numbers have since changed and our layout was tweaked, but we came up with calculations showing we could allocate about 8TB of shards to a node with 32GB heap memory before we running into issues. That’s if you really want to push it, but it’s also a metric used to keep your segment memory per node around 50%. This allows enough memory to run queries without knocking out your data nodes. Naturally this led us to ask “What is using all this segment memory per node?”

Index Mapping and Field Types:

Could we lower how much segment memory our indices used to cut our cluster operation costs? Using the segments data found in the ES cluster and some simple Python loops, we tracked down the total memory used per field in our index.

We used a lot of segment memory for the _id field (but can’t do much about that). It also gave us a good breakdown of our other fields. And we realized we indexed fields in completely unnecessary ways. A few fields should have been integers but were keyword fields. We had fields no one would ever search against and which could be dropped from index memory.

Most importantly, this began our learning process of how tokens and analyzers work in Elasticsearch/Lucene.

Picking the Wrong Analyzer:

By default, we use Elasticsearch’s Standard Analyzer on all analyzed fields. It’s great, offering a very close approximation to how users search and it doesn’t explode your index memory like an N-gram tokenizer would.

But it does a few things we thought unnecessary, so we thought we could save a significant amount of heap memory. For starters, it keeps the original tokens: the Standard Analyzer would break IDXVB56KLM into tokens IDXVB, 56,  and KLM. This usually works well, but it really hurts you if you have a lot of alphanumeric strings.

We never have a user search for a user ID as a partial value. It would be more useful to only return the entire match of an alphanumeric string. This has the added benefit of only storing the single token in our index memory. This modification alone stripped a whole 1GB off our index memory, or at our scale meant we could eliminate 8 nodes.

We can’t stress enough how cautious you need to be when changing analyzers on a production system. Throughout this process, end users were confused why search results were no longer returning or returning weird results. There is a nice kibana pluginthat gives you a representation of how your tokens look with a different analyzer, or use the build in ES tools to get the same understanding.

Be Careful with Cloud Maintainers:

We realized that running a cluster at this scale is expensive. The hardware alone sets you back a lot, but our hidden bigger cost was cross traffic between availability zones.

Most cloud providers offer different “zones” for your machines to entice you to achieve a High-Availability environment. That’s a very useful thing to have, but you need to do a cost/risk analysis. If you migrate shards from HOT to WARM to COLD nodes constantly, you can really rack up a bill. This alone was about 30% of our total cluster cost, which wasn’t cheap at our scale.

We re-worked how our indices sat in the cluster. This let us create a different index for each zone and pin logging data so it never left the zone it was generated in. One small tweak to how we stored data cut our costs dramatically. Plus, it was a smaller scope for troubleshooting. We’d know a zone was misbehaving and could focus there vs. looking at everything.

Conclusion:

Running our own logging stack started as a challenge. We roughly knew the scale we were aiming for; it wasn’t going to be trivial or easy. A year later, we’ve gone from pipe-dream to production and immensely grown the team’s ELK stack knowledge.

We could probably fill 30 more pages with odd things we ran into, hacks we implemented, or times we wanted to pull our hair out. But we made it through and provide a superior logging platform to our engineers at a significant price reduction while maintaining a stable platform.

There are many different ways we could have started knowing what we do now. For example, using Logstash over Ingest nodes, changing default circuit breakers, and properly using heap space to prevent node failures. But hindsight is 20/20 and it’s rare for projects to not change.

We suggest anyone wanting to revamp their centralized logging system look at the ELK solutions. There is a learning curve, but the scalability is outstanding and having subsecond lookup time for assisting a customer is phenomenal. But, before you begin, do your homework to save yourself weeks of troubleshooting down the road. In the end though, we’ve received nothing but praise from Grab engineers about their experiences with our new logging system.

Catwalk: Serving Machine Learning Models at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/catwalk-serving-machine-learning-models-at-scale

Introduction

Grab’s unwavering ambition is to be the best Super App in Southeast Asia that adds value to the everyday for our customers. In order to achieve that, the customer experience must be flawless for each and every Grab service. Let’s take our frequently used ride-hailing service as an example. We want fair pricing for both drivers and passengers, accurate estimation of ETAs, effective detection of fraudulent activities, and ensured ride safety for our customers. The key to perfecting these customer journeys is artificial intelligence (AI).

Grab has a tremendous amount of data that we can leverage to solve complex problems such as fraudulent user activity, and to provide our customers personalized experiences on our products. One of the tools we are using to make sense of this data is machine learning (ML).

As Grab made giant strides towards increasingly using machine learning across the organization, more and more teams were organically building model serving solutions for their own use cases. Unfortunately, these model serving solutions required data scientists to understand the infrastructure underlying them. Moreover, there was a lot of overlap in the effort it took to build these model serving solutions.

That’s why we came up with Catwalk: an easy-to-use, self-serve, machine learning model serving platform for everyone at Grab.

Goals

To determine what we wanted Catwalk to do, we first looked at the typical workflow of our target audience – data scientists at Grab:

  • Build a trained model to solve a problem.
  • Deploy the model to their project’s particular serving solution. If this involves writing to a database, then the data scientists need to programmatically obtain the outputs, and write them to the database. If this involves running the model on a server, the data scientists require a deep understanding of how the server scales and works internally to ensure that the model behaves as expected.
  • Use the deployed model to serve users, and obtain feedback such as user interaction data. Retrain the model using this data to make it more accurate.
  • Deploy the retrained model as a new version.
  • Use monitoring and logging to check the performance of the new version. If the new version is misbehaving, revert back to the old version so that production traffic is not affected. Otherwise run an AB test between the new version and the previous one.

We discovered an obvious pain point – the process of deploying models requires additional effort and attention, which results in data scientists being distracted from their problem at hand. Apart from that, having many data scientists build and maintain their own serving solutions meant there was a lot of duplicated effort. With Grab increasingly adopting machine learning, this was a state of affairs that could not be allowed to continue.

To address the problems, we came up with Catwalk with goals to:

  1. Abstract away the complexities and expose a minimal interface for data scientists
  2. Prevent duplication of effort by creating an ML model serving platform for everyone in Grab
  3. Create a highly performant, highly available, model versioning supported ML model serving platform and integrate it with existing monitoring systems at Grab
  4. Shorten time to market by making model deployment self-service

What is Catwalk?

In a nutshell, Catwalk is a platform where we run Tensorflow Serving containers on a Kubernetes cluster integrated with the observability stack used at Grab.

In the next sections, we are going to explain the two main components in Catwalk – Tensorflow Serving and Kubernetes, and how they help us obtain our outlined goals.

What is Tensorflow Serving?

Tensorflow Serving is an open-source ML model serving project by Google. In Google’s own words, “Tensorflow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. Tensorflow Serving provides out-of-the-box integration with Tensorflow models, but can be easily extended to serve other types of models and data.”

Why Tensorflow Serving?

There are a number of ML model serving platforms in the market right now. We chose Tensorflow Serving because of these three reasons, ordered by priority:

  1. Highly performant. It has proven performance handling tens of millions of inferences per second at Google according to their website.
  2. Highly available. It has a model versioning system to make sure there is always a healthy version being served while loading a new version into its memory
  3. Actively maintained by the developer community and backed by Google

Even though, by default, Tensorflow Serving only supports models built with Tensorflow, this is not a constraint, though, because Grab is actively moving toward using Tensorflow.

How are we using Tensorflow Serving?

In this section, we will explain how we are using Tensorflow Serving and how it helps abstract away complexities for data scientists.

Here are the steps showing how we are using Tensorflow Serving to serve a trained model:

  1. Data scientists export the model using tf.saved_model API and drop it to an S3 models bucket. The exported model is a folder containing model files that can be loaded to Tensorflow Serving.
  2. Data scientists are granted permission to manage their folder.
  3. We run Tensorflow Serving and point it to load the model files directly from the S3 models bucket. Tensorflow Serving supports loading models directly from S3 out of the box. The model is served!
  4. Data scientists come up with a retrained model. They export and upload it to their model folder.
  5. As Tensorflow Serving keeps watching the S3 models bucket for new models, it automatically loads the retrained model and serves. Depending on the model configuration, it can either gracefully replace the running model version with a newer version or serve multiple versions at the same time.
Tensorflow Serving Diagram

The only interface to data scientists is a path to their model folder in the S3 models bucket. To update their model, they upload exported models to their folder and the models will automatically be served. The complexities are gone. We’ve achieved one of the goals!

Well, not really…

Imagine youare going to run Tensorflow Serving to serve one model in a cloud provider, which means you  need a compute resource from a cloud provider to run it. Running it on one box doesn’t provide high availability, so you need another box running the same model. Auto scaling is also needed in order to scale out based on the traffic. On top of these many boxes lies a load balancer. The load balancer evenly spreads incoming traffic to all the boxes, thus ensuring that there is a single point of entry for any clients, which can be abstracted away from the horizontal scaling. The load balancer also exposes an HTTP endpoint to external users. As a result, we form a Tensorflow Serving cluster that is ready to serve.

Next, imagine you have more models to deploy. You have three options

  1. Load the models into the existing cluster – having one cluster serve all models.
  2. Spin up a new cluster to serve each model – having multiple clusters, one cluster serves one model.
  3. Combination of 1 and 2 – having multiple clusters, one cluster serves a few models.

The first option would not scale, because it’s just not possible to load all models into one cluster as the cluster has limited resources.

The second option will definitely work but it doesn’t sound like an effective process, as you need to create a set of resources every time you have a new model to deploy. Additionally, how do you optimize the usage of resources, e.g., there might be unutilized resources in your clusters that could potentially be shared by the rest.

The third option looks promising, you can manually choose the cluster to deploy each of your new models into so that all the clusters’ resource utilization is optimal. The problem is you have to manuallymanage it. Managing 100 models using 25 clusters can be a challenging task. Furthermore, running multiple models in a cluster can also cause a problem as different models usually have different resource utilization patterns and can interfere with each other. For example, one model might use up all the CPU and the other model won’t be able to serve anymore.

Wouldn’t it be better if we had a system that automatically orchestrates model deployments based on resource utilization patterns and prevents them from interfering with each other? Fortunately, that  is exactly what Kubernetes is meant to do!

So what is Kubernetes?

Kubernetes abstracts a cluster of physical/virtual hosts (such as EC2) into a cluster of logical hosts (pods in Kubernetes terms). It provides a container-centric management environment. It orchestrates computing, networking, and storage infrastructure on behalf of user workloads.

Let’s look at some of the definitions of Kubernetes resources

Tensorflow Serving Diagram
  • Cluster – a cluster of nodes running Kubernetes.
  • Node – a node inside a cluster.
  • Deployment – a configuration to instruct Kubernetes the desired state of an application. It also takes care of rolling out an update (canary, percentage rollout, etc), rolling back and horizontal scaling.
  • Pod – a single processing unit. In our case, Tensorflow Serving will be running as a container in a pod. Pod can have CPU/memory limits defined.
  • Service – an abstraction layer that abstracts out a group of pods and exposes the application to clients.
  • Ingress – a collection of routing rules that govern how external users access services running in a cluster.
  • Ingress Controller – a controller responsible for reading the ingress information and processing that data accordingly such as creating a cloud-provider load balancer or spinning up a new pod as a load balancer using the rules defined in the ingress resource.

Essentially, we deploy resources to instruct Kubernetes the desired state of our application and Kubernetes will make sure that it is always the case.

How are we using Kubernetes?

In this section, we will walk you through how we deploy Tensorflow Serving in Kubernetes cluster and how it makes managing model deployments very convenient.

We used a managed Kubernetes service, to create a Kubernetes cluster and manually provisioned compute resources as nodes. As a result, we have a Kubernetes cluster with nodes that are ready to run applications.

An application to serve one model consists of

  1. Two or more Tensorflow Serving pods that serves a model with an autoscaler to scale pods based on resource consumption
  2. A load balancer to evenly spread incoming traffic to pods
  3. An exposed HTTP endpoint to external users

In order to deploy the application, we need to

  1. Deploy a deployment resource specifying
  2. Number of pods of Tensorflow Serving
  3. An S3 url for Tensorflow Serving to load model files
  4. Deploy a service resource to expose it
  5. Deploy an ingress resource to define an HTTP endpoint url

Kubernetes then allocates Tensorflow Serving pods to the cluster with the number of pods according to the value defined in deployment resource. Pods can be allocated to any node inside the cluster, Kubernetes makes sure that the node it allocates a pod into has sufficient resources that the pod needs. In case there is no node that has sufficient resources, we can easily scale out the cluster by adding new nodes into it.

In order for the rules defined inthe ingressresource to work, the cluster must have an ingress controller running, which is what guided our choice of the load balancer. What an ingress controller does is simple: it keeps checking the ingressresource, creates a load balancer and defines rules based on rules in the ingressresource. Once the load balancer is configured, it will be able to redirect incoming requests to the Tensorflow Serving pods.

That’s it! We have a scalable Tensorflow Serving application that serves a model through a load balancer! In order to serve another model, all we need to do is to deploy the same set of resources but with the model’s S3 url and HTTP endpoint.

To illustrate what is running inside the cluster, let’s see how it looks like when we deploy two applications: one for serving pricing model another one for serving fraud-check model. Each application is configured to have two Tensorflow Serving pods and exposed at /v1/models/model

Tensorflow Serving Diagram

There are two Tensorflow Serving pods that serve fraud-check model and exposed through a load balancer. Same for the pricing model, the only differences are the model it is serving and the exposed HTTP endpoint url. The load balancer rules for pricing and fraud-check model look like this

IfThen forward to
Path is /v1/models/pricingpricing pod ip-1
pricing pod ip-2
Path is /v1/models/fraud-checkfraud-check pod ip-1
fraud-check pod ip-2

Stats and Logs

The last piece is how stats and logs work. Before getting to that, we need to introduce DaemonSet. According to the document, DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage collected. Deleting a DaemonSet will clean up the pods it created.

We deployed datadog-agent and filebeat as a DaemonSet. As a result, we always have one datadog-agent pod and one filebeat pod in all nodes and they are accessible from Tensorflow Serving pods in the same node. Tensorflow Serving pods emit a stats event for every request to datadog-agent pod in the node it is running in.

Here is a sample of DataDog stats:

DataDog stats

And logs that we put in place:

Logs

Benefits Gained from Catwalk

Catwalk has become the go-to, centralized system to serve machine learning models. Data scientists are not required to take care of the serving infrastructure hence they can focus on what matters the most: come up with models to solve customer problems. They are only required to provide exported model files and estimation of expected traffic in order to prepare sufficient resources to run their model. In return, they are presented with an endpoint to make inference calls to their model, along with all necessary tools for monitoring and debugging. Updating the model version is self-service, and the model improvement cycle is much shorter than before. We used to count in days, we now count in minutes.

Future Plans

Improvement on Automation

Currently, the first deployment of any model will still need some manual task from the platform team. We aim to automate this processentirely. We’ll work with our awesome CI/CD team who is making the best use of Spinnaker.

Model serving on mobile devices

As a platform, we are looking at setting standards for model serving across Grab. This includes model serving on mobile devices as well. Tensorflow Serving also provides a Lite version to be used on mobile devices. It is a whole new paradigm with vastly different tradeoffs for machine learning practitioners. We are quite excited to set some best practices in this area.

gRPC support

Catwalk currently supports HTTP/1.1. We’ll hook Grab’s service discovery mechanism to open gRPC traffic, which TFS already supports.

If you are interested in building pipelines for machine learning related topics, and you share our vision of driving South East Asia forward, come join us!

C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages

Post Syndicated from Kavita Ganesan original https://github.blog/2019-07-02-c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages/

GitHub hosts over 300 programming languages—from commonly used languages such as Python, Java, and Javascript to esoteric languages such as Befunge, only known to very small communities.

JavaScript is the top programming language on GitHub, followed by Java and HTML
Figure 1: Top 10 programming languages hosted by GitHub by repository count

One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.

Despite the appearance, language recognition isn’t a trivial task. File names and extensions, while providing a good indication of what the coding language is likely to be, do not offer the full picture. In fact, many extensions are associated with the same language (e.g., “.pl”, “.pm”, “.t”, “.pod” are all associated with Perl), while others are ambiguous and used almost interchangeably across languages (e.g., “.h” is commonly used to indicate many languages of the “C” family, including C, C++, and Objective-C). In other cases, files are simply provided with no extension (especially for executable scripts) or with the incorrect extension (either on purpose or accidentally).

Linguist is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data. 

Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README’s, issues, and pull requests.

In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance. 

The Nuts and Bolts Behind OctoLingua

OctoLingua was built from scratch using Python, Keras with TensorFlow backend—and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language. 

Data sources

The current version of OctoLingua was trained on files retrieved from Rosetta Code and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.

Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a Fibonacci sequence is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.

Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist. 

Features: leveraging prior knowledge

Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:

  1. Top five special characters per file
  2. Top 20 tokens per file
  3. File extension
  4. Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons

The Artificial Neural Network (ANN) model

We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend. 

The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.

image
Figure 2: The ANN Structure of our initial model (50 languages + 1 for “other”)

We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.

Performance benchmark

OctoLingua vs. Linguist

In Figure 3, we show the F1 Score (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source). 

Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a “.txt” extension and a Python file may have a “.java”) extension. 

The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).

The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.

image
Figure 3: Performance of OctoLingua vs. Linguist on the same test set

 

Effect of removing file extension during training time

As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table below shows the performance of our model with different fractions of file extensions removed during training time. 

image
Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations

Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features. 

Supporting a new language

Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.

image
Figure 5: Adding a new language with OctoLingua

Our plans

As of now, OctoLingua is at the “advanced prototyping stage”. Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support—which would be rather straightforward—we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages. 

We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you’re interested.

Summary

With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering—all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code.  If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter @github!

Authors

The post C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages appeared first on The GitHub Blog.