All posts by Grab Tech

Unsupervised graph anomaly detection – Catching new fraudulent behaviours

Post Syndicated from Grab Tech original https://engineering.grab.com/graph-anomaly-model

Earlier in this series, we covered the importance of graph networks, graph concepts, graph visualisation, and graph-based fraud detection methods. In this article, we will discuss how to automatically detect new types of fraudulent behaviour and swiftly take action on them.

One of the challenges in fraud detection is that fraudsters are incentivised to always adversarially innovate their way of conducting frauds, i.e., their modus operandi (MO in short). Machine learning models trained using historical data may not be able to pick up new MOs, as they are new patterns that are not available in existing training data. To enhance Grab’s existing security defences and protect our users from these new MOs, we needed a machine learning model that is able to detect them quickly without the need for any label supervision, i.e., an unsupervised learning model rather than the regular supervised learning model.

To address this, we developed an in-house machine learning model for detecting anomalous patterns in graphs, which has led to the discovery of new fraud MOs. Our focus was initially on GrabFood and GrabMart verticals, where we monitored the interactions between consumers and merchants. We modelled these interactions as a bipartite graph (a type of graph for modelling interactions between two groups) and then performed anomaly detection on the graph. Our in-house anomaly detection model was also presented at the International Joint Conference on Neural Networks (IJCNN) 2023, a premier academic conference in the area of neural networks, machine learning, and artificial intelligence.

In this blog, we discuss the model and its application within Grab. For avid audiences that want to read the details of our model, you can access it here. Note that even though we implemented our model for anomaly detection in GrabFood and GrabMart, the model is designed for general purposes and is applicable to interaction graphs between any two groups.

Interaction-Focused Anomaly Detection on Bipartite Node-and-Edge-Attributed Graphs
By Rizal Fathony, Jenn Ng, Jia Chen
Presented at International Joint Conference on Neural Networks (IJCNN) 2023

Before we dive into how our model works, it is important to understand the process of graph construction in our application as the model assumes the availability of the graphs in a standardised format.

Graph construction 

We modelled the interactions between consumers and merchants in GrabFood and GrabMart platforms as bipartite graphs (G), where the first group of nodes (U) represents the consumers, the second group of nodes (V) represents the merchants, and the edges (E) connecting them means that the consumers have placed some food/mart orders to the merchants. The graph is also supplied with rich transactional information about the consumers and the merchants in the form of node features (Xu and Xv), as well as order information in the form of edge features (Xe).

Fig 1. Graph construction process

The goal of our anomaly model is to detect anomalous and suspicious behaviours from the consumers or merchants (node-level anomaly detection), as well as anomalous order interactions (edge-level anomaly detection). As mentioned, this detection needs to be done without any label supervision.

Model architecture

We designed our graph anomaly model as a type of autoencoder, with an encoder and two decoders – a feature decoder and a structure decoder. The key feature of our model is that it accepts a bipartite graph with both node and edge attributes as the input. This is important as both node and edge attributes encode essential information for determining if certain behaviours are suspicious. Many previous works on graph anomaly detection only support node attributes. In addition, our model can produce both node and edge level anomaly scores, unlike most of the previous works that produce node-level scores only. We named our model GraphBEAN, which is short for Bipartite Node-and-Edge-Attributed Networks.

From the input, the encoder then processes the attributed bipartite graph into a series of graph convolution layers to produce latent representations for both node groups. Our graph convolution layers produce new representations for each node in both node groups (U and V), as well as for each edge in the graph. Note that the last convolution layer in the encoder only produces the latent representations for nodes, without producing edge representations. The reason for this design is that we only put the latent representations for the active actors, the nodes representing consumers and merchants, but not their interactions.

Fig 2. GraphBEAN architecture

From the nodes’ latent representations, the feature decoder is tasked to reconstruct the original graph with both node and edge attributes via a series of graph convolution layers. As the graph structure is provided by the feature decoder, we task the structure decoder to learn the graph structure by predicting if there exists an edge connecting two nodes. This edge prediction, as well as the graph reconstructed by the feature decoder, are then compared to the original input graph via a reconstruction loss function.

The model is then trained using the bipartite graph constructed from GrabFood and GrabMart transactions. We use a reconstruction-based loss function as the training objective of the model. After the training is completed, we compute the anomaly score of each node and edge in the graph using the trained model.

Anomaly score computation

Our anomaly scores are reconstruction-based. The score design assumes that normal behaviours are common in the dataset and thus, can be easily reconstructed by the model. On the other hand, anomalous behaviours are rare. Therefore the model will have a hard time reconstructing them, hence producing high errors.

Fig 3. Edge-level and node-level anomaly scores computation

The model produces two types of anomaly scores. First, the edge-level anomaly scores, which are calculated from the edge reconstruction error. Second, the node-level anomaly scores, which are calculated from node reconstruction error plus an aggregate over the edge scores from the edges connected to the node. This aggregate could be a mean or max aggregate.

Actioning system

In our implementation of GraphBEAN within Grab, we designed a full pipeline of anomaly detection and actioning systems. It is a fully-automated system for constructing a bipartite graph from GrabFood and GrabMart transactions, training a GraphBEAN model using the graph, and computing anomaly scores. After computing anomaly scores for all consumers and merchants (node-level), as well as all of their interactions (edge-level), it automatically passes the scores to our actioning system. But before that, it also passes them through a system we call fraud type tagger. This is also a fully-automated heuristic-based system that tags some of the detected anomalies with some fraud tags. The purpose of this tagging is to provide some context in general, like the types of detected anomalies. Some examples of these tags are promo abuse or possible collusion.

Fig 4. Pipeline in our actioning system

Both the anomaly scores and the fraud type tags are then forwarded to our actioning system. The system consists of two subsystems:

  • Human expert actioning system: Our fraud experts analyse the detected anomalies and perform certain actioning on them, like suspending certain transaction features from suspicious merchants.
  • Automatic actioning system: Combines the anomaly scores and fraud type tags with other external signals to automatically do actioning on the detected anomalies, like preventing promos from being used by fraudsters or preventing fraudulent transactions from occurring. These actions vary depending on the type of fraud and the scores.

What’s next?

The GraphBEAN model enables the detection of suspicious behaviour on graph data without the need for label supervision. By implementing the model on GrabFood and GrabMart platforms, we learnt that having such a system enables us to quickly identify new types of fraudulent behaviours and then swiftly perform action on them. This also allows us to enhance Grab’s defence against fraudulent activity and actively protect our users.

We are currently working on extending the model into more generic heterogeneous (multi-entity) graphs. In addition, we are also working on implementing it to more use cases within Grab.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Zero traffic cost for Kafka consumers

Post Syndicated from Grab Tech original https://engineering.grab.com/zero-traffic-cost

Introduction

Coban, Grab’s real-time data streaming platform team, has been building an ecosystem around Kafka, serving all Grab verticals. Along with stability and performance, one of our priorities is also cost efficiency.

In this article, we explain how the Coban team has substantially reduced Grab’s annual cost for data streaming by enabling Kafka consumers to fetch from the closest replica.

Problem statement

The Grab platform is primarily hosted on AWS cloud, located in one region, spanning over three Availability Zones (AZs). When it comes to data streaming, both the Kafka brokers and Kafka clients run across these three AZs.

Figure 1 – Initial design, consumers fetching from the partition leader

Figure 1 shows the initial design of our data streaming platform. To ensure high availability and resilience, we configured each Kafka partition to have three replicas. We have also set up our Kafka clusters to be rack-aware (i.e. 1 “rack” = 1 AZ) so that all three replicas reside in three different AZs.

The problem with this design is that it generates staggering cross-AZ network traffic. This is because, by default, Kafka clients communicate only with the partition leader, which has a 67% probability of residing in a different AZ.

This is a concern as we are charged for cross-AZ traffic as per AWS’s network traffic pricing model. With this design, our cross-AZ traffic amounted to half of the total cost of our Kafka platform.

The Kafka cross-AZ traffic for this design can be broken down into three components as shown in Figure 1:

  • Producing (step 1): Typically, a single service produces data to a given Kafka topic. Cross-AZ traffic occurs when the producer does not reside in the same AZ as the partition leader it is producing data to. This cross-AZ traffic cost is minimal, because the data is transferred to a different AZ at most once (excluding retries).
  • Replicating (step 2): The ingested data is replicated from the partition leader to the two partition followers, which reside in two other AZs. The cost of this is also relatively small, because the data is only transferred to a different AZ twice.
  • Consuming (step 3): Most of the cross-AZ traffic occurs here because there are many consumers for a single Kafka topic. Similar to the producers, the consumers incur cross-AZ traffic when they do not reside in the same AZ as the partition leader. However, on the consuming side, cross-AZ traffic can occur as many times as there are consumers (on average, two-thirds of the number of consumers). The solution described in this article addresses this particular component of the cross-AZ traffic in the initial design.

Solution

Kafka 2.3 introduced the ability for consumers to fetch from partition replicas. This opens the door to a more cost-efficient design.

Figure 2 – Target design, consumers fetching from the closest replica

Step 3 of Figure 2 shows how consumers can now consume data from the replica that resides in their own AZ. Implementing this feature requires rack-awareness and extra configurations for both the Kafka brokers and consumers. We will describe this in the following sections.

The Coban journey

Kafka upgrade

Our journey started with the upgrade of our legacy Kafka clusters. We decided to upgrade them directly to version 3.1, in favour of capturing bug fixes and optimisations over version 2.3. This was a safe move as version 3.1 was deemed stable for almost a year and we projected no additional operational cost for this upgrade.

To perform an online upgrade with no disruptions for our users, we broke down the process into three stages.

  • Stage 1: Upgrading Zookeeper. All versions of Kafka are tested by the community with a specific version of Zookeeper. To ensure stability, we followed this same process. The upgraded Zookeeper would be backward compatible with the pre-upgrade version of Kafka which was still in use at this early stage of the operation.
  • Stage 2: Rolling out the upgrade of Kafka to version 3.1 with an explicit backward-compatible inter-broker protocol version (inter.broker.protocol.version). During this progressive rollout, the Kafka cluster is temporarily composed of brokers with heterogeneous Kafka versions, but they can communicate with one another because they are explicitly set up to use the same inter-broker protocol version. At this stage, we also upgraded Cruise Control to a compatible version, and we configured Kafka to import the updated cruise-control-metrics-reporter JAR file on startup.
  • Stage 3: Upgrading the inter-broker protocol version. This last stage makes all brokers use the most recent version of the inter-broker protocol. During the progressive rollout of this change, brokers with the new protocol version can still communicate with brokers on the old protocol version.

Configuration

Enabling Kafka consumers to fetch from the closest replica requires a configuration change on both Kafka brokers and Kafka consumers. They also need to be aware of their AZ, which is done by leveraging Kafka rack-awareness (1 “rack” = 1 AZ).

Brokers

In our Kafka brokers’ configuration, we already had broker.rack set up to distribute the replicas across different AZs for resiliency. Our Ansible role for Kafka automatically sets it with the AZ ID that is dynamically retrieved from the EC2 instance’s metadata at deployment time.

- name: Get availability zone ID
  uri:
    url: http://169.254.169.254/latest/meta-data/placement/availability-zone-id
    method: GET
    return_content: yes
  register: ec2_instance_az_id

Note that we use AWS AZ IDs (suffixed az1, az2, az3) instead of the typical AWS AZ names (suffixed 1a, 1b, 1c) because the latter’s mapping is not consistent across AWS accounts.

Also, we added the new replica.selector.class parameter, set with value org.apache.kafka.common.replica.RackAwareReplicaSelector, to enable the new feature on the server side.

Consumers

On the Kafka consumer side, we mostly rely on Coban’s internal Kafka SDK in Golang, which streamlines how service teams across all Grab verticals utilise Coban Kafka clusters. We have updated the SDK to support fetching from the closest replica.

Our users only have to export an environment variable to enable this new feature. The SDK then dynamically retrieves the underlying host’s AZ ID from the host’s metadata on startup, and sets a new client.rack parameter with that information. This is similar to what the Kafka brokers do at deployment time.

We have also implemented the same logic for our non-SDK consumers, namely Flink pipelines and Kafka Connect connectors.

Impact

We rolled out fetching from the closest replica at the turn of the year and the feature has been progressively rolled out on more and more Kafka consumers since then.

Figure 3 – Variation of our cross-AZ traffic before and after enabling fetching from the closest replica

Figure 3 shows the relative impact of this change on our cross-AZ traffic, as reported by AWS Cost Explorer. AWS charges cross-AZ traffic on both ends of the data transfer, thus the two data series. On the Kafka brokers’ side, less cross-AZ traffic is sent out, thereby causing the steep drop in the dark green line. On the Kafka consumers’ side, less cross-AZ traffic is received, causing the steep drop in the light green line. Hence, both ends benefit by fetching from the closest replica.

Throughout the observeration period, we maintained a relatively stable volume of data consumption. However, after three months, we observed a substantial 25% drop in our cross-AZ traffic compared to December’s average. This reduction had a direct impact on our cross-AZ costs as it directly correlates with the cross-AZ traffic volume in a linear manner.

Caveats

Increased end-to-end latency

After enabling fetching from the closest replica, we have observed an increase of up to 500ms in end-to-end latency, that comes from the producer to the consumers. Though this is expected by design, it makes this new feature unsuitable for Grab’s most latency-sensitive use cases. For these use cases, we retained the traditional design whereby consumers fetch directly from the partition leaders, even when they reside in different AZs.

Figure 4 – End-to-end latency (99th percentile) of one of our streams, before and after enabling fetching from the closest replica

Inability to gracefully isolate a broker

We have also verified the behaviour of Kafka clients during a broker rotation; a common maintenance operation for Kafka. One of the early steps of our corresponding runbook is to demote the broker that is to be rotated, so that all of its partition leaders are drained and moved to other brokers.

In the traditional architecture design, Kafka clients only communicate with the partition leaders, so demoting a broker gracefully isolates it from all of the Kafka clients. This ensures that the maintenance is seamless for them. However, by fetching from the closest replica, Kafka consumers still consume from the demoted broker, as it keeps serving partition followers. When the broker effectively goes down for maintenance, those consumers are suddenly disconnected. To work around this, they must handle connection errors properly and implement a retry mechanism.

Potentially skewed load

Another caveat we have observed is that the load on the brokers is directly determined by the location of the consumers. If they are not well balanced across all of the three AZs, then the load on the brokers is similarly skewed. At times, new brokers can be added to support an increasing load on an AZ. However, it is undesirable to remove any brokers from the less loaded AZs as more consumers can suddenly relocate there at any time. Having these additional brokers and underutilisation of existing brokers on other AZs can also impact cost efficiency.

Figure 5 – Average CPU utilisation by AZ of one of our critical Kafka clusters

Figure 5 shows the CPU utilisation by AZ for one of our critical Kafka clusters. The skewage is visible after 01/03/2023. To better manage this skewage in load across AZs, we have updated our SDK to expose the AZ as a new metric. This allows us to monitor the skewness of the consumers and take measures proactively, for example, moving some of them to different AZs.

What’s next?

We have implemented the feature to fetch from the closest replica on all our Kafka clusters and all Kafka consumers that we control. This includes internal Coban pipelines as well as the managed pipelines that our users can self-serve as part of our data streaming offering.

We are now evangelising and advocating for more of our users to adopt this feature.

Beyond Coban, other teams at Grab are also working to reduce their cross-AZ traffic, notably, Sentry, the team that is in charge of Grab’s service mesh.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Go module proxy at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/go-module-proxy

At Grab, we rely heavily on a large Go monorepo for backend development, which offers benefits like code reusability and discoverability. However, as we continue to grow, managing a large monorepo brings about its own set of unique challenges.

As an example, using Go commands such as go get and go list can be incredibly slow when fetching Go modules residing in a large multi-module repository. This sluggishness takes a toll on developer productivity, burdens our Continuous Integration (CI) systems, and strains our Version Control System host (VCS), GitLab.

In this blog post, we look at how Athens, a Go module proxy, helps to improve the overall developer experience of engineers working with a large Go monorepo at Grab.

Key highlights

  • We reduced the time of executing the go get command from ~18 minutes to ~12 seconds when fetching monorepo Go modules.
  • We scaled in and scaled down our entire Athens cluster by 70% by utilising the fallback network mode in Athens along with Golang’s GOVCS mode, resulting in cost savings and enhanced efficiency.

Problem statements and solutions

1. Painfully slow performance of Go commands

Problem summary: Running the go get command in our monorepo takes a considerable amount of time and can lead to performance degradation in our VCS.

When working with the Go programming language, go get is one of the most common commands that you’ll use every day. Besides developers, this command is also used by CI systems.

What does go get do?

The go get command is used to download and install packages and their dependencies in Go. Note that it operates differently depending on whether it is run in legacy GOPATH mode or module-aware mode. In Grab, we’re using the module-aware mode in a multi-module repository setup.

Every time go get is run, it uses Git commands, like git ls-remote, git tag, git fetch, etc, to search and download the entire worktree. The excessive use of these Git commands on our monorepo contributes to the long processing time and can be strenuous to our VCS.

How big is our monorepo?

To fully grasp the challenges faced by our engineering teams, it’s crucial to understand the vast scale of the monorepo that we work with daily. For this, we use git-sizer to analyse our monorepo.

Here’s what we found:

  • Overall repository size: The monorepo has a total uncompressed size of 69.3 GiB, a fairly substantial figure. To put things into perspective, the Linux kernel repository, known for its vastness, currently stands at 55.8 GiB.
  • Trees: The total number of trees is 3.21M and tree entries are 99.8M, which consume 3.65 GiB. This may cause performance issues during some Git operations.
  • References: Totalling 10.7k references.
  • Biggest checkouts: There are 64.7k directories in our monorepo. This affects operations like git status and git checkout. Moreover, our monorepo has a maximum path depth of 20. This contributes to a slow processing time on Git and negatively impacts developer experience. The number of files (354k) and the total size of files (5.08 GiB) are also concerns due to their potential impact on the repository’s performance.

To draw a comparison, refer to the git-sizer output of the Linux repository.

How slow is “slow”?

To illustrate the issue further, we will compare the time taken for various Go commands to fetch a single module in our monorepo at a 10 MBps download speed.

This is an example of how a module is structured in our monorepo:

gitlab.company.com/monorepo/go
  |-- go.mod
  |-- commons/util/gk
        |-- go.mod
Go commands GOPROXY Previously cached? Description Result (time taken)
go get -x gitlab.company.com/monorepo/go/commons/util/gk proxy.golang.org,direct Yes Download and install the latest version of the module. This is a common scenario that developers often encounter. 18:50.71 minutes
go get -x gitlab.company.com/monorepo/go/commons/util/gk proxy.golang.org,direct No Download and install the latest version of the module without any module cache 1:11:54.56 hour
go list -x -m -json -versions gitlab.company.com/monorepo/go/util/gk proxy.golang.org,direct Yes List information about the module 3.873 seconds
go list -x -m -json -versions gitlab.company.com/monorepo/go/util/gk proxy.golang.org,direct No List information about the module without any module cache 3:18.58 minutes

In this example, using go get to fetch a module took over 18 minutes to complete. If we needed to retrieve more than one module in our monorepo, it can be incredibly time-consuming.

Why is it slow in a monorepo?

In a large Go monorepo, go get commands can be slow due to several factors:

  1. Large number of files and directories: When running go get, the command needs to search and download the entire worktree. In a large multi-module monorepo, the vast number of files and directories make this search process very expensive and time-consuming.
  2. Number of refs: A large number of refs (branches or tags) in our monorepo can affect performance. Ref advertisements (git ls-remote), which contain every ref in our monorepo, are the first phase in any remote Git operation, such as git clone or git fetch. With a large number of refs, performance takes a hit when performing these operations.
  3. Commit history traversal: Operations that need to traverse a repository’s commit history and consider each ref will be slow in a monorepo. The larger the monorepo, the more time-consuming these operations become.

The consequences: Stifled productivity and strained systems

Developers and CI

When Go command operations like go get are slow, they contribute to significant delays and inefficiencies in software development workflows. This leads to reduced productivity and demotivated developers.

Optimising Go command operations’ speed is crucial to ensure efficient software development workflows and high-quality software products.

Version Control System

It’s also worth noting that overusing go get commands can also lead to performance issues for VCS. When Go packages are frequently downloaded using go get, we saw that it caused a bottleneck in our VCS cluster, which can lead to performance degradation or even cause rate-limiting queue issues.

This negatively impacts the performance of our VCS infrastructure, causing delays or sometimes unavailability for some users and CI.

Solution: Athens + fallback Network Mode + GOVCS + Custom Cache Refresh Solution

Problem summary: Speed up go get command by not fetching from our VCS

We addressed the speed issue by using Athens, a proxy server for Go modules (read more about the GOPROXY protocol).

How does Athens work?

The following sequence diagram describes the default flow of go get command with Athens.

Athens uses a storage system for Go module packages, which can also be configured to use various storage systems such as Amazon S3, and Google Cloud Storage, among others.

By caching these module packages in storage, Athens can serve the packages directly from storage rather than requesting them from an upstream VCS while serving Go commands such as go mod download and certain go build modes. However, just using a Go module proxy didn’t fully resolve our issue since the go get and go list commands still hit our VCS through the proxy.

With this in mind, we thought “what if we could just serve the Go modules directly from Athens’ storage for go get?” This question led us to discover Athens network mode.

What is Athens network mode?

Athens NetworkMode configures how Athens will return the results of the Go commands. It can be assembled from both its own storage and the upstream VCS. As of Athens v0.12.1, it currently supports these 3 modes:

  1. strict: merge VCS versions with storage versions, but fail if either of them fails.
  2. offline: only get storage versions, never reach out to VCS.
  3. fallback: only return storage versions, if VCS fails. Fallback mode does the best effort of giving you what’s available at the time of requesting versions.

Our Athens clusters were initially set to use strict network mode, but this was not ideal for us. So we explored the other network modes.

Exploring offline mode

We initially sought to explore the idea of putting Athens in offline network mode, which would allow Athens to serve Go requests only from its storage. This concept aligned with our aim of reducing VCS hits while also leading to significant performance improvement in Go workflows.

However in practice, it’s not an ideal approach. The default Athens setup (strict mode) automatically updates the module version when a user requests a new module version. Nevertheless, switching Athens to offline mode would disable the automatic updates as it wouldn’t connect to the VCS.

Custom cache refresh solution

To solve this, we implemented a CI pipeline that refreshes Athens’ module cache whenever a new module is released in our monorepo. Employing this with offline mode made Athens effective for the monorepo but it resulted in the loss of automatic updates for other repositories

Restoring this feature requires applying our custom cache refresh solution to all other Go repositories. However, implementing this workaround can be quite cumbersome and significant additional time and effort. We decided to look for another solution that would be easier to maintain in the long run.

A balanced approach: fallback Mode and GOVCS

This approach builds upon our aforementioned custom cache refresh which is specifically designed for the monorepo.

We came across the GOVCS environment variable, which we use in combination with the fallback network mode to effectively put only the monorepo in “offline” mode.

When GOVCS is set to gitlab.company.com/monorepo/go:off, Athens encounters an error whenever it tries to fetch modules from VCS:

gitlab.company.com/monorepo/go/commons/util/[email protected]: unrecognized import path "gitlab.company.com/monorepo/go/commons/util/gk": GOVCS disallows using git for private gitlab.company.com/monorepo/go; see 'go help vcs'

If Athens network mode is set to strict, Athens returns 404 errors to the user. By switching to fallback mode, Athens tries to retrieve the module from its storage if a GOVCS failure occurs.

Here’s the updated Athens configuration (example default config):

GoBinaryEnvVars = ["GOPROXY=direct", 
"GOPRIVATE=gitlab.company.com", 
"GOVCS=gitlab.company.com/monorepo/go:off"]

NetworkMode = "fallback"

With the custom cache refresh solution coupled with this approach, we not only accelerate the retrieval of Go modules within the monorepo but also allow for automatic updates for non-monorepo Go modules.

Final results

This solution resulted in a significant improvement in the performance of Go commands for our developers. With Athens, the same command is completed in just ~12 seconds (down from ~18 minutes), which is impressively fast.

Go commands GOPROXY Previously cached? Description Result (time taken)
go get -x gitlab.company.com/monorepo/go/commons/util/gk goproxy.company.com Yes Download and install the latest version of the module. This is a common scenario that developers often encounter. 11.556 seconds
go get -x gitlab.company.com/monorepo/go/commons/util/gk goproxy.company.com No Download and install the latest version of the module without any module cache 1:05.60 minutes
go list -x -m -json -versions gitlab.company.com/monorepo/go/util/gk goproxy.company.com Yes List information about the monorepo module 0.592 seconds
go list -x -m -json -versions gitlab.company.com/monorepo/go/util/gk goproxy.company.com No List information about the monorepo module without any module cache 1.023 seconds
Average cluster CPU utlisation
Average cluster memory utlisation

In addition, this change to our Athens cluster also leads to substantial reduction in average cluster CPU and memory utilisation. This also enabled us to scale in and scale down our entire Athens cluster by 70%, resulting in cost savings and enhanced efficiency. On top of that, we were also able to effectively eliminate VCS’s rate-limiting issues while making the monorepo’s command operation considerably faster.

2. Go modules in GitLab subgroups

Problem summary: Go modules are unable to work natively with private or internal repositories under GitLab subgroups.

When it comes to managing code repositories and packages, GitLab subgroups and Go modules have become an integral part of the development process at Grab. Go modules help to organise and manage dependencies, and GitLab subgroups provide an additional layer of structure to group related repositories together.

However, a common issue when using Go modules is that they do not work natively with private or internal repositories under a GitLab subgroup (see this GitHub issue).

For example, using go get to retrieve a module from gitlab.company.com/gitlab-org/subgroup/repo will result in a failure. This problem is not specific to Go modules, all repositories under the subgroup will face the same issue.

A cumbersome workaround

To overcome this issue, we had to use workarounds. One workaround is to authenticate the HTTPS calls to GitLab by adding authentication details to the .netrc file on your machine.

The following lines can be added to the .netrc file:

machine gitlab.company.com
    login [email protected]
    password <personal-access-token>

In our case, we are using a Personal Access Token (PAT) since we have 2FA enabled. If 2FA is not enabled, the GitLab password can be used instead. However, this approach would mean configuring the .netrc file in the CI environments as well as on the machine of every Go developer.

Solution: Athens + .netrc

A feasible solution is to set up the .netrc file in the Go proxy server. This method eliminates the need for N number of developers to configure their own .netrc files. Instead, the responsibility for this task is delegated to the Go proxy server.

3. Sharing common libraries

Problem summary: Distributing internal common libraries within a monorepo without granting direct repository access can be challenging.

At Grab, we work with various cross-functional teams, and some could have distinct network access like different VPNs. This adds complexity to sharing our monorepo’s internal common libraries with them. To maintain the security and integrity of our monorepo, we use a Go proxy for controlled access to necessary libraries.

The key difference between granting direct access to the monorepo via VCS and using a Go proxy is that the former allows users to read everything in the repository, while the latter enables us to grant access only to the specific libraries users need within the monorepo. This approach ensures secure and efficient collaboration across diverse network configurations.

Without Go module proxy

Without Athens, we would need to create a separate repository to store the code we want to share and then use a build system to automatically mirror the code from the monorepo to the public repository.

This process can be cumbersome and lead to inconsistencies in code versions between the two repositories, ultimately making it challenging to maintain the shared libraries.

Furthermore, copying code can lead to errors and increase the risk of security breaches by exposing confidential or sensitive information.

Solution: Athens + Download Mode File

To tackle this problem statement, we utilise Athens’ download mode file feature using an allowlist approach to specify which repositories can be downloaded by users.

Here’s an example of the Athens download mode config file:

downloadURL = "https://proxy.golang.org"

mode = "sync"

download "gitlab.company.com/repo/a" {
    mode = "sync"
}

download "gitlab.company.com/repo/b" {
    mode = "sync"
}

download "gitlab.company.com/*" {
    mode = "none"
}

In the configuration file, we specify allowlist entries for each desired repo, including their respective download modes. For example, in the snippet above, repo/a and repo/b are allowed (mode = “sync”), while everything else is blocked using mode = “none”.

Final results

By using Athens’ download mode feature in this case, the benefits are clear. Athens provides a secure, centralised place to store Go modules. This approach not only provides consistency but also improves maintainability, as all code versions are managed in one single location.

Additional benefits of Go proxy

As we’ve touched upon the impressive results achieved by implementing Athens Go proxy at Grab, it’s crucial to explore the supplementary advantages that accompany this powerful solution.

These unsung benefits, though possibly overlooked, play a vital role in enriching the overall developer experience at Grab and promoting more robust software development practices:

  1. Module immutability: ​​As the software world continues to face issues around changing or disappearing libraries, Athens serves as a useful tool in mitigating build disruptions by providing immutable storage for copied VCS code. The use of a Go proxy also ensures that builds remain deterministic, improving consistency across our software.
  2. Uninterrupted development: Athens allows users to fetch dependencies even when VCS is down, ensuring continuous and seamless development workflows.
  3. Enhanced security: Athens offers access control by enabling the blocking of specific packages within Grab. This added layer of security protects our work against potential risks from malicious third-party packages.
  4. Vendor directory removal: Athens prepares us for the eventual removal of the vendor directory, fostering faster workflows in the future.

What’s next?

Since adopting Athens as a Go module proxy, we have observed considerable benefits, such as:

  1. Accelerated Go command operations
  2. Reduced infrastructure costs
  3. Mitigated VCS load issues

Moreover, its lesser-known advantages like module immutability, uninterrupted development, enhanced security, and vendor directory transition have also contributed to improved development practices and an enriched developer experience for Grab engineers.

Today, the straightforward process of exporting three environment variables has greatly influenced our developers’ experience at Grab.

export GOPROXY="goproxy.company.com|proxy.golang.org,direct"

export GONOSUMDB="gitlab.company.com"

export GONOPROXY="none"

At Grab, we are always looking for ways to improve and optimise the way we work, so we contribute to open-sourced projects like Athens, where we help with bug fixes. If you are interested in setting up a Go module proxy, do give Athens (github.com/gomods/athens) a try!

Special thanks to Swaminathan Venkatraman, En Wei Soh, Anuj More, Darius Tan, and Fernando Christyanto for contributing to this project and this article.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

PII masking for privacy-grade machine learning

Post Syndicated from Grab Tech original https://engineering.grab.com/pii-masking

At Grab, data engineers work with large sets of data on a daily basis. They design and build advanced machine learning models that provide strategic insights using all of the data that flow through the Grab Platform. This enables us to provide a better experience to our users, for example by increasing the supply of drivers in areas where our predictive models indicate a surge in demand in a timely fashion.

Grab has a mature privacy programme that complies with applicable privacy laws and regulations and we use tools to help identify, assess, and appropriately manage our privacy risks. To ensure that our users’ data are well-protected and avoid any human-related errors, we always take extra measures to secure this data.

However, data engineers will still require access to actual production data in order to tune effective machine learning models and ensure the models work as intended in production.

In this article, we will describe how the Grab’s data streaming team (Coban), along with the data platform and user teams, have enforced Personally Identifiable Information (PII) masking on machine learning data streaming pipelines. This ensures that we uphold a high standard and embody a privacy by design culture, while enabling data engineers to refine their models with sanitised production data.

PII tagging

Data streaming at Grab leverages the Protocol Buffers (protobuf) data format to structure in-transit data. When creating a new stream, developers must describe its fields in a protobuf schema that is then used for serialising the data wherever it is sent over the wire, and deserialising it wherever it is consumed.

A fictional example schema looks like this (the indexes are arbitrary, but commonly created in sequence):

message Booking {
  string bookingID = 1;
  int64 creationTime = 2;
  int64 passengerID = 3;
  string passengerName = 4;
  ... truncated output ...
}

Over here, the fourth field passengerName involves a PII and the data pertaining to that field should never be accessible by any data engineer. Therefore, developers owning the stream must tag that field with a PII label like this:

import "streams/coban/options/v1/pii.proto";

message Booking {
  string bookingID = 1;
  int64 creationTime = 2;
  int64 passengerID = 3;
  string passengerName = 4 [(streams.coban.options.v1.pii_type) = PII_TYPE_NAME];
  ... truncated output ...
}

The imported pii.proto library defines the tags for all possible types of PII. In the example above, the passengerName field has not only been flagged as PII, but is also marked as PII_TYPE_NAME – a specific type of PII that conveys the names of individuals. This high-level typing enables more flexible PII masking methods, which we will explain later.

Once the PII fields have been properly identified and tagged, developers need to publish the schema of their new stream into Coban’s Git repository. A Continuous Integration (CI) pipeline described below ensures that all fields describing PII are correctly tagged.

The following diagram shows this CI pipeline in action.

Fig. 1 CI pipeline failure due to untagged PII fields

When a developer creates a Merge Request (MR) or pushes a new commit to create or update a schema (step 1), the CI pipeline is triggered. It runs an in-house Python script that scans each variable name of the committed schema and tests it against an extensive list of PII keywords that is regularly updated, such as name, address, email, phone, etc (step 2). If there is a match and the variable is not tagged with the expected PII label, the pipeline fails (step 3) with an explicit error message in the CI pipeline’s output, similar to this:

Field name [Booking.passengerName] should have been marked with type streams.coban.options.v1.pii_type = PII_TYPE_NAME

There are cases where a variable name in the schema is a partial match against a PII keyword but is legitimately not a PII – for example, carModelName is a partial match against name but does not contain PII data. In this case, the developer can choose to add it to a whitelist to pass the CI.

However, modifying the whitelist requires approval from the Coban team for verification purposes. Apart from this particular case, the requesting team can autonomously approve their MR in a self-service fashion.

Now let us look at an example of a successful CI pipeline execution.

Fig. 2 CI pipeline success and schema publishing

In Fig. 2, the committed schema (step 1) is properly tagged so our in-house Python script is unable to find any untagged PII fields (step 2). The MR is approved by a code owner (step 3), then merged to the master branch of the repository (step 4).

Upon merging, another CI pipeline is triggered to package the protobuf schema in a Java Archive (JAR) of Scala classes (step 5), which in turn is stored into a package registry (step 6). We will explain the reason for this in a later section.

Production environment

With the schemas published and all of their PII fields properly tagged, we can now take a look at the data streaming pipelines.

Fig. 3 PII flow in the production environment

In this example, the user generates data by interacting with the Grab superapp and making a booking (step 1). The booking service, compiled with the stream’s schema definition, generates and produces Kafka records for other services to consume (step 2). Among those consuming services are the production machine learning pipelines that are of interest to this article (step 3).

PII is not masked in this process because it is actually required by the consuming services. For example, the driver app needs to display the passenger’s actual name, so the driver can confirm their identity easily.

At this part of the process, this is not much of a concern because access to the sacrosanct production environment is highly restricted and monitored by Grab.

PII masking

To ensure the security, stability, and privacy of our users, data engineers who need to tune their new machine learning models based on production data are not granted access to the production environment. Instead, they have access to the staging environment, where production data is mirrored and PII is masked.

Fig. 4 PII masking pipeline from the production environment to the staging environment

The actual PII masking is performed by an in-house Flink application that resides in the production environment. Flink is a reference framework for data streaming that we use extensively. It is also fault tolerant, with the ability to restart from a checkpoint.

The Flink application is compiled along with the JAR containing the schema as Scala classes previously mentioned. Therefore, it is able to consume the original data as a regular Kafka consumer (step 1). It then dynamically masks the PII of the consumed data stream, based on the PII tags of the schema (step 2). Ultimately, it produces the sanitised data to the Kafka cluster in the staging environment as a normal Kafka producer (step 3).

Depending on the kind of PII, there are several methods of masking such as:

  • Names and strings of characters: They are replaced by consistent HMAC (Hash-based message authentication code). A HMAC is a digest produced by a one-way cryptographic hash function that takes a secret key as a parameter. Leveraging a secret key here is a defence against chosen plaintext attacks, i.e. computing the digest of a particular plaintext, like a targeted individual’s name.
  • Numbers and dates: Similarly, they are transformed in a consistent manner, by leveraging a random generator that takes the unmasked value as a seed, so that the same PII input consistently produces the same masked output.

Note that consistency is a recurring pattern. This is because it is a key requirement for certain machine learning models.

This sanitised data produced to the Kafka cluster in the staging environment is then consumed by the staging machine learning pipelines (step 4). There, it is used by data engineers to tune their models effectively with near real-time production data (step 5).

The Kafka cluster in the staging environment is secured with authorisation and authentication (see Zero Trust with Kafka). This is an extra layer of security in case some PII data inadvertently fall through the cracks of PII tagging, following the defence in depth principle.

Finally, whenever a new PII-tagged field is added to a schema, the PII masking Flink application needs to be compiled and deployed again. If the schema is not updated, the Flink pipeline is unable to decode this new field when deserialising the stream. Thus, the added field is just dropped and the new PII data does not make it to the staging environment.

What’s next?

For the immediate next steps, we are going to enhance this design with an in-house product based on AWS Macie to automatically detect the PII that would have fallen through the cracks. Caspian, Grab’s data lake team and one of Coban’s sister teams, has built a service that is already able to detect PII data in relational databases and data lake tables. It is currently being adapted for data streaming.

In the longer run, we are committed to taking our privacy by design posture to the next level. Indeed, the PII masking described in this article does not prevent a bad actor from retrieving the consistent hash of a particular individual based on their non-PII data. For example, the target might be identifiable by a signature in the masked data set, such as unique food or transportation habits.

A possible counter-measure could be one or a combination of the following techniques, ordered by difficulty of implementation:

  • Data minimisation: Non-essential fields in the data stream should not be mirrored at all. E.g. fields of the data stream that are not required by the data engineers to tune their models. We can introduce a dedicated tag in the schema to flag those fields and instruct the mirroring pipeline to drop them. This is the most straightforward approach.
  • Differential privacy: The mirroring pipeline could introduce some noise in the mirrored data, in a way that would obfuscate the signatures of particular individuals while still preserving the essential statistical properties of the dataset required for machine learning. It happens that Flink is a suitable framework to do so, as it can split a stream into multiple windows and apply computation over those windows. Designing and generalising a logic that meets the objective is challenging though.
  • PII encryption at source: PII could be encrypted by the producing services (like the booking service), and dynamically decrypted where plaintext values are required. However, key management and performance are two tremendous challenges of this approach.

We will explore these techniques further to find the solution that works best for Grab and ensures the highest level of privacy for our users.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Performance bottlenecks of Go application on Kubernetes with non-integer (floating) CPU allocation

Post Syndicated from Grab Tech original https://engineering.grab.com/performance-bottlenecks-go-apps

Grab’s real-time data platform team, Coban, has been running its stream processing framework on Kubernetes, as detailed in Plumbing at scale. We’ve also written another article (Scaling Kafka consumers) about vertical pod autoscaling (VPA) and the benefits of using it.

In this article, we cover the performance bottlenecks and other issues we came across for Go applications on Kubernetes.

Background

We noticed CPU throttling issues on some pipelines leading to consumption lag, which meant there was a delay between data production and consumption. This was an issue because the data might no longer be relevant or accurate when it gets consumed. This led to incorrect data-driven conclusions, costly mistakes, and more.

While debugging this issue, we focused primarily on the SinktoS3 pipeline. It is essentially used for sinking data from Kafka topics to AWS S3. Depending on your requirements, data sinking is primarily for archival purposes and can be used for analytical purposes.

Investigation

After conducting a thorough investigation, we found two main issues:

  • Resource throttling
  • Issue with VPA

Resource throttling

We redesigned our SinktoS3 pipeline architecture to concurrently perform the most CPU intensive operations using parallel goroutines (workers). This improved performance and considerably reduced consumer lag.

But the high-performance architecture needed more intensive resource configuration. As mentioned in Scaling kafka consumers, VPA helps remove manual resource configuration. So, we decided to let the SinktoS3 pipeline run on VPA, but this exposed a new set of problems.

We tested our hypothesis on one of the highest traffic pipelines with parallel goroutines (workers). When the pipeline was left running on VPA, it tried optimising the resources by slowly reducing from 2.5 cores to 2.05 cores, and then to 1.94 cores.

CPU requests dropped from 2.05 cores to 1.94 cores, since the maximum performance can be seen at ~1.7 cores.

As you can see from the image above, CPU usage and performance reduced significantly after VPA changed the CPU cores to less than 2. The pipeline ended up with a huge backlog to clear and although it had resources on pod (around 1.94 cores), it did not process any faster and instead, slowed down significantly, resulting in throttling.

From the image above, we can see that after VPA scaled the limits of CPU down to 1.94 cores per pod, there was a sudden drop in CPU usage in each of the pods.

Stream production rate

You can see that at 21:00, CPU usage reached a maximum of 80%. This value dropped to around 50% between 10:00 to 12:00, which is our consecutive peak production rate.

Significant drop in consumption rate from Day_Before
Consumer lag in terms of records pending to be consumed and in terms of minutes

In the image above, we compared this data with trends from previous data, where the purple line indicates the day before. We noticed a significant drop in consumption rate compared to the day before, which resulted in consumer lag. This drop was surprising since we didn’t tweak the application configuration. The only change was done by VPA, which brought the CPU request and limit down to less than 2 cores.

To revert this change, we redeployed the pipeline by retaining the same application setting but adjusting the minimum VPA limit to 2 cores. This helps to prevent VPA from bringing down the CPU cores below 2. With this simple change, performance and CPU utilisation improved almost instantly.

CPU usage percentage jumped back up to ~95%
Pipeline consumption rate compared to Day_Before

In the image above, we compared the data with trends from the day before (indicated in purple), where the pipeline was lagging and had a large backlog. You can see that the improved consumption rate was even better than the day before and the application consumed even more records. This is because it was catching up on the backlog from the previous consumer lag.

Deep dive into the root cause

This significant improvement just from increasing CPU allocation from 1.94 to 2 cores was unexpected as we had AUTO-GOMAXPROCS enabled in our SPF pipelines and this only uses integer values for CPU.

Upon further investigation, we found that the GOMAXPROCS is useful to control the CPU that golang uses on a kubernetes node when kubernetes Cgroup masks the actual CPU cores of the nodes. GOMAXPROCS only allocates the requested resources of the pod, hence configuring this value correctly helps the runtime to preallocate the correct CPU resources.

Without configuring GOMAXPROCS, go runtime assumes the node’s entire CPU capacity is available for its execution, which is sub-optimal when we run the Golang application on Kubernetes. Thus, it is important to configure GOMAXPROCS correctly so your application pre-allocates the right number of threads based on CPU resources. More details can be found in this article.

Let’s look at how Kubernetes resources relate to GOMAXPROCS value in the following table:

Kubernetes resources GOMAXPROCS value Remarks
2.5 core 2 Go runtime will just take and utilise 2 cores efficiently.
2 core 2 Go runtime will take and utilise the maximum CPU of the pod efficiently if the workload requires it.
1.5 core 1 AUTO-GOMAXPROCS will set the value as 1 since it rounds down the non-integer CPU value to an integer number. Hence the performance will be the same as if you had 1 core CPU.
0.5 core 1 AUTO-GOMAXPROCS will set the value as 1 CPU as the minimum allowed value for GOMAXPROCS is 1. Here we will see some throttling as Kubernetes will only give 0.5 core but runtime configures itself as it would have 1  hence it will starve for a few CPU cycles.

Issue with VPA

The vertical pod autoscaler enables you to easily scale pods vertically so you don’t have to make manual adjustments. It automatically allocates resources based on usage and allows proper scheduling so that there will be appropriate resources available for each pod. However, in our case, the throttling and CPU starvation issue was because VPA brought resources down to less than 2 cores.

To better visualise the issue, let’s use an example. Assume that this application needs roughly 1.7 cores to perform all its operations without any resource throttling. Let’s see how the VPA journey in this scenario looks like and where it will fail to correctly scale.

Timeline VPA recommendation CPU Utilisation AUTO-GOMAXPROCS Remarks
T0 0.5 core >90% 1 Throttled by Kubernetes Cgroup as it does give only 0.5 core.
T1 1 core >90% 1 CPU utilisation will still be >90% as GOMAXPROCS setting for the application remains the same. In reality, it will need even more.
T2 1.2 core <85% 1 Since the application actually needs more resources, VPA sets a non-integer value but GOMAXPROCS never utilised that extra resource and continued to throttle. Now, VPA computes that the CPU is underutilised and it won’t scale further.
T3 2 core (manual override) 80-90% 2 Since the application has enough resources, it will perform most optimally without throttling and will have maximum throughput.

Solution

During our investigation, we saw that AUTO-GOMAXPROCS sets an integer value (minimum 1). To avoid CPU throttling, we need VPA to propose integer values while scaling.

In v0.13 of VPA, this feature is available but only for Kubernetes versions ≥1.25 – see #5313 in the image below.

We acknowledge that if we define a default minimum integer CPU value of 1 core for Coban’s stream processing pipelines, it might be excessive for those that only require less than 1 core. So we propose to only enable this default setting for pipelines with heavy resource requirements and require more than 1 core.

That said, you should make this decision by evaluating your application’s needs. For example, some Coban pipelines still run on VPA with less than one core but they do not experience any lag. As we mentioned earlier  AUTO-GOMAXPROCS would be configured to 1 in this case, still they can catch up with message production rates. However, technically these pipelines are actually throttled and do not perform optimally but these pipelines don’t have consumer lag.

As we move from single to concurrent goroutine processing, we need more intensive CPU allocation. In the following table, we consider some scenarios where we have a few pipelines with heavy workloads that are not able to catch up with the production rate.

Actual CPU requirement VPA recommendation (after upgrade to v0.13) GOMAXPROCS value Remarks
0.8 1 core 1 Optimal setting for this pipeline. It should not lag and should utilise the CPU resources optimally via concurrent goroutines.
1.2 2 2 No CPU throttling and no lag. But not very cost efficient.
1.8 2 2 Optimal performance with no lag and cost efficiency.

Learnings/Conclusion

From this experience, we learnt several things:

  • Incorrect GOMAXPROCS configuration can lead to significant throttling and CPU starvation issues.
  • Autoscaling solutions are important, but can only take you so far. Depending on your application needs, manual intervention might still be needed to ensure optimal performance.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How we improved our iOS CI infrastructure with observability tools

Post Syndicated from Grab Tech original https://engineering.grab.com/iOS-CI-infrastructure-with-observability-tools

Note: Timestamps used in this article are in UTC+8 Singapore time, unless stated otherwise.

Background

When we upgraded to Xcode 13.1 in April 2022, we noticed a few issues such as instability of the CI tests and other problems related to the switch to Xcode 13.1. 

After taking a step back, we investigated this issue by integrating some observability tools into our iOS CI development process. This gave us a comprehensive perspective of the entire process, from the beginning to the end of the UITest job. In this article, we share the improvements we made, the insights we gathered, and the impact of these improvements on the overall process and resource utilisation.

Solution

In the following sections, we elaborate the various steps we took to investigate the issues, like unstable CI tests and high CPU utilisation, and the improvements we made to make our iOS CI infrastructure more reliable.

Analyse Xcode 13.1 CPU utilisation

As an iOS developer, we are certain that you have also experienced Spotlight process-related CPU usage problems with Xcode 13.1, which have since been resolved in Xcode 13.2. After investigating, we found that the CPU usage issues were one of the root causes of UITest’s instability and it was something we needed to fix urgently. We decided not to wait for Apple’s update as it would cost us more time to perform another round of migration.

Before we started UITest, we moved the spotlight.app into a new folder. When the test was complete, we restored the application to its original location. This significantly decreased CPU utilisation by more than 50%.

This section helps you better visualise how the different versions of Xcode affected CPU utilisation.

Xcode 12.1
Xcode 13.1 before fix
Xcode 13.1 after fix

As a superapp, there are countless scenarios that need to be thoroughly tested at Grab before the feature is released in production. One of these tests is deep link testing.

More than 10% of the total number of tests are deep link tests. Typically, it is advised to mock the dependencies throughout the test to ensure that it runs quickly and reliably. However, this creates another reliance on iOS Safari.

As a result, we created a mock browser in UITest. We used the URL to the mock browser as the launch argument, and the same URL is then called back. This method results in a 20% reduction in CI time and more stable tests.

Boot the iOS simulator with permission

It is always a good idea to reset the simulator before running UITest so that there are no residual presets or simulated data from a different test. Additionally, using any of the simulator’s services (location, ATT, contacts, etc.) will prompt the simulator to request permission, which slows down execution. We used UIInterruptionHandler (a handler block for managing alerts and other dialogues) to manage asynchronous UI interruptions during the test.

We wanted to reduce the time taken for test execution, which we knew includes many permissions. Therefore, in order to speed up execution, we boot the simulator with permissions. This removes the need for permissions during UITest, which speeds up performance by 5%.

Monitor HTTP traffic during the UITest

When writing tests, it is important to mock all resources as this enables us to focus on the code that’s being tested and not how external dependencies interact or respond. However, with a large team working concurrently, it can be challenging to ensure that nothing is actually downloaded from the internet.

Developers often make changes to code, and UITests are essential for ensuring that these modifications do not adversely affect existing functionality. It is advised to mock all dependencies while writing tests to simulate all possible behavior. We discovered that a significant number of resources were being downloaded each time we ran the tests, which was highly inefficient.

In large teams working simultaneously, preventing downloads from the internet can be quite challenging. To tackle this issue, we devised a custom tool that tracks all URLs accessed throughout the UITest. This enabled us to identify resources being downloaded from the internet during the testing process.

By using our custom tool to analyse network traffic, we were able to ensure that no resources were being downloaded during testing. Instead, we relied on mocked dependencies, resulting in reduced testing times and improved stability.

GitLab load runner analysis

At Grab, we have many teams of developers who maintain the app, make code changes, and raise merge requests (MRs) on a daily basis. To make sure that new changes don’t conflict with existing code, these MRs are integrated with CI.

Additionally, to manage the number of MRs, we maintain a list of clusters that run test runners concurrently for better resource utilisation and performance. We frequently run these tests to determine how many parallel processors are required for stable results.

####Return HTTP responses to the local mock server

We have a tool that we use to mock API requests, which we improved to also support HTML responses. This increases the scope of testing and ensures the HTML response sequences work properly.

Use explicit waiting commands

When running multiple tests, timing issues are inevitable and they cause tests to occasionally pass and fail. To mitigate this, most of the developers prefer to add a sleep command so there is time for the element to render properly before we verify it – but this slows down execution. In order to improve CI execution, we introduced a link that allows us to track sleep function usage and suggest developers use waitForExistence wrappers in UI tests.

Track each failure state

With large codebases, it is quite common to see flakiness in UITests, where tests occasionally succeed and fail without any code changes. This means that test results can be inconsistent and in some cases, faulty. Faulty testing can be frustrating, and quite expensive. This is because engineers need to re-trigger entire builds, which ends up consuming more time.

Initially, we used an internal tool that required all tests to pass on the first run, before merging was allowed. However, we realised that this significantly increased engineers’ manual retry time, hence, we modified the rules to allow merging as long as a subsequent retry passes the tests. This minor change improved our engineers’ CI overall experience and did not result in more flaky tests.

Learnings/Conclusion

Our journey to improve iOS CI infrastructure is still ongoing, but from this experience, we learnt several things:

  • Focus on the feature being tested by ensuring all external responses are mocked.
  • A certain degree of test flakiness is expected, but you should monitor past trends. If flakiness increases, there’s probably a deeper lying issue within your code.
  • Regularly monitor resource utilisation and performance – detecting a sudden spike early could save you a lot of time and money.

2.3x faster using the Go plugin to replace Lua virtual machine

Post Syndicated from Grab Tech original https://engineering.grab.com/faster-using-the-go-plugin-to-replace-Lua-VM

Abstract

We’re excited to share with you the latest update on our open-source project Talaria. In our efforts to improve performance and overcome infrastructure limitations, we’ve made significant strides by implementing the Go plugin to replace Lua VM.

Our team has found that the Go plugin is roughly 2.3x faster and uses 2.3x less memory than the Lua VM. This significant performance boost has helped us improve overall functionality, scalability, and speed.

For those who aren’t familiar, Talaria is a distributed, highly available, and low-latency time-series database that’s designed for Big Data systems. Originally developed and implemented at Grab, Talaria is a critical component in processing millions and millions of transactions and connections every day, which demands scalable, data-driven decision-making.

Background

One of the methods we previously used for processing ingested data was Lua script. This method allowed users to customise the ingestion process, providing a high degree of flexibility.

The config below is an example of using Lua script to JSON encode the row as a data column:

computed:
  - name: data
    type: json
    func: |
      local json = require("json")
      function main(row)
        return json.encode(row)
      end     

Problem

We found that loading a Lua script required launching a Lua virtual machine (VM) to execute the script, which had a significant impact on performance, especially when ingesting large amounts of events.

This performance issue led us to reevaluate our approach to processing ingested data and make changes to improve Talaria’s performance.

As a result, this is the code we used on Lua VM to run the trim, remove keys “key1”, “key2”, “key3”, “key4”, “key5”, in the ingested data:

import "github.com/kelindar/lua"

func luaTrim() string {
    s, err := lua.FromString("test.lua", `
    local json = require("json")
    local keys = {
    "key1", "key2", "key3", "key4", "key5",
    }
    function main(input)
        local data = json.decode(input)
        for i, key in ipairs(keys) do
            data[key] = nil
        end
        return json.encode(data)
    end
`)
    if err != nil {
        panic(err)
    }
    result, err := s.Run(context.Background(), jsonstr)
    if err != nil {
        panic(err)
    }
    return result.String()
}

Here is the benchmark, using Lua VM is 1000 times slower and uses 1000 times more memory than Golang’s native function on a Trim function:

BenchmarkTrim-12 543541 2258 ns/op 848 B/op 12 allocs/op
BenchmarkLuaTrim-12 553 2105506 ns/op 5006319 B/op 10335 allocs/op

But, anything can be improved by adding a cache, what if we cache the Lua VM and reuse them? Here is the new improved benchmark:

BenchmarkTrim-8 232105 4995 ns/op 2192 B/op 53 allocs/op
BenchmarkLuaTrim-8 97536 12108 ns/op 4573 B/op 121 allocs/op

So we can conclude that Lua VMs are roughly 2.3x faster and use 2.3x less memory than Golang’s native function.

Use the Go plugin as Lua VM to execute custom code

We came up with the idea of using a Linux shared library to execute the custom function instead of using Lua VM to run the custom script. Maybe you will be more familiar with the files with suffixes .so; they are shared libraries designed to package similar functionality in a single unit and shared with other developers so that they can call the function without writing it again.

In Golang, a similar idea is called Go plugin, which allows you to build Golang code as a shared library (Golang names it a plugin). Open this file and call the Go function inside this plugin.

How to use the Go plugin

Let’s say you have a function F that wants to be called via the plugin.

package main
import "fmt"
func F() { fmt.Printf("Hello, world") }

After writing the function F, you can compile it as a Go plugin file f_plugin.so via Go build -buildmode=plugin -o f_plugin.so. And you can open the file and use the function F like this:

p, err := plugin.Open("f_plugin.so")
if err != nil {
    panic(err)
}
f, err := p.Lookup("F")
if err != nil {
    panic(err)
}
f.(func())() // prints "Hello, world"

Go plugin benchmark

Here is the result that compares Golang native function, Golang plugin call.

Golang native function: 2.3x faster and 2.3x lesser memory than using the Lua VM.
Golang plugin call has almost the same performance as Golang native function.

BenchmarkNativeFunctionCall-12 2917465 401.7 ns/op 200 B/op 6 allocs/op
BenchmarkPluginCall-12 2778988 447.1 ns/op 200 B/op 6 allocs/op

Integrated into Talaria

This is the MR we integrated the Go plugin into Talaria: https://github.com/talariadb/talaria/pull/87, adding it as a loader like LuaLoader.

They both implemented the Handler interfaces.

type Handler interface {
    Load(uriOrCode string) (Handler, error)
    String() string
    Value(map[string]interface{}) (interface{}, error)
}

The implementation of this interface is listed here:

For Lua loader

Load: Load the Lua code or Lua script file path (local file path or s3 path) as the loader.

String: Return “lua” so that we can call it to get what the loader is.

Value: Run the Lua script, and take the arg as input.

For Go plugin loader

Load: Read the plugin file path (local file path or s3 path) as the plugin, lookup the function name defined by the user, save the function for later use.

String: Return “plugin” so that we can call it to get what the loader is.

Value: Run the saved function, take the arg as input.

Things you need to notice

The Go version you used to build the Golang plugin must be the same as the service used in that plugin. We use Docker to build the service, so that we can ensure the Go version is the same.

Reference (Benchmark plugin and LUA)

https://github.com/atlas-comstock/talaria_benchmark/tree/master/benchmark_plugin_and_lua

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Safer deployment of streaming applications

Post Syndicated from Grab Tech original https://engineering.grab.com/safer-flink-deployments

The Flink framework has gained popularity as a real-time stateful stream processing solution for distributed stream and batch data processing. Flink also provides data distribution, communication, and fault tolerance for distributed computations over data streams. To fully leverage Flink’s features, Coban, Grab’s real-time data platform team, has adopted Flink as part of our service offerings.

In this article, we explore how we ensure that deploying Flink applications remain safe as we incorporate the lessons learned through our journey to continuous delivery.

Background

Figure 1. Flink platform architecture within Coban

Users interact with our systems to develop and deploy Flink applications in three different ways.

Firstly, users create a Merge Request (MR) to develop their Flink applications on our Flink Scala repository, according to business requirements. After the MR is merged, GitOps Continuous Integration/Continuous Deployment (CI/CD) automatically runs and dockerises the application, allowing the containerised applications to be deployed easily.

Secondly, users create another MR to our infrastructure as a code repository. The GitOps CI/CD that is integrated with Terraform runs and configures the created Spinnaker application. This process configures the Flink application that will be deployed.

Finally, users trigger the actual deployment of the Flink applications on Spinnaker, which orchestrates the deployment of the Docker image onto our Kubernetes cluster. Flink applications are deployed as standalone clusters in Grab to ensure resource isolation.

Problem

The main issue we noticed with streaming pipelines like these, is that they are often interconnected, where application A depends on application B’s output. This makes it hard to find a solution that perfectly includes integration tests and ensures that propagated changes do not affect downstream applications.

However, this problem statement is too large to solve with a single solution. As such, we are narrowing the problem statement to focus on ensuring safety of our applications, where engineers can deploy Flink applications that will be rolled back if they fail health checks. In our case, the definition of a Flink application’s health is limited to the uptime of the Flink application itself.

It is worth noting that Flink applications are designed to be stateful streaming applications, meaning a “state” is shared between events (stream entities) and thus, past events can influence the way current events are processed. This also implies that traditional deployment strategies do not apply to the deployment of Flink applications.

Current strategy

Figure 2. Current deployment stages

In Figure 2, our current deployment stages are split into three parts:

  1. Delete current deployment: Remove current configurations (if applicable) to allow applications to pick up the new configurations.
  2. Bake (Manifest): Bake the Helm charts with the provided configurations.
  3. Deploy (Manifest): Deploy the charts onto Kubernetes.

Over time, we learnt that this strategy can be risky. Part 2 can result in a loss of Flink application states due to how internal CI/CD processes are set up. There is also no easy way to rollback if an issue arises. Engineers will need to revert all config changes and rollback the deployment manually by re-deploying the older Docker image – which results in slower operation recovery.

Lastly, there are no in-built monitoring mechanisms that perform regular health probes. Engineers need to manually monitor their applications to see if their deployment was successful or if they need to perform a rollback.

With all these issues, deploying Flink applications for engineers are often stressful and fraught with uncertainty. Common mitigation strategies are canary and blue-green deployments, which we cover in the next section.

Canary deployments

Figure 3. Canary deployment

In canary deployments, you gradually roll out new versions of the application in parallel with the production version, while serving a percentage of total traffic before promoting it gradually.

This does not work for Flink deployments due to the nature of stream processing. Applications are frequently required to do streaming operations like stream joining, which involves matching related events in different Kafka topics. So, if a Flink application is only receiving a portion of the total traffic, the data generated will be considered inaccurate due to incomplete data inputs.

Blue-green deployments

Figure 4. Blue-green deployment

Blue-green deployments work by running two versions of the application with a Load Balancer that acts as a traffic switch, which determines which version traffic is directed to.

This method might work for Flink applications if we only allow one version of the application to consume Kafka messages at any point in time. However, we noticed some issues when switching traffic to another version. For example, the state of both versions will be inconsistent because of the different data traffic each version receives, which complicates the process of switching Kafka consumption traffic.

So if there’s a failure and we need to rollback from Green to Blue deployment, or vice versa, we will need to take an extra step and ensure that before the failure, the data traffic received is exactly the same for both deployments.

Solution

As previously mentioned, it is crucial for streaming applications to ensure that at any point in time, only one application is receiving data traffic to ensure data completeness and accuracy. Although employing blue-green deployments can technically fulfil this requirement, the process must be modified to handle state consistency such that both versions have the same starting internal state and receive the same data traffic as each other, if a rollback is needed.

Figure 5. Visualised deployment flow

This deployment flow will operate in the following way:

  1. Collect metadata regarding current application
  2. Take savepoint and stop the current application
  3. Clear up high availability configurations
  4. Bake and deploy the new application
  5. Monitor application and rollback if the health check fails

Let’s elaborate on the key changes implemented in this new process.

Savepointing

Flink’s savepointing feature helps address the issue of state consistency and ensures safer deployments.

A savepoint in Flink is a snapshot of a Flink application’s state at the point in time. This savepoint allows us to pause the Flink application and restore the application to this snapshot state, if there’s an issue.

Before deploying a Flink application, we perform a savepoint via the Flink API before killing the current application. This would enable us to save the current state of the Flink application and rollback if our deployment fails – just like how you would do a quick save before attempting a difficult level when playing games. This mechanism ensures that both deployment versions have the same internal state during deployment as they both start from the same savepoint.

Additionally, this feature allows us to easily handle Kafka offsets since these consumed offsets are stored as part of the savepoint. As Flink manages their own state, they don’t need to rely on Kafka’s consumer offset management. With this savepoint feature, we can ensure that the application receives the same data traffic post rollback and that no messages are lost due to processing on the failed version.

Monitoring

To consistently monitor Flink applications, we can conduct health probes to the respective API endpoints to check if the application is stuck in a restart state or if it is running healthily.

We also configured our monitoring jobs to wait for a few minutes for the deployment to stabilise before probing it over a defined duration, to ensure that the application is in a stable running state.

Rollback

If the health checks fail, we then perform an automatic rollback. Typically, Flink applications are deployed as a standalone cluster and a rollback involves changes in one of the following:

  • Application and Flink configurations
  • Taskmanager or Jobmanager resource provision

For configuration changes, we leverage the fact that Spinnaker performs versioned deployment of configmap resources. In this case, a rollback simply involves mounting the old configmap back onto the Kubernetes deployment.

To retrieve the old version of the configmap mount, we can simply utilise Kubernetes’ rollback mechanisms – Kubernetes updates a deployment by creating a new replicaset with an incremental version before attaching it to the current deployment and scaling the previous replicaset to 0. To retrieve previous deployment specs, we just need to list all replicasets related to the deployment and find the previous deployed version, before updating the current deployment to mimic the previous template specifications.

However, this deployment does not contain the number of replicas of previously configured task managers. Kubernetes does not register the number of replicas as part of deployment configuration as this is a dynamic configuration and might be changed during processing due to auto scaling operations.

Our Flink applications are deployed as standalone clusters and do not use native or yarn resource providers. Coupled with the fact that Flink has strict resource provision, we realised that we do not have enough information to perform rollbacks, without the exact number of replicas created.

Taskmanager or Jobmanager resource provision changes

To gather information about resource provision changes, we can simply include the previously configured number of replicas as part of our metadata annotation. This allows us to retrieve it in future during rollback.

Making this change involves creating an additional step of metadata retrieval to retrieve and store previous deployment states as annotations of the new deployment.

Impact

With this solution, the deployment flow on Spinnaker looks like this:

Figure 6. New deployment flow on Spinnaker

Engineers no longer need to monitor the deployment pipeline as closely as they get notified of their application’s deployment status via Slack. They only need to interact or take action when they get notified that the different stages of the deployment pipeline are completed.

Figure 7. Slack notifications on deployment status

It is also easier to deploy Flink applications since failures and rollbacks are handled automatically. Furthermore, application state management is also automated, which reduces the amount of uncertainties.

What’s next?

As we work to further improve our deployment pipeline, we will look into extending the capabilities at our monitoring stage to allow engineers to define and configure their own health probes, allowing our deployment configurations to be more extendable.

Another interesting improvement will be to make this deployment flow seamlessly, ensuring as little downtime as possible by minimising cold start duration.

Coban also looks forward to pushing more features on our Flink platform to enable our engineers to explore more use cases that utilises real-time data to allow our operations to become auto adaptive and make data-driven decisions.

References

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Message Center – Redesigning the messaging experience on the Grab superapp

Post Syndicated from Grab Tech original https://engineering.grab.com/message-center

Since 2016, Grab has been using GrabChat, a built-in messaging feature to connect our users with delivery-partners or driver-partners. However, as the Grab superapp grew to include more features, the limitations of the old system became apparent. GrabChat could only handle two-party chats because that’s what it was designed to do. To make our messaging feature more extensible for future features, we decided to redesign the messaging experience, which is now called Message Center.

Migrating from the old GrabChat to the new Message Center

To some, building our own chat function might not be the ideal approach, especially with open source alternatives like Signal. However, Grab’s business requirements introduce some level of complexity, which required us to develop our own solution.

Some of these requirements include, but are not limited to:

  • Handle multiple user types (passengers, driver-partners, consumers, delivery-partners, customer support agents, merchant-partners, etc.) with custom user interface (UI) rendering logic and behaviour.
  • Enable other Grab backend services to send system generated messages (e.g. your driver is reaching) and customise push notifications.
  • Persist message state even if users uninstall and reinstall their apps. Users should be able to receive undelivered messages even if they were offline for hours.
  • Provide translation options for non-native speakers.
  • Filter profanities in the chat.
  • Allow users to handle group chats. This feature might come in handy in future if there needs to be communication between passengers, driver-partners, and delivery-partners.

Solution architecture

Message Center architecture

The new Message Center was designed to have two components:

  1. Message-center backend: Message processor service that handles logical and database operations.
  2. Message-center postman: Message delivery service that can scale independently from the backend service.

This architecture allows the services to be sufficiently decoupled and scale independently. For example, if you have a group chat with N participants and each message sent results in N messages being delivered, this architecture would enable message-center postman to scale accordingly to handle the higher load.

As Grab delivers millions of events a day via the Message Center service, we need to ensure that our system can handle high throughput. As such, we are using Apache Kafka as the low-latency high-throughput event stream connecting both services and Amazon SQS as a redundant delay queue that attempts a retry 10 seconds later.

Another important aspect for this service is the ability to support low-latency and bi-directional communications from the client to the server. That’s why we chose Transmission Control Protocol (TCP) as the main protocol for client-server communication. Mobile and web clients connect to Hermes, Grab’s TCP gateway service, which then digests the TCP packets and proxies the payloads to Message Center via gRPC. If both recipients and senders are online, the message is successfully delivered in a matter of milliseconds.

Unlike HTTP, individual TCP packets do not require a response so there is an inherent uncertainty in whether the messages were successfully delivered. Message delivery can fail due to several reasons, such as the client terminating the connection but the server’s connection remaining established. This is why we built a system of acknowledgements (ACKs) between the client and server, which ensures that every event is received by the receiving party.

The following diagram shows the high-level sequence of events when sending a message.

Events involved in sending a message on Message Center

Following the sequence of events involved in sending a message and updating its status for the sender from sending to sent to delivered to read, the process can get very complicated quickly. For example, the sender will retry the 1302 TCP new message until it receives a server ACK. Similarly, the server will also keep attempting to send the 1402 TCP message receipt or 1303 TCP message unless it receives a client ACK. With this in mind, we knew we had to give special attention to the ACK implementation, to prevent infinite retries on the client and server, which can quickly cascade to a system-wide failure.

Lastly, we also had to consider dropped TCP connections on mobile devices, which happens quite frequently. What happens then? Message Center relies on Hedwig, another in-house notification service, to send push notifications to the mobile device when it receives a failed response from Hermes. Message Center also maintains a user-events DynamoDB database, which updates the state of every pending event of the client to delivered whenever a client ACK is received.

Every time the mobile client reconnects to Hermes, it also sends a special TCP message to notify Message Center that the client is back online, and then the server retries sending all the pending events to the client.

Learnings/Conclusion

With large-scale features like Message Center, it’s important to:

  • Decouple services so that each microservice can function and scale as needed.
  • Understand our feature requirements well so that we can make the best choices and design for extensibility.
  • Implement safeguards to prevent system timeouts, infinite loops, or other failures from cascading to the entire system, i.e. rate limiting, message batching, and idempotent eventIDs.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Evolution of quality at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/evolution-of-quality

To achieve our vision of becoming the leading superapp in Southeast Asia, we constantly need to balance development velocity with maintaining the high quality of the Grab app. Like most tech companies, we started out with the traditional software development lifecycle (SDLC) but as our app evolved, we soon noticed several challenges like high feature bugs and production issues.  

In this article, we dive deeper into our quality improvement journey that officially began in 2019, the challenges we faced along the way, and where we stand as of 2022.

Background

Figure 1 – Software development life cycle (SDLC) sample

When Grab first started in 2012, we were using the Agile SDLC (Figure 1) across all teams and features. This meant that every new feature went through the entire process and was only released to app distribution platforms (PlayStore or AppStore) after the quality assurance (QA) team manually tested and signed off on it.

Over time, we discovered that feature testing took longer, with more bugs being reported and impact areas that needed to be tested. This was the same for regression testing as QA engineers had to manually test each feature in the app before a release. Despite the best efforts of our QA teams, there were still many major and critical production issues reported on our app – the highest numbers were in 2019 (Figure 2).

Figure 2 – Critical open production issue (OPI) trend

This surge in production issues and feature bugs was directly impacting our users’ experience on our app. To directly address the high production issues and slow testing process, we changed our testing strategy and adopted shift-left testing.

Solution

Shift-left testing is an approach that brings testing forward to the early phases of software development. This means testing can start as early as the planning and design phases.

Figure 3 – Shift-left testing

By adopting shift-left testing, engineering teams at Grab are able to proactively prevent possible defect leakage in the early stages of testing, directly addressing our users’ concerns without delaying delivery times.

With shift-left testing, we made three significant changes to our SDLC:

  • Software engineers conduct acceptance testing
  • Incorporate Definition of Ready (DoR) and Definition of Done (DoD)
  • Balanced testing strategy

Let’s dive deeper into how we implemented each change, the challenges, and learnings we gained along the way.

Software engineers conduct acceptance testing

Acceptance testing determines whether a feature satisfies the defined acceptance criteria, which helps the team evaluate if the feature fulfills our consumers’ needs. Typically, acceptance testing is done after development. But our QA engineers still discovered many bugs and the cost of fixing bugs at this stage is more expensive and time-consuming. We also realised that the most common root causes of bugs were associated with insufficient requirements, vague details, or missing test cases.

With shift-left testing, QA engineers start writing test cases before development starts and these acceptance tests will be executed by the software engineers during development. Writing acceptance tests early helps identify potential gaps in the requirements before development begins. It also prevents possible bugs and streamlines the testing process as engineers can find and fix bugs even before the testing phase. This is because they can execute the test cases directly during the development stage.

On top of that, QA and Product managers also made Given/When/Then (GWT) the standard for acceptance criteria and test cases, making them easier for all stakeholders to understand.

Step by Step style GWT format
  1. Open the Grab app
  2. Navigate to home feed
  3. Tap on merchant entry point card
  4. Check that merchant landing page is shown
Given user opens the app
And user navigates to the home feed
When the user taps on the merchant entry point card
Then the user should see the merchant’s landing page

By enabling software engineers to conduct acceptance testing, we minimised back-and-forth discussions within the team regarding bug fixes and also, influenced a significant shift in perspective – quality is everyone’s responsibility.

Another key aspect of shift-left testing is for teams to agree on a standard of quality in earlier stages of the SDLC. To do that, we started incorporating Definition of Ready (DoR) and Definition of Done (DoD) in our tasks.

Incorporate Definition of Ready (DoR) and Definition of Done (DoD)

As mentioned, quality checks can be done before development even begins and can start as early as backlog grooming and sprint planning. The team needs to agree on a standard for work products such as requirements, design, engineering solutions, and test cases. Having this alignment helps reduce the possibility of unclear requirements or misunderstandings that may lead to re-work or a low-quality feature.

To enforce consistent quality of work products, everyone in the team should have access to these products and should follow DoRs and DoDs as standards in completing their tasks.

  • DoR: Explicit criteria that an epic, user story, or task must meet before it can be accepted into an upcoming sprint. 
  • DoD: List of criteria to fulfill before we can mark the epic, user story, or task complete, or the entry or exit criteria for each story state transitions. 

Including DoRs and DoDs have proven to improve delivery pace and quality. One of the first teams to adopt this observed significant improvements in their delivery speed and app quality – consistently delivering over 90% of task commitments, minimising technical debt, and reducing manual testing times.

Unfortunately, having these two changes alone were not sufficient – testing was still manually intensive and time consuming. To ease the load on our QA engineers, we needed to develop a balanced testing strategy.  

Balanced testing strategy

Figure 4 – Test automation strategy

Our initial automation strategy only included unit testing, but we have since enhanced our testing strategy to be more balanced.

  • Unit testing
  • UI component testing
  • Backend integration testing
  • End-to-End (E2E) testing

Simply having good coverage in one layer does not guarantee good quality of an app or new feature. It is important for teams to test vigorously with different types of testing to ensure that we cover all possible scenarios before a release.

As you already know, unit tests are written and executed by software engineers during the development phases. Let’s look at what the remaining three layers mean.

UI component testing

This type of testing focuses on individual components within the application and is useful for testing specific use cases of a service or feature. To reduce manual effort from QA engineers, teams started exploring automation and introduced a mobile testing framework for component testing.

This UI component testing framework used mocked API responses to test screens and interactions on the elements. These UI component tests were automatically executed whenever the pipeline was run, which helped to reduce manual regression efforts. With shift-left testing, we also revised the DoD for new features to include at least 70% coverage of UI component tests.

Backend integration testing

Backend integration testing is especially important if your application regularly interacts with backend services, much like the Grab app. This means we need to ensure the quality and stability of these backend services. Since Grab started its journey toward becoming a superapp, more teams started performing backend integration tests like API integration tests.

Our backend integration tests also covered positive and negative test cases to determine the happy and unhappy paths. At the moment, majority of Grab teams have complete test coverage for happy path use cases and are continuously improving coverage for other use cases.

End-to-End (E2E) testing

E2E tests are important because they simulate the entire user experience from start to end, ensuring that the system works as expected. We started exploring E2E testing frameworks, from as early as 2015, to automate tests for critical services like logging in and booking a ride.

But as Grab introduced more services, off-the-shelf solutions were no longer a viable option, as we noticed issues like automation limitations and increased test flakiness. We needed a framework that is compatible with existing processes, stable enough to reduce flakiness, scalable, and easy to learn.

With this criteria in mind, our QA engineering teams built an internal E2E framework that could make API calls, test different account-based scenarios, and provide many other features. Multiple pilot teams have started implementing tests with the E2E framework, which has helped to reduce regression efforts. We are continuously improving the framework by adding new capabilities to cover more test scenarios.

Now that we’ve covered all the changes we implemented with shift-left testing, let’s take a look at how this changed our SDLC.

Impact

Figure 5 – Updated SDLC process

Since the implementation of shift-left testing, we have improved our app quality without compromising our project delivery pace. Compared to 2019, we observed the following improvements within the Grab superapp in 2022:

  • Production issues with “Major and Critical” severity bugs found in production were reduced by 60%
  • Bugs found in development phase with “Major and Critical” severity were reduced by 40%

What’s next?

Through this journey, we recognise that there’s no such thing as a bug-free app – no matter how much we test, production issues still happen occasionally. To minimise the occurrence of bugs, we’re regularly conducting root cause analyses and writing postmortem reports for production incidents. These allow us to retrospect with other teams and come up with corrective actions and prevention plans. Through these continuous learnings and improvements, we can continue to shape the future of the Grab superapp.

Special thanks to Sori Han for designing the images in this article.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Determine the best technology stack for your web-based projects

Post Syndicated from Grab Tech original https://engineering.grab.com/determining-tech-stack

In the current technology landscape, startups are developing rapidly. This usually leads to an increase in the number of engineers in teams, with the goal of increasing the speed of product development and delivery frequency. However, this growth often leads to a diverse selection of technology stacks being used by different teams within the same organisation.

Having different technology stacks within a team could lead to a bigger problem in the future, especially if documentation is not well-maintained. The best course of action is to pick just one technology stack for your projects, but it begs the question, “How do I choose the best technology stack for my projects?”

One such example is OVO, which is an Indonesian payments, rewards, and financial services platform within Grab. We share our process and analysis to determine the best technology stack that complies with precise standards. By the end of the article, you may also learn to choose the best technology stack for your needs.

Background

In recent years, we have seen massive growth in modern web technologies, such as React, Angular, Vue, Svelte, Django, TypeScript, and many more. Each technology has its benefits. However, having so many choices can be confusing when you must determine which technologies are best for your projects. To narrow down the choices, a few aspects, such as scalability, stability, and usage in the market, must be considered.

That’s the problem that we used to face. Most of our legacy services were not standardised and were written in different languages like PHP, React, and Vue. Also, the documentation for these legacy services is not well-structured or regularly updated.

Current technology stack usage in OVO

We realised that we had two main problems:

  • Various technology stacks (PHP, Vue, React, Nuxt, and Go) maintained simultaneously, with incomplete documentation, may consume a lot of time to understand the code, especially for engineers unfamiliar with the frameworks or even a new hire.
  • Context switching when reviewing code makes it hard to review other teammates’ merge requests on complex projects and quickly offer better code suggestions.

To prevent these problems from recurring, teams must use one primary technology stack.

After detailed comparisons, we narrowed our choices to two options – React and Vue – because we have developed projects in both technologies and already have the user interface (UI) library in each technology stack.

Taken from ulam.io

Next, we conducted a more detailed research and exploration for each technology. The main goals were to find the unique features, scalability, ease of migration, and compatibility for the UI library for React and Vue. To test the compatibility of each UI library, we also used a sample UI on one of our upcoming projects and sliced it.

Here’s a quick summary of our exploration:

Metrics Vue React
UI Library Compatibility Doesn’t require much component development Doesn’t require much component development
Scalability Easier to upgrade, slower in releasing major updates, clear migration guide Quicker release of major versions, supports gradual updates
Others Composition API, strong community (Vue Community) Latest version (v18) of React gradual updates, doesn’t support IE

From this table, we found that the differences between these frameworks are miniscule, making it tough for us to determine which to use. Ultimately, we decided to step back and see the Big Why

Solution

The Big Why here was “Why do we need to standardise our technology stack?”. We wanted to ease the onboarding process for new hires and reduce the complexity, like context switching, during code reviews, which ultimately saves time.

As Kleppmann (2017) states, “The majority of the cost of software is in its ongoing maintenance”. In this case, the biggest cost was time. Increasing the ease of maintenance would reduce the cost, so we decided to use maintainability as our north star metric.

Kleppmann (2017) also highlighted three design principles in any software system:

  • Operability: Make it easy to keep the system running.
  • Simplicity: Easy for new engineers to understand the system by minimising complexity.
  • Evolvability: Make it easy for engineers to make changes to the system in the future.

Keeping these design principles in mind, we defined three metrics that our selected tech stack must achieve:

  • Scalability
    • Keeping software and platforms up to date
    • Anticipating possible future problems
  • Stability of the library and documentation
    • Establishing good practices and tools for development
  • Usage in the market
    • The popularity of the library or framework and variety of coding best practices

Metrics Vue React
Scalability Framework

Operability
Easier to update because there aren’t many approaches to writing Vue.

Evolvability
Since Vue is a framework, it needs fewer steps to upgrade.

Library
Supports gradual updates but there will be many different approaches when upgrading React on our services.
Stability of the library and documentation Has standardised documentation Has many versions of documentation
Usage on Market Smaller market share.

Simplicity
We can reduce complexity for new hires, as the Vue standard in OVO remains consistent with standards in other companies.

Larger market share.

Many React variants are currently in the market, so different companies may have different folder structures/conventions.

Screenshot taken from https://www.statista.com/ on 2022-10-13

After conducting a detailed comparison between Vue and React, we decided to use Vue as our primary tech stack as it best aligns with Kleppmann’s three design principles and our north star metric of maintainability. Even though we noticed a few disadvantages to using Vue, such as smaller market share, we found that Vue is still the better option as it complies with all our metrics.

Moving forward, we will only use one tech stack across our projects but we decided not to migrate technology for existing projects. This allows us to continue exploring and learning about other technologies’ developments. One of the things we need to do is ensure that our current projects are kept up-to-date.

Implementation

After deciding on the primary technology stack, we had to do the following:

  • Define a boilerplate for future Vue projects, which will include items like a general library or dependencies, implementation for unit testing, and folder structure, to align with our north star metric.
  • Update our existing UI library with new components and the latest Vue version.
  • Perform periodic upgrades to existing React services and create a standardised code structure with proper documentation.

With these practices in place, we can ensure that future projects will be standardised, making them easier for engineers to maintain.

Impact

There are a few key benefits of standardising our technology stack.

  • Scalability and maintainability: It’s much easier to scale and maintain projects using the same technology stack. For example, when implementing security patches on all projects due to certain vulnerabilities in the system or libraries, we will need one patch for each technology. With only one stack, we only need to implement one patch across all projects, saving a lot of time.
  • Faster onboarding process: The onboarding process is simplified for new hires because we have standardisation between all services, which will minimise the amount of context switching and lower the learning curve.
  • Faster deliveries: When it’s easier to implement a change, there’s a compounding impact where the delivery process is shortened and release to production is quicker. Ultimately, faster deliveries of a new product or feature will help increase revenue.

Learnings/Conclusion

For every big decision, it is important to take a step back and understand the Big Why or the main motivation behind it, in order to remain objective. That’s why after we identified maintainability as our north star metric, it was easier to narrow down the choices and make detailed comparisons.

The north star metric, or deciding factor, might differ vastly, but it depends on the problems you are trying to solve.

References

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Migrating from Role to Attribute-based Access Control

Post Syndicated from Grab Tech original https://engineering.grab.com/migrating-to-abac

Grab has always regarded security as one of our top priorities; this is especially important for data platform teams. We need to control access to data and resources in order to protect our consumers and ensure compliance with various, continuously evolving security standards.

Additionally, we want to keep the process convenient, simple, and easily scalable for teams. However, as Grab continues to grow, we have more services and resources to manage and it becomes increasingly difficult to keep the process frictionless. That’s why we decided to move from Role-Based Access Control (RBAC) to Attribute-Based Access Control (ABAC) for our Kafka Control Plane (KCP).

In this article, you will learn how Grab’s streaming data platform team (Coban) deleted manual role and permission management of hundreds of roles and resources, and reduced operational overhead of requesting or approving permissions to zero by moving from RBAC to ABAC.

Introduction

Kafka is widely used across Grab teams as a streaming platform. For decentralised Kafka resource (e.g. topic) management, teams have the right to create, update, or delete based on their needs. As the data platform team, we implemented a KCP to ensure that these operations are only performed by authorised parties, especially on multi-tenant Kafka clusters.

For internal access management, Grab uses its own Identity and Access Management (IAM) service, based on RBAC, to support authentication and authorisation processes:

  • Authentication verifies the identity of a user or service, for example, if the provided token is valid or expired.
  • Authorisation determines their access rights, for example, whether users can only update and/or delete their own Kafka topics.

In RBAC, roles, permissions, actions, resources, and the relationships between them need to be defined in the IAM service. They are used to determine whether a user can access a certain resource.

In the following example, we can see how IAM concepts come together. The Coban engineer role belongs to the Engineering-coban group and has permission to update the retention topic. Any engineer added to the Engineering-coban group will also be able to update the topic retention.

Following the same concept, each team using the KCP has its own roles, permissions, and resources created in the system. However, there are some disadvantages to this approach:

  • It leads to a significant growth in the number of access control artifacts both platform and user teams need to manage, and increased time and effort to debug access control issues. We start off by finding which group the engineer belongs to and locating the group that should be used for KCP, and then trace to role and permissions.
  • All group membership access requests of new joiners need to be reviewed and approved by their direct managers. This leads to a lot of backlog as new joiners might have multiple groups to join and managers might not be able to review them timely. In some cases, roles need to be re-applied or renewed every 90 days, which further adds to the delay.
  • Group memberships are not updated to reflect active members in the team, leaving some engineers with access they don’t need and others with access they should have but don’t.

Solution

With ABAC, access management becomes a lot easier. Any new joiner to a specific team gets the same access rights as everyone on that team – no need for manual approval from a manager. However, for ABAC to work, we need these components in place:

  • User attributes: Who is the subject (actor) of a request?
  • Resource attributes: Which object (resource) does the actor want to deal with?
  • Evaluation engine: How do we decide if the actor is allowed to perform the action on the resource?

User attributes

All users have certain attributes depending on the department or team they belong to. This data is then stored and synced automatically with the human resource management system (HRMS) tool, which acts as a source of truth for Grab-wide data, every time a user switches teams, roles, or leaves the company.

Resource attributes

Resource provisioning is an authenticated operation. This means that KCP knows who sent the requests and what each request/action is about. Similarly, resource attributes can be derived from their creators. For new resource provisioning, it is possible to capture the resource tags and store them after authentication. For existing resources, a major challenge was the need to backfill the tagging and ensure a seamless transition from the user’s perspective. In the past, all resource provisioning operations were done by a centralised platform team and most of the existing resource attributes are still under platform team’s ownership.

Evaluation engine

We chose to use Open Policy Agent (OPA) as our policy evaluation engine mainly for its wide community support, applicable feature set, and extensibility to other tools and platforms in our system. This is also currently used by our team for Kafka authorisation. The policies are written in Rego, the default language supported by OPA.

Architecture and implementation

With ABAC, the access control process looks like this:

User attributes

Authentication is handled by the IAM service. In the /generate_token call, a user requests an authentication token from KCP before calling an authenticated endpoint. KCP then calls IAM to generate a token and returns it to the user.

In the /create_topic call, the user includes the generated token in the request header. KCP takes the token and verifies the token validity with IAM. User attributes are then extracted from the token payload for later use in request authorisation.

Some of the common attributes we use for our policy are user identifier, department code, and team code, which provide details like a user’s department and work scope.

When it comes to data governance and central platform and identity teams, one of the major challenges was standardising the set of attributes to be used for clear and consistent ABAC policies across platforms so that their lifecycle and changes could be governed. This was an important shift in the mental model for attribute management over the RBAC model.

Resource attributes

For newly created resources, attributes will be derived from user attributes that are captured during the authentication process.

Previously with RBAC, existing resources did not have the required attributes. Since migrating to ABAC, the implementation has tagged newly created resources and ensured that their attributes are up to standard. IAM was also still doing the actual authorisation using RBAC.

It is also important to note that we collaborated with data governance teams to backfill Kafka resource ownership. Having accurate ownership of resources like data lake or Kafka topics enabled us to move toward a self-service model and remove bottlenecks from centralised platform teams.

After identifying most of the resource ownership, we started switching over to ABAC. The transition was smooth and had no impact on user experience. The remaining unidentified resources were tagged to lost-and-found and could be reclaimed by service teams when they needed permission to manage them.

Open Policy Agent

The most common question when implementing the policy is “how do you define ownership by attributes?”. With respect to the principle of least privilege, each policy must be sufficiently strict to limit access to only the relevant parties. In the end, we aligned as an organisation on defining ownership by department and team.

We created a simple example below to demonstrate how to define a policy:

package authz

import future.keywords

default allow = false

allow if {
        input.endpoint == "updateTopic"
        is_owner(input.resource_attributes)
}

is_owner(md) if {
        md.department == input.user_attributes.department
        md.team == input.user_attributes.team
}

In this example, we start with denying access to everyone. If the updateTopic endpoint is called and the department and team attributes between user and resource are matched, access is allowed.

With a similar scenario, we would need 1 role, 1 action, 1 resource, and 1 mapping (a.k.a permission) between action and resource. We will need to keep adding resources and permissions when we have new resources created. Compared to the policy above, no other changes are required.

With ABAC, there are no further setup or access requests needed when a user changes teams. The user will be tagged to different attributes, automatically granted access to the new team’s resources, and excluded from the previous team’s resources.

Another consideration we had was making sure that the policy is well-written and transparent in terms of change history. We decided to include this as part of our application code so every change is accounted for in the unit test and review process.

Authorisation

The last part of the ABAC process is authorisation logic. We added the logic to the middleware so that we could make a call to OPA for authorisation.

To ensure token validity after authentication, KCP extracts user attributes from the token payload and fetches resource attributes from the resource store. It combines the request metadata such as method and endpoint, along with the user and resource attributes into an OPA request. OPA then evaluates the request based on the redefined policy above and returns a response.

Auditability

For ABAC authorisation, there are two key areas of consideration:

  • Who made changes to the policy, who deployed, and when the change was made
  • Who accessed what resource and when

We manage policies in a dedicated GitLab repository and changes are submitted via merge requests. Based on the commit history, we can easily tell who made changes, reviewed, approved, and deployed the policy. 

For resource access, OPA produces a decision log containing user attributes, resource attributes, and the authorisation decision for every call it serves. The log is kept for five days in Kibana for debugging purposes, then moved to S3 where it is kept for 28 days.

Impact

The move to ABAC authorisation has improved our controls as compared to the previous RBAC model, with the biggest impact being fewer resources to manage. Some other benefits include:

  • Optimised resource allocation: Discarded over 200 roles, 200 permissions, and almost 3000 unused resources from IAM services, simplifying our debugging process. Now, we can simply check the user and resource attributes as needed.
  • Simplified resource management: In the three months we have been using ABAC, about 600 resources have been added without any increase in complexity for authorisation, which is significantly lesser than the RBAC model.
  • Reduction in delays and waiting time: Engineers no longer have to wait for approval for KCP access.
  • Better governance over resource ownership and costs: ABAC allowed us to have a standardised and accurate tagging system of almost 3000 resources.

Learnings

Although ABAC does provide significant improvements over RBAC, it comes with its own caveats:

  • It needs a reliable and comprehensive attribute tagging system to function properly. This only became possible after roughly three months of identifying and tagging the ownership of existing resources by both automated and manual methods.
  • Tags should be kept up to date with the company’s growth. Teams could lose access to their resources if they are wrongly tagged. It needs a mechanism to keep up with changes, or people will unexpectedly lose access when user and resource attributes are changed.

What’s next?

  • To keep up with organisational growth, KCP needs to start listening to the IAM stream, which is where all IAM changes are published. This will allow KCP to regularly update user attributes and refresh resource attributes when restructuring occurs, allowing authorisation to be done with the right data.
  • Constant collaboration with HR to ensure that we maintain sufficient user attributes (no extra unused information) that remain clean so ABAC works as expected.

References

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Securing GitOps pipelines

Post Syndicated from Grab Tech original https://engineering.grab.com/securing-gitops-pipeline

Introduction

Grab’s real-time data platform team, Coban, has been managing infrastructure resources via Infrastructure-as-code (IaC). Through the IaC approach, Terraform is used to maintain infrastructure consistency, automation, and ease of deployment of our streaming infrastructure, notably:

With Grab’s exponential growth, there needs to be a better way to scale infrastructure automatically. Moving towards GitOps processes benefits us in many ways:

  • Versioned and immutable: With our source code being stored in Git repositories, the desired state of infrastructure is stored in an environment that enforces immutability, versioning, and retention of version history, which helps with auditing and traceability.
  • Faster deployment: By automating the process of deploying resources after code is merged, we eliminate manual steps and improve overall engineering productivity while maintaining consistency.
  • Easier rollbacks: It’s as simple as making a revert for a Git commit as compared to creating a merge request (MR) and commenting Atlantis commands, which add extra steps and contribute to a higher mean-time-to-resolve (MTTR) for incidents.

Background

Originally, Coban implemented automation on Terraform resources using Atlantis, an application that operates based on user comments on MRs.

Fig. 1 User flow with Atlantis

We have come a long way with Atlantis. It has helped us to automate our workflows and enable self-service capabilities for our engineers. However, there were a few limitations in our setup, which we wanted to improve:

  • Course grained: There is no way to restrict the kind of Terraform resources users can create, which introduces security issues. For example, if a user is one of the Code owners, they can create another IAM role with Admin privileges with approval from their own team anywhere in the repository.
  • Limited automation: Users are still required to make comments in their MR such as atlantis apply. This requires the learning of Atlantis commands and is prone to human errors.
  • Limited capability: Having to rely entirely on Terraform and Hashicorp Configuration Language (HCL) functions to validate user input comes with limitations. For example, the ability to validate an input variable based on the value of another has been a requested feature for a long time.
  • Not adhering to Don’t Repeat Yourself (DRY) principle: Users need to create an entire Terraform project with boilerplate codes such as Terraform environment, local variables, and Terraform provider configurations to create a simple resource such as a Kafka topic.

Solution

We have developed an in-house GitOps solution named Khone. Its name was inspired by the Khone Phapheng Waterfall. We have evaluated some of the best and most widely used GitOps products available but chose not to go with any as the majority of them aim to support Kubernetes native or custom resources, and we needed infrastructure provisioning that is beyond Kubernetes. With our approach, we have full control of the entire user flow and its implementation, and thus we benefit from:

  • Security: The ability to secure the pipeline with many customised scripts and workflows.
  • Simple user experience (UX): Simplified user flow and prevents human errors with automation.
  • DRY: Minimise boilerplate codes. Users only need to create a single Terraform resource and not an entire Terraform project.
Fig. 2 User flow with Khone

With all types of streaming infrastructure resources that we support, be it Kafka topics or Flink pipelines, we have identified they all have common properties such as namespace, environment, or cluster name such as Kafka cluster and Kubernetes cluster. As such, using those values as file paths help us to easily validate users input and de-couple them from the resource specific configuration properties in their HCL source code. Moreover, it helps to remove redundant information to maintain consistency. If the piece of information is in the file path, it won’t be elsewhere in resource definition.

Fig. 3 Khone directory structure

With this approach, we can utilise our pipeline scripts, which are written in Python and perform validations on the types of resources and resource names using Regular Expressions (Regex) without relying on HCL functions. Furthermore, we helped prevent human errors and improved developers’ efficiency by deriving these properties and reducing boilerplate codes by automatically parsing out other necessary configurations such as Kafka brokers endpoint from the cluster name and environment.

Pipeline stages

Khone’s pipeline implementation is designed with three stages. Each stage has different duties and responsibilities in verifying user input and securely creating the resources.

Fig. 4 An example of a Khone pipeline

Initialisation stage

At this stage, we categorise the changes into Deleted, Created or Changed resources and filter out unsupported resource types. We also prevent users from creating unintended resources by validating them based on resource path and inspecting the HCL source code in their Terraform module. This stage also prepares artefacts for subsequent stages.

Fig. 5 Terraform changes detected by Khone

Terraform stage

This is a downstream pipeline that runs either the Terraform plan or Terraform apply command depending on the state of the MR, which can either be pending review or merged. Individual jobs run in parallel for each resource change, which helps with performance and reduces the overall pipeline run time.

For each individual job, we implemented multiple security checkpoints such as:

  • Code inspection: We use the python-hcl2 library to read HCL content of Terraform resources to perform validation, restrict the types of Terraform resources users can create, and ensure that resources have the intended configurations. We also validate whitelisted Terraform module source endpoint based on the declared resource type. This enables us to inherit the flexibility of Python as a programming language and perform validations more dynamically rather than relying on HCL functions.
  • Resource validation: We validate configurations based on resource path to ensure users are following the correct and intended directory structure.
  • Linting and formatting: Perform HCL code linting and formatting using Terraform CLI to ensure code consistency.

Furthermore, our Terraform module independently validates parameters by verifying the working directory instead of relying on user input, acting as an additional layer of defence for validation.

path = one(regexall(join("/",
[
    "^*",
    "(?P<repository>khone|khone-dev)",
    "resources",
    "(?P<namespace>[^/]*)",
    "(?P<resource_type>[^/]*)",
    "(?P<env>[^/]*)",
    "(?P<cluster_name>[^/]*)",
    "(?P<resource_name>[^/]*)$"
]), path.cwd))

Metric stage

In this stage, we consolidate previous jobs’ status and publish our pipeline metrics such as success or error rate.

For our metrics, we identified actual users by omitting users from Coban. This helps us measure success metrics more consistently as we could isolate metrics from test continuous integration/continuous deployment (CI/CD) pipelines.

For the second half of 2022, we achieved a 100% uptime for Khone pipelines.

Fig. 6 Khone’s success metrics for the second half of 2022

Preventing pipeline config tampering

By default, with each repository on GitLab that has CI/CD pipelines enabled, owners or administrators would need to have a pipeline config file at the root directory of the repository with the name .gitlab-ci.yml. Other scripts may also be stored somewhere within the repository.

With this setup, whenever a user creates an MR, if the pipeline config file is modified as part of the MR, the modified version of the config file will be immediately reflected in the pipeline’s run. Users can exploit this by running arbitrary code on the privileged GitLab runner.

In order to prevent this, we utilise GitLab’s remote pipeline config functionality. We have created another private repository, khone-admin, and stored our pipeline config there.

Fig. 7 Khone’s remote pipeline config

In Fig. 7, our configuration is set to a file called khone-gitlab-ci.yml residing in the khone-admin repository under snd group.

Preventing pipeline scripts tampering

We had scripts that ran before the MR and they were approved and merged to perform preliminary checks or validations. They were also used to run the Terraform plan command. Users could modify these existing scripts to perform malicious actions. For example, they could bypass all validations and directly run the Terraform apply command to create unintended resources.

This can be prevented by storing all of our scripts in the khone-admin repository and cloning them in each stage of our pipeline using the before_script clause.

default:
  before_script:
    - rm -rf khone_admin
    - git clone --depth 1 --single-branch https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git khone_admin

Even though this adds an overhead to each of our pipeline jobs and increases run time, the amount is insignificant as we have optimised the process by using shallow cloning. The Git clone command included in the above script with depth=1 and single-branch flag has reduced the time it takes to clone the scripts down to only 0.59 seconds.

Testing our pipeline

With all the security measures implemented for Khone, this raises a question of how did we test the pipeline? We have done this by setting up an additional repository called khone-dev.

Fig. 8 Repositories relationship

Pipeline config

Within this khone-dev repository, we have set up a remote pipeline config file following this format:

<File Name>@<Repository Ref>:<Branch Name>

Fig. 9 Khone-dev’s remote pipeline config

In Fig. 9, our configuration is set to a file called khone-gitlab-ci.yml residing in the khone-admin repository under the snd group and under a branch named ci-test. With this approach, we can test our pipeline config without having to merge it to master branch that affects the main Khone repository. As a security measure, we only allow users within a certain GitLab group to push changes to this branch.

Pipeline scripts

Following the same method for pipeline scripts, instead of cloning from the master branch in the khone-admin repository, we have implemented a logic to clone them from the branch matching our lightweight directory access protocol (LDAP) user account if it exists. We utilised the GITLAB_USER_LOGIN environment variable that is injected by GitLab to each individual CI job to get the respective LDAP account to perform this logic.

default:
  before_script:
    - rm -rf khone_admin
    - |
      if git ls-remote --exit-code --heads "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" "$GITLAB_USER_LOGIN" > /dev/null; then
        echo "Cloning khone-admin from dev branch ${GITLAB_USER_LOGIN}"
        git clone --depth 1 --branch "$GITLAB_USER_LOGIN" --single-branch "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" khone_admin
      else
        echo "Dev branch ${GITLAB_USER_LOGIN} not found, cloning from master instead"
        git clone --depth 1 --single-branch "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" khone_admin
      fi

What’s next?

With security being our main focus for our Khone GitOps pipeline, we plan to abide by the principle of least privilege and implement separate GitLab runners for different types of resources and assign them with just enough IAM roles and policies, and minimal network security group rules to access our Kafka or Kubernetes clusters.

Furthermore, we also plan to maintain high standards and stability by including unit tests in our CI scripts to ensure that every change is well-tested before being deployed.

References

Special thanks to Fabrice Harbulot for kicking off this project and building a strong foundation for it.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

New zoom freezing feature for Geohash plugin

Post Syndicated from Grab Tech original https://engineering.grab.com/geohash-plugin

Introduction

Geohash is an encoding system with a unique identifier for each region on the planet. Therefore, all geohash units can be associated with an individual set of digits and letters.

Geohash is a plugin built by Grab that is available in the Java OpenStreetMap Editor (JOSM) tool, which comes in handy for those who work on precise areas based on geohash units.

Background

Up until recently, users of the Geohash JOSM plugin were unable to stop the displaying of new geohashes with every zoom-in or zoom-out. This meant that every time they changed the zoom, new geohashes would be displayed, and this became bothersome for many users when it was unneeded. The previous behaviour of the plugin when zooming in and out is depicted in the following short video:


This led to the implementation of the zoom freeze feature, which helps users toggle between Enable zoom freeze and Disable zoom freeze, based on their needs.

Solution

As you can see in the following image, a new label was created with the purpose of freezing or unfreezing the display of new geohashes with each zoom change:


By default, this label says “Enable zoom freeze”, and when zoom freezing is enabled, the label changes to “Disable zoom freeze”.

In order to see how zoom freezing works, let’s consider the following example: a user wants to zoom inside the geohash with the code w886hu, without triggering the display of smaller geohashes inside of it. For this purpose, the user will enable the zoom freezing feature by clicking on the label, and then they will proceed with the zoom. The map will look like this:


It is apparent from the image that no new geohashes were created. Now, let’s say the user has finished what they wanted to do, and wants to go back to the “normal” geohash visualisation mode, which means disabling the zoom freeze option. After clicking on the label that now says ‘Disable zoom freeze’, new, smaller geohashes will be displayed, according to the current zoom level:


The functionality is illustrated in the following short video:


Another effect that enabling zoom freeze has is that it disables the ‘Display larger geohashes’ and ‘Display smaller geohashes’ options, since the geohashes are now fixed. The following images show how these options work before and after disabling zoom freeze:



To conclude, we believe that the release of this new feature will benefit users by making it more comfortable for them to zoom in and out of a map. By turning off the display of new geohashes when this is unwanted, map readability is improved, and this translates to a better user experience.

Impact/Limitations

In order to start using this new feature, users need to update the Geohash JOSM plugin.

What’s next?

Grab has come a long way in map-making, from using open source map-making software and developing its own suite of map-making tools to contributing to the open-source map community and building and launching GrabMaps. To find out more, read How KartaCam powers GrabMaps and KartaCam delivers comprehensive, cost-effective mapping data.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Graph service platform

Post Syndicated from Grab Tech original https://engineering.grab.com/graph-service-platform

Introduction

In earlier articles of this series, we covered the importance of graph networks, graph concepts, how graph visualisation makes fraud investigations easier and more effective, and how graphs for fraud detection work. In this article, we elaborate on the need for a graph service platform and how it works.

In the present age, data linkages can generate significant business value. Whether we want to learn about the relationships between users in online social networks, between users and products in e-commerce, or understand credit relationships in financial networks, the capability to understand and analyse large amounts of highly interrelated data is becoming more important to businesses.

As the amount of consumer data grows, the GrabDefence team must continuously enhance fraud detection on mobile devices to proactively identify the presence of fraudulent or malicious users. Even simple financial transactions between users must be monitored for transaction loops and money laundering. To preemptively detect such scenarios, we need a graph service platform to help discover data linkages. 

Background

As mentioned in an earlier article, a graph is a model representation of the association of entities and holds knowledge in a structured way by marginalising entities and relationships. In other words, graphs hold a natural interpretability of linked data and graph technology plays an important role. Since the early days, large tech companies started to create their own graph technology infrastructure, which is used for things like social relationship mining, web search, and sorting and recommendation systems with great commercial success.

As graph technology was developed, the amount of data gathered from graphs started to grow as well, leading to a need for graph databases. Graph databases1 are used to store, manipulate, and access graph data on the basis of graph models. It is similar to the relational database with the feature of Online Transactional Processing (OLTP), which supports transactions, persistence, and other features.

A key concept of graphs is the edge or relationship between entities. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. These relationships allow data in the store to be linked directly and retrieved with one operation.

With graph databases, relationships between data can be queried fast as they are perpetually stored in the database. Additionally, relationships can be intuitively visualised using graph databases, making them useful for heavily interconnected data. To have real-time graph search capabilities, we must leverage the graph service platform and graph databases.

Architecture details

Graph services with graph databases are Platforms as a Service (PaaS) that encapsulate the underlying implementation of graph technology and support easier discovery of data association relationships with graph technologies.

They also provide universal graph operation APIs and service management for users. This means that users do not need to build graph runtime environments independently and can explore the value of data with graph service directly.

Fig. 1 Graph service platform system architecture

As shown in Fig. 1, the system can be divided into four layers:

  1. Storage backend – Different forms of data (for example, CSV files) are stored in Amazon S3, graph data stores in Neptune and meta configuration stores in DynamoDB.
  2. Driver – Contains drivers such as Gremlin, Neptune, S3, and DynamoDB.
  3. Service – Manages clusters, instances, databases etc, provides management API, includes schema and data load management, graph operation logic, and other graph algorithms.
  4. RESTful APIs – Currently supports the standard and uniform formats provided by the system, the Management API, Search API for OLTP, and Analysis API for online analytical processing (OLAP).

How it works

Graph flow

Fig. 2 Graph flow

CSV files stored in Amazon S3 are processed by extract, transform, and load (ETL) tools to generate graph data. This data is then managed by an Amazon Neptune DB cluster, which can only be accessed by users through graph service. Graph service converts user requests into asynchronous interactions with Neptune Cluster, which returns the results to users.

When users launch data load tasks, graph service synchronises the entity and attribute information with the CSV file in S3, and the schema stored in DynamoDB. The data is only imported into Neptune if there are no inconsistencies.

The most important component in the system is the graph service, which provides RESTful APIs for two scenarios: graph search for real-time streams and graph analysis for batch processing. At the same time, the graph service manages clusters, databases, instances, users, tasks, and meta configurations stored in DynamoDB, which implements features of service monitor and data loading offline or stream ingress online.

Use case in fraud detection

In Grab’s mobility business, we have come across situations where multiple accounts use shared physical devices to maximise their earning potential. With the graph capabilities provided by the graph service platform, we can clearly see the connections between multiple accounts and shared devices.

Historical device and account data are stored in the graph service platform via offline data loading or online stream injection. If the device and account data exists in the graph service platform, we can find the adjacent account IDs or the shared device IDs by using the device ID or account ID respectively specified in the user request.

In our experience, fraudsters tend to share physical resources to maximise their revenue. The following image shows a device that is shared by many users. With our Graph Visualisation platform based on graph service, you can see exactly what this pattern looks like.

Fig 3. Example of a device being shared with many users

Data injection

Fig. 4 Data injection

Graph service also supports data injection features, including data load by request (task with a type of data load) and real-time stream write by Kafka.  

When connected to GrabDefence’s infrastructure, Confluent with Kafka is used as the streaming engine.  The purpose of using Kafka as a streaming write engine is two-fold: to provide primary user authentication and to relieve the pressure on Neptune.

Impact

Graph service supports data management of Labelled Property Graphs and provides the capability to add, delete, update, and get vertices, edges, and properties for some graph models. Graph traversal and searching relationships with RESTful APIs are also more convenient with graph service.

Businesses usually do not need to focus on the underlying data storage, just designing graph schemas for model definition according to their needs. With the graph service platform, platforms or systems can be built for personalised search, intelligent Q&A, financial fraud, etc.

For big organisations, extensive graph algorithms provide the power to mine various entity connectivity relationships in massive amounts of data. The growth and expansion of new businesses is driven by discovering the value of data.

What’s next?

Fig. 5 Graph-centric ecosystems

We are building an integrated graph ecosystem inside and outside Grab. The infrastructure and service, or APIs are key components in graph-centric ecosystems; they provide graph arithmetic and basic capabilities of graphs in relation to search, computing, analysis etc. Besides that, we will also consider incorporating applications such as risk prediction and fraud detection in order to serve our current business needs.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

References

Zero trust with Kafka

Post Syndicated from Grab Tech original https://engineering.grab.com/zero-trust-with-kafka

Introduction

Grab’s real-time data platform team, also known as Coban, has been operating large-scale Kafka clusters for all Grab verticals, with a strong focus on ensuring a best-in-class-performance and 99.99% availability.

Security has always been one of Grab’s top priorities and as fraudsters continue to evolve, there is an increased need to continue strengthening the security of our data streaming platform. One of the ways of doing this is to move from a pure network-based access control to state-of-the-art security and zero trust by default, such as:

  • Authentication: The identity of any remote systems – clients and servers – is established and ascertained first, prior to any further communications.
  • Authorisation: Access to Kafka is granted based on the principle of least privilege; no access is given by default. Kafka clients are associated with the whitelisted Kafka topics and permissions – consume or produce – they strictly need. Also, granted access is auditable.
  • Confidentiality: All in-transit traffic is encrypted.

Solution

We decided to use mutual Transport Layer Security (mTLS) for authentication and encryption. mTLS enables clients to authenticate servers, and servers to reciprocally authenticate clients.

Kafka supports other authentication mechanisms, like OAuth, or Salted Challenge Response Authentication Mechanism (SCRAM), but we chose mTLS because it is able to verify the peer’s identity offline. This verification ability means that systems do not need an active connection to an authentication server to ascertain the identity of a peer. This enables operating in disparate network environments, where all parties do not necessarily have access to such a central authority.

We opted for Hashicorp Vault and its PKI engine to dynamically generate clients and servers’ certificates. This enables us to enforce the usage of short-lived certificates for clients, which is a way to mitigate the potential impact of a client certificate being compromised or maliciously shared. We said zero trust, right?

For authorisation, we chose Policy-Based Access Control (PBAC), a more scalable solution than Role-Based Access Control (RBAC), and the Open Policy Agent (OPA) as our policy engine, for its wide community support.

To integrate mTLS and the OPA with Kafka, we leveraged Strimzi, the Kafka on Kubernetes operator. In a previous article, we have alluded to Strimzi and hinted at how it would help with scalability and cloud agnosticism. Built-in security is undoubtedly an additional driver of our adoption of Strimzi.

Server authentication

Figure 1 – Server authentication process for internal cluster communications

We first set up a single Root Certificate Authority (CA) for each environment (staging, production, etc.). This Root CA, in blue on the diagram, is securely managed by the Hashicorp Vault cluster. Note that the color of the certificates, keys, signing arrows and signatures on the diagrams are consistent throughout this article.

To secure the cluster’s internal communications, like the communications between the Kafka broker and Zookeeper pods, Strimzi sets up a Cluster CA, which is signed by the Root CA (step 1). The Cluster CA is then used to sign the individual Kafka broker and zookeeper certificates (step 2). Lastly, the Root CA’s public certificate is imported into the truststores of both the Kafka broker and Zookeeper (step 3), so that all pods can mutually verify their certificates when authenticating one with the other.

Strimzi’s embedded Cluster CA dynamically generates valid individual certificates when spinning up new Kafka and Zookeeper pods. The signing operation (step 2) is handled automatically by Strimzi.

For client access to Kafka brokers, Strimzi creates a different set of intermediate CA and server certificates, as shown in the next diagram.

Figure 2 – Server authentication process for client access to Kafka brokers

The same Root CA from Figure 1 now signs a different intermediate CA, which the Strimzi community calls the Client CA (step 1). This naming is misleading since it does not actually sign any client certificates, but only the server certificates (step 2) that are set up on the external listener of the Kafka brokers. These server certificates are for the Kafka clients to authenticate the servers. This time, the Root CA’s public certificate will be imported into the Kafka Client truststore (step 3).

Client authentication

Figure 3 – Client authentication process

For client authentication, the Kafka client first needs to authenticate to Hashicorp Vault and request an ephemeral certificate from the Vault PKI engine (step 1). Vault then issues a certificate and signs it using its Root CA (step 2). With this certificate, the client can now authenticate to Kafka brokers, who will use the Root CA’s public certificate already in their truststore, as previously described (step 3).

CA tree

Putting together the three different authentication processes we have just covered, the CA tree now looks like this. Note that this is a simplified view for a single environment, a single cluster, and two clients only.

Figure 4 – Complete certificate authority tree

As mentioned earlier, each environment (staging, production, etc.) has its own Root CA. Within an environment, each Strimzi cluster has its own pair of intermediate CAs: the Cluster CA and the Client CA. At the leaf level, the Zookeeper and Kafka broker pods each have their own individual certificates.

On the right side of the diagram, each Kafka client can get an ephemeral certificate from Hashicorp Vault whenever they need to connect to Kafka. Each team or application has a dedicated Vault PKI role in Hashicorp Vault, restricting what can be requested for its certificate (e.g., Subject, TTL, etc.).

Strimzi deployment

We heavily use Terraform to manage and provision our Kafka and Kafka-related components. This enables us to quickly and reliably spin up new clusters and perform cluster scaling operations.

Under the hood, Strimzi Kafka deployment is a Kubernetes deployment. To increase the performance and the reliability of the Kafka cluster, we create dedicated Kubernetes nodes for each Strimzi Kafka broker and each Zookeeper pod, using Kubernetes taints and tolerations. This ensures that all resources of a single node are dedicated solely to either a single Kafka broker or a single Zookeeper pod.

We also decided to go with a single Kafka cluster by Kubernetes cluster to make the management easier.

Client setup

Coban provides backend microservice teams from all Grab verticals with a popular Kafka SDK in Golang, to standardise how teams utilise Coban Kafka clusters. Adding mTLS support mostly boils down to upgrading our SDK.

Our enhanced SDK provides a default mTLS configuration that works out of the box for most teams, while still allowing customisation, e.g., for teams that have their own Hashicorp Vault Infrastructure for compliance reasons. Similarly, clients can choose among various Vault auth methods such as AWS or Kubernetes to authenticate to Hashicorp Vault, or even implement their own logic for getting a valid client certificate.

To mitigate the potential risk of a user maliciously sharing their application’s certificate with other applications or users, we limit the maximum Time-To-Live (TTL) for any given certificate. This also removes the overhead of maintaining a Certificate Revocation List (CRL). Additionally, our SDK stores the certificate and its associated private key in memory only, never on disk, hence reducing the attack surface.

In our case, Hashicorp Vault is a dependency. To prevent it from reducing the overall availability of our data streaming platform, we have added two features to our SDK – a configurable retry mechanism and automatic renewal of clients’ short-lived certificates when two thirds of their TTL is reached. The upgraded SDK also produces new metrics around this certificate renewal process, enabling better monitoring and alerting.

Authorisation

Figure 5 – Authorisation process before a client can access a Kafka record

For authorisation, we set up the Open Policy Agent (OPA) as a standalone deployment in the Kubernetes cluster, and configured Strimzi to integrate the Kafka brokers with that OPA.

OPA policies – written in the Rego language – describe the authorisation logic. They are created in a GitLab repository along with the authorisation rules, called data sources (step 1). Whenever there is a change, a GitLab CI pipeline automatically creates a bundle of the policies and data sources, and pushes it to an S3 bucket (step 2). From there, it is fetched by the OPA (step 3).

When a client – identified by its TLS certificate’s Subject – attempts to consume or produce a Kafka record (step 4), the Kafka broker pod first issues an authorisation request to the OPA (step 5) before processing the client’s request. The outcome of the authorisation request is then cached by the Kafka broker pod to improve performance.

As the core component of the authorisation process, the OPA is deployed with the same high availability as the Kafka cluster itself, i.e. spread across the same number of Availability Zones. Also, we decided to go with one dedicated OPA by Kafka cluster instead of having a unique global OPA shared between multiple clusters. This is to reduce the blast radius of any OPA incidents.

For monitoring and alerting around authorisation, we submitted an Open Source contribution in the opa-kafka-plugin project in order to enable the OPA authoriser to expose some metrics. Our contribution to the open source code allows us to monitor various aspects of the OPA, such as the number of authorised and unauthorised requests, as well as the cache hit-and-miss rates. Also, we set up alerts for suspicious activity such as unauthorised requests.

Finally, as a platform team, we need to make authorisation a scalable, self-service process. Thus, we rely on the Git repository’s permissions to let Kafka topics’ owners approve the data source changes pertaining to their topics.

Teams who need their applications to access a Kafka topic would write and submit a JSON data source as simple as this:

{
 "example_topic": {
   "read": [
     "clientA.grab",
     "clientB.grab"
   ],
   "write": [
     "clientB.grab"
   ]
 }
}

GitLab CI unit tests and business logic checks are set up in the Git repository to ensure that the submitted changes are valid. After that, the change would be submitted to the topic’s owner for review and approval.

What’s next?

The performance impact of this security design is significant compared to unauthenticated, unauthorised, plaintext Kafka. We observed a drop in throughput, mostly due to the low performance of encryption and decryption in Java, and are currently benchmarking different encryption ciphers to mitigate this.

Also, on authorisation, our current PBAC design is pretty static, with a list of applications granted access for each topic. In the future, we plan to move to Attribute-Based Access Control (ABAC), creating dynamic policies based on teams and topics’ metadata. For example, teams could be granted read and write access to all of their own topics by default. Leveraging a versatile component such as the OPA as our authorisation controller enables this evolution.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How KartaCam powers GrabMaps

Post Syndicated from Grab Tech original https://engineering.grab.com/kartacam-powers-grabmaps

Introduction

The foundation for making any map is in imagery, but due to the complexity and dynamism of the real world, it is difficult for companies to collect high-quality, fresh images in an efficient yet low-cost manner. This is the case for Grab’s Geo team as well.

Traditional map-making methods rely on professional-grade cameras that provide high resolution images to collect mapping imagery. These images are rich in content and detail, providing a good snapshot of the real world. However, we see two major challenges with this approach.

The first is high cost. Professional cameras are too expensive to use at scale, especially in an emerging region like Southeast Asia. Apart from high equipment cost, operational cost is also high as local operation teams need professional training before collecting imagery.

The other major challenge, related to the first, is that imagery will not be refreshed in a timely manner because of the high cost and operational effort required. It typically takes months or years before imagery is refreshed, which means maps get outdated easily.

Compared to traditional collection methods, there are more affordable alternatives that some emerging map providers are using, such as crowdsourced collection done with smartphones or other consumer-grade action cameras. This allows more timely imagery refresh at a much lower cost.

That said, there are several challenges with crowdsourcing imagery, such as:

  • Inconsistent quality in collected images.
  • Low operational efficiency as cameras and smartphones are not optimised for mapping.
  • Unreliable location accuracy.

In order to solve the challenges above, we started building our own artificial intelligence (AI) camera called KartaCam.

What is KartaCam?

Designed specifically for map-making, KartaCam is a lightweight camera that is easy to operate. It is everything you need for accurate and efficient image collection. KartaCam is powered by edge AI, and mainly comprises a camera module, a dual-band Global Navigation Satellite System (GNSS) module, and a built-in 4G Long-Term Evolution (LTE) module.

KartaCam

Camera module

The camera module or optical design of KartaCam focuses on several key features:

  • Wide field of vision (FOV): A wide FOV to capture as many scenes and details as possible without requiring additional trips. A single KartaCam has a wide lens FOV of >150° and when we use four KartaCams together, each facing a different direction, we increase the FOV to 360°.
  • High image quality: A combination of high-definition optical lens and a high-resolution pixel image sensor can help to achieve better image quality. KartaCam uses a high-quality 12MP image sensor.
  • Ease of use: Portable and easy to start using for people with little to no photography training. At Grab, we can easily deploy KartaCam to our fleet of driver-partners to map our region as they regularly travel these roads while ferrying passengers or making deliveries.

Edge AI for smart capturing on edge

Each KartaCam device is also equipped with edge AI, which enables AI computations to operate closer to the actual data – in our case, imagery collection. With edge AI, we can make decisions about imagery collection (i.e. upload, delete or recapture) at the device-level.

To help with these decisions, we use a series of edge AI models and algorithms that are executed immediately after each image capture such as:

  • Scene recognition model: For efficient map-making, we ensure that we make the right screen verdicts, meaning we only upload and process the right scene images. Unqualified images such as indoor, raining, and cloudy images are deleted directly on the KartaCam device. Joint detection algorithms are deployed in some instances to improve the accuracy of scene verdicts. For example, to detect indoor recording we look at a combination of driver moving speed, Inertial Measurement Units (IMU) data and edge AI image detection.

  • Image quality (IQ) checking AI model: The quality of the images collected is paramount for map-making. Only qualified images judged by our IQ classification algorithm will be uploaded while those that are blurry or considered low-quality will be deleted. Once an unqualified image is detected (usually within the next second), a new image is captured, improving the success rate of collection.

  • Object detection AI model: Only roadside images that contain relevant map-making content such as traffic signs, lights, and Point of Interest (POI) text are uploaded.

  • Privacy information detection: Edge AI also helps protect privacy when collecting street images for map-making. It automatically blurs private information such as pedestrians’ faces and car plate numbers before uploading, ensuring adequate privacy protection.


Better positioning with a dual-band GNSS module

The Global Positioning System (GPS) mainly uses two frequency bands: L1 and L5. Most traditional phone or GPS modules only support the legacy GPS L1 band, while modern GPS modules support both L1 and L5. KartaCam leverages the L5 band which provides improved signal structure, transmission capabilities, and a wider bandwidth that can reduce multipath error, interference, and noise impacts. In addition, KartaCam uses a fine-tuned high-quality ceramic antenna that, together with the dual frequency band GPS module, greatly improves positioning accuracy.

Keeping KartaCam connected

KartaCam has a built-in 4G LTE module that ensures it is always connected and can be remotely managed. The KartaCam management portal can monitor camera settings like resolution and capturing intervals, even in edge AI machine learning models. This makes it easy for Grab’s map ops team and drivers to configure their cameras and upload captured images in a timely manner.

Enhancing KartaCam

KartaCam 360: Capturing a panorama view

To improve single collection trip efficiency, we group four KartaCams together to collect 360° images. The four cameras can be synchronised within milliseconds and the collected images are stitched together in a panoramic view.

With KartaCam 360, we can increase the number of images collected in a single trip. According to Grab’s benchmark testing in Singapore and Jakarta, the POI information collected by KartaCam 360 is comparable to that of professional cameras, which cost about 20x more.

KartaCam 360 & Scooter mount
Image sample from KartaCam 360

KartaCam and the image collection workflow

KartaCam, together with other GrabMaps imagery tools, provides a highly efficient, end-to-end, low-cost, and edge AI-powered smart solution to map the region. KartaCam is fully integrated as part of our map-making workflow.

Our map-making solution includes the following components:

  • Collection management tool – Platform that defines map collection tasks for our driver-partners.
  • KartaView application – Mobile application provides map collection tasks and handles crowdsourced imagery collection.
  • KartaCam – Camera device connected to KartaView via Bluetooth and equipped with edge automatic processing for imagery capturing according to the task accepted.
  • Camera management tool – Handles camera parameters and settings for all KartaCam devices and can remotely control the KartaCam.
  • Automatic processing – Collected images are processed for quality check, stitching, and personal identification information (PII) blurring.
  • KartaView imagery platform – Processed images are then uploaded and the driver-partner receives payment.

In a future article, we will dive deeper into the technology behind KartaView and its role in GrabMaps.

Impact

At the moment, Grab is rolling out thousands of KartaCams to all locations across Southeast Asia where Grab operates. This saves operational costs while improving the efficiency and quality of our data collection.


Better data quality and more map attributes

Due to the excellent image quality, wide FOV coverage, accurate GPS positioning, and sensor data, the 360° images captured by KartaCam 360 also register detailed map attributes like POIs, traffic signs, and address plates. This will help us build a high quality map with rich and accurate content.


Reducing operational costs

Based on our research, the hardware cost for KartaCam 360 is significantly lower compared to similar professional cameras in the market. This makes it a more feasible option to scale up in Southeast Asia as the preferred tool for crowdsourcing imagery collection.

With image quality checks and detection conducted at the edge, we can avoid re-collections and also ensure that only qualified images are uploaded. These result in saving time as well as operational and upload costs.

Upholding privacy standards

KartaCam automatically blurs captured images that contain PII, like faces and licence plates directly from the edge devices. This means that all sensitive information is removed at this stage and is never uploaded to Grab servers.

On-the-edge blurring example

What’s next?

Moving forward, Grab will continue to enhance KartaCam’s performance in the following aspects:

  • Further improve image quality with better image sensors, unique optical components, and state-of-art Image Signal Processor (ISP).
  • Make KartaCam compatible with Light Detection And Ranging (LIDAR) for high-definition collection and indoor use cases.
  • Improve GNSS module performance with higher sampling frequency and accuracy, and integrate new technology like Real-Time Kinematic (RTK) and Precise Point Positioning (PPP) solutions to further improve the positioning accuracy. When combined with sensor fusion from IMU sensors, we can improve positioning accuracy for map-making further.
  • Improve usability, integration, and enhance imagery collection and portability for KartaCam so driver-partners can easily capture mapping data. 
  • Explore new product concepts for future passive street imagery collection.

To find out more about how KartaCam delivers comprehensive cost-effective mapping data, check out this article.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

References

Graph for fraud detection

Post Syndicated from Grab Tech original https://engineering.grab.com/graph-for-fraud-detection

Grab has grown rapidly in the past few years. It has expanded its business from ride hailing to food and grocery delivery, financial services, and more. Fraud detection is challenging in Grab, because new fraud patterns always arise whenever we introduce a new business product. We cannot afford to develop a new model whenever a new fraud pattern appears as it is time consuming and introduces a cold start problem, that is no protection at the early stage. We need a general fraud detection framework to better protect Grab from various unknown fraud risks.

Our key observation is that although Grab has many different business verticals, the entities within those businesses are connected to each other (Figure 1. Left), for example, two passengers may be connected by a Wi-Fi router or phone device, a merchant may be connected to a passenger by a food order, and so on. A graph provides an elegant way to capture the spatial correlation among different entities in the Grab ecosystem. A common fraud shows clear patterns on a graph, for example, a fraud syndicate tends to share physical devices, and collusion happens between a merchant and an isolated set of passengers (Figure 1. Right).

Figure 1. Left: The graph captures different correlations in the Grab ecosystem.
Right: The graph shows that common fraud has clear patterns.

We believe graphs can help us discover subtle traces and complicated fraud patterns more effectively. Graph-based solutions will be a sustainable foundation for us to fight against known and unknown fraud risks.

Why graph?

The most common fraud detection methods include the rule engine and the decision tree-based models, for example, boosted tree, random forest, and so on. Rules are a set of simple logical expressions designed by human experts to target a particular fraud problem. They are good for simple fraud detection, but they usually do not work well in complicated fraud or unknown fraud cases.

Fraud detection methods

Utilises correlations
(Higher is better)
Detects unknown fraud
(Higher is better)
Requires feature engineering
(Lower is better)
Depends on labels
(Lower is better)
Rule engine Low N/A N/A Low
Decision tree Low Low High High
Graph model High High Low Low

Table 1. Graph vs. common fraud detection methods.

Decision tree-based models have been dominating fraud detection and Kaggle competitions for structured or tabular data in the past few years. With that said, the performance of a tree-based model is highly dependent on the quality of labels and feature engineering, which is often hard to obtain in real life. In addition, it usually does not work well in unknown fraud which has not been seen in the labels.

On the other hand, a graph-based model requires little amount of feature engineering and it is applicable to unknown fraud detection with less dependence on labels, because it utilises the structural correlations on the graph.

In particular, fraudsters tend to show strong correlations on a graph, because they have to share physical properties such as personal identities, phone devices, Wi-Fi routers, delivery addresses, and so on, to reduce cost and maximise revenue as shown in Figure 2 (left). An example of such strong correlations is shown in Figure 2 (right), where the entities on the graph are densely connected, and the known fraudsters are highlighted in red. Those strong correlations on the graph are the key reasons that make the graph based approach a sustainable foundation for various fraud detection tasks.

Figure 2. Fraudsters tend to share physical properties to reduce cost (left), and they are densely connected as shown on a graph (right).

Semi-supervised graph learning

Unlike traditional decision tree-based models, the graph-based machine learning model can utilise the graph’s correlations and achieve great performance even with few labels. The semi-supervised Graph Convolutional Network model has been extremely popular in recent years 1. It has proven its success in many fraud detection tasks across industries, for example, e-commerce fraud, financial fraud, internet traffic fraud, etc.
We apply the Relational Graph Convolutional Network (RGCN) 2 for fraud detection in Grab’s ecosystem. Figure 3 shows the overall architecture of RGCN. It takes a graph as input, and the graph passes through several graph convolutional layers to get node embeddings. The final layer outputs a fraud probability for each node. At each graph convolutional layer, the information is propagated along the neighbourhood nodes within the graph, that is nodes that are close on the graph are similar to each other.

Fig 3. A semi-supervised Relational Graph Convolutional Network model.

We train the RGCN model on a graph with millions of nodes and edges, where only a few percentages of the nodes on the graph have labels. The semi-supervised graph model has little dependency on the labels, which makes it a robust model for tackling various types of unknown fraud.

Figure 4 shows the overall performance of the RGCN model. On the left is the Receiver Operating Characteristic (ROC) curve on the label dataset, in particular, the Area Under the Receiver Operating Characteristic (AUROC) value is close to 1, which means the RGCN model can fit the label data quite well. The right column shows the low dimensional projections of the node embeddings on the label dataset. It is clear that the embeddings of the genuine passenger are well separated from the embeddings of the fraud passenger. The model can distinguish between a fraud and a genuine passenger quite well.

Fig 4. Left: ROC curve of the RGCN model on the label dataset.
Right: Low dimensional projections of the graph node embeddings.

Finally, we would like to share a few tips that will make the RGCN model work well in practice.

  • Use less than three convolutional layers: The node feature will be over-smoothed if there are many convolutional layers, that is all the nodes on the graph look similar.
  • Node features are important: Domain knowledge of the node can be formulated as node features for the graph model, and rich node features are likely to boost the model performance.

Graph explainability

Unlike other deep network models, graph neural network models usually come with great explainability, that is why a user is classified as fraudulent. For example, fraudulent accounts are likely to share hardware devices and form dense clusters on the graph, and those fraud clusters can be easily spotted on a graph visualiser 3.

Figure 5 shows an example where graph visualisation helps to explain the model prediction scores. The genuine passenger with a low RGCN score does not share devices with other passengers, while the fraudulent passenger with a high RGCN score shares devices with many other passengers, that is, dense clusters.

Figure 5. Upper left: A genuine passenger with a low RGCN score has no device sharing with other passengers. Bottom right: A fraudulent user with a high RGCN score shares devices with many other passengers.

Closing thoughts

Graphs provide a sustainable foundation for combating many different types of fraud risks. Fraudsters are evolving very fast these days, and the best traditional rules or models can do is to chase after those fraudsters given that a fraud pattern has already been discovered. This is suboptimal as the damage has already been done on the platform. With the help of graph models, we can potentially detect those fraudsters before any fraudulent activity has been conducted, thus reducing the fraud cost.

The graph structural information can significantly boost the model performance without much dependence on labels, which is often hard to get and might have a large bias in fraud detection tasks. We have shown that with only a small percentage of labelled nodes on the graph, our model can already achieve great performance.

With that said, there are also many challenges to making a graph model work well in practice. We are working towards solving the following challenges we are facing.

  • Feature initialisation: Sometimes, it is hard to initialise the node feature, for example, a device node does not carry many semantic meanings. We have explored self-supervised pre-training 4 to help the feature initialisation, and the preliminary results are promising.
  • Real-time model prediction: Realtime graph model prediction is challenging because real-time graph updating is a heavy operation in most cases. One possible solution is to do batch real-time prediction to reduce the overhead.
  • Noisy connections: Some connections on the graph are inherently noisy on the graph, for example, two users sharing the same IP address does not necessarily mean they are physically connected. The IP might come from a mobile network. One possible solution is to use the attention mechanism in the graph convolutional kernel and control the message passing based on the type of connection and node profiles.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

References

  1. T. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017 

  2. Schlichtkrull, Michael, et al. “Modeling relational data with graph convolutional networks.” European semantic web conference. Springer, Cham, 2018. 

  3. Fujiao Liu, Shuqi Wang, et al.. “Graph Networks – 10X investigation with Graph Visualisations”. Grab Tech Blog. 

  4. Wang, Chen, et al.. “Deep Fraud Detection on Non-attributed Graph.” IEEE Big Data conference, PSBD, 2021. 

Query expansion based on user behaviour

Post Syndicated from Grab Tech original https://engineering.grab.com/query-expansion-based-on-user-behaviour

Introduction

Our consumers used to face a few common pain points while searching for food with the Grab app. Sometimes, the results would include merchants that were not yet operational or locations that were out of the delivery radius. Other times, no alternatives were provided. The search system would also have difficulties handling typos, keywords in different languages, synonyms, and even word spacing issues, resulting in a suboptimal user experience.

Over the past few months, our search team has been building a query expansion framework that can solve these issues. When a user query comes in, it expands the query to a few related keywords based on semantic relevance and user intention. These expanded words are then searched with the original query to recall more results that are high-quality and diversified. Now let’s take a deeper look at how it works.

Query expansion framework

Building the query expansion corpus

We used two different approaches to produce query expansion candidates: manual annotation for top keywords and data mining based on user rewrites.

Manual annotation for top keywords

Search has a pronounced fat head phenomenon. The most frequent thousand of keywords account for more than 70% of the total search traffic. Therefore, handling these keywords well can improve the overall search quality a lot. We manually annotated the possible expansion candidates for these common keywords to cover the most popular merchants, items and alternatives. For instance, “McDonald’s” is annotated with {“burger”, “western”}.

Data mining based on user rewrites

We observed that sometimes users tend to rewrite their queries if they are not satisfied with the search result. As a pilot study, we checked the user rewrite records within some user search sessions and found several interesting samples:

{Ya Kun Kaya Toast,Starbucks}
{healthy,Subway}
{Muni,Muji}
{奶茶,koi}
{Roti,Indian}

We can see that besides spelling corrections, users’ rewrite behaviour also reveals deep semantic relations between these pairs that cannot be easily captured by lexical similarity, such as similar merchants, merchant attributes, language differences, cuisine types, and so on. We can leverage the user’s knowledge to build a query expansion corpus to improve the diversity of the search result and user experience. Furthermore, we can use the wisdom of the crowd to find some common patterns with higher confidence.

Based on this intuition, we leveraged the high volume of search click data available in Grab to generate high-quality expansion pairs at the user session level. To augment the original queries, we collected rewrite pairs that happened to multiple users and multiple times in a time period. Specifically, we used the heuristic rules below to collect the rewrite pairs:

  • Select the sessions where there are at least two distinct queries (rewrite session)
  • Collect adjacent query pairs in the search session where the second query leads to a click but the first does not (effective rewrite)
  • Filter out the sample pairs with time interval longer than 30 seconds in between, as users are more likely to change their mind on what to look for in these pairs (single intention)
  • Count the occurrences and filter out the low-frequency pairs (confidence management)

After we have the mining pairs, we categorised and annotated the rewrite types to gain a deeper understanding of the user’s rewrite behaviour. A few samples mined from the Singapore area data are shown in the table below.

Original query Rewrite query Frequency in a month Distinct user count Type
playmade by 丸作 playmade 697 666 Drop keywords
mcdonald’s burger 573 535 Merchant -> Food
Bubble tea koi 293 287 Food -> Merchant
Kfc McDonald’s 238 234 Merchant -> Merchant
cake birthday cake 206 205 Add words
麦当劳 mcdonald’s 205 199 Locale change
4 fingers 4fingers 165 162 Space correction
krc kfc 126 124 Spelling correction
5 guys five guys 120 120 Number synonym
koi the koi thé 45 44 Tone change

We further computed the percentages of some categories, as shown in the figure below.

Figure 1. The donut chart illustrates the percentages of the distinct user counts for different types of rewrites.

Apart from adding words, dropping words and spelling corrections, a significant portion of the rewrites are in the category of Other. It is more semantic driven, such as merchant to merchant, or merchant to cuisine. Those rewrites are useful for capturing deeper connections between queries and can be a powerful diversifier to query expansion.

Grouping

After all the rewrite pairs were discovered offline through data mining, we grouped the query pairs by the original query to get the expansion candidates of each query. For serving efficiency, we limited the max number of expansion candidates to three.

Query expansion serving

Expansion matching architecture

The expansion matching architecture benefits from the recent search architecture upgrade, where the system flow is changed to a query understanding, multi-recall and result fusion flow. In particular, a query goes through the query understanding module and gets augmented with additional information. In this case, the query understanding module takes in the keyword and expands it to multiple synonyms, for example, KFC will be expanded to fried chicken. The original query together with its expansions are sent together to the search engine under the multi-recall framework. After that, results from multiple recallers with different keywords are fused together.

Continuous monitoring and feedback loop

It’s important to make sure the expansion pairs are relevant and up-to-date. We run the data mining pipeline periodically to capture the new user rewrite behaviours. Meanwhile, we also monitor the expansion pairs’ contribution to the search result by measuring the net contribution of recall or user interaction that the particular query brings, and eliminate the obsolete pairs in an automatic way. This reflects our effort to build an adaptive system.

Results

We conducted online A/B experiments across 6 countries in Southeast Asia to evaluate the expanded queries generated by our system. We set up 3 groups:

  • Control group, where no query is expanded.
  • Treatment group 1, where we expanded the queries based on manual annotations only.
  • Treatment group 2, where we expanded the queries using the data mining approach.

We observed decent uplift in click-through rate and conversion rate from both treatment groups. Furthermore, in treatment group 2, the data mining approach produced even better results.

Future work

Data mining enhancement

Currently, the data mining approach can only identify the pairs from the same search session by one user. This limits the number of linked pairs. Some potential enhancements include:

  • Augment expansion pairs by associating queries from different users who click on the same merchant/item, for example, using a click graph. This can capture relevant queries across user sessions.
  • Build a probabilistic model on top of the current transition pairs. Currently, all the transition pairs are equally weighted but apparently, the transitions that happen more often should carry higher probability/weights.

Ads application

Query expansion can be applied to advertising and would increase ads fill rate. With “KFC” expanded to “fried chicken”, the sponsored merchants who buy the keyword “fried chicken” would be eligible to show up when the user searches “KFC”. This would enable Grab to provide more relevant sponsored content to our users, which helps not only the consumers but also the merchants.

Special thanks to Zhengmin Xu and Daniel Ng for proofreading this article.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Using mobile sensor data to encourage safer driving

Post Syndicated from Grab Tech original https://engineering.grab.com/using-mobile-sensor-data-to-encourage-safer-driving

“Telematics”, a cross between the words telecommunications and informatics, was coined in the late 1970s to refer to the use of communication technologies in facilitating exchange of information. In the modern day, such technologies may include cloud platforms, mobile networks, and wireless transmissions (e.g., Bluetooth). Although the initial intention is for a more general scope, telematics is now specifically used to refer to vehicle telematics where details of vehicle movements are tracked for use cases such as driving safety, driver profiling, fleet optimisation, and productivity improvements.

We’ve previously published this article to share how Grab uses telematics to improve driver safety. In this blog post, we dive deeper into how telematics technology is used at Grab to encourage safer driving for our driver and delivery partners.

Background

At Grab, the safety of our users and their experience on our platform is our highest priority. By encouraging safer driving habits from our driver and delivery partners, road traffic accidents can be minimised, potentially reducing property damage, injuries, and even fatalities. Safe driving also helps ensure smoother rides and a more pleasant experience for consumers using our platform.

To encourage safer driving, we should:

  1. Have a data-driven approach to understand how our driver and delivery partners are driving.
  2. Help partners better understand how to improve their driving by summarising key driving history into a personalised Driving Safety Report.

Understanding driving behaviour

One of the most direct forms of driving assessment is consumer feedback or complaints. However, the frequency and coverage of this feedback is not very high as they are only applicable to transport verticals like JustGrab or GrabBike and not delivery verticals like GrabFood or GrabExpress. Plus, most driver partners tend not to receive any driving-related feedback (whether positive or negative), even for the transport verticals.

A more comprehensive method of assessing driving behaviour is to use the driving data collected during Grab bookings. To make sense of these data, we focus on selected driving manoeuvres (e.g., braking, acceleration, cornering, speeding) and detect the number of instances where our data shows unsafe driving in each of these areas.

We acknowledge that the detected instances may be subjected to errors and may not provide the complete picture of what’s happening on the ground (e.g., partners may be forced to do an emergency brake due to someone swerving into their lane).

To address this, we have incorporated several fail-safe checks into our detection logic to minimise erroneous detection. Also, any assessment of driving behaviour will be based on an aggregation of these unsafe driving instances over a large amount of driving data. For example, individual harsh braking instances may be inconclusive but if a driver partner displays multiple counts consistently across many bookings, it is likely that the partner may be used to unsafe driving practices like tailgating or is distracted while driving.

Telematics for detecting unsafe driving

For Grab to consistently ensure our consumers’ safety, we need to proactively detect unsafe driving behaviour before an accident occurs. However, it is not feasible for someone to be with our driver and delivery partners all the time to observe their driving behaviour. We should leverage sensor data to monitor these driving behaviour at scale.

Traditionally, a specialised “black box” inertial measurement unit (IMU) equipped with sensors such as accelerometers, gyroscopes, and GPS needs to be installed in alignment with the vehicle to directly measure vehicular acceleration and speed. In this manner, it would be straightforward to detect unsafe driving instances using this data. Unfortunately, the cost of purchasing and installing such devices for all our partners is prohibitively high and it would be hard to scale.

Instead, we can leverage a device that all partners already have: their mobile phone. Modern smartphones already contain similar sensors to those in IMUs and data can be collected through the telematics SDK. More details on telematics data collection can be found in a recently published Grab tech blog article1.

It’s important to note that telematics data are collected at a sufficiently high sampling frequency (much more than 1 Hz) to minimise inaccuracies in detecting unsafe driving instances characterised by sharp acceleration impulses.

Processing mobile sensor data to detect unsafe driving

Unlike specialised IMUs installed in vehicles, mobile sensor data have added challenges to detecting unsafe driving.

Accounting for orientation: Phone vs. vehicle

The phone is usually in a different orientation compared to the vehicle. Strictly speaking, the phone accelerometer sensor measures the accelerations of the phone and not the vehicle acceleration. To infer vehicle acceleration from phone sensor data, we developed a customised processing algorithm optimised specifically for Grab’s data.

First, the orientation offset of the phone with respect to the vehicle is defined using Euler angles: roll, pitch and yaw. In data windows with no net acceleration of the vehicle (e.g., no braking, turning motion), the only acceleration measured by the accelerometer is gravitational acceleration. Roll and pitch angles can then be determined through trigonometric manipulation. The complete triaxial accelerations of the phone are then rotated to the horizontal plane and the yaw angle is determined by principal component analysis (PCA).

An assumption here is that there will be sufficient braking and acceleration manoeuvring for PCA to determine the correct forward direction. This Euler angles determination is done periodically to account for any movement of phones during the trip. Finally, the raw phone accelerations are rotated to the vehicle orientation through a matrix multiplication with the rotation matrix derived from the Euler angles (see Figure 1).

Figure 1: Inference of vehicle acceleration from the phone sensor data. Smartphone and car images modified from designs found in Freepik.com.

Handling variations in data quality

Our processing algorithm is optimised to be highly robust and handle large variations in data quality that is expected from bookings on the Grab platform. There are many reported methods for processing mobile data to reorientate telematics data for four wheel vehicles23.

However, with the prevalent use of motorcycles on our platform, especially for delivery verticals, we observed that data collected from two wheel vehicles tend to be noisier due to differences in phone stability and vehicular vibrations. Data noise can be exacerbated if partners hold the phone in their hand or place it in their pockets while driving.

In addition, we also expect a wide variation in data quality and sensor availability from different phone models, such as older, low-end models to the newest, flagship models. A good example to illustrate the robustness of our algorithm is having different strategies to handle different degrees of data noise. For example, a simple low-pass filter is used for low noise data, while more complex variational decomposition and Kalman filter approaches are used for high noise data.

Detecting behaviour anomalies with thresholds

Once the vehicular accelerations are inferred, we can use a thresholding approach (see Figure 2) to detect unsafe driving instances.

For unsafe acceleration and braking, a peak finding algorithm is used to detect acceleration peaks beyond a threshold in the longitudinal (forward/backward) direction. For unsafe cornering, older and lower end phones are usually not equipped with gyroscope sensors, so we should look for peaks of lateral (sidewards) acceleration (which constitutes the centripetal acceleration during the turn) beyond a threshold. GPS bearing data that coarsely measures the orientation of the vehicle is then used to confirm that a cornering and not lane change instance is being detected. The thresholds selected are fine-tuned on Grab’s data using initial values based on published literature4 and other sources.

To reduce false positive detection, no unsafe driving instances will be flagged when:

  1. Large discrepancies are observed between speeds derived from integrating the longitudinal (forward/backward) acceleration and speeds directly measured by the GPS sensor.
  2. Large phone motions are detected. For example, when the phone falls to the seat from the dashboard, accelerations recorded on the phone sensor will deviate significantly from the vehicle accelerations.
  3. GPS speed is very low before and after the unsafe driving instance is detected. This is limited to data collected from motorcycles which is usually used by delivery partners. It implies that the partner is walking and not in a vehicle. For example, a GrabFood delivery partner may be collecting the food from the merchant partner on foot, so no unsafe driving instances should be detected.
Figure 2: Animation showing unsafe driving detection by thresholding. Dotted lines in acceleration charts indicate selected thresholds. Map tiles by stamen design.

Detecting speeding instances from GPS speeds and map data

To define speeding along a stretch of road, we used a rule-based method by comparing raw speeds from GPS pings with speeding thresholds for that road. Although GPS speeds are generally accurate (subjected to minimal GPS errors), we need to take more precautions to ensure the right speeding thresholds are determined.

These thresholds are set using known speed limits from available map data or hourly aggregated speed statistics where speed limits are not available. The coverage and accuracy of known speed limits is continuously being improved by our in-house mapping initiatives and validated comprehensively by the respective local ground teams in selected cities.

Aggregating GPS pings from Grab driver and delivery partners can be a helpful proxy to actual speed limits by defining speeding violations as outliers from socially acceptable speeds derived from partners collectively. To reliably compute aggregated speed statistics, a representative speed profile for each stretch of road must first be inferred from raw GPS pings (see Figure 3).

As ping sampling intervals are fixed, more pings tend to be recorded for slower speeds. To correct the bias in the speed profile, we reweigh ping counts by using speed values as weights. Furthermore, to minimise distortions in the speed profile from vehicles driving at lower-than-expected speeds due to high traffic volumes, only pings from free-flowing traffic are used when inferring the speed profile.

Free-flowing traffic is defined by speeds higher than the median speed on each defined road category (e.g., small residential roads, normal primary roads, large expressways). To ensure extremely high speeds are flagged regardless of the speed of other drivers, maximum threshold values for aggregated speeds are set for each road category using heuristics based on the maximum known speed limit of that road category.

Figure 3: Steps to infer a representative speed profile for computing aggregated speed statistics.

Besides a representative speed profile, hourly aggregation should also include data from a sufficient number of unique drivers depending on speed variability. To obtain enough data, hourly aggregations are performed on the same day of the week over multiple weeks. This way, we have a comprehensive time-specific speed profile that accounts for traffic quality (e.g., peak hour traffic, traffic differences between weekdays/weekends) and driving conditions (e.g., visibility difference between day/night).

When detecting speeding violations, the GPS pings used are snapped-to-road and stationary pings, pings with unrealistic speeds, while pings with low GPS accuracy (e.g., when the vehicle is in a tunnel) are excluded. A speeding violation is defined as a sequence of consecutive GPS pings that exceed the speeding threshold. The following checks were put in place to minimise erroneous flagging of speeding violations:

  1. Removal of duplicated (or stale) GPS pings.
  2. Sufficient speed buffer given to take into account GPS errors.
  3. Sustained speeding for a prolonged period of time is required to exclude transient speeding events (e.g., during lane change).

Driving safety report

The driving safety report is a platform safety product that driver and delivery partners can access via their driver profile page on the Grab Driver Application (see Figure 4). It is updated daily and aims to create awareness regarding driving habits by summarising key information from the processed data into a personalised report that can be easily consumed.

Individual reports of each driving manoeuvre (e.g., braking, acceleration, cornering and speeding) are available for daily and weekly views. Partners can also get more detailed information of each individual instance such as when these unsafe driving instances were detected.

Figure 4: Driving safety report for driver and delivery partners using four wheel vehicles. a) Actionable insights feature circled by red dotted lines. b) Daily view of various unsafe driving instances where more details of each instance can be viewed by tapping on “See details”.

Actionable insights

Besides compiling the instances of unsafe driving in a report to create awareness, we are also using these data to provide some actionable recommendations for our partners to improve their driving.

With unsafe driving feedback from consumers and reported road traffic accident data from our platform, we also train machine learning models to identify patterns in the detected unsafe driving instances and estimate the likelihood of partners receiving unsafe driving feedback or getting into accidents. One use case is to compute a safe driving score that equates a four-wheel partner’s driving behaviour to a numerical value where a higher score indicates a safer driver.

Additionally, we use Shapley additive explanation (SHAP) approaches to determine which driving manoeuvre contributes the most to increasing the likelihood of partners receiving unsafe driving feedback or getting into accidents. This information is included as an actionable insight in the driving safety report and helps partners to identify the key area to improve their driving.

What’s next?

At the moment, Grab performs telematics processing and unsafe driving detections after the trip and updates the report the next day. One of the biggest improvements would be to share this information with partners faster. We are actively working on developing a real-time processing algorithm that addresses this and also, satisfies the robustness requirements such that partners are immediately aware after an unsafe driving instance is detected.

Besides detecting typical unsafe driving manoeuvres, we are also exploring other use cases for mobile sensor data in road safety such as detection of poor road conditions, counterflow driving against traffic, and phone usage leading to distracted driving.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

References

  1. Burhan, W. (2022). How telematics helps Grab to improve safety. Grab Tech Blog. https://engineering.grab.com/telematics-at-grab 

  2. Mohan, P., Padmanabhan, V.N. and Ramjee, R. (2008).Nericell: rich monitoring of road and traffic conditions using mobile smartphones. SenSys ‘08: Proceedings of the 6th ACM conference on Embedded network sensor systems, 312-336. https://doi.org/10.1145/1460412.1460444 

  3. Sentiance (2016). Driving behavior modeling using smart phone sensor data. Sentiance Blog. https://sentiance.com/2016/02/11/driving-behavior-modeling-using-smart-phone-sensor-data/ 

  4. Yarlagadda, J. and Pawar, D.S. (2022). Heterogeneity in the Driver Behavior: An Exploratory Study Using Real-Time Driving Data. Journal of Advanced Transportation. vol. 2022, Article ID 4509071. https://doi.org/10.1155/2022/4509071