All posts by Grab Tech

How we built our in-house chat platform for the web

Post Syndicated from Grab Tech original https://engineering.grab.com/how-we-built-our-in-house-chat-platform-for-the-web

At Grab, we’ve built an in-house chat platform to help connect our passengers with drivers during a booking, as well as with their friends and family for social sharing purposes.

P2P chat for the Angbow campaign and GrabHitch chat
P2P chat for the Angbow campaign and GrabHitch chat

We wanted to focus on our customer support chat experience, and so we replaced the third-party live chat tool that we’ve used for years with our newly developed chat platform. As a part of this initiative, we extended this platform for the web to integrate with our internal Customer Support portal.

Sample chat between a driver and a customer support agent
Sample chat between a driver and a customer support agent

This is the first time we introduced chat on the web, and we faced a few challenges while building it. In this article, we’ll go over some of these challenges and how we solved them.

Current Architecture

A vast majority of the communication from our Grab Passenger and Driver apps happens via TCP. Our TCP gateway takes care of processing all the incoming messages, authenticating, and routing them to the respective services. Our TCP connections are unicast, which means there is only one active connection possible per user at any point in time. This served us well, as we only allow our users to log in from one device at a time.

A TL;DR version of our current system
A TL;DR version of our current system

However, this model breaks on the web since our users can have multiple tabs open at the same time, and each would establish a new socket connection. Due to the unicast nature of our TCP connections, the older tabs would get disconnected and wouldn’t receive any messages from our servers. Our Customer Support agents love their tabs and have a gazillion open at any time. This behaviour would be too disruptive for them.

The obvious answer was to change our TCP connection strategy to multicast. We took a look at this and quickly realised that it was going to be a huge undertaking and could introduce a lot of unknowns for us to deal with.

We had to consider a different approach for the web and zeroed in on a hybrid approach with a little known Javascript APIs called SharedWorker and BroadcastChannel.

Understanding the basics

Before we jump in, let’s take a quick detour to review some of the terminologies that we’ll be using in this post.

If you’re familiar with how WebWorker works, feel free to skip ahead to the next section. For the uninitiated, JavaScript on the browser runs in a single-threaded environment. Workers are a mechanism to introduce background, OS-level threads in the browser. Creating a worker in JavaScript is simple. Let’s look at it with an example:

//instantiate a worker
const worker = new WebWorker("./worker.js");
worker.postMessage({ message: "Ping" });
worker.onMessage((e) => {
  console.log("Message from the worker");
});

// and in  worker.js
onMessage = (e) => {
  console.log(e.message);
  this.postMessage({ message: "pong" });
};

The worker API comes with a handy postMessage method which can be used to pass messages between the main thread and worker thread. Workers are a great way to add concurrency in a JavaScript application and help in speeding up an expensive process in the background.

Note: While the method looks similar, worker.postMessage is not the same as window.postMessage.

What is a SharedWorker?

SharedWorker is similar to a WebWorker and spawns an OS thread, but as the name indicates, it’s shared across browser contexts. In other words, there is only one instance of that worker running for that domain across tabs/windows. The API is similar to WebWorker but has a few subtle differences.

SharedWorkers internally use MessagePort to pass messages between the worker thread and the main thread. There are two ports- one for sending a message to the main thread and the other to receive. Let’s explore it with an example:

const mySharedWorker = new SharedWorker("./worker.js");
mySharedWorker.port.start();
mySharedWorker.port.postMessage(message);

onconnect = (e) => {
  const port = e.ports[0];

  // Handle messages from the main thread
  port.onmessage = handleEventFromMainThread.bind(port);
};

// Message from the main thread
const handleEventFromMainThread = (params) => {
  console.log("I received", params, "from the main thread");
};

const sendEventToMainThread = (params) => {
  connections.forEach((c) => c.postMessage(params));
};

There is a lot to unpack here. Once a SharedWorker is created, we’ve to manually start the port using mySharedWorker.port.start() to establish a connection between the script running on the main thread and the worker thread. Post that, messages can be passed via the worker’s postMessage method. On the worker side, there is an onconnect callback which helps in setting up listeners for connections from each browser context.

Under the hood, SharedWorker spawns a single OS thread per worker script per domain. For instance, if the script name is worker.js running in the domain https://ce.grab.com. The logic inside worker.js runs exactly once in this domain. The advantage of this approach is that we can run multiple worker scripts in the same-origin each managing a different part of the functionality. This was one of the key reasons why we picked SharedWorker over other solutions.

What are Broadcast channels

In a multi-tab environment, our users may send messages from any of the tabs and switch to another for the next message. For a seamless experience, we need to ensure that the state is in sync across all the browser contexts.

Message passing across tabs
Message passing across tabs

The BroadcastChannel API creates a message bus that allows us to pass messages between multiple browser contexts within the same origin. This helps us sync the message that’s being sent on the client to all the open tabs.

Let’s explore the API with a code example:

const channel = new BroadcastChannel("chat_messages");
// Sets up an event listener to receive messages from other browser contexts
channel.onmessage = ({ data }) => {
  console.log("Received ", data);
};

const sendMessage = (message) => {
  const event = { message, type: "new_message" };
  send(event);
  // Publish event to all browser contexts listening on the chat\_messages channel
  channel.postMessage(event);
};

const off = () => {
  // clear event listeners
  channel.close();
};

One thing to note here is that communication is restricted to listeners from the same origin.

How are our chat rooms powered

Now that we have a basic understanding of how SharedWorker and Broadcast channels work, let’s take a peek into how Grab is using it.

Our Chat SDK abstracts the calls to the worker and the underlying transport mechanism. On the surface, the interface just exposes two methods: one for sending a message and another for listening to incoming events from the server.

export interface IChatSDK {
  sendMessage: (message: ChatMessage) => string;
  sendReadReceipt: (receiptAck: MessageReceiptACK) => void;
  on: (callback: ICallBack) => void;
  off: (topic?: SDKTopics) => void;
  close: () => void;
}

The SDK does all the heavy lifting to manage the connection with our TCP service, and keeping the information in-sync across tabs.

SDK flow
SDK flow

In our worker, we additionally maintain all the connections from browser contexts. When an incoming event arrives from the socket, we publish it to the first active connection. Our SDK listens to this event, processes it, sends out an acknowledgment to the server, and publishes it in the BroadcastChannel. Let’s look at how we’ve achieved this via a code example.

Managing connections in the worker:

let socket;
let instances = 0;
let connections = [];

let URI: string;

// Called when a  new worker is connected.
// Worker is created at
onconnect = e => {
 const port = e.ports[0];

 port.start();
 port.onmessage = handleEventFromMainThread.bind(port);
 connections.push(port);
 instances ++;
};

// Publish ONLY to the first connection.
// Let the caller decide on how to sync this with other tabs
const callback= (topic, payload) => {
    connections[0].postMessage({
      topic,
      payload,
    });
 }

 const handleEventFromMainThread = e => {
   switch (e.data.topic) {
     case SocketTopics.CONNECT: {
       const config = e.data.payload;
       if (!socket) {
         // Establishes a WebSocket connection with the server
          socket = new SocketManager({...})
        } else {
          callback(SocketTopics.CONNECTED, '');
        }
        break;
      }
      case SocketTopics.CLOSE: {
        const index = connections.indexOf(this);
        if (index != -1 && instances > 0) {
            connections.splice(index, 1);
            instances--;
        }
        break;
      }
        // Forward everything else to the server
      default: {
        const payload = e.data;
        socket.sendMessage(payload);
        break;
      }
    }
  }

And in the ChatSDK:

// Implements IChatSDK

// Rough outline of our GrabChat implementation

class GrabChatSDK {
  constructor(config) {
    this.channel = new BroadcastChannel('incoming_events');
    this.channel.onmessage = ({data}) => {
        switch(data.type) {
            // Handle events from other tabs
            // .....
        }
    }
    this.worker = new SharedWorker('./worker', {
        type: 'module',
        name: `${config.appID}-${config.appEnv}`,
        credentials: 'include',
      });
      this.worker.port.start();
      // Publish a connected event, so the worker manager can register this connection
      this.worker.port.postMessage({
        topic: SocketTopics.CONNECT,
        payload,
      });
      // Incoming event from the shared worker
      this.worker.port.onmessage = this._handleIncomingMessage;
      // Disconnect this port before tab closes
      addEventListener('beforeunload', this._disconnect);
    }

    sendMessage(message) {
      // Attempt a delivery of the message
      worker.postMessage({
        topic: SocketTopics.NEW_MESSAGE,
        getPayload(message),
      });
      // Send the message to all tabs to keep things in sync
      this.channel.postMessage(getPayload(message));
    }

    // Hit if this connection is the leader of the SharedWorker connection
    _handleIncomingMessage(event) {
      // Send an ACK to our servers confirming receipt of the message
      worker.postMessage({
        topic: SocketTopics.ACK,
        payload,
      });

      if (shouldBroadcast(event.type)) {
        this.channel.postMessage(event);
      }

      this.callback(event);
    }

    _disconnect() {
      this.worker.port.postMessage(data);
      removeEventListener('beforeunload', this._disconnect);
    }
}

This ensures that there is only one connection between our application and the TCP service irrespective of the number of tabs the page is open in.

Some caveats

While SharedWorker is a great way to enforce singleton objects across browser contexts, the developer experience of SharedWorker leaves a lot to be desired. There aren’t many resources on the web, and it could be quite confusing if this is the first time you’re using this feature.

We faced some trouble integrating SharedWorker with bundling the worker code along. This plugin from GoogleChromeLabs did a great job of alleviating some pain. Debugging an issue with SharedWorker was not obvious. Chrome has a dedicated page for inspecting SharedWorkers (chrome://inspect/#workers), and it took some getting used to.

The browser support for SharedWorker is far from universal. While it works great in Chrome, Firefox, and Opera, Safari and most mobile browsers lack support. This was an acceptable trade-off in our use case, as we built this for an internal portal and all our users are on Chrome.

Shared race
Shared race

SharedWorker enforces uniqueness using a combination of origin and the script name. This could potentially introduce an unintentional race condition during deploy times if we’re not careful. Let’s say the user has a tab open before the latest deployment, and another one after deployment, it’s possible to end up with two different versions of the same script. We built a wrapper over the SharedWorker which cedes control to the latest connection, ensuring that there is only one version of the worker active.

Wrapping up

We’re happy to have shared our learnings from building our in-house chat platform for the web, and we hope you found this post helpful. We’ve built the web solution as a reusable SDK for our internal portals and public-facing websites for quick and easy integration, providing a powerful user experience.

We hope this post also helped you get a deeper sense of how SharedWorker and BroadcastChannels work in a production application.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Go Modules- A guide for monorepos (Part 1)

Post Syndicated from Grab Tech original https://engineering.grab.com/go-module-a-guide-for-monorepos-part-1

Go modules are a new feature in Go for versioning packages and managing dependencies. It has been almost 2 years in the making, and it’s finally production-ready in the Go 1.14 release early this year. Go recommends using single-module repositories by default, and warns that multi-module repositories require great care.

At Grab, we have a large monorepo and changing from our existing monorepo structure has been an interesting and humbling adventure. We faced serious obstacles to fully adopting Go modules. This series of articles describes Grab’s experience working with Go modules in a multi-module monorepo, the challenges we faced along the way, and the solutions we came up with.

To fully appreciate Grab’s journey in using Go Modules, it’s important to learn about the beginning of our vendoring process.

Native support for vendoring using the vendor folder

With Go 1.5 came the concept of the vendor folder, a new package discovery method, providing native support for vendoring in Go for the first time.

With the vendor folder, projects influenced the lookup path simply by copying packages into a vendor folder nested at the project root. Go uses these packages before traversing the GOPATH root, which allows a monorepo structure to vendor packages within the same repo as if they were 3rd-party libraries. This enabled go build to work consistently without any need for extra scripts or env var modifications.

Initial obstacles

There was no official command for managing the vendor folder, and even copying the files in the vendor folder manually was common.

At Grab, different teams took different approaches. This meant that we had multiple version manifests and lock files for our monorepo’s vendor folder. It worked fine as long as there were no conflicts. At this time very few 3rd-party libraries were using proper tagging and semantic versioning, so it was worse because the lock files were largely a jumble of commit hashes and timestamps.

Jumbled commit hashes and timestamps
Jumbled commit hashes and timestamps

As a result of the multiple versions and lock files, the vendor directory was not reproducible, and we couldn’t be sure what versions we had in there.

Temporary relief

We eventually settled on using Glide, and standardized our vendoring process. Glide gave us a reproducible, verifiable vendor folder for our dependencies, which worked up until we switched to Go modules.

Vendoring using Go modules

I first heard about Go modules from Russ Cox’s talk at GopherCon Singapore in 2018, and soon after started working on adopting modules at Grab, which was to manage our existing vendor folder.

This allowed us to align with the official Go toolchain and familiarise ourselves with Go modules while the feature matured.

Switching to go mod

Go modules introduced a go mod vendor command for exporting all dependencies from go.mod into vendor. We didn’t plan to enable Go modules for builds at this point, so our builds continued to run exactly as before, indifferent to the fact that the vendor directory was created using go mod.

The initial task to switch to go mod vendor was relatively straightforward as listed here:

  1. Generated a go.mod file from our glide.yaml dependencies. This was scripted so it could be kept up to date without manual effort.
  2. Replaced the vendor directory.
  3. Committed the changes.
  4. Used go mod instead of glide to manage the vendor folder.

The change was extremely large (due to differences in how glide and go mod handled the pruning of unused code), but equivalent in terms of Go code. However, there were some additional changes needed besides porting the version file.

Addressing incompatible dependencies

Some of our dependencies were not yet compatible with Go modules, so we had to use Go module’s replace directive to substitute them with a working version.

A more complex issue was that parts of our codebase relied on nested vendor directories, and had dependencies that were incompatible with the top level. The go mod vendor command attempts to include all code nested under the root path, whether or not they have used a sub-vendor directory, so this led to conflicts.

Problematic paths

Rather than resolving all the incompatibilities, which would’ve been a major undertaking in the monorepo, we decided to exclude these paths from Go modules instead. This was accomplished by placing an empty go.mod file in the problematic paths.

Nested modules

The empty go.mod file worked. This brought us to an important rule of Go modules, which is central to understanding many of the issues we encountered:

A module cannot contain other modules

This means that although the modules are within the same repository, Go modules treat them as though they are completely independent. When running go mod commands in the root of the monorepo, Go doesn’t even ‘see’ the other modules nested within.

Tackling maintenance issues

After completing the initial migration of our vendor directory to go mod vendor however, it opened up a different set of problems related to maintenance.

With Glide, we could guarantee that the Glide files and vendor directory would not change unless we deliberately changed them. This was not the case after switching to Go modules; we found that the go.mod file frequently required unexpected changes to keep our vendor directory reproducible.

There are two frequent cases that cause the go.mod file to need updates: dependency inheritance and implicit updates.

Dependency inheritance

Dependency inheritance is a consequence of Go modules version selection. If one of the monorepo’s dependencies uses Go modules, then the monorepo inherits those version requirements as well.

When starting a new module, the default is to use the latest version of dependencies. This was an issue for us as some of our monorepo dependencies had not been updated for some time. As engineers wanted to import their module from the monorepo, it caused go mod vendor to pull in a huge amount of updates.

To solve this issue, we wrote a quick script to copy the dependency versions from one module to another.

One key learning here is to have other modules use the monorepo’s versions, and if any updates are needed then the monorepo should be updated first.

Implicit updates

Implicit updates are a more subtle problem. The typical Go modules workflow is to use standard Go commands: go build, go test, and so on, and they will automatically update the go.mod file as needed. However, this was sometimes surprising, and it wasn’t always clear why the go.mod file was being updated. Some of the reasons we found were:

  • A new import was added by mistake, causing the dependency to be added to the go.mod file
  • There is a local replace for some module B, and B changes its own go.mod. When there’s a local replace, it bypasses versioning, so the changes to B’s go.mod are immediately inherited.
  • The build imports a package from a dependency that can’t be satisfied with the current version, so Go attempts to update it.

This means that simply creating a tag in an external repository is sometimes enough to affect the go.mod file, if you already have a broken import in the codebase.

Resolving unexpected dependencies using graphs

To investigate the unexpected dependencies, the command go mod graph proved the most useful.

Running graph with good old grep was good enough, but its output is also compatible with the digraph tool for more sophisticated queries. For example, we could use the following command to trace the source of a dependency on cloud.google.com/go:

$ go mod graph | digraph somepath grab.com/example cloud.google.com/[email protected]

github.com/hashicorp/vault/[email protected] github.com/hashicorp/vault/[email protected]

github.com/hashicorp/vault/[email protected] google.golang.org/[email protected]

google.golang.org/[email protected] google.golang.org/[email protected]

google.golang.org/[email protected] cloud.google.com/[email protected]
Diagram generated using modgraphviz
Diagram generated using modgraphviz

Stay tuned for more

I hope you have enjoyed this article. In our next post, we’ll cover the other solutions we have for catching unexpected changes to the go.mod file and addressing dependency issues.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Credits

The cute Go gopher logo for this blog’s cover image was inspired by Renee French’s original work.

Does Southeast Asia run on coffee?

Post Syndicated from Grab Tech original https://engineering.grab.com/does-southeast-asia-run-on-coffee

This article was originally published in the Grab Medium account on December 4, 2019. Reposting it here for your reading pleasure.

There is no surprise as to why coffee is a go-to drink in the region. For one, almost a third of coffee is produced in Asia, giving us easy access to beans. Coupled with the plethora of local cafes and stores at every corner in Southeast Asia, coffee has become an accessible and affordable drink — and one that enjoys a huge following.

For many, a morning cuppa is fuel to kick start their day. For some it’s the secret weapon to a food coma, for others, it’s fuel to keep them going throughout the day.

To get a glimpse of how our fellow Southeast Asians refuel with coffee on a daily basis, we took a look (along with our ‘kopi’) at GrabFood data, and here is what we found.

Did you know: Coffee orders have grown 1,400% on GrabFood?

How much do we actually love our coffee? It seems like we do, a lot.

Coffee orders on GrabFood has been growing pervasively throughout the major cities, and a timelapse visualisation based on data from GrabFood orders show us the growth of orders across major cities over a 9-month period:

Timelapse visualisation

Time for coffee?

But how reliant are we on caffeine? We analysed the coffee consumption behaviour of GrabFood users from major SEA countries across a typical week.

Coffee Orders by Day of the Week: Singapore coffee orders peak on the weekends

Coffee Orders by Day of the Week - Chart

Turns out most coffee orders are placed on Wednesdays — clearly a much needed shot to overcome the dreaded hump day. And as we head into the weekend, orders begin to decline as Southeast Asians wind down from the work week.

However, the complete opposite happens for our friends in Singapore and the Philippines! Coffee orders actually spike on the weekends, and especially so on Sundays. It can only mean that Singaporeans and Filipinos surely enjoy their coffee catch-ups with friends and family.

AM- Coffee… PM- Still coffee

The question begets — when exactly do SEA coffee drinkers summon that life saving cup from our delivery heroes in green?

Check out this trippy visualisation that resembles jumping coffee beans:

Coffee Orders by Hour of Day — Orders peak at 10am for Thailand and 2pm in Indonesia

Coffee Orders by Day of the Week

While other cities generally reach for the Grab app at noon for that extra boost to fight that food coma through the rest of the day, our friends in Thailand gets their caffeine fix early, with most orders coming in at 10.00am, just before the lunch hour.

Interestingly, coffee orders for Singapore peak at about 4pm in the afternoon… are they working hard, or are they hardly working?

GrabFood’s love is in the air, and it smells like coffee

Curious as to what coffee flavours our SEA neighbours prefer? We spill the (coffee) beans!

Top 3 Coffee Flavours in each Country

Top 3 Coffee Flavours in each Country

What is a non-coffee drinker to do?

What non-coffee drinker drinks

Also known as Matcha Latte, Green Tea Latte seems to be the next big beverage fad in the region , serving as a perfect coffee alternative for non-coffee drinkers.

Matcha latte, made with concentrated shots of green tea and topped with frothy, steamed milk, is gaining popularity. While it offers the same quantity of caffeine as a cup of brewed coffee, the drink is perceived to be as more energising , because of the slower release of caffeine.

It has consistently been one of the top 10 beverage items ordered on GrabFood, and we’ve delivered over 25 million cups of these green, frothy and creamy ‘heaven in a cup’ over the last nine months!

Southeast Asian’s love of tea-based latte (other than green tea) is apparent in Grab’s data! Some of the unique flavours that are being ordered on GrabFood include the following flavours:

Unique flavours

GrabFood Coffee is a hug in a mug

Is your blood type coffee? Whether you feel like caramelly and chocolatey Macchiato, or fruity and floral aroma of freshly brewed Americano, or intense and bitter double-shot Long Black — GrabFood has got you covered! May your coffee get delivered (and kick in) before reality does!

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GrabChat Much? Talk Data to me!

Post Syndicated from Grab Tech original https://engineering.grab.com/grabchat-much-talk-data-to-me

This article was originally published in the Grab Medium account on November 20, 2019. Reposting it here for your reading pleasure.

In September 2016 GrabChat was born, a platform designed to allow seamless communication between passenger and driver-partner. Since then, Grab has continuously improved the GrabChat experience by introducing features such as instant translation, images, and audio chats, and as a result — reduced cancellation rates by up to 50%! We’ve even experimented with various features to deliver hyper-localised experiences in each country! So with all these features, how have our users responded? Let’s take a deeper look into this to uncover some interesting insights from our data in Singapore, Malaysia and Indonesia.

The Chattiest Country

Number of Chats by Country
Number of Chats by Country

In a previous blog post several years ago, we revealed that Indonesia was the chattiest nation in South-east Asia. Our latest data is no different. Indonesia is still the chattiest country out of the three, having an average of 5.5 chats per bookings, while Singapore is the least chatty! Furthermore, passengers in Singapore tend to be chattier than driver-partners, while the reverse relationship is true for the other two countries.

But what do people talk about?

Common words in Indonesia
Common words in Indonesia
Common words in Singapore
Common words in Singapore
Common words in Malaysia
Common words in Malaysia

As expected, most of the chats revolve around pick-up points. There are many similarities between the three countries, such as typing courtesies such as ‘Hi’ and ‘Thank you’, and that the driver-partner/passenger is coming. However, there are slight differences between the countries. Can you spot them all?

In Indonesia, chats are usually in Bahasa Indonesia, and tend to be mostly driver-partners thanking passengers for using Grab.

Chats in Singapore on the other hand, tend to be in English, and contain mostly pick-up locations, such as a car park. There are quite a few unique words in the Singapore context, such as ‘rubbish chute’ and ‘block’ that reflect features of the ubiquitous HDB’s (public housing) found everywhere in Singapore that serve as popular residential pickup points.

Malaysia seems to be a blend of the other two countries, with chats in a mix of English and Bahasa Malaysia. Many of the chats highlight pickup locations, such as a guard house, as well as the phrase all Malaysians know: being stuck in traffic.

Time Trend
Time Trend

Analysis in chat trends across the three countries revealed an unexpected insight: a trend of talking more from midnight until around 4am. Perplexed but intrigued, we dug further to discover what prompted our users to talk more in such odd hours.

From midnight to 4am shops and malls are usually closed during these hours, and pickup locations become more obscure as people wander around town late at night. Driver-partners and passengers thus tend to have more conversations to determine the pickup point. This also explains why the proportion of pick-up location based messages out of all messages is highest between 12 and 6am. On the other hand, these messages are less common in the mornings (6am-12pm) as people tend to be picked up from standard residential locations.

GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) - Image 1
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) – Image 1
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019)  - Image 2
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) – Image 2
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) - Image 3
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) – Image 3

The ability to send images on GrabChat was introduced in September 2018, with the aim of helping driver-partners identify the exact pickup location of passengers. Within the first few weeks of release, 22,000 images were sent in Singapore alone. The increase in uptake of the image feature for the cities of Jakarta, Singapore and Kuala Lumpur can be seen in the images above.

From analysis, we found that areas that were more remote such as Tengah in Singapore tended to have the highest percentage of images sent, indicating that images are useful for users in unfamiliar places.

Safety First

Aside from images, Grab also introduced two other features: templates and audio chats, to avoid driver-partners from texting while driving.

Templates and audio features used by driver-partners, and a reduced number of typed texts by driver-partners per booking
Templates and audio features used by driver-partners, and a reduced number of typed texts by driver-partners per bookin

“Templates” (pre-populated phrases) allowed driver-partners to send templated messages with just a quick tap. In our recent data analysis, we discovered that almost 50% of driver-partner texts comprised of templates.

“Audio chat” alongside “images chat” were introduced in September 2018, and the use of this feature has been steadily increasing, with audio comprising an increasing percentage of driver-partner texts.

With both features being picked up by driver-partners across all three countries, Grab has successfully seen a decrease in the overall number of driver-partner texts (non-templates) per booking within a 3 month period.

A Brief Pick-up Guide

No one likes a cancelled ride, right? Well, after analysing millions of data points, we’ve unearthed some neat tips and tricks to help you complete your ride, and we’re sharing them with you!

Completed Rides
Completed Rides

This first tip might be a no-brainer, but replying your driver-partner would result in a higher completion rate. No one likes to be blue-ticked do they?

Next, we discovered various things you could say that would result in higher completion rates, explained below in the graphic.

Tips for a Better Pickup Experience
Tips for a Better Pickup Experience

Informing the driver-partner that you’re coming, giving them directions, and telling them how to identify you results in almost double the chances of completing the ride!

Last but not least, let’s not forget our manners. Grab’s data analysis revealed that saying ‘thank you’ correlated with an increase in completion rates! Also, be at the pickup point on time — remember, time is money for our driver-partners!

Conclusion

Just like in Shakespeare’s Much Ado about Nothing, ample information can be gathered from the mere whim of a message. Grab is constantly aspiring to achieve the best experience for both passengers and driver-partners, and data plays a huge role in helping us achieve this.

This is just the first page of the book. The amount of information lurking between every page is endless. So stay tuned for more interesting insights about our GrabChat platform!

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

7 Fun Facts about Grab’s Driver-Partners in Singapore

Post Syndicated from Grab Tech original https://engineering.grab.com/seven-facts-about-grab-driver-partners-in-sg

This article was originally published in the Grab Medium account on June 17, 2019. Reposting it here for your reading pleasure.

Grab’s Big Data Story

Grab is on an incredible mission to empower our driver-partners in 336 cities in 8 countries.

Curious about what Grab’s data tells us about driver-partners on the platform?

Let us share with you the most interesting data points we found among our driver-partners in Singapore!

1. It’s a Small World

It's a Small World

Lim Chu Kang may feel like a world away from Katong, but Singapore is a small world for our driver-partners — a driver-partner has a 1 in 400 chance of having a repeat passenger amongst the 5.4 million population!

2. Saturday Night Fever

Saturday Night Fever

The annual average number of rides that a Grab driver-partner complete on Saturday nights is 110. But there was one special driver-partner who did 1,131 Saturday night trips in the year of 2018! Weekend parties just wouldn’t be the same without you. Rock on!

3. Share the Love

Share the Lover

Did you know that having more passengers in a car can yield more 5-star ratings? Our GrabShare passengers share more than just rides — they share their appreciation too! GrabShare rides have an average trip rating of 4.8!

4. The Road More Travelled

The Road More Travelled

Which neighbourhoods are painting the town green? Our driver partners picked up the most passengers from Tampines, while Orchard & Marina Bay areas were the most popular destination in 2018!

5. Tricks of the Trade

Tricks of the Trad

Ever wondered if seasoned driver-partners who have been with us for more than 2 years, have different driving preferences and habits? They tend to start their day 1 hour earlier around 6–7am, and are on auto-accept most of the time. Did you know that drivers on auto-accept spend less time idle waiting for new jobs?

6. Busy Bee

Busy Bee

Did you know? Drivers are twice as likely to get back-to-back allocations during evening peak hours! Drivers with frequent back-to-back jobs earn about 50% more per hour.

7. Road Runner

Road Runner

One of our most active driver-partners covered 57,000km ferrying passengers in 2018 — that’s like driving every road, street, jalan, lorong and tanjong in Singapore for more than 57 times!

How our driver-partners utilize the Grab platform to make a living (and break a few records along the way) never ceases to amaze us.

Interested to know more about the winning strategies among our driver-partners? Look out for the next data story!

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Tackling UI test execution time imbalance for Xcode parallel testing

Post Syndicated from Grab Tech original https://engineering.grab.com/tackling-ui-test-execution-time-imbalance-for-xcode-parallel-testing

Introduction

Testing is a common practice to ensure that code logic is not easily broken during development and refactoring. Having tests running as part of Continuous Integration (CI) infrastructure is essential, especially with a large codebase contributed by many engineers. However, the more tests we add, the longer it takes to execute. In the context of iOS development, the execution time of the whole test suite might be significantly affected by the increasing number of tests written. Running CI pre-merge pipelines against a change, would cost us more time. Therefore, reducing test execution time is a long term epic we have to tackle in order to build a good CI infrastructure.

Apart from splitting tests into subsets and running each of them in a CI job, we can also make use of the Xcode parallel testing feature to achieve parallelism within one single CI job. However, due to platform-specific implementations, there are some constraints that prevent parallel testing from working efficiently. One constraint we found is that tests of the same Swift class run on the same simulator. In this post, we will discuss this constraint in detail and introduce a tip to overcome it.

Background

Xcode parallel testing

The parallel testing feature was shipped as part of the Xcode 10 release. This support enables us to easily configure test setup:

  • There is no need to care about how to split a given test suite.
  • The number of workers (i.e. parallel runners/instances) is configurable. We can pass this value in the xcodebuild CLI via the -parallel-testing-worker-count option.
  • Xcode takes care of cloning and starts simulators accordingly.

However, the distribution logic under the hood is a black-box. We do not really know how tests are assigned to each worker or simulator, and in which order.

Three simulators running tests in parallel
Three simulators running tests in parallel

It is worth mentioning that even without the Xcode parallel testing support, we can still achieve similar improvements by running subsets of tests in different child processes. But it takes more effort to dispatch tests to each child process in an efficient way, and to handle the output from each test process appropriately.

Test time imbalance

Generally, a parallel execution system is at its best efficiency if each parallel task executes in roughly the same duration and ends at roughly the same time.

If the time spent on each parallel task is significantly different, it will take more time than expected to execute all tasks. For example, in the following image, it takes the system on the left 13 mins to finish 3 tasks. Whereas, the one on the right takes only 10.5 mins to finish those 3 tasks.

Bad parallelism vs. good parallelism
Bad parallelism vs. good parallelism

Assume there are N workers. The ith worker executes its tasks in ti seconds/minutes. In the left plot, t1 = 10 mins, t2 = 7 mins, t3 = 13 mins.

We define the test time imbalance metric as the difference between the min and max end time:

max(ti) – min(ti)

For the example above, the test time imbalance is 13 mins – 7 mins = 6 mins.

Contributing factors in test time imbalance

There are several factors causing test time imbalance. The top two prominent factors are:

  1. Tests vary in execution time.
  2. Tests of the same class run on the same simulator.

An example of the first factor is that in our project, around 50% of tests execute in a range of 20-40 secs. Some tests take under 15 secs to run while several take up to 2 minutes. Sometimes tests taking longer execution time is inevitable since those tests usually touch many flows, which cannot be split. If such tests run last, the test time imbalance may increase.

However, this issue, in general, does not matter that much because long-time-execution tests do not always run last.

Regarding the second factor, there is no official Apple documentation that explicitly states this constraint. When Apple first introduced parallel testing support in Xcode 10, they only mentioned that test classes are distributed across runner processes:

“Test parallelization occurs by distributing the test classes in a target across multiple runner processes. Use the test log to see how your test classes were parallelized. You will see an entry in the log for each runner process that was launched, and below each runner you will see the list of classes that it executed.”

For example, we have a test class JobFlowTests that includes five tests and another test class TutorialTests that has only one single test.

final class JobFlowTests: BaseXCTestCase {
func testHappyFlow() { ... }
  func testRecoverFlow() { ... }
  func testJobIgnoreByDax() { ... }
  func testJobIgnoreByTimer() { ... }
  func testForceClearBooking() { ... }
}
...
final class TutorialTests: BaseXCTestCase {
  func testOnboardingFlow() { ... }
}

When executing the two tests with two simulators running in parallel, the actual run is like the one shown on the left side of the following image, but ideally it should work like the one on the right side.

Tests of the same class are supposed to run on the same simulator but they should be able to run on different simulators.
Tests of the same class are supposed to run on the same simulator but they should be able to run on different simulators.

Diving deep into Xcode parallel testing

Demystifying Xcode scheduling log

As mentioned above, Xcode distributes tests to simulators/workers in a black-box manner. However, by looking at the scheduling log generated when running tests, we can understand how Xcode parallel testing works.

When running UI tests via the xcodebuild command:

$ xcodebuild -workspace Driver/Driver.xcworkspace \
    -scheme Driver \
    -configuration Debug \
    -sdk 'iphonesimulator' \
    -destination 'platform=iOS Simulator,id=EEE06943-7D7B-4E76-A3E0-B9A5C1470DBE' \
    -derivedDataPath './DerivedData' \
    -parallel-testing-enabled YES \
    -parallel-testing-worker-count 2 \
    -only-testing:DriverUITests/JobFlowTests \    # 👈👈👈👈👈
    -only-testing:DriverUITests/TutorialTests \
    test-without-building

The log can be found inside the *.xcresult folder under DerivedData/Logs/Test. For example: DerivedData/Logs/Test/Test-Driver-2019.11.04\_23-31-34-+0800.xcresult/1\_Test/Diagnostics/DriverUITests-144D9549-FD53-437B-BE97-8A288855E259/scheduling.log

Scheduling log under xcresult folder.
Scheduling log under xcresult folder
2019-11-05 03:55:00 +0000: Received worker from worker provider: 0x7fe6a684c4e0 [0: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
2019-11-05 03:55:13 +0000: Worker 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)] finished bootstrapping
2019-11-05 03:55:13 +0000: Parallelization enabled; test execution driven by the IDE
2019-11-05 03:55:13 +0000: Skipping test class discovery
2019-11-05 03:55:13 +0000: Executing tests {(	# 👈👈👈👈👈
    DriverUITests/JobFlowTests,
    DriverUITests/TutorialTests
)}; skipping tests {(
)}
2019-11-05 03:55:13 +0000: Load balancer requested an additional worker
2019-11-05 03:55:13 +0000: Dispatching tests {(  # 👈👈👈👈👈
    DriverUITests/JobFlowTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
2019-11-05 03:55:13 +0000: Received worker from worker provider: 0x7fe6a1582e40 [0: Clone 2 of DaxIOS-XC10-1-iP7-1 (F640C2F1-59A7-4448-B700-7381949B5D00)]
2019-11-05 03:55:39 +0000: Dispatching tests {(  # 👈👈👈👈👈
    DriverUITests/TutorialTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
...

Looking at the log below, we know that once a test class is dispatched or distributed to a worker/simulator, all tests of that class will be executed in that simulator.

2019-11-05 03:55:39 +0000: Dispatching tests {(
    DriverUITests/TutorialTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]

Even when we customize a test suite (by swizzling some XCTestSuite class methods or variables), to split a test suite into multiple suites, it does not work because the made-up test suite is only initialized after tests are dispatched to a given worker.

Therefore, any hook to bypass this constraint must be done early on.

Passing the -only-testing argument to xcodebuild command

Now we pass tests (instead of test classes) to the -only-testing argument.

$ xcodebuild -workspace Driver/Driver.xcworkspace \
    # ...
    -only-testing:DriverUITests/JobFlowTests/testJobIgnoreByTimer \
    -only-testing:DriverUITests/JobFlowTests/testRecoverFlow \
    -only-testing:DriverUITests/JobFlowTests/testJobIgnoreByDax \
    -only-testing:DriverUITests/JobFlowTests/testHappyFlow \
    -only-testing:DriverUITests/JobFlowTests/testForceClearBooking \
    -only-testing:DriverUITests/TutorialTests/testOnboardingFlow \
    test-without-building

But still, the scheduling log shows that tests are grouped by test class before being dispatched to workers (see the following log for reference). This grouping is automatically done by Xcode (which it should not).

2019-11-05 04:21:42 +0000: Executing tests {(	# 👈
    DriverUITests/JobFlowTests/testJobIgnoreByTimer,
    DriverUITests/JobFlowTests/testRecoverFlow,
    DriverUITests/JobFlowTests/testJobIgnoreByDax,
    DriverUITests/TutorialTests/testOnboardingFlow,
    DriverUITests/JobFlowTests/testHappyFlow,
    DriverUITests/JobFlowTests/testForceClearBooking
)}; skipping tests {(
)}
2019-11-05 04:21:42 +0000: Load balancer requested an additional worker
2019-11-05 04:21:42 +0000: Dispatching tests {(  # 👈 ❌
    DriverUITests/JobFlowTests/testJobIgnoreByTimer,
    DriverUITests/JobFlowTests/testForceClearBooking,
    DriverUITests/JobFlowTests/testJobIgnoreByDax,
    DriverUITests/JobFlowTests/testHappyFlow,
    DriverUITests/JobFlowTests/testRecoverFlow
)} to worker: 0x7fd781261940 [6300: Clone 1 of DaxIOS-XC10-1-iP7-1 (93F0FCB6-C83F-4419-9A75-C11765F4B1CA)]
......

Overcoming grouping logic in Xcode parallel testing

Tweaking the -only-testing argument values

Based on our observation, we can imagine how Xcode runs tests in parallel. See below.

Step 1.   tests = detect_tests_to_run() # parse -only-testing arguments
Step 2.   groups_of_tests = group_tests_by_test_class(tests)
Step 3.   while groups_of_tests is not empty:
Step 3.1. 	worker = find_free_worker()
Step 3.2.     if worker is not None:
                  dispatch_tests_to_workers(groups_of_tests.pop())

In the pseudo-code above, we do not have much control to change step 2 since that grouping logic is implemented by Xcode. But we have a good guess that Xcode groups tests, by the first two components (class name) only (For example, DriverUITests/JobFlowTests). In other words, tests having the same class name run together on one simulator.

The trick to break this constraint is simple. We can tweak the input (test names) so that each group contains only one test. By inserting a random token in the class name, all class names in the tests that are passed via -only-testing argument are different.

For example, instead of passing:

-only-testing:DriverUITests/JobFlowTests/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests/testRecoverFlow \

We rather use:

-only-testing:DriverUITests/JobFlowTests_AxY132z8/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests_By8MTk7l/testRecoverFlow \

Or we can use the test name itself as the token:

-only-testing:DriverUITests/JobFlowTests_testJobIgnoreByTimer/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests_testRecoverFlow/testRecoverFlow \

After that, looking at the scheduling log, we will see that the trick can bypass the grouping logic. Now, only one test is dispatched to a worker once ready.

2019-11-05 06:06:56 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/JobFlowTests_testJobIgnoreByDax/testJobIgnoreByDax
)} to worker: 0x7fef7952d0e0 [13857: Clone 2 of DaxIOS-XC10-1-iP7-1 (9BA030CD-C90F-4B7A-B9A7-D12F368A5A64)]
2019-11-05 06:06:58 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/TutorialTests_testOnboardingFlow/testOnboardingFlow
)} to worker: 0x7fef7e85fd70 [13719: Clone 1 of DaxIOS-XC10-1-iP7-1 (584F99FE-49C2-4536-B6AC-90B8A10F361B)]
2019-11-05 06:07:07 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/JobFlowTests_testRecoverFlow/testRecoverFlow
)} to worker: 0x7fef7952d0e0 [13857: Clone 2 of DaxIOS-XC10-1-iP7-1 (9BA030CD-C90F-4B7A-B9A7-D12F368A5A64)]

Handling tweaked test names

When a worker/simulator receives a request to run a test, the app (could be the runner app or the hosting app) initializes an XCTestSuite corresponding to the test name. In order for the test suite to be properly made up, we need to remove the inserted token.

This could be done easily by swizzling the XCTestSuite.init(forTestCaseWithName:). Inside that swizzled function, we remove the token and then call the original init function.

extension XCTestSuite {
  /// For 'Selected tests' suite
  @objc dynamic class func swizzled_init(forTestCaseWithName maskedName: String) -> XCTestSuite {
    /// Recover the original test name
    /// - masked: UITestCaseA_testA1/testA1      	--> recovered: UITestCaseA/testA1
    /// - masked: Driver/UITestCaseA_testA1/testA1   --> recovered: Driver/UITestCaseA/testA1
    guard let testBaseName = maskedName.split(separator: "/").last else {
      return swizzled_init(forTestCaseWithName: maskedName)
    }
    let recoveredName = maskedName.replacingOccurrences(of: "_\(testBaseName)/", with: "/") # 👈 remove the token
    return swizzled_init(forTestCaseWithName: recoveredName) # 👈 call the original init
  }
}
Swizzle function to run tests properly
Swizzle function to run tests properly

Test class discovery

In order to adopt this tip, we need to know which test classes we need to run in advance. Although Apple does not provide an API to obtain the list before running tests, this can be done in several ways. One approach we can use is to generate test classes using Sourcery. Another alternative is to parse the binaries inside .xctest bundles (in build products) to look for symbols related to tests.

Conclusion

In this article, we identified some factors causing test execution time imbalance in Xcode parallel testing (particularly for UI tests).

We also looked into how Xcode distributes tests in parallel testing. We also try to mitigate a constraint in which tests within the same class run on the same simulator. The trick not only reduces the imbalance but also gives us more confidence in adding more tests to a class without caring about whether it affects our CI infrastructure.

Below is the metric about test time imbalance recorded when running UI tests. After adopting the trick, we saw a decrease in the metric (which is a good sign). As of now, the metric stabilizes at around 0.4 mins.

Tracking data of UI test time imbalance (in minutes) in our project, collected by multiple runs
Tracking data of UI test time imbalance (in minutes) in our project, collected by multiple runs

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Returning 575 Terabytes of storage space back to our users

Post Syndicated from Grab Tech original https://engineering.grab.com/returning-storage-space-back-to-our-users

Have you ever run out of storage on your phone? Mobile phones come with limited storage and with the multiplication of apps and large video files, many of you are running out of space.

In this article, we explain how we measure and reduce the storage footprint of the Grab App on a user’s device to help you overcome this issue.

The wakeup call

Android vitals (information provided by Google play Console about our app performance) gives us two main pieces of information about storage footprint.

15.7% of users have less than 1GB of free storage and they tend to uninstall more than other users (1.2x).

The proportion of 30 day active devices which reported less than 1GB free storage. Calculated as a 30 days rolling average.

Active devices with <1GB free space
Active devices with <1GB free space

This is the ratio of uninstalls on active devices with less than 1GB free storage to uninstalls on all active devices. Calculated as a 30 days rolling average.

Ratio of uninstalls on active devices with less than 1GB
Ratio of uninstalls on active devices with less than 1GB

Instrumentation to know where we stand

First things first, we needed to know how much space the Grab App occupies on user device. So we started using our personal devices. We can find this information by opening the phone settings and selecting Grab App.

App Settings
App Settings

For this device (screenshot), the application itself (Installed binary) was 186 MB and the total footprint was 322 MB. Since this information varies a lot based on the usage of the app, we needed this information directly from our users in production.

Disclaimer: We are only measuring files that are inside the internal Grab app folder (Cache/Database). We do NOT measure any file that is not inside the private Grab folder.

We decided to leverage on our current implementation using StorageManager API to gather the following information during each session launch:

  • Application Size (Installed binary size)
  • Cache folder size
  • Total footprint
Sample code to retrieve storage information on Android
Sample code to retrieve storage information on Android

Data analysis

We began analysing this data one month after our users’ updated their app and found that the cache size was anomaly huge (> 1GB) for a lot of users. Intrigued, we dug deeper.

We added code to log the top largest files inside the cache folder, and we found that most of the files were inside a sub cache folder that was no longer in use. This was due to a usage of a 3rd party library that was removed from our app. We added a specific metric to track the size of this folder.

In the end, a lot of users still had this old cache data and for some users the amount of data can be up to 1GB.

Root cause analysis

The Grab app relies a lot on 3rd party libraries. For example, Picasso was a library we used in the past for image display which is now replaced by Glide. Picasso uses a cache to store images and avoid making network calls again and again. After removing Picasso from the app, we didn’t delete this cache folder on the user device. We knew there would likely be more third-party libraries that had been discontinued so we expanded our analysis to look at how other 3rd party libraries cached their data.

Freeing up space on user’s phone

Here comes the fun part. We implemented a cleanup mechanism to remove old cache folders. When users update the GrabApp, any old cache folders which were there before would automatically be removed. Through this, we released up to 1GB of data in a second back to our users. In total, we removed 575 terabytes of old cache data across more than 13 million devices (approximately 40MB per user on average).

Data summary

The following graph shows the total size of junk data (in Terabytes) that we can potentially remove each day, calculated by summing up the maximum size of cache when a user opens the Grab app each day.

The first half of the graph reflects the amount of junk data in relation to the latest app version before auto-clean up was activated. The second half of the graph shows a dramatic dip in junk data after auto-clean up was activated. We were deleting up to 33 Terabytes of data per day on the user’s device when we first started!

Sum of all junk data on user’s device reported per day in Terabytes
Sum of all junk data on user’s device reported per day in Terabytes

Next step

This is the first phase of our journey in reducing the storage footprint of our app on Android devices. We specifically focused on making improvements at scale i.e. deliver huge storage gains to the most number of users in the shortest time. In the next phase, we will look at more targeted improvements for specific groups of users that still have a high storage footprint. In addition, we are also reviewing iOS data to see if a round of clean up is necessary.

Concurrently, we are also reducing the maximum size of cache created by some libraries. For example, Glide by default creates a cache of 250MB but this can be configured and optimised.

We hope you found this piece insightful and please remember to update your app regularly to benefit from the improvements we’re making every day. If you find that your app is still taking a lot of space on your phone, be assured that we’re looking into it.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Grab-Posisi – SouthEast Asia’s first comprehensive GPS trajectory dataset

Post Syndicated from Grab Tech original https://engineering.grab.com/grab-posisi

Introduction        

At Grab, thousands of bookings happen daily via the Grab app. The driver phones and GPS devices enable us to collect large-scale GPS trajectories.

Apart from the time and location of the object, GPS trajectories are also characterised by other parameters such as speed, the headed direction, the area and distance covered during its travel, and the travelled time. Thus, the trajectory patterns from users GPS data are a valuable source of information for a wide range of urban applications, such as solving transportation problems, traffic prediction, and developing reasonable urban planning.

Currently, it’s a herculean task to create and maintain the GPS datasets since it’s costly and laborious. As a result, most of the GPS datasets available today in the market have poor coverage or contain outdated information. They cover only a small area of a city, have low sampling rates and contain less contextual information of the GPS pings such as no accuracy level, bearing, and speed.Despite over a dozen mapping communities engaged in collecting GPS trajectory datasets, a significant amount of effort would be required for data cleaning and data pre-processing in order to utilize them.

To overcome the shortfalls in the existing datasets, we built Grab-Posisi, the first GPS trajectory dataset of Southeast Asia. The term Posisi refers to a position in Bahasa. The data was collected from Grab drivers’ phones while in transit. By tackling the addition of major arterial roads in regions where existing maps have poor coverage, and the incremental improvement of coverage in regions where major roads are already mapped, Posisi substantially improves mapping productivity.

What’s inside the dataset

The whole Grab-Posisi dataset contains in total 84K trajectories that consist of more than 80 million GPS pings and cover over 1 million km. The average trajectory length is 11.94 km and the average duration per trip is 21.50 minutes.

The data were collected very recently in April 2019 with a 1 second sampling rate, which is the highest amongst all the publicly available datasets. It also has richer contextual information, including the accuracy level, bearing and speed. The accuracy level is important because GPS measurements are noisy and the true location can be anywhere inside a circle centred at the reported location with a radius equal to the accuracy level. The bearing is the horizontal direction of travel, measured in degrees relative to true north. Finally, the speed is reported in meters/second over ground.

As the GPS trajectories were collected from Grab drivers’ phones while in transit, we labelled each trajectory by phone device type being either Android or iOS. This is the first dataset which differentiates such device information. Furthermore, we also label the trajectories by driving mode (Car or Motorcycle).

All drivers’ personal information is encrypted and the real start/end locations are removed within the dataset.

Data format

Each trajectory is serialised in a file in Apache Parquet format. The whole dataset size is around 2 GB. Each GPS ping is associated with values for a trajectory ID, latitude, longitude, timestamp (UTC), accuracy level, bearing and speed. The GPS sampling rate is 1 second, which is the highest among all the existing open source datasets. Table 1 shows a sample of the dataset.

Table 1: Sample dataset
Table 1: Sample dataset

Coverage

Figure 1a shows the spatial coverage of the dataset in Singapore. Compared with the GPS datasets available in the market that only cover a specific area of a city, the Grab-Posisi dataset encompasses almost the whole island of Singapore. Figure 1b depicts the GPS density in Singapore. Red represents high density while green represents low density. Expressways in Singapore are clearly visible because of their dense GPS pings.

Figure 1a. Spatial coverage (Singapore)
Figure 1a. Spatial coverage (Singapore)
Figure 1b. GPS density (highways have more GPS)
Figure 1b. GPS density (highways have more GPS)

Figure 2a illustrates that the Grab-Posisi dataset encloses not only central Jakarta but also extends to external highways. Figure 2b depicts the GPS density of cars in Jakarta. Compared with Singapore, trips in Jakarta are spread out in all different areas, not just concentrated on highways.

Figure 2a. Spatial coverage (Jakarta)
Figure 2a. Spatial coverage (Jakarta)
Figure 2b. GPS density (Car)
Figure 2b. GPS density (Car)

Applications of Grab-Posisi

The following are some of the applications of Grab-Posisi dataset.

On Map Inference

The traditional method used in updating road networks in maps is time-consuming and labour-intensive. That’s why maps might have important roads missing and real-time traffic conditions might be unavailable. To address this problem, we can use GPS trajectories in reconstructing road networks automatically.

A bunch of map generation algorithms can be applied to infer both map topology and road attributes. Figure 3b shows a snippet of the inferred map from our GPS trajectories (Figure 3a) using one of the algorithms. As you can see from the blue dots, the skeleton of the underlining map inferred is correct, although some section of the inferred road is disconnected, and at the roundabout in the bottom right corner it’s not a smooth curve.

Figure 3a. Raw GPS trajectories
Figure 3a. Raw GPS trajectories  
Figure 3b. Inferred Map
Figure 3b. Inferred Map

On Map Matching                                         

The map matching refers to the task of automatically determining the correct route where the driver has travelled on a digital map, given a sequence of raw and noisy GPS points. The correction of the raw GPS data has been important for many location-based applications such as navigation, tracking, and road attribute detection as aforementioned. The accuracy levels provided in the Grab-Posisi dataset can be of great use to address this issue.

On Traffic Detection and Forecast                         

In addition to the inference of a static digital map, the Grab-Posisi GPS dataset can also be used to perform real-time traffic forecasting, which is very important for congestion detection, flow control, route planning, and navigation. Some examples of the fundamental indicators that are mostly used to monitor the current status of traffic conditions include the average speed, volume, and density in each road segment. These variables can be computed based on drivers’ GPS trajectories and can be used to predict the future traffic conditions.

On Mode Detection                         

Transportation mode detection refers to the task of identifying the travel mode of a user (some examples of transportation mode include walk, bike, car, bus, etc.). The GPS trajectories in our dataset are associated with rich attributes including GPS accuracy, bearing, and speed in addition to the latitude and longitude of geo-coordinates, which can be used to develop mode detection models. Our dataset also provides labels for each trajectory to be collected from a car or motorcycle, which can be used to verify performance of those models.

Economics Perspective                                         

The real-world GPS trajectories of people reveal realistic travel patterns and demands, which can be of great help for city planning. As there are some realistic constraints faced by governments such as budget limitations and construction inconvenience, it is important to incorporate both the planning authorities’ requirements and the realistic travel demands mined from trajectories for intelligent city planning. For example, the trajectories of cars can provide suggestions on how to schedule highway constructions. The trajectories of motorcycles can help the government to choose the optimal locations to construct motorcycle lanes for safety concerns.

Want to access our dataset?

Grab-Posisi dataset offers a great value and is a significant resource to the community for benchmarking and revisiting existing technologies.         

If you want to access our dataset for research purposes, email [email protected] with the following details:

  • Your Name and contact details
  • Your institution
  • Your potential usage of the dataset

When using Grab-Posisi dataset, please cite the following paper:

Huang, X., Yin, Y., Lim, S., Wang, G., Hu, B., Varadarajan, J., … & Zimmermann, R. (2019, November). Grab-Posisi: An Extensive Real-Life GPS Trajectory Dataset in Southeast Asia. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Prediction of Human Mobility (pp. 1-10). DOI: https://doi.org/10.1145/3356995.3364536

Click here to download the published paper.

Click here to download the BibTex file.

Note: You cannot use Grab-Posisi dataset for commercial purposes.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Preventing App Performance Degradation Due to Sudden Ride Demand Spikes

Post Syndicated from Grab Tech original https://engineering.grab.com/preventing-app-performance-degradation-due-to-sudden-ride-demand-spikes

In Southeast Asia, when it rains, it pours. It’s a major mood dampener especially if you are stuck outside when the rain starts, you are about to have an awful day.

In the early days of Grab, if the rains came at the wrong time, like during morning rush hour, then we engineers were also in for a terrible day.

In those days, demand for Grab’s ride services grew much faster than our ability to scale our tech system and this often meant clocking late nights just to ensure our system could handle the ever-growing demand. When there’s a massive, sudden spike in ride bookings, our system often struggled to manage the load.

There were also other contributors to demand spikes, for example when public transport services broke down or when a major event such as an international concert ends and event-goers all need a ride at the same time.

Upon reflection, we realized there were two integral aspects to these incidents.

Firstly, they were localized events. The increase in demand came from a particular geographical location; in some cases a very small area. These localized events had the potential to cause so much load on our system that it impacted the experience of other users outside the geolocation.

Secondly, the underlying problem was a lack of drivers (supply) in that particular geographical area.

At Grab, our goal has always been to get everyone a ride when and where they needed it, but in this situation, it was just not possible. We needed to find a way to ensure this localized demand spike did not affect our ability to meet the needs of other users.

Enter the Spampede Filter

The Spampede (a play of the words spam and stampede) filter was inspired by another concept you may have read on this blog, circuit breakers.

In software, as in electronics, circuit breakers are designed to protect a system by short-circuiting in the face of adverse conditions.

Let’s break this down.

There are two key concepts here: short-circuiting and adverse conditions.

Firstly, short-circuiting, in this context means performing minimal processing on a particular booking, and by doing so, reducing the overall load on the system. Secondly, adverse conditions, in this, we refer to a large number of unfulfilled requests for a particular service, from a small geographical area, within a short time window. With these two concepts in mind, we devised the following process.

Spampede Design

First, we needed to track unallocated requests in a location-aware manner. To do this, we convert the requested pickup location of an unallocated request using the Geohash Integer algorithm.  

After the conversion, the resulting value is an exact location. We can convert this location into a “bucket” or area by reducing the precision.

This method is by no means smart or aware of the local geography, but it is incredibly CPU efficient and requires no external resources like network API calls.

Now that we can track unallocated requests, we needed a way for the tracking to be time-aware. After all, traffic conditions, driver locations, and passenger demand are continually changing. We could have implemented something precise like a sliding window sum, but that would have introduced a lot of complexity and a significantly higher CPU and memory cost.

By using the Unix timestamp, we converted the current time to a “bucket” of time by using the straightforward formula:

Event Sourcing

where bs is the size of the time buckets in seconds

With location and time buckets calculated, we can track the unallocated bookings using Redis. We could have used any data store, but Redis was familiar and battle-proven to us.

To do this, we first constructed the Redis key by combining the service type, the geographic location, and the time bucket. With this key, we call the INCR command, which increments the value stored in that location and returns the new value.

If the value returned is 1, this indicates that this is the first value stored for this bucket combination, and we would then make a second call, this time to EXPIRE. With this second call, we would set a time to live (TTL) on the Redis item, allowing the data to be self-cleaning.

You will notice that we are blindly calling increment and only making a second call if needed. This pattern is more efficient and resource-friendly than using a more traditional, load-check-store pattern.

The next step was the configuration. Specifically, setting how many unallocated bookings could happen in a particular location and time bucket before the circuit opened. For this, we decided on Redis again. Again, we could have used anything, but we were already using Redis and, as mentioned previously, quite familiar with it.

Finally, the last piece. We introduced code at the beginning of our booking processing, most importantly, before any calls to any other services and before any significant processing was done. This code compared the location, time, and requested service to the currently configured Spampede setting, along with the previously unallocated bookings. If the maximum had already been reached, then we immediately stopped processing the booking.

This might sound harsh- to immediately refuse a booking request without even trying to fulfill it. But the goal of the Spampede filter is to prevent excessive, localized demand from impacting all of the users of the system.

Conclusion

Reading about this as a programmer, it probably feels strange, intentionally dropping bookings and impacting the business this way.

After all, we want nothing more than to help people get to where they need to be. This process is a system safety mechanism to ensure that the system stays alive and able to do just that.

I would be remiss if I didn’t highlight the critical software-engineering takeaway here is a combination of the Observer effect and the underlying goals of the CAP theorem. Observing a system will influence the system due to the cost of instrumentation and monitoring.

Generally, the higher the accuracy or consistency of the monitoring and limits, the higher the resource cost.

In this case, we have intentionally chosen the most resource-efficient options and traded accuracy for more throughput.

Plumbing At Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/plumbing-at-scale

When you open the Grab app and hit book, a series of events are generated that define your personalised experience with us: booking state machines kick into motion, driver partners are notified, reward points are computed, your feed is generated, etc. While it is important for you to know that a request has been received, a lot happens asynchronously in our back-end services.

As custodians and builders of the streaming platform at Grab operating at massive scale (think terabytes of data ingress each hour), the Coban team’s mission is to provide a NoOps, managed platform for seamless, secure access to event streams in real-time, for every team at Grab.

Coban Sewu Waterfall In Indonesia
Coban Sewu Waterfall In Indonesia. (Streams, get it?)

Streaming systems are often at the heart of event-driven architectures, and what starts as a need for a simple message bus for asynchronous processing of events quickly evolves into one that requires a more sophisticated stream processing paradigms.
Earlier this year, we saw common patterns of event processing emerge across our Go backend ecosystem, including:

  • Filtering and mapping stream events of one type to another
  • Aggregating events into time windows and materializing them back to the event log or to various types of transactional and analytics databases

Generally, a class of problems surfaced which could be elegantly solved through an event sourcing1 platform with a stream processing framework built over it, similar to the Keystone platform at Netflix2.

This article details our journey building and deploying an event sourcing platform in Go, building a stream processing framework over it, and then scaling it (reliably and efficiently) to service over 300 billion events a week.

Event Sourcing

Event sourcing is an architectural pattern where changes to an application state are stored as a sequence of events, which can be replayed, recomputed, and queried for state at any time. An implementation of the event sourcing pattern typically has three parts to it:

  • An event log
  • Processor selection logic: The logic that selects which chunk of domain logic to run based on an incoming event
  • Processor domain logic: The domain logic that mutates an application’s state
Event Sourcing
Event Sourcing

Event sourcing is a building block on which architectural patterns such as Command Query Responsibility Segregation3, serverless systems, and stream processing pipelines are built.

The Case For Stream Processing

Below are some use cases serviced by stream processing, built on event sourcing.

Asynchronous State Management

A pub-sub system allows for change events from one service to be fanned out to multiple interested subscribers without letting any one subscriber block the progress of others. Abstracting the event log and centralising it democratises access to this log to all back-end services. It enables the back-end services to apply changes from this centralised log to their own state, independent of downstream services, and/or publish their state changes to it.

Time Windowed Aggregations

Time-windowed aggregates are a common requirement for machine learning models (as features) as well as analytics. For example, personalising the Grab app landing page requires counting your interaction with various widget elements in recent history, not any one event in particular. Similarly, an analyst may not be interested in the details of a singular booking in real-time, but in building demand heatmaps segmented by geohashes. For latency-sensitive lookups, especially for the personalisation example, pre-aggregations are preferred instead of post-aggregations.

Stream Joins, Filtering, Mapping

Event logs are typically sharded by some notion of topics to logically divide events of interest around a theme (booking events, profile updates, etc.). Building bigger topics out of smaller ones, as well as smaller ones from bigger ones are common ways to compose “substreams” of the log of interest directed towards specific services. For example, a promo service may only be interested in listening to booking events for promotional bookings.

Realtime Business Intelligence

Outputs of stream processing workloads are also plugged into realtime Business Intelligence (BI) and stream analytics solutions upstream, as raw data for visualizations on operations dashboards.

Archival

For offline analytics, as well as reconciliation and disaster recovery, having an archive in a cold store helps for certain mission critical streams.

Platform Requirements

Any processing platform for event sourcing and stream processing has certain expectations around its functionality.

Scaling and Elasticity

Stream/Event Processing pipelines need to be elastic and responsive to changes in traffic patterns, especially considering that user activity (rides, food, deliveries, payments) varies dramatically during the course of a day or week. A spike in food orders on rainy days shouldn’t cause indefinite order processing latencies.

NoOps

For a platform team, it’s important that users can easily onboard and manage their pipeline lifecycles, at their preferred cadence. To scale effectively, the process of scaffolding, configuring, and deploying pipelines needs to be standardised, and infrastructure managed. Both the platform and users are able to leverage common standards of telemetry, configuration, and deployment strategies, and users benefit from a lack of infrastructure management overhead.

Multi-Tenancy

Our platform has quickly scaled to support hundreds of pipelines. Workload isolation, independent processing uptime guarantees, and resource allocation and cost audit are important requirements necessitating multi-tenancy, which help amortize platform overhead costs.

Resiliency

Whether latency sensitive or latency tolerant, all workloads have certain expectations on processing uptime. From a user’s perspective, there must be guarantees on pipeline uptimes and data completeness, upper bounds on processing delays, instrumentation for alerting, and self-healing properties of the platform for remediation.

Tunable Tradeoffs

Some pipelines are latency sensitive, and rely on processing completeness seconds after event ingress. Other pipelines are latency tolerant, and can tolerate disruption to processing lasting in tens of minutes. A one size fits all solution is likely to be either cost inefficient or unreliable. Having a way for users to make these tradeoffs consciously becomes important for ensuring efficient processing guarantees at a reasonable cost. Similarly, in the case of upstream failures or unavailability, being able to tune failure modes (like wait, continue, or retry) comes in handy.

Stream Processing Framework

While basic event sourcing covers simple use cases like archival, more complicated ones benefit from a common framework that shifts the mental model for processing from per event processing to stream pipeline orchestration.
Given that Go is a “paved road” for back-end development at Grab, and we have service code and bindings for streaming data in a mono-repository, we built a Go framework with a subset of capabilities provided by other streaming frameworks like Flink4.

Logic Blocks In A Stream Processing Pipeline
Logic Blocks In A Stream Processing Pipeline

Capabilities

Some capabilities built into the framework include:

  • Deduplication: Enables pipelines to idempotently reprocess data in case of rewinds/replays, and provides some processing guarantees within a time window for certain use cases including sinking to datastores.
  • Filtering and Mapping: An ability to filter a source stream data and map them onto target streams.
  • Aggregation: An ability to generate and execute aggregation logic such as sum, avg, max, and min in a window.
  • Windowing: An ability to window processing into tumbling, sliding, and session windows.
  • Join: An ability to combine two streams together with certain join keys in a window.
  • Processor Chaining: Various functionalities can be chained to build more complicated pipelines from simpler ones. For example: filter a large stream into a smaller one, aggregate it over a time window, and then map it to a new stream.
  • Rewind: The ability to rewind the processing logic by a few hours through configuration.
  • Replay: The ability to replay archived data into the same or a separate pipeline via configuration.
  • Sinks: A number of connectors to standard Grab stores are provided, with concerns of auth, telemetry, etc. managed in the runtime.
  • Error Handling: Providing an easy way to indicate whether to wait, skip, and/or retry in case of upstream failures is an important tuning parameter that users need for making sensible tradeoffs in dimensions of backpressure, latency, correctness, etc.

Architecture

Coban Platform
Coban Platform

Our event log is primarily a bunch of critical Kafka clusters, which are being polled by various pipelines deployed by service teams on the platform for incoming events. Each pipeline is an isolated deployment, has an identity, and the ability to connect to various upstream sinks to materialise results into, including the event log itself.
There is also a metastore available as an intermediate store for processing pipelines, so the pipelines themselves are stateless with their lifecycle completely beholden to the whims of their owners.

Anatomy of a Processing Pipeline

Anatomy Of A Stream Processing Pod
Anatomy Of A Stream Processing Pod

Anatomy of a Stream Processing Pod
Each stream processing pod (the smallest unit of a pipeline’s deployment) has three top level components:

  • Triggers: An interface that connects directly to the source of the data and converts it into an event channel.
  • Runtime: This is the app’s entrypoint and the orchestrator of the pod. It manages the worker pools, triggers, event channels, and lifecycle events.
  • The Pipeline plugin: The plugin is provided by the user, and conforms to a contract that the platform team publishes. It contains the domain logic for the pipeline and houses the pipeline orchestration defined by a user based on our Stream Processing Framework.

Deployment Infrastructure

Our deployment infrastructure heavily leverages Kubernetes on AWS. After a (pretty high) initial cost for infrastructure set up, we’ve found scaling to hundreds of pipelines a breeze with the Kubernetes provided controls. We package our stateless pipeline workloads into Kubernetes deployments, with each pod containing a unit of a stream pipeline, with sidecars that integrate them with our monitoring systems. Other cluster wide tooling deployed (usually as DaemonSets) deal with metric collection, log ingestion, and autoscaling. We currently use the Horizontal Pod Autoscaler5 to manage traffic elasticity, and the Cluster Autoscaler6 to manage worker node scaling.

Kubernetes
A Typical Kubernetes Set Up On AWS

Metastore

Some pipelines require storage for use cases ranging from deduplication to stores for materialised results of time windowed aggregations. All our pipelines have access to clusters of ScyllaDB instances (which we use as our internal store), made available to pipeline authors via interfaces in the Stream Processing Framework. Results of these aggregations are then made available to backend services via our GrabStats service, which is a thin query layer over the latest pipeline results.

Compute Isolation

A nice property of packaging pipelines as Kubernetes deployments is a good degree of compute workload isolation for pipelines. While node resources of pipeline pods are still shared (and there are potential noisy neighbour issues on matters like logging throughput), the pipeline pods of various pods can be scheduled and rescheduled across a wide range of nodes safely and swiftly, with minimal impact to pods of other pipelines.

Redundancy

Stateless processing pods mean we can set up backup or redundant Kubernetes clusters in hot-hot, hot-warm, or hot-cold modes. We use this to ensure high processing availability despite limited control plane guarantees from any single cluster. (Since EKS SLAs for the Kubernetes control plane guarantee only 99.9% uptime today7.) Transparent to our users, we make the deployment systems aware of multiple available targets for scheduling.

Availability vs Cost

As alluded to in the “Platform Requirements” section, having a way of trading off availability for cost becomes important where the requirements and criticality of each processing pipeline are very different. Given that AWS spot instances are a lot cheaper8 than on-demand ones, we use user annotated Kubernetes priority classes to determine deployment targets for pipelines. For latency tolerant pipelines, we schedule them on Spot instances which are routinely between 40-90% cheaper than on demand instances on which latency sensitive pipelines run. The caveat is that Spot instances occasionally disappear, and these workloads are disrupted until a replacement node for their scheduling can be found.

What’s Next?

  • Expand the ecosystem of triggers to support custom sources of data i.e. the “event log”, as well as push based (RPC driven) versus just pull based triggers
  • Build a control plane for API integration with pipeline lifecycle management
  • Move some workloads to use the Vertical Pod Autoscaler9 in Kubernetes instead of horizontal scaling, as most of our workloads have a limit on parallelism (which is their partition count in Kafka topics)
  • Move from Go plugins for pipelines to plugins over RPC, like what HashiCorp does10, to enable processing logic in non-Go languages.
  • Use either pod gossip or a service mesh with a control plane to set up quotas for shared infrastructure usage per pipeline. This is to protect upstream dependencies and the metastore from surges in event backlogs.
  • Improve availability guarantees for pipeline pods by occasionally redistributing/rescheduling pods across nodes in our Kubernetes cluster to prevent entire workloads being served out of a few large nodes.

Authored By Karan Kamath on behalf of the Coban team at Grab-
Zezhou Yu, Ryan Ooi, Hui Yang, Yuguang Xiao, Ling Zhang, Roy Kim, Matt Hino, Jump Char, Lincoln Lee, Jason Cusick, Shrinand Thakkar, Dean Barlan, Shivam Dixit, Shubham Badkur, Fahad Pervaiz, Andy Nguyen, Ravi Tandon, Ken Fishkin, and Jim Caputo.


Footnotes

Coban Sewu Waterfall Photo by Dwinanda Nurhanif Mujito on Unsplash

Cover Photo by tian kuan on Unsplash

  1. https://martinfowler.com/eaaDev/EventSourcing.html 

  2. https://medium.com/netflix-techblog/keystone-real-time-stream-processing-platform-a3ee651812a 

  3. https://martinfowler.com/bliki/CQRS.html 

  4. https://flink.apache.org 

  5. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ 

  6. https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler 

  7. https://aws.amazon.com/eks/sla/ 

  8. https://aws.amazon.com/ec2/pricing/ 

  9. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler 

  10. https://github.com/hashicorp/go-plugin 

Journey to a Faster Everyday Super App Where Every Millisecond Counts

Post Syndicated from Grab Tech original https://engineering.grab.com/journey-to-a-faster-everyday-super-app

Introduction

At Grab, we are moving faster than ever. In 2019 alone, we released dozens of new features in the Grab passenger app. With our goal to delight users in Southeast Asia with a powerful everyday super app, the app’s performance became one of the most critical components in delivering that experience to our users.

This post narrates the journey of our performance improvement efforts on the Grab passenger app. It highlights how we were able to reduce the time spent starting the app by more than 60%, while preventing regressions introduced by new features. We use the p95 scale when referring to these improvements.

Here’s a quick look at the improvements and timeline:

Improvements Timeline

Improving App Performance

While app performance consists of different aspects – such as battery consumption rate, network performance, app responsiveness, etc. – the first thing users notice is the time it takes for an app to start. Apps that take too long to load frustrate users, leading to bad reviews and uninstalls.

We focused our efforts on the app’s time to interactive(TTI), which consists of two main operations:

  • Starting the app
  • Displaying interactive service tiles (these are the icons for the services offered on the app such as Transport, Food, Delivery, and so on)

There are many other operations that occur in the background, which we won’t cover in this article.

We prioritised on optimising the app’s ability to load the service tiles (highlighted in the image below) and render them as interactive upon startup (cold start). This allowed users to use the app as soon as they launch it.

Service Tiles

Instrumentation and Benchmarking

Before we could start improving the app’s performance, we needed to know where we stood and set measurable goals.

We couldn’t get a baseline from local performance testing as it did not simulate the real environment condition, where network variability and device performance are contributing factors. Thus, we needed to use real production data to get an accurate reflection of our current performance at a scale. In production, we measured the performance of ~8-9 millions users per day – a small subset of our overall active user base.

As a start, we measured the different components contributing to TTI, such as binary loading, library initialisations, and tiles loading. For example, if we had to measure the time taken by function A, this is how it looked like in the code:

functionA (){
// start the timer
....
....
...
//Stop the timer, calculate the time difference and send it as an analytic event
}

With all the numbers from the contributing components, we took the sum to calculate the full TTI (as shown in the following image).

Full TTI

When the numbers started rolling in from production, we needed specific measurements to interpret those numbers, so we started looking at TTI’s 50th, 90th, and 95th percentile. A 90th percentile (p90) of x seconds means that 90% of the users have an interactive screen in at most x seconds.

We chose to only focus on p50 and p95 as these cover the majority of our users who deal with performance issues. Improving performance for <p50 (who already have high-end devices) would not bring too much of a value, and improving for >p95 would be very difficult as the app performance improvements will be limited by device performance.

By the end of January, we got the p50, p90, and p95 numbers for the contributing components that summed up to TTI numbers for tiles, which allowed us to start identifying areas with potential improvements.

Caching and Animation Removal

While reviewing the TTI numbers, we were drawn to contributors with high time consumption rates such as tile loading and app start animation. Other evident improvement we worked on was caching data between app launches instead of waiting for a network response for loading tiles at every single app launch.

Tile Caching

Based on the gathered data, the service tiles only change when a user travels between cities. This is because the available services vary in each city. Since users do not frequently change cities, the service tiles do not change very frequently either, and so caching the tiles made sense. However, we also needed to sync the fresh tiles, in case of any change. So, we updated the logic based on these findings. as illustrated in the following image:

Tile Caching Logic

Caching tiles brought us a huge improvement of ~3s on each platform.

Animation Removal

We came across a beautifully created animation at appstart that didn’t provide any additional value in terms of information or practicality.

With detailed discussions and trade-offs with designers, we removed the animation and improved our TTI further by 1s.

In conclusion, with the caching and animation removal alone, we improved the TTI by 4s.

Welcome Static Linking and Coroutines

At this point, our users gained 4 seconds of their time back, but we didn’t want to stop with that number. So, we dug through the data to see what further enhancements we could do. When we could not find anything else that was similar to caching and animation removal, we shifted to architecture fundamentals.

We knew that this was not an easy route to take and that it would come with a cost; if we decided to choose a component related to architecture fundamentals, all the other teams working on the Grab app would be impacted. We had to evaluate our options and make decisions with trade-offs for overall improvements. And this eventually led to static linking on iOS and coroutines on Android.

Binary Loading

Binary loading is one of the first steps in both mobile platforms when an app is launched. It primarily contributes to pre-main and dex-loading, on iOS and Android respectively.

The pre-main time on iOS was about 7.9s. It is known in the iOS development world that each framework (binary) can either be dynamically or statically linked. While static helps in a faster app start, it brings complexity in building frameworks that are elaborate or contain resources bundles.Building a lot of libraries statically also impact build times negatively.With proper evaluations, we decided to take the route to enable more static linking due to the trade-offs.

Apple recommends a maximum of half a dozen dynamic frameworks for an optimal performance. Guess what? Our passenger app had 107 dynamically linked frameworks, a lot of them were internal.

The task looked daunting at first, since it affected all parts of the app, but we were ready to tackle the challenge head on. Deciding to take this on was the easy part, the actual work entailed lots of tricky coordination and collaboration with multiple teams.

We created an RFC (Request For Comments) doc to propose the static linking of frameworks, wherever applicable, and co-ordinated with teams with the agreed timelines to execute this change.

While collaborating with teams, we learned that we could remove 12 frameworks entirely that were no longer required. This exercise highlighted the importance of regular cleanup and deprecation in our codebase, and was added into our standard process.

And so, we were left with 95 frameworks; 75 of which were statically linked successfully, resulting in our p90 pre-main dropping by 41%.

As Grabbers, it’s in our DNA to push ourselves a little more. With the remaining 20 frameworks, our pre-main was still considerably high. Out of the 20 frameworks, 10 could not be statically linked without issues. As a workaround, we merged multiple dynamic frameworks into one. One of our outstanding engineers even created a plug-in for this, which is called the Cocoapod Merge. With this plug-in, we were able to merge 10 dynamically linked frameworks into 2. We’ve made this plug-in open source: https://github.com/grab/cocoapods-pod-merge.

With all of the above steps, we were finally left with 12 dynamic frameworks – a huge 88% reduction.

The following image illustrates the complex numbers mentioned above:

Static Linking

Using cocoapod merge further helped us with ~0.8s of improvement.

Coroutines

While we were executing the static linking initiative on iOS, we also started refactoring the application initialisation for a modular and clean code on Android. This resulted in creating an ApplicationInitialiser class, which handles the entire application initialisation process with maximum parallelism using coroutines.

Now all the libraries are being initialised in parallel via coroutines and thus enabling better utilisations of computing resources and a faster TTI.

This refactoring and background initialisation for libraries on Android helped in gaining ~0.4s of improvements.

Changing the Basics – Visualisation Setup

By the end of H1 2019, we observed a 50% improvement in TTI, and now it was time to set new goals for H2 2019. Until this point, we would query our database for all metric numbers, copy the numbers into a spreadsheet, and compare them against weeks and app versions.

Despite the high loads of manual work and other challenges, this method still worked at the beginning due to the improvements we had to focus on.

However, in H2 2019 it became apparent that we had to reassess our methodology of reading numbers. So, we started thinking about other ways to present and visualise these numbers better. With help from our Product Analyst, we took advantage of metabase’s advanced capabilities and presented our goals and metrics in a clear and easy to understand format.

For example, here is a graph that shows the top contributing metrics for Android:

Android Metrics

Looking at it, we could clearly tell which metric needed to be prioritised for improvements.

We did this not only for our metrics, but also for our main goals, which allowed us to easily see our progress and track our improvements on a daily basis.

Visualisation

The color bars in the above image depicts the status of our numbers against our goals and also shows the actual numbers at p50, p90, and p95.

As our tracking progressed, we started including more granular and precise measurements, to help guide the team and achieve more impactful improvements of around ~0.3-0.4s.

Fortunately, we were deprecating a third-party library for analytics and experimentation, which happened to be one of the highest contributing metrics for both platforms due to a high number of operations on the main thread. We started using our own in-house experimentation platform where we had better control over performance. We removed this third-party dependency, and it helped us with huge improvements of ~2.5s on Android and ~0.5-0.7s on iOS.

You might be wondering as to why there is such a big difference on the iOS and Android improvement numbers for this dependency. This was due to the setting user attributes operations that ran only in the Android codebase, which was performed on the main thread and took a huge amount of time. These were the times that made us realise that we should focus more on the consistency for both platforms, as well as to identify the third-party library APIs that are used, and to assess whether they are absolutely necessary.

*Tip*: So, it is time for you as well to eliminate such inconsistencies, if there are any.

Ok, there goes our third quarter with ~3s of improvement on Android and ~1.3s on iOS.

Performance Regression Detection

Entering into Q4 brought us many challenges as we were running out of improvements to make. Even finding an improvement worth ~0.05s was really difficult! We were also strongly challenged by regressions (increase in TTI numbers) because of continuous feature releases and code additions to the app start process.

So, maintaining the TTI numbers became our primary task for this period. We started looking into setting up processes to block regressions from being merged to the master, or at least get notified before they hit production.

To begin with, we identified the main sources of regressions: static linking breakage on iOS and library initialisation in the app startup process on Android.

We took the following measures to cover these cases:

Linters

We built linters on the Continuous Integration (CI) pipeline to detect potential changes in static linking on iOS and the ApplicationInitialiser class on Android. The linters block the changelist and enforce a special review process for such changes.

Library Integration Process

The team also focused on setting up a process for library integrations, where each library (internal or third party) will first be evaluated for performance impact before it is integrated into the codebase.

While regression guarding was in process, we were simultaneously trying to bring in more improvements for TTI. We enabled the Link Time Optimisations (LTO) flag on iOS to improve the overall app performance. We also experimented on order files on iOS and anko layout on Android, but were ruled out due to known issues.

On Android, we hit the bottom hard as there were minimal improvements. Fortunately, it was a different story for iOS. We managed to get improvements worth ~0.6s by opting for lazy loading, optimising I/O operations, and deferring more operations to post app start (if applicable).

Next Steps

We will be looking at the different aspects of performance such as network, battery, and storage, while maintaining our current numbers for TTI.

  • Network performance – Track the turnaround time for network requests then move on to optimisations.
  • Battery performance – Focus on profiling the app for CPU and energy intensive operations, which drains the battery, and then move to optimisations.
  • Storage performance – Review our caching and storage mechanisms, and then look for ways to optimise them.

In addition to these, we are also focusing on bringing performance initiatives for all the teams at Grab. We believe that performance is a collaborative approach, and we would like to improve the app performance in all aspects.

We defined different metrics to track performance e.g. Time to Interactive, Time to feedback (the time taken to get the feedback for a user action), UI smoothness indicators, storage, and network metrics.

We are enabling all teams to benchmark their performance numbers based on defined metrics and move on to a path of improvement.

Conclusion

Overall, we improved by 60%, and this calls for a big celebration! Woohoo! The bigger celebration came from knowing that we’ve improved our customers’ experience in using our app.

This graph represents our performance improvement journey for the entire 2019, in terms of TTI.

Performance Graph

Based on the graph, looking at our p95 improvements and converting them to number of hours saved per day gives us ~21,388 hours on iOS and ~38,194 hours saved per day on Android.

Hey, did you know that it takes approximately 80-85 hours to watch all the episodes of Friends? Just saying. 🙂

We will continue to serve our customers for a better and faster experience in the upcoming years.

Marionette – Enabling E2E user-scenario simulation

Post Syndicated from Grab Tech original https://engineering.grab.com/marionette-enabling-e2e-user-scenario-simulation

Introduction

A plethora of interconnected microservices is what powers the Grab’s app. The microservices work behind the scenes to delight millions of our customers in Southeast Asia. It is a no-brainer that we emphasize on strong testing tools, so our app performs flawlessly to continuously meet our customers’ needs.

Background

We have a microservices-based architecture, in which microservices are interconnected to numerous other microservices. Each passing day sees teams within Grab updating their microservices, which in turn enhances the overall app. If any of the microservices fail after changes are rolled out, it may lead to the whole app getting into an unstable state or worse. This is a major risk and that’s why we stress on conducting “end-to-end (E2E) testing” as an integral part of our software test life-cycle.

E2E tests are done for all crucial workflows in the app, but not for every detail. For that we have conventional tests such as unit tests, component tests, functional tests, etc. Consider E2E testing as the final approval in the quality assurance of the app.

Writing E2E tests in the microservices world is not a trivial task. We are not testing just a single monolithic application. To test a workflow on the app from a user’s perspective, we need to traverse multiple microservices, which communicate through different protocols such as HTTP/HTTPS and TCP. E2E testing gets even more challenging with the continuous addition of microservices. Over the years, we have grown tremendously with hundreds of microservices working in the background to power our app.

Some major challenges in writing E2E tests for the microservices-based apps are:

  • Availability

    Getting all microservices together for E2E testing is tough. Each development team works independently and is responsible only for its microservices. Teams use different programming languages, data stores, etc for each microservice. It’s hard to construct all pieces in a common test environment as a complete app for E2E testing each time.

  • Data or resource set up

    E2E testing requires comprehensive data set up. Otherwise, testing results are affected because of data constraints, and not due to any recent changes to underlying microservices. For example, we need to create real-life equivalent driver accounts, passenger accounts, etc and to have those, there are a few dependencies on other internal systems which manage user accounts. Further, data and booking generation should be robust enough to replicate real-world scenarios as far as possible.

  • Access and authentication

    Usually, the test cases require sequential execution in E2E testing. In a microservices architecture, it is difficult to test a workflow which requires access and permissions to several resources or data that should remain available throughout the test execution.

  • Resource and time intensive

    It is expensive and time consuming to run E2E tests; significant time is involved in deploying new changes, configuring all the necessary test data, etc.

Though there are several challenges, we had to find a way to overcome them and test workflows from the beginning to the end in our app.

Our approach to overcome challenges

We knew what our challenges were and what we wanted to achieve from E2E testing, so we started thinking about how to develop a platform for E2E tests. To begin with, we determined that the scope of E2E testing that we’re going to primarily focus on is Grab’s transport domain — the microservices powering the driver and passenger apps.

One approach is to “simulate” user scenarios through a single platform before any new versions of these microservices are released. Ideally, the platform should also have the capabilities to set up the data required for these simulations. For example, ride booking requires data set up such as driver accounts, passenger accounts, location coordinates, geofencing, etc.

We wanted to create a single platform that multiple teams could use to set up their test data and run E2E user-scenario simulations easily. We put ourselves to work on that idea, which resulted in the creation of an internal platform called “Marionette”. It simulates actions performed by Grab’s passenger and driver apps as they are expected to behave in the real world. The objective is to ensure that all standard user workflows are tested before deploying new app versions.

Introducing Marionette

Marionette enables Grabbers (developers and QAs) to run E2E user-scenario simulations without depending on the actual passenger and driver apps. Grabbers can set up data as well as configure data such as drivers, passengers, taxi types, etc to mimic the real-world behavior.

Let’s look at the overall architecture to understand Marionette better:

Overall Architecture

Grabbers can interact with Marionette through three channels: UI, SDK, and through RESTful API endpoints in their test scripts. All requests are routed through a load balancer to the Marionette platform. The Marionette platform in turn talks to the required microservices to create test data and to run the simulations.

The benefits

With Marionette, Grabbers now have the ability to:

  • Simulate the whole booking flow including customer and driver behavior as well as transition through the booking life cycle including pick-up, drop-off, cancellation, etc. For example, developers can make passenger booking from the UI and configure pick-up points, drop-off points, taxi types, and other parameters easily. They can define passenger behaviour such as “make bookings after a specified time interval”, “cancel each booking”, etc. They can also set driver locations, define driver behaviour such as “always accept booking manually”, “decline received bookings”, etc.
  • Simulate bookings in all cities where Grab operates. Further, developers can run simulations for multiple Grab taxi types such as JustGrab, GrabShare, etc.
  • Visualize passengers, drivers, and ride transitions on the UI, which lets them easily test their workflows.
  • Save efforts and time spent on installing third-party android or iOS emulators, troubleshooting or debugging .apk installation files, etc before testing workflows.
  • Conduct E2E testing without real mobile devices and installed apps.
  • Run automatic simulations, in which a particular set of scenarios are run continuously, thus helping developers with exploratory testing.

How we isolated simulations among users

It is important to have independent simulations for each user. Otherwise, simulations don’t yield correct results. This was one of the challenges we faced when we first started running simulations on Marionette.

To resolve this issue, we came up with the idea of “cohorts”. A cohort is a logical group of passengers and drivers who are located in a particular city. Each simulation on Marionette is run using a “cohort” containing the number of drivers and passengers required for that simulation. When a passenger/driver needs to interact with other passengers/drivers (such as for ride bookings), Marionette ensures that the interaction is constrained to resources within the cohort. This ensures that drivers and passengers are not shared in different test cases/simulations, resulting in more consistent test runs.

How to interact with Marionette

Let’s take a look at how to interact with Marionette starting with its user interface first.

User Interface

The Marionette UI is designed to provide the same level of granularity as available on the real passenger and driver apps.

Generally, the UI is used in the following scenarios:

  • To test common user scenarios/workflows after deploying a change on staging.
  • To test the end-to-end booking flow right from the point where a passenger makes a booking till drop-off at the destination.
  • To simulate functionality of other teams within Grab – the passenger app developers can simulate the driver app for their testing and vice versa. Usually, teams work independently and the ability to simulate the dependent app for testing allows developers to work independently.
  • To perform E2E testing (such as by QA teams) without writing any test scripts.

The Marionette UI also allows Grabbers to create and set up data. All that needs to be done is to specify the necessary resources such as number of drivers, number of passengers, city to run the simulation, etc. Running E2E simulations involves just the click of a button after data set up. Reports generated at the end of running simulations provide a graphical visualization of the results. Visual reports save developers’ time, which otherwise is spent on browsing through logs to ascertain errors.

SDK

Marionette also provides an SDK, written in the Go programming language.

It lets developers:

  • Create resources such as passengers, drivers, and cohorts for simulating booking flows.
  • Create booking simulations in both staging and production.
  • Set bookings to specific states as needed for simulation through customizable driver and passenger behaviour.
  • Make HTTP requests and receive responses that matter in tests.
  • Run load tests by scaling up booking requests to match the required workload (QPS).

Let’s look at a high-level booking test case example to understand the simulation workflow.

Assume we want to run an E2E booking test with this driver behavior type — “accepts passenger bookings and transits between booking states according to defined behavior parameters”. This is just one of the driver behavior types in Marionette; other behavior types are also supported. Similarly, passengers also have behaviour types.

To write the E2E test for this example case, we first define the driver behavior in a function like this:

Overall Architecture

Then, we handle the booking request for the driver like this:

Overall Architecture

The SDK client makes the handling of passengers, drivers, and bookings very easy as developers don’t need to worry about hitting multiple services and multiple APIs to set up their required driver and passenger actions. Instead, teams can focus on testing their use cases.

To ensure that passengers and drivers are isolated in our test, we need to group them together in a cohort before running the E2E test.

Overall Architecture

In summary, we have defined the driver’s behavior, created the booking request, created the SDK client and associated the driver and passenger to a cohort. Now, we just have to trigger the E2E test from our IDE. It’s just that simple and easy!

Previously, developers had to write boilerplate code to make HTTP requests and parse returned HTTP responses. With the Marionette SDK in place, developers don’t have to write any boilerplate code saving significant time and effort in E2E testing.

RESTful APIs in test scripts

Marionette provides several RESTful API endpoints that cover different simulation areas such as resource or data creation APIs, driver APIs, passenger APIs, etc. APIs are particularly suitable for scripted testing. Developers can directly call these APIs in their test scripts to facilitate their own tests such as load tests, integration tests, E2E tests, etc.

Developers use these APIs with their preferred programming languages to run simulations. They don’t need to worry about any underlying complexities when using the APIs. For example, developers in Grab have created custom libraries using Marionette APIs in Python, Java, and Bash to run simulations.

What’s next

Currently, we cover E2E tests for our transport domain (microservices for the passenger and driver apps) through Marionette. The next phase is to expand into a full-fledged platform that can test microservices in other Grab domains such as Food, Payments, and so on. Going forward, we are also looking to further simplify the writing of E2E tests and running them as a part of the CD pipeline for seamless testing before deployment.

In conclusion

We had an idea of creating a simulation platform that can run and facilitate E2E testing of microservices. With Marionette, we have achieved this objective. Marionette has helped us understand how end users use our apps, allowing us to make improvements to our services. Further, Marionette ensures there are no breaking changes and provides additional visibility into potential bugs that might be introduced as a result of any changes to microservices.

If you have any comments or questions about Marionette, please leave a comment below.

How we implemented domain-driven development in Golang

Post Syndicated from Grab Tech original https://engineering.grab.com/domain-driven-development-in-golang

Partnerships have always been core to Grab’s super app strategy. We believe in collaborating with partners who are the best in what they do – combining their expertise with what we’re good at so that we can bring high-quality new services to our customers, at the same time create new opportunities for the merchant and driver-partners in our ecosystem.

That’s why we launched GrabPlatform last year. To make it easier for partners to either integrate Grab into their services, or integrate their services into Grab.

In view of that, part of the GrabPlatform’s team mission is to make it easy for partners to integrate with Grab services. These partners are external companies that would like to offer Grab’s services such as ride-booking through their own websites or applications. To do that, we decided to build a website that will serve as a one-stop-shop that would allow them to self-service these integrations.

The challenges we faced with the conventional approach

In the process of building this website, our team noticed that the majority of the functions and responsibilities were added to files without proper segregation. A single file would contain more than 500 lines of code. Each of these files were imported from different collections of source codes, resulting in an unstructured codebase. Any changes to the existing functions risked breaking existing functionality; we realized then that we needed to proactively plan for the future. Hence, we decided to use the principles of Domain-Driven Design (DDD) and idiomatic Go. This blog aims to demonstrate the process of how we leveraged those concepts to design a modern application.

How we implemented DDD in our codebase

Here’s how we went about solving our unstructured codebase using DDD principles.

Step 1: Gather domain (business) knowledge

We collaborated closely with our domain experts (in our case, this was our product team) to identify functionality and flow. From them, we discovered the following key points:

  • After creating a project, developers are added to the project.
  • The domain experts wanted an ability to add other products (e.g. Pricing service, ETA service, GrabPay service) to their projects.
  • They wanted the ability to create multiple authentication clients to access the above products.

Step 2: Break down domain knowledge into bounded context

Now that we had gathered the required domain knowledge (i.e. what our code needed to reflect to our partners), it was time to use the DDD strategic tool Bounded Context to break down problems into subcontexts. Here is a graphical representation of how we converted the problem into smaller units.

Bounded Context

We identified several dependencies on each of the units involved in the project. Take some of these examples:

  • The project domain overlapped with the product and developer domains.
  • Our RideBooking project can only exist if it has some products like Ridebooking APIs and not the other way around.

What this means is a product can exist independent of the project, but a project will have no significance without any product. In the same way, a project is dependent on the developers, but developers can exist whether or not they belong to a project.

Step 3: Identify value objects or entity (lowest layer)

Looking at the above bounded contexts, we figured out the building blocks (i.e. value objects or entity) to break down the above functionality and flow.

// ProjectDAO ...
type ProjectDAO struct {
  ID            int64
  UUID          string
  Status        ProjectStatus
  CreatedAt     time.Time
}

// DeveloperDAO ...
type DeveloperDAO struct {
  ID            int64
  UUID          string
  PhoneHash     *string
  Status        Status
  CreatedAt     time.Time
}

// ProductDAO ...
type ProductDAO struct {
  ID            int64
  UUID          string
  Name          string
  Description   *string
  Status        ProductStatus
  CreatedAt     time.Time
}

// DeveloperProjectDAO to map developer's to a project
type DeveloperProjectDAO struct {
  ID            int64
  DeveloperID   int64
  ProjectID     int64
  Status        DeveloperProjectStatus
}

// ProductProjectDAO to map product's to a project
type ProductProjectDAO struct {
  ID            int64
  ProjectID     int64
  ProductID     int64
  Status        ProjectProductStatus
}

All the objects shown above have ID as a field and can be identifiable, hence they are identified as entities and not as value objects. But if we apply domain knowledge, DeveloperProjectDAO and ProductProjectDAO are actually not independent entities. Project object is the aggregate root since it must exist before the child fields, DevProjectDAO and ProdcutProjectDAO, can exist.

Step 4: Create the repositories

As stated above, we created an interface to abstract the working logic of a particular domain (i.e. Repository). Here is an example of how we designed the repositories:

// ProductRepositoryImpl responsible for product functionality
type ProductRepositoryImpl struct {
  productDao storage.IProductDao // private field
}

type ProductRepository interface {
  GetProductsByIDs(ctx context.Context, ids []int64) ([]IProduct, error)
}

// DeveloperRepositoryImpl
type DeveloperRepositoryImpl struct {
  developerDAO storage.IDeveloperDao // private field
}

type DeveloperRepository interface {
  FindActiveAllowedByDeveloperIDs(ctx context.Context, developerIDs []interface{}) ([]*Developer, error)
  GetDeveloperDetailByProfile(ctx context.Context, developerProfile *appdto.DeveloperProfile) (IDeveloper, error)
}

Here is a look at how we designed our repository for aggregate root project:

// Unexported Struct
type productProjectRepositoryImpl struct {
  productProjectDAO storage.IProjectProductDao // private field
}

type ProductProjectRepository interface {
  GetAllProjectProductByProjectID(ctx context.Context, projectID int64) ([]*ProjectProduct, error)
}

// Unexported Struct
type developerProjectRepositoryImpl struct {
  developerProjectDAO storage.IDeveloperProjectDao // private field
}

type DeveloperProjectRepository interface {
  GetDevelopersByProjectIDs(ctx context.Context, projectIDs []interface{}) ([]*DeveloperProject, error)
  UpdateMappingWithRole(ctx context.Context, developer IDeveloper, project IProject, role string) (*DeveloperProject, error)
}

// Unexported Struct
type projectRepositoryImpl struct {
  projectDao storage.IProjectDao // private field
}

type ProjectRepository interface {
  GetProjectsByIDs(ctx context.Context, projectIDs []interface{}) ([]*Project, error)
  GetActiveProjectByUUID(ctx context.Context, uuid string) (IProject, error)
  GetProjectByUUID(ctx context.Context, uuid string) (*Project, error)
}

type ProjectAggregatorImpl struct {
  projectRepositoryImpl           // private field
  developerProjectRepositoryImpl  // private field
  productProjectRepositoryImpl    // private field
}

type ProjectAggregator interface {
  GetProjects(ctx context.Context) ([]*dto.Project, error)
  AddDeveloper(ctx context.Context, request *appdto.AddDeveloperRequest) (*appdto.AddDeveloperResponse, error)
  GetProjectWithProducts(ctx context.Context, uuid string) (IProject, error)
}

Step 5: Identify Domain Events

The functions described in Step 4 only returns the ID of the developer and product, which conveys no information to the users. In order to provide developer and product information, we use the domain-event technique to return the actual product and developer attributes.

A domain event is something that happened in a bounded context that you want another context of a domain to be aware of. For example, if there are new updates to the developer domain, it’s important to convey these updates to the project domain. This propagation technique is termed as domain event. Domain events enable independence between different classes.

One way to implement it is seen here:

// file: project\_aggregator.go
func (p *ProjectAggregatorImpl) GetProjects(ctx context.Context) ([]*dto.Project, error) {
  ....
  ....
  developers := p.EventHandler.Handle(DomainEvent.FindDeveloperByDeveloperIDs{DeveloperIDs})
  ....
}

// file: event\_type.go
type FindDeveloperByDeveloperIDs struct{ developerID []interface{} }

// file: event\_handler.go
func (e *EventHandler) Handle(event interface{}) interface{} {
  switch op := event.(type) {
      case FindDeveloperByDeveloperIDs:
            developers, _ := e.developerRepository.FindDeveloperByDeveloperIDs(op.developerIDs)
            return developers
      case ....
      ....
    }
}
Domain Event

Some common mistakes to avoid when implementing DDD in your codebase:

  • Not engaging with domain experts. Not interacting with domain experts is a common mistake when using DDD. Talking to domain experts to get an understanding of the problem domain from their perspective is at the core of DDD. Starting with schemas or data modelling instead of talking to domain experts may create code based on a relational model instead of it built around a domain model.
  • Ignoring the language of the domain experts. Creating a ubiquitous language shared with domain experts is also a core DDD practice. This common language must be used in all discussions as well as in the code, e.g. in class and method names.
  • Not identifying bounded contexts. A common approach to solving a complex problem is breaking it down into smaller parts. Creating bounded contexts is breaking down a large domain into smaller ones, each handling one cohesive part of the domain.
  • Using an anaemic domain model. This is a common sign that a team is not doing DDD and often a symptom of a failure in the modelling process. At first, an anaemic domain model often looks like a real domain model with correct names, but the classes lack functionalities. They contain only the Get and Set methods.

How the DDD model improved our software development

Thanks to this brand new clean up, we achieved the following:

  • Core functionalities are evenly distributed to the overall codebase and not limited to just a few files.
  • The developers are aware of what each folder is responsible for by simply looking at the file naming and folder structure.
  • The risk of breaking major functionalities by merely making small changes is greatly reduced. Changing a feature is now more efficient.

The team now finds the code well structured and we require less hand-holding for onboarders, thanks to the simplicity of the structure.

Finally, the most important thing, we now have a system oriented towards our business necessities. Everyone ends up using the same language and terms. Developers communicate better with the business team. The work is more efficient when it comes to establishing solutions for the models that reflect how the business operates, instead of how the software operates.

Lessons Learnt

  • Use DDD to collaborate among all project disciplines (product, business, partner, and so on) and clearly understand the business requirements.
  • Establish a ubiquitous language to discuss domain-related concepts.
  • Use bounded contexts to break down complex domains into manageable parts.
  • Implement a layered architecture (i.e. DDD building blocks) to focus on particular aspects of the application.
  • To simplify your dependency, use domain event to communicate with sub-bounded context.

Driving Southeast Asia forward through people-focused design

Post Syndicated from Grab Tech original https://engineering.grab.com/driving-sea-forward-through-people-focused-design

Southeast Asia is home to around 650 million people from diverse and comparatively different economic, political and social backgrounds. Many people in the region today rely on super apps like Grab to earn a daily living or get from A to B more efficiently and safely. This means that decisions made have real impact on people’s lives – so how do you know when your decisions are right or wrong?

In this post, I’ll share key customer insights that have guided my decisions and informed my design thinking over the last year whilst working as a product designer for Grab in Singapore. I’ve broken my learnings down into 3 transient areas for thinking about product development and how each one addressed our customers’ needs.

  1. Relevance – does the design solve the customer problem? For example, loss of connectivity which is common in Southeast Asia should not completely prevent a customer from accessing the content on our app.
  2. Inclusivity – does the design consider the full range of customer diversity? For example, a driver waiting in the hot sun for his passenger can still use the product. Inclusive design covers people with a range of perspectives, disabilities and environments.
  3. Engagement – does the design invoke a feeling of satisfaction? For example, building a compelling narrative around your product that solicits a higher engagement.

Under each of these areas, I’ll elaborate on how we’ve built empathy from customer insights and applied these to our design thinking.

But before jumping in, think about the lens which frames any customer experience – the mobile device. In Southeast Asia, the commonly used devices are inexpensive low-end devices made by OPPO, Xiaomi, and Samsung. Knowing which devices customers use helps us understand potential performance constraints, different screen resolutions, and custom Android UIs.

Designing for relevance  

Shopping mall in Medan, Indonesia
Shopping mall in Medan, Indonesia

Connectivity

In Southeast Asia, it’s not too hard to find public WiFi. However, the main challenge is finding a reliable network. Take this shopping mall in Medan, Indonesia. The WiFi connectivity didn’t live up to the modern infrastructure of the building. The locals knew this and used mobile data over spotty and congested connectivity. Mobile data is the norm for most people and 4G reach is high, but the power of the connections is relatively low.

Building empathy

To genuinely design for customers’ needs, designers at Grab regularly get out the office to understand what people are doing in the real world. But how do we integrate empathy and compassion into the design process? Throughout this article, I’ll explain how the insights we gathered from around Southeast Asia can inform your decision making process.  

For simulating a loss of connectivity, switch to airplane mode to observe current UI states and limitations. If you have the resources, create a 2G network to compare how bandwidth constraints page loading speeds.Network Link Conditioner for Mac and iOS or Lighthouse by Chrome DevTools can replicate a slower network.

Design implications

This diagram is from Scott Hurff’s book, Designing Products People Love. The book is amazing, but if you don’t have the time to read it, this article offers a quick overview.

Scott Hurff’s UI Stack
Scott Hurff’s UI Stack

An ideal state (the fully loaded experience) is primarily what a lot of designers think about when problem-solving. However, when connectivity is a common customer pain-point, designers at Grab have to design for the less desirable: Blank, Loading, Partial, and Error states in tandem with all the happy paths. Latency can make or break the user experience, so buffer wait times with visual progress to cushion each millisecond. Loading skeletons when you open Grab, act as momentary placeholders for content and reduce the perceived latency to load the full experience.

A loss of connectivity shouldn’t mean the end of your product’s experience. Prepare connectivity issues by keeping screens alive through intuitive visual cues, messaging, and cached content.

Device type and condition

In Southeast Asia, people tend to opt for low-end or hand-me-down devices that can sometimes have cracked screens or depleting batteries. These devices are usually in circulation much longer than in developed markets, and the device’s OS might not be the latest version because of the perceived effort or risk to update.  

A driver’s device taken during research in Indonesia
A driver’s device taken during research in Indonesia

Building empathy

At Grab, we often use a range of popular, in-market devices to understand compatibility during the design process. Installing mainstream apps to a device with a small screen size, 512MB internal memory, low resolution and depleting battery life will provide insights into performance.  If these apps have lite versions or Progressive Web Apps (PWA), try to understand the trade-offs in user experience compared to the parent app.

Grab’s passenger app on the left versus the driver app
Grab’s passenger app on the left versus the driver app

Design implications

Design for small screens first to reduce the chances of design debt later in the development lifecycle. For interactive elements, it’s important to think about all types of customers that will use the product and in what circumstances. For Grab’s driver-partners who may have their devices mounted to the dashboard, tap targets need to be larger and more explicit.  

Similarly, color contrast will vary depending on screen resolution and time of the day. Practical tests involve dimming the screen and standing near a window in bright sunshine (our HQ is in Singapore which helps!). To further improve accessibility, use a tool like Sketch’s Stark plugin to understand if contrast ratios are accessible to visually impaired customers. A general rule is to aim for higher contrast between essential UI components, text and interactive affordances.

Fancy transitions can look great on high-end devices but can appear choppy or delayed on older and less performant phones. Aim for simple animations to offer a more seamless experience.

Passenger verification to improve safety
Passenger verification to improve safety

Day-to-day budgeting

Many people in Southeast Asia earn a daily income, so it’s no surprise that prepaid mobile is more common over a monthly contract. This mindset to ration on a day-to-day basis also extends itself to other essentials like washing powder and nappies. Data can be an expensive necessity, and customers are selective over the types of content that will consume a daily or weekly budget. Some customers might turn off data after getting a ride, and not turn it back on until another Grab service is required.

Building empathy

Rationing data consumption daily can be achieved through not connecting to WiFi, or a more granular way is to turn off WiFi and use an app like Google’s Datally on Android to cap data usage. Starting low, around 50MB per day will increase your understanding around the data trade-offs you make and highlight the apps that require more data to perform certain actions.

Design implications

Where possible, avoid using video when SVG animations can be just as effective, scalable and lightweight. For Grab’s passenger verification flow, we decided to move away from a video tutorial and keep data consumption to a minimum through utilising SVG animations. When a video experience is required, like Grab’s feed on the home screen, disabling autoplay and clearly distinguishing the media as video allowed customers to decide on committing data.

Design for inclusivity  

Mobile-only

The expression “mobile-first” has been bounced around for the last decade, but in Southeast Asia, “mobile-only” is probably more accurate. Most customers have never owned a tablet or laptop, and mobile numbers are more synonymous with a method of registration over an email address. In the region, people rely more on social media and chat apps to understand broadcast or published news reports, events and recommendations. Customers who sign up for a new Grab account, prefer phone numbers and OTP (one-time-password) registration over providing an email address and password. And anecdotally from interviews conducted at Grab, customers didn’t feel the need for email when communication can take place via SMS, WhatsApp, or other messaging apps.

Building empathy

At Grab, we apply design thinking from a mobile-only perspective for our passenger, merchant,  and driver-partner experiences by understanding our customers’ journeys online and off.  These journeys are synthesized back in the office and sometimes recreated with video and physical artifacts to simulate the customer experience. It’s always helpful to remove smartwatches, put away laptops and use an in-market device that offers a similar experience to your customers.

Design implications

When onboarding new customers, offer a relevant sign-in method for a mobile-only customer, like phone number and social account registration. Grab’s passenger sign-up experience addresses these priorities with phone number first, social accounts second.  

Grab’s sign-in screen
Grab’s sign-in screen

PC-era icons are also widely misunderstood by mobile-only customers, so avoid floppy disks to imply Save, or a folder to Change Directory as these offer little symbolic meaning. When icons are paired with text, this can often reinforce meaning and quicken recognition.  For example, a pencil icon alone can be confusing, so adding the word “Edit” will provide more clarity.  

Nightfall in Yogyakarta, Indonesia
Nightfall in Yogyakarta, Indonesia

Diversity and safety

This photo was taken in Yogyakarta, Indonesia. In the evening, women often formed groups to improve personal safety. In an online environment, women often face discrimination, harassment, blackmail, cyberstalking, and more.  Minorities in emerging markets are further marginalised due to employment, literacy, and financial issues.  

Building empathy

Southeast Asia has a very diverse population, and it’s important to understand gender, ethnic,  and class demographics before you plan any research.Research recruitment at Grab involves working with local vendors to recruit diverse groups of customers for interviews and focus groups. When spending time with customers, we try to understand how diversity and safety factors contribute to the experience of the product.

If you don’t have the time and resources to arrange face-to-face interviews, I’d recommend this article for creating a survey: Respectful Collection of Demographic Data

Design for inclusivity

Allow people to control how they represent their identities through pseudonym names and avatars. But does this undermine trust on the platform? No, not really. Credit card registration or more recently, Grab’s passenger and driver selfie verification feature has closed the loop on suspect accounts whilst maintaining everyone’s privacy and safety.  

On the visual design side, our illustration and content guide incorporates diverse representations of ethnic backgrounds, clothing, physical ability, and social class. You can see examples in the app or through our Dribbble page. For user-generated content, allow people to report and flag abusive material. While data and algorithms can do so much, facts and ethics cannot be policed by machine learning.

Language

In Southeast Asia and other emerging markets, customers may set their phone to a language which they aspire to learn but may not fully comprehend. Swipe, tap, drag, pinch, and other specific terms relating to interactions might not easily translate into the local language, and English might be the preferred language regardless of comprehension. It’s surprisingly common to attend an interview with a translator but the device’s UI is set to English.  

A Grab pick-up point taken in Medan, Indonesia
A Grab pick-up point taken in Medan, Indonesia

Building empathy

If your app supports multiple languages, try setting your phone to a different language but know how to change it back again!  At Grab, we test design robustness by incorporating translated text strings into our mocks. Look for visual cues to infer meaning since some customers might be illiterate or not fully comprehend English.

Grab’s Safety Centre in different languages
Grab’s Safety Centre in different languages

Design for different languages, formats and visual cues

To reduce design debt later on, it’s a good idea to start with the smallest screen size and test the most vulnerable parts of the UI with translated text strings. Keep in mind, dates, times, addresses, and phone numbers may have different formats and require special attention. You can apply multiple visual cues to represent important UI states, such as a change in colour, shape and imagery.

Design for engagement

Sharing

From our research studies, word-of-mouth communication and consuming viral content via Instagram or Facebook was more popular than trawling through search page results. The social aspect is extended to the physical environment where devices can sometimes be shared with more than one person, or in some cases, one mobile is used concurrently with more than one user at a time. In numerous interviews, customers talk about not using biometric authentication so that family members can access their devices.

Building empathy

To understand the layers of personalisation, privacy and security on a device, it’s worth loaning a device from your research team or just borrow a friend’s phone (if they let you!).  How far do you get before you require biometric authentication or a PIN to proceed further? If you decide to wipe a personal device, what steps can you miss out from the setup, and how does that affect your experience post setup?

Offline to Online: GrabNow connecting with driver
Offline to Online: GrabNow connecting with driver

Design for sharing

If necessary, facilitate device sharing through easy switching of accounts, and enable people to remove or hide private content after use. Allow content to be easily shared for both online and offline in-person situations. Using this approach, GrabNow allows passengers to find and connect with a nearby driver without having to pre-book and wait for a driver to arrive. This offline to online interaction also saves data and battery for the customer.

Support and tutoring

In Southeast Asia, people find troubleshooting issues from inside a help page troublesome and generally prefer human assistance, like speaking to someone through a call centre. The opportunity for face-to-face tutoring on how something works is often highly desired and is much more effective than standard onboarding flows that many apps use. From the many physical phone stores, it’s not uncommon for people to go and ask for help or get apps manually installed onto their device.

Building empathy

Apart from speaking with your customers regularly, always look through the Play and App Store reviews for common issues. Understand your customers’ problems and the jargon they use to describe what happened. If you have a customer support team, the tickets created will be a key indicator of where your customers need the most support.

Help and Feedback Design implications

Make support accessible through a variety of methods: online forms, email, and if possible, allow customers to call in. With in-app or online forms, try to use drop-downs or pre-populated quick responses to reduce typing, triage the type of support, and decrease misunderstanding when a request comes in.  When a customer makes a Grab transport booking for the first time, we assist the customer through step-by-step contextual call-outs.

Local aesthetics

This photo was taken in Medan, Indonesia, on the day of an important wedding. It was impressive to see hundreds of handcrafted, colourful placards lining the streets for miles, but maybe more admirable that such an occasion was shared with the community and passers-by, and not only for the wedding guests.  

A wedding celebration flower board in Medan, Indonesia
A wedding celebration flower board in Medan, Indonesia

These types of public displays are not exclusive to weddings in Southeast Asia, vibrant colours and decorative patterns are woven into the fabric of everyday life, indicative of a jovial spirit that many people in the region possess.

Building empathy

What are some of the immediate patterns and surfaces that exist in your workspace? Looking around your immediate environment can provide an immediate assessment of visual stimuli that can influence your decisions on a day-to-day basis.

Wall space can be incredibly valuable when you can display photos from your research trip, or find online inspiration to recreate some of the visual imagery from your target markets.  When speaking with your customers, ask to see mobile wallpapers, and think about how fashion could also play a role in determining an aesthetic choice. Lastly, take time out when on a research trip to explore the streets, museums, and absorb some of the local cultures.

Design to delight and surprise customers

Capture local inspiration on research trips to incorporate into visual collections that can be a source of inspiration for colour, imagery, and textures. Find opportunities in your product to delight and engage customers through appropriate images and visuals. Grab’s marketing consent experience leverages illustrative visuals to help customers understand the different categories that require their consent.

For all our markets, we work with local teams around culturally sensitive visuals and imagery to ensure our content is not offensive or portrays the wrong connotations.

My top 5 for guerrilla field research

If you don’t have enough time, stakeholder buy-in or budget to do research, getting out of the office to do your own is sometimes the only answer. Here are my top 5 things to keep in mind.

  1. Don’t jump in. Always start with observation to capture customers’ natural behaviour.
  2. Sanity check scripts. Your time and customers’ time is valuable; streamline your script and prepare for u-turns and potential Facebook and Instagram friend requests at the end!  
  3. Ask the right people. It’s difficult to know who wants to or has time for your 10-minute intercept. Look for individuals sitting around and not groups if possible (group feedback can be influenced by the most vocal person).
  4. Focus on the user.Never multitask when speaking to the user. Jotting notes on an answer sheet is less distracting than using your mobile or laptop (and less dangerous in some places!). Ask permission to record audio if you want to avoid notetaking all together but this does create more work later on.
  5. Use insights to enrich understanding. Insights are not trends and should be used in conjunction with quantitative data to validate decision making.

Feel inspired by this article and want to learn more? Grab is hiring across Southeast Asia and Seattle. Connect with me on LinkedIn or Twitter @PhilipMadeley to learn more about design at Grab.

Griffin, an anti-fraud risk rule engine making billions of predictions daily

Post Syndicated from Grab Tech original https://engineering.grab.com/griffin

Introduction

At Grab, the scale and fast-moving nature of our business means we need to be vigilant about potential risks to our customers and to our business. Some of the things we watch for include promotion abuse, or passenger safety on late-night ride allocations. To overcome these issues, the TIS (Trust/Identity/Safety) taskforce was formed with a group of AI developers dedicated to fraud detection and prevention.

The team’s mission is:

  • to keep fraudulent users away from our app or services
  • ensure our customers’ safety, and
  • Manage user identities to securely login to the Grab app.

The TIS team’s scope covers not just transport, but also our food, deliver and other Grab verticals.

How we prevented fraudulent transactions in the earlier days

In our early days when Grab was smaller, we used a rules-based approach to block potentially fraudulent transactions. Rules are like boolean conditions that determines if the result will be true or false. These rules were very effective in mitigating fraud risk, and we used to create them manually in the code.

We started with very simple rules. For example:

Rule 1:

 IF a credit card has been declined today

 THEN this card cannot be used for booking

To quickly incorporate rules in our app or service, we integrated them in our backend service code and deployed our service frequently to use the latest rules.

It worked really well in the beginning. Our logic was relatively simple, and only one developer managed the changes regularly. It was very lightweight to trigger the rule deployment and enforce the rules.

However, as the business rapidly expanded, we had to exponentially increase the rule complexity. For example, consider these two new rules:

Rule 2:

IF a credit card has been declined today but this passenger has good booking history

THEN we would still allow this booking to go through, but precharge X amount

Rule 3:

IF a credit card has been declined(but paid off) more than twice in the last 3-months

THEN we would still not allow this booking

The system scans through the rules, one by one, and if it determines that any rule is tripped it will check the other rules. In the example above, if a credit card has been declined more than twice in the last 3-months, the passenger will not be allowed to book even though he has a good booking history.

Though all rules follow a similar pattern, there are subtle differences in the logic and they enable different decisions. Maintaining these complex rules was getting harder and harder.

Now imagine we added more rules as shown in the example below. We first check if the device used by the passenger is a high-risk one. e.g using an emulator for booking. If not, we then check the payment method to evaluate the risk (e.g. any declined booking from the credit card), and then make a decision on whether this booking should be precharged or not. If passenger is using a low-risk  device but is in some risky location where we traditionally see a lot of fraud bookings, we would then run some further checks about the passenger booking history to decide if a pre-charge is also needed.

Now consider that instead of a single passenger, we have thousands of passengers. Each of these passengers can have a large number of rules for review. While not impossible to do, it can be difficult and time-consuming, and it gets exponentially more difficult the more rules you have to take into consideration. Time has to be spent carefully curating these rules.

Rules flow

The more rules you add to increase accuracy, the more difficult it becomes to take them all into consideration.

Our rules were getting 10X more complicated than the example shown above. Consequently, developers had to spend long hours understanding the logic of our rules, and also be very careful to avoid any interference with new rules.

In the beginning, we implemented rules through a three-step process:

  1. Data Scientists and Analysts dived deep into our transaction data, and discovered patterns.
  2. They abstracted these patterns and wrote rules in English (e.g. promotion based booking should be limited to 5 bookings and total finished bookings should be greater than 6, otherwise unallocate current ride)
  3. Developers implemented these rules and deployed the changes to production

Sometimes, the use of English between steps 2 and 3 caused inaccurate rule implementation (e.g. for “X should be limited to 5”, should the implementation be X < 5 or  X <= 5?)

Once a new rule is deployed, we monitored the performance of the rule. For example,

  • How often does the rule fire (after minutes, hours, or daily)?
  • Is it over-firing?
  • Does it conflict with other rules?

Based on implementation, each rule had dependency with other rules. For example, if Rule 1 is fired, we should not continue with Rule 2 and Rule 3.

As a result, we couldn’t  keep each rule evaluation independent.  We had no way to observe the performance of a rule with other rules interfering. Consider an example where we change Rule 1:

From IF a credit card has been declined today

To   IF a credit card has been declined this week

As Rules 2 and 3 depend on Rule 1, their trigger-rate would drop significantly. It means we would have unstable performance metrics for Rule 2 and Rule 3 even though the logic of Rule 2 and Rule 3 does not change. It is very hard for a rule owner to monitor the performance of Rules 2 and Rule 3.

When it comes to the of A/B testing of a new rule, Data Scientists need to put a lot of effort into cleaning up noise from other rules, but most of the time, it is mission-impossible.

After several misfiring events (wrong implementation of rules) and ever longer rule development time (weekly), we realized “No one can handle this manually.“

Birth of Griffin Rule Engine

We decided to take a step back, sit down and closely review our daily patterns. We realized that our daily patterns fall into two categories:

  1. Fetching new data:  e.g. “what is the credit card risk score”, or “how many food bookings has this user ordered in last 7 days”, and transform this data for easier consumption.
  2. Updating/creating rules: e.g. if a credit card risk score is high, decline a booking.

These two categories are essentially divided into two independent components:

  1. Data orchestration – collecting/transforming the data from different data sources.
  2. Rule-based prediction

Based on these findings, we got started with our Data Orchestrator (open sourced at https://github.com/grab/symphony) and Griffin projects.

The intent of Griffin is to provide data scientists and analysts with a way to add new rules to monitor, prevent, and detect fraud across Grab.

Griffin allows technical novices to apply their fraud expertise to add very complex rules that can automate the review of rules without manual intervention.

Griffin  now predicts billions of events every day with 100K+ Queries per second(QPS) at peak time (on only 6 regular EC2s).

Data scientists and analysts can self-service rule changes on the web portal directly, deploy rules with just a few clicks, experiment and monitor performance in real time.

Why we came up with Griffin instead of using third-party tools in the market

Before we decided to create our in-built tool, we did some research for common business rule engines available in the market such as Drools and checked if we should use them. In that process, we found:

  1. Drools has its own Java-based DSL with a non-trivial learning curve (whereas our major users are from Python background).
  2. Limited [expressive power](https://en.wikipedia.org/wiki/Expressive_power_(computer_science),
  3. Limited support for some common math functions (e.g. factorial/ Greatest Common Divisor).
  4. Our nature of business needed dynamic dataset for predictions (for example, a rule may need only passenger booking history on Day 1, but it may use passenger booking history, passenger credit balance, and passenger favorite places on Day 2). On the other hand, Drools usually works well with a static list of dataset instead of dynamic dataset.

Given the above constraints, we decided to build our own rule engine which can better fit our needs.

Griffin Architecture

The diagram depicts the high-level flow of making a prediction through Griffin.

High-level flow of making a prediction through Griffin

Components

  • Data Orchestration: a service that collects all data needed for predictions
  • Rule Engine: a service that makes prediction based on rules
  • Rule Editor: the portal through which users can create/update rules

Workflow

  1. Users create/update rules in the Rule Editor web portal, and save the rules in the database.
  2. Griffin Rule Engine reloads rules immediately as long as it detects any rule changes.
  3. Data Orchestrator sends all dataset (features) needed for a prediction (e.g. whether to block a ride based on passenger past ride pattern, credit card risk) to the Rule Engine
  4. Griffin Rule Engine makes a prediction.

How you can create rules using Griffin

In an abstract view, a rule inside Griffin is defined as:

Rule:

Input:JSON => Result:Boolean

We allow users (analysts, data scientists) to write Python-based rules on WebUI to accommodate some very complicated rules like:

len(list(filter(lambdax: x \>7, (map(lambdax: math.factorial(x), \[1,2,3,4,5,6\]))))) \>2

This significantly optimizes the expressive power of rules.

To match and evaluate a rule more efficiently, we also have other key components associated:

Scenarios

  • Here are some examples: PreBooking, PostBookingCompletion, PostFoodDelivery

Actions

  • Actions such as NotAllowBooking, AuthCapture, SendNotification
  • If a rule result is True, it returns a list of treatments as selected by users, e.g. AuthCapture and SendNotification (the example below is treatments for one Safety-related rule).The one below is for a checkpoint to detect credit-card risk.
Treatments: AuthCapture
  • Each checkpoint has a default treatment. If no rule inside this checkpoint is hit, the rule engine would return the default one (in most cases, it is just “do nothing”).
  • A treatment can only belong to one checkpoint, but one checkpoint can have multiple treatments.

For example, the graph below demonstrates a checkpoint PaxPreRide associated with three treatments: Pass, Decline, Hold

Treatments: Adding

Segments

  • The scope/dimension of a rule. Based on the sample segments below, a rule can be applied only to countries=\[MY,PH\] and verticals=\[GrabBus, GrabCar\]
  • It can be changed at any time on WebUI as well.
Segments

Values of a rule

 

When a rule is hit, more than just treatments, users also want some dynamic values returned. E.g. a max distance of the ride allowed if we believe this booking is medium risk.

Does Python make Griffin run slow?

We picked Python to enjoy its great expressive power and neatness of syntax, but some people ask: Python is slow, would this cause a latency bottleneck?

Our answer is No.

The below graph shows the Latency P99 of Prediction Request from load balancer side(actually the real latency for each prediction is < 6ms, the metrics are peaked at 30ms because some batch requests contain 50 predictions in a single call)

Prediction Request Latency P99

What we did to achieve this?

  • The key idea is to make all computations in CPU and memory only (in other words, no extra I/O).
  • We do not fetch the rules from database for each prediction. Instead, we keep a record called dirty_key, which keeps the latest rule update timestamp. The rule engine would actively check this timestamp and trigger a rule reload only when the dirty_key timestamp in the DB is newer than the latest rule reload time.
  • Rule engine would not fetch any additional new data, instead, all data should be from Data Orchestrator.
  • So the whole prediction flow is only between CPU & memory (and if the data size is small, it could be on CPU cache only).
  • Python GIL essentially enforces a process to have up to one active thread running at a time, no matter how many cores a CPU has. We have Gunicorn to wrap our service, so on the Production machine, we have (2x$num_cores) + 1 processes (see http://docs.gunicorn.org/en/latest/design.html#how-many-workers). The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

The below screenshot is the process snapshot on C5.large machine with 2 vCPU. Note only green processes are active.

Process snapshot on C5.large machine

A lot of trial and error performance tuning:

  • We used to have python-jsonpath-rw for JSONPath query, but the performance was not strong enough. We switched to jmespath and observed about 10ms latency reduction.
  • We use sqlalchemy for DB Query and ORM. We enabled cache for some use cases, but turned out it was over-optimized with stale data. We ended up turning off some caching points to ensure the data consistency.
  • For new dict/list creation, we prefer native call (e.g. {}/[]) instead of function call (see the comparison below).
Native call and Function call
  • Use built-in functions https://docs.python.org/3/library/functions.html. It is written in C, no one can beat it.
  • Add randomness to rule reload so that not all machines run at the same time causing latency spikes.
  • Caching atomic feature units as they are used so that we don’t have to requery for them each time a checkpoint uses it.

How Griffin makes on-call engineers relax

One of the most popular aspects of Griffin is the WebUI. It opens a door for non-developers to make production changes in real time which significantly boosts organisation productivity. In the past a rule change needed 1 week for code change/test/deployment, now it is just 1 minute.

But this also introduces extra risks. Anyone can turn the whole checkpoint down, whether unintentionally or maliciously.

Hence we implemented Shadow Mode and Percentage-based rollout for each rule. Users can put a rule into Shadow Mode to verify the performance without any production impact, and if needed, rollout of a rule can be from 1% all the way to 100%.

We implemented version control for every rule change, and in case anything unexpected happened, we could rollback to the previous version quickly.

Version control
Rollback button

We also built RBAC-based permission system, along with Change Approval flow to make sure any prod change needs at least two people(and approver role has higher permission)

Closing thoughts

Griffin evolved from a fraud-based rule engine to generic rule engine. It can apply to any rule at Grab. For example, Grab just launched Appeal automation several days ago to reduce 50% of the  human effort it typically takes to review straightforward appeals from our passengers and drivers. It was an unplanned use case, but we are so excited about this.

This could happen because from the very beginning we designed Griffin with minimized business context, so that it can be generic enough.

After the launch of this, we observed an amazing adoption rate for various fraud/safety/identity use cases. More interestingly, people now treat Griffin as an automation point for various integration points.

Using Grab’s Trust Counter Service to Detect Fraud Successfully

Post Syndicated from Grab Tech original https://engineering.grab.com/using-grabs-trust-counter-service-to-detect-fraud-successfully

Background

Fraud is not a new phenomenon, but with the rise of the digital economy it has taken different and aggressive forms. Over the last decade, novel ways to exploit technology have appeared, and as a result, millions of people have been impacted and millions of dollars in revenue have been lost. According to ACFE survey, companies lost USD6.3 billion due to fraud. Organizations lose 5% of its revenue annually due to fraud.

In this blog, we take a closer look at how we developed an anti-fraud solution using the Counter service, which can be an indispensable tool in the highly complex world of fraud detection.

Anti-fraud solution using counters

At Grab, we detect fraud by deploying data science, analytics, and engineering tools to search for anomalous and suspicious transactions, or to identify high-risk individuals who are likely to commit fraud. Grab’s Trust Platform team provides a common anti-fraud solution across a variety of business verticals, such as transportation, payment, food, and safety. The team builds tools for managing data feeds, creates SDK for engineering integration, and builds rules engines and consoles for fraud detection.

One example of fraudulent behavior could be that of an individual who masquerades as both driver and passenger, and makes cashless payments to get promotions, for example, earn a one dollar rebate in the next transaction.In our system, we analyze real time booking and payment signals, compare it with the historical data of the driver and passenger pair, and create rules using the rule engine. We count the number of driver and passenger pairs at a given time frame. This counter is provided as an input to the rule.If the counter value exceeds a predefined threshold value, the rule evaluates it as a fraud transaction. We send this verdict back to the booking service.

The conventional method

Fraud detection is a job that requires cross-functional teams like data scientists, data analysts, data engineers, and backend engineers to work together. Usually data scientists or data analysts come up with an offline idea and apply it to real-time traffic. For example, a rule gets invented after brainstorming sessions by data scientists and data analysts. In the conventional method, the rule needs to be communicated to engineers.

Automated solution using the Counter service

To overcome the challenges in the conventional method, the Trust platform team decided to come out with the Counter service, a self-service platform, which provides management tools for users, and a computing engine for integrating with the backend services. This service provides an interface, such as a UI based rule editor and data feed, so that analysts can experiment and create rules without interacting with engineers. The platform team also decided to provide different data contracts, APIs, and SDKs to engineers so that the business verticals can use it quickly and easily.

The major engineering challenges faced in designing the Counter service

There are millions of transactions happening at Grab every day, which implies we needed to perform billions of fraud and safety detections. As seen from the example shared earlier, most predictions require a group of counters. In the above use case, we need to know how many counts of the cashless payment happened for a driver and passenger pair. Due to the scale of Grab’s business, the potential combinations of drivers and passengers could be exponential. However, this is only one use case. So imagine that there could be hundreds of counters for different use cases. Hence it’s important that we provide a platform for stakeholders to manage counters.

Some of the common challenges we faced were:

Scalability

As mentioned above, we could potentially have an exponential number of passengers and drivers in a single counter. So it’s a great challenge to store the counters in the database, read, and query them in real-time. When there are billions of counter keys across a long period of time, the Trust team had to find a scalable way to write and fetch keys effectively and meet the client’s SLA.

Self-serving

A counter is usually invented by data scientists or analysts and used by engineers. For example, every time a new type of counter is needed from data scientists, developers need to manually make code changes, such as adding a new stream, capturing related data sets for the counter, and storing it on the fraud service, then doing a deployment to make the counters ready. It usually takes two or more weeks for the whole iteration, and if there are any changes from the data analysts’ side, which happens often, the situation loops again. The team had to come up with a solution to prevent the long loop of manual tasks by coming out with a self-serving interface.

Manageable and extendable

Due to a lack of connection between real-time and offline data, data analysts and data scientists did not have a clear picture of what is written in the counters. That’s because the conventional counter data were stored in Redis database to satisfy the query SLA. They could not track the correctness of counter value, or its history. With the new solution, the stakeholders can get a real-time picture of what is stored in the counters using the data engineering tools.

The Machine Learning challenges solved by the Counter service

The Counter service plays an important role in our Machine Learning (ML) workflow.

Data Consistency Challenge/Issue

Most of the machine learning workflows need dedicated input data. However, when there is an anti-fraud model that is trained using offline data from the data lake, it is difficult to use the same model in real-time. This is because the model lacks the data contract and the consistency with the data source. In this case, the Counter service becomes a type of data source by providing the value of counters to file system.

ML featuring

Counters are important features for the ML models. Imagine there is a new invention of counters, which data scientists need to evaluate. We need to provide a historical data set for counters to work. The Counter service provides a counter replay feature, which allows data scientists to simulate the counters via historical payload.

In general, the Counter service is a bridge between online and offline datasets, data scientists, and engineers. There was technical debt with regards to data consistency and automation on the ML pipeline, and the Counter service closed this loop.

How we designed the Counter service

We followed the principle of asynchronized data ingestion, and synchronized transaction for designing the Counter service.

The diagram shows how the counters are generated and saved to database.

How the counters are generated and saved to the database

Counter creation workflow

  1. User opens the Counter Creation UI and creates a new key “fraud:counter:counter_name”.
  2. Configures required fields.
  3. The Counter service monitors the new counter-creation, puts a new counter into load script storage, and starts processing new counter events (see Counter Write below).

Counter write workflow

  1. The Counter service monitors multiple streams, assembles extra data from online data services (i.e. Common Data Service (CDS), passenger service, hydra service, etc), so that rich dataset would also be available for editors on each stream resource.
  2. The Counter Processor evaluates the user-configured expression and writes the evaluated values to the dedicated Grab-Stats stream using the GrabPlugin tool.

Counter read workflow

Counter read workflow

We use Grab-Stats as our storage service. Basically Grab-Stats runs above ScyllaDB, which is a distributed NoSQL data store. We use ScyllaDB because of its good performance on aggregation in memory to deal with the time series dataset. In comparison with in-memory storage like AWS elasticCache, it is 10 times cheaper and as reliable as AWS in terms of stability. The p99 of reading from ScyllaDB is less than 150ms which satisfies our SLA.

How we improved the Counter service performance

We used the multi-buckets strategy to improve the Counter service performance.

Background

There are different time windows when you perform a query. Some counters are time sensitive so that it needs to know what happened in the last 30 or 60 minutes. Some other counters focus on the long term and need to know the events in the last 30 or 90 days.

From a transactional database perspective, it’s not possible to serve small range as well as long term events at the same time. This is because the more the need for the accuracy of the data and the longer the time range, the more aggregations need to happen on database. Which means we would not be able to satisfy the SLA. Otherwise we will need to block other process which leads to the service downgrade.

Solution for improving the query

We resolved this problem by using different granularities of the tables. We pre-aggregated the signals into different time buckets, such as 15min, 1 hour, and 1 day.

When a request comes in, the time-range of the request will be divided by the buckets, and the results are conquered. For example, if there is a request for 9/10 23:15:20 to 9/12 17:20:18, the handler will query 15min buckets within the hour.  It will query for hourly buckets for the same day. And it will query the daily buckets for the rest of 2 days. This way, we avoid doing heavy aggregations, but still keep the accuracy in 15 minutes level in a scalable response time.

Counter service UI

We allowed data analysts and data scientists to onboard counters by themselves, from a dedicated web portal. After the counter is submitted, the Counter service takes care of the integration and parsing the logic at runtime.

Counter service UI

Backend integration

We provide SDK for quicker and better integration. The engineers only need to provide the counter identifier ID (which is shown in the UI) and the time duration in the query. Under the hood we provide a GRPC protocol to communicate across services. We divide the query time window to smaller granularities, fetching from different time series tables and then conquering the result. We are also providing a short TTL cache layer to take the uncommon traffic from client such as network retry or traffic throttle. Our QPS are designed to target 100K.

Monitoring the Counter service

The Counter service dashboard helps to track the human errors while editing the counters in real-time. The Counter service sends alerts to slack channel to notify users if there is any error.

Counter service dashboard

We setup Datadog for monitoring multiple system metrics. The figure below shows a portion of stream processing and counter writing. In the example below, the total stream QPS would reach 5k at peak hour, and the total counter saved to storage tier is about 4k. It will keep climbing without an upper limit, when more counters are onboarded.

Counter service dashboard with multiple metrics

The Counter service UI portal also helps users to fetch real-time counter results for verification purposes.

Counter service UI

Future plans

Here’s what we plan to do in the near future to improve the Counter service.

Close the ML workflow loop

As mentioned above, we plan to send the resource payload of the Counter service to the offline data lake, in order to complete the counter replay function for data scientists. We are working on the project called “time traveler”. As the name indicates, it is used not only for serving the online transactional data, but also supports historical data analytics, and provides more flexibility on counter inventions and experiments.

There are more automation steps we plan to do, such as adding a replay button on the web portal, and hooking up with the offline big data engine to trigger the analytics jobs. The performance metrics will be collected and displayed on the web portal. A single platform would be able to manage both the online and offline data.

Integration to Griffin

Griffin is our rule engine. Counters are sometimes an input to a particular rule, and one rule usually needs many counters to work together. We need to provide a better integration to Griffin on backend. We plan to minimize the current engineering effort when using counters on Griffin. A counter then becomes an automated input variable on Griffin, which can be configured on the web portal by any users.

About being a Principal Engineer at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/about-being-a-principal-engineer-at-grab

Over the past few years Grab has grown from a small startup to one of the largest technology companies in South-East Asia. Along with the company’s growth, the number of microservices, features and teams also grew substantially. At the time of writing this blog, we have around 350 microservices powering our super-app.

A great engineering team is a critical component of our success. As an engineer you have two career paths in front of you: an individual contributor role, or a management role. While a management role is generally better understood, this article clarifies what it means to be a principal engineer at Grab, which is one of the highest levels of our engineering career ladder.

Engineering Career Ladder

Improving the Quality

“You set the standard for engineering excellence in your technical family. Your architectures are exemplary in terms of efficiency, stability, extensibility, testability and the ability to evolve over time. Your software is robust in the presence of failures, scalable, and cost-effective. Your coding practices are exemplary in terms of code organization, clarity, simplicity, error handling, and documentation. You tackle intrinsically hard problems, acquiring expertise as needed. You decompose complex problems into straightforward solutions.” – Grab’s Engineering Career Ladder

 

So, what does a principal engineer do? As your career progresses from junior to senior to lead engineer we have more and more responsibilities; you manage larger and larger systems. For example, junior engineer might manage a specific component of a micro-service. A senior engineer would be tasked with designing and operating an entire micro-service or product. While a lead engineer would typically be concerned with the architecture at a team level.

Principal engineer level is akin to a senior manager where instead of indirectly managing people (manager of managers) you take care of the architecture of an entire sub-organisation, known as Tech Family/Platform. These Tech Families usually have more than 50 engineers spread across multiple teams and function as a tiny company with their own business owners, designers, product managers, etc.

Challenging Projects

“You take engineering ownership of projects that may require the work of several teams to implement; you divide responsibilities so that each team can work independently and have the system come together into an integrated whole. Your projects often cross team, tech family, platform, and even R&D center boundaries. You solicit differing views and keep an open mind. You are adept at building consensus.” – Grab’s Engineering Career Ladder

 

As a principal engineer, your job is to solve larger problems and translate somewhat vague problems into a set of actionable items. You might be faced with a large problem such as “improve efficiency and interoperability of Grab’s transportation system.” You will need to understand the problem, the business impact and see how can it be improved. It might require you to design new systems, change existing systems, understand the costs involved and get the right people together to make it happen.

Solving such a problem all by yourself is pretty much impossible. You have to work with other managers and other engineers together as a team to make it happen. Help your lead/senior engineers to design the right system by giving them a clear objective but let them take care of the system-level architecture.  

You will also need to work with managers, advise them to get things done, and get the right things prioritised by the team. While you don’t need to be well-versed in project management and agile methodologies, you do need to be able to plan ahead with your teams and have an understanding of how much time a project or migration will take.

A Tech Family can easily have 20 or more micro-services. You need to have a good understanding of their functional requirements and interactions. This is challenging as learning new things is always “uncomfortable” and takes time. You must reach out to engineers, product managers, and data scientists, ideally face-to-face to build empathy. Keep asking questions and try to understand how things work. You will also need to read the existing documentation and their code.

Technical Ownership

“You are the origin of significant technical contributions to our architecture and infrastructure.  You take technical ownership of the design and quality of the security, performance, availability, and operational aspects of the software built by one or more teams. You identify where your time is needed, transitioning between coding, design, and architecture based on project and team needs. You deliver software in ways that empower teams to self-service, providing clear adoption/migration paths.” – Grab’s Engineering Career Ladder

 

As a principal engineer you work together with the Head of Engineering and managers within the Tech Family and improve the quality of systems across the board. Typically, no-one tells you what needs to be done. You need to identify gaps, raise them and keep improving the systems.

You also need to learn how to manage your own time better so you can prioritise effectively. This boils down to knowing your strengths, your weaknesses. For example, if you are really good in building distributed systems but have no clue about the latest-and-greatest design in information security, get the right InfoSec engineers in this meeting and consider skipping it yourself. Avoid trying to do everything at once and be in every single meeting you get invited – you still have to review code, design and focus, so plan accordingly.

You will also need to understand the business impact of your decisions. For example, if you contribute to product features, know how impactful this feature is going to be to the organisation. If you don’t know it – ask the Product Manager responsible for it. If you work on a platform feature, for example improving the build system, know how it will help: saving 30 minutes of build time for every engineer each day is a huge achievement.

More often than not, you will have to drive migrations, this is akin to code refactoring but on a system-level and will involve a lot of collaboration with the people. Understand what a technical debt is and how it can be mitigated – a good architecture minimises technical debt and in turn accelerates time-to-market and helps business flourish.

Technical Leadership

“You amplify your impact by leading design reviews for complex software and/or critical features. You probe assumptions, illuminate pitfalls, and foster shared understanding. You align teams toward coherent architectural strategies.” – Grab’s Engineering Career Ladder

 

In Grab we have a process known as RFC (Request For Comments) which allows engineers to submit designs and ideas for a larger audience to debate. This is especially important given that our organisation is spread across several continents with research and development offices in Southeast Asia, the US, India and China. While any engineer is welcome to comment on these RFCs, it is a duty of lead and principal engineers’ to review them on a regular basis. This will help you to expand your knowledge of existing systems and help others with improving their designs.

Communication is a key skill that you need to keep improving and it is often the Achilles’ heel of many engineers who would rather be doing work in their corner without talking to anyone else. This is perfectly fine for a junior (or even some senior engineers) but it is critical for a principal engineer to communicate. Let’s break this down to a set of specific skills that you’d need to sharpen.

You need to be able to write effectively in order to convey your ideas to others. This includes knowing your audience and wording it in such a way that readers can understand. A technical design document whose audience are engineers is not written the same way as a design proposal whose audience are product and business managers.

You need to be able to publicly present and talk about various projects that you are working on. This includes creation of slide decks with good visuals and distilling down months of work to just a couple of slides. The best way of learning this is to get out there and keep presenting your work – you will get better over time.

You also need to be able to drive meetings and discussions without wasting anyone’s time. As a technical leader, one of your key responsibilities is to get people moving in the same direction and driving consensus during meetings.

Teaching and Learning

“You educate other engineers, both at an individual level and at scale: keeping the engineering community up to date on advanced technical issues, technologies, and trends. Examples include onboarding bootcamps for new hires, interns, specific skill-gap training development, and sharing specialized knowledge to raise the technical bar for other engineers/teams/dev centers.”

 

A principal engineer is a technical leader and as a leader you have the responsibility to mentor, coach fellow engineers, regardless of their level. In addition to code-reviews, you can organise office hours in your team and knowledge sharing sessions where everyone could present something. You could also help with bootcamps and help new hires in getting up-to-speed.

Most importantly, you will also need to keep learning whichever way works for you – reading journals and papers, blog posts, watching video-recorded talks, attending conferences and browsing through a variety of open-source projects. You will also learn from other Grabbers as even a junior engineer can teach you something, we all have our strengths and weaknesses. Keep improving and working on yourself!

Data first, SLA always

Post Syndicated from Grab Tech original https://engineering.grab.com/data-first-sla-always

Introducing Trailblazer, the Data Engineering team’s solution to implementing change data capture of all upstream databases. In this article, we introduce the reason why we needed to move away from periodic batch ingestion towards a real time solution and show how we achieved this through an end to end streaming pipeline.

Context

Our mission as Grab’s Data Engineering team is to fulfill 100% of SLAs for data availability to our downstream users. Our 40 person team is responsible for providing accurate and reliable data to data analysts and data scientists so that they can produce actionable reports that will help Grab’s leadership team make data-driven decisions. We maintain data for a variety of business intelligence tools such as Tableau, Presto and Holistics as well as predictive algorithms for all of Grab.

We ingest data from multiple upstream sources, such as relational databases, Kafka or third party applications such as Salesforce or Zendesk. The majority of these source data exists in MySQL and we run ETL pipelines to mirror any updates into our data lake. These pipelines are triggered on an hourly or daily basis and are powered by an in-house Loader application which performs Spark batch ingestion and loading of data from source to sink.

Problems with the Loader application started to surface when Grab’s data exceeded the petabyte threshold. As such for larger tables, the most practical method to ingest data was to perform ETL only on rows that were updated within a specified timeframe. This is akin to issuing the query

SELECT * FROM table WHERE updated >= [start_time] AND updated < [end_time]

Now imagine two situations. One, firing this query to a huge table without an updated field. Two, firing the same query to the huge table, this time without indexes on the updated field. In the first scenario, the query will never work and we can never perform incremental ingestion on the table based on a timed window. The second scenario carries the dangers of creating high CPU load to replicate the database that we are querying from. Neither has an ideal outcome.

One other problem that we identified was the unpredictability of growth in data volume. Tables smaller than one gigabyte were ingested by fully scanning the table and overwriting the data in the data lake. This worked out well for us until the table size increased exponentially, at which point our Spark jobs failed due to JDBC timeouts. If we were only dealing with a handful of tables, this issue could have been addressed by switching our data ingestion strategy from full scan to a timed window.

When assessing the issue, we discovered that there were hundreds of tables running under the full scan strategy, all of them potentially crashing our data system, all time bombs silently waiting to explode.

The team urgently needed a new approach to ETL. Our Loader application was highly coupled to upstream table characteristics. We needed to find solutions that were truly scalable, which meant decoupling our pipelines from the upstream.

Change data capture (CDC)

Much like event sourcing, any log change to the database is captured and streamed out for downstream applications to consume. This process is lightweight since any row level update to the table is instantly captured by a real time processor, avoiding the need for large chunked queries on the table. In addition, CDC works regardless of upstream table definition, so we do not need to worry about missing updated columns impacting our data migration process.

Binary Logs (binlogs) are the CDC agents of MySQL. All updates, insertions or deletions performed on the table are captured as a series of logged events containing the past state of the row and it’s newly modified state. Check out the binlogs reference to find out more.

In order to persist all binlogs generated upstream, our team created a Spark Structured Streaming application called Trailblazer. Trailblazer streams all MySQL binlogs to our data lake. These binlogs serve as a foundation for us to build Presto tables for data auditing and help to remove the direct dependency of our batch ETL jobs to the source MySQL.

Trailblazer is an amalgamation of various data streaming stacks. Binlogs are captured by Debezium which runs on Kafka connect clusters. All binlogs are sent to our Kafka cluster, which is managed by the Data Engineering Infrastructure team and are streamed out to a real time bucket via a Spark structured streaming application. Hourly or daily ETL compaction jobs ingests the change logs from the real time bucket to materialize tables for downstream users to consume.

CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services
CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services

 

Some statistics

To date, we are streaming hundreds oftables across 60 Spark streaming jobs and with the constant increase in Grab’s database instances, the numbers are expected to keep growing.

Designing Trailblazer streams

We built our streaming application using Spark structured streaming 2.3. Structured streaming was designed to remove the technical aspects of provisioning streams. Developers can focus on perfecting business logic without worrying about fundamentals such as checkpoint management or reading and writing to data sources.

Key architecture for Trailblazer streaming
Key architecture for Trailblazer streaming

 

In the design phase, we made sure to follow several key principles that helped in managing our streams.

Checkpoints have to be externally managed

Structured streaming manages checkpoints both in a local directory and in a ‘_metadata’ directory on S3 buckets, such that the state of the stream can be restored in the event of failure and restart.

This is all well and good, with two exceptions. First, changing the starting point of data ingestion meant ssh-ing into the machine and manipulating metadata, which could be extremely dangerous. Second, we could not assume cluster prevalence since clusters can die and be recreated with data erased from its local disk or the distributed file system.

Our solution was to do a work around at the application level. All checkpoints will be stored in temporary directories with the existing timestamp appended as path (eg /tmp/checkpoint/job_A/1560697200/… ). A linearly progressive timestamp guarantees that the same directory will never be reused by new instances of the stream. This explains why we never restore its state from local disk but instead, store all checkpoints in a highly available Redis cluster, with key as the Kafka topic and value as a JSON of partition : offset.

Key

debz-schema-A.schema_A.table_B

Value

{"11":19183566,"12":19295602,"13":18992606[[a]](#cmnt1)[[b]](#cmnt2)[[c]](#cmnt3)[[d]](#cmnt4)[[e]](#cmnt5)[[f]](#cmnt6),"14":19269499,"15":19197199,"16":19060873,"17":19237853,"18":19107959,"19":19188181,"0":19193976,"1":19072585,"2":19205764,"3":19122454,"4":19231068,"5":19301523,"6":19287447,"7":19418871,"8":19152003,"9":19112431,"10":19151479}
Example of how offsets are stored in Redis as Key : Value pairs

 

Fortunately, structured streaming provides the StreamQueryListener class which we can use to register checkpoints after the completion of each microbatch.

Streams must handle 0, 1 or 1 million data

Scalability is at the heart of all well-designed applications. Spark streaming jobs are built for scalability in the face of varying data volumes.

In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing
In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing

 

There are a few settings that we can configure to influence the degree of scalability for a streaming app

  • spark.dynamicAllocation.enabled=true gives spark autonomy to provision / revoke executors to suit the workload
  • spark.dynamicAllocation.maxExecutors controls the maximum job parallelism
  • maxOffsetsPerTrigger controls the maximum number of messages ingested from Kafka per microbatch
  • trigger controls the duration between microbatchs and is a property of the DataStreamWriter class

Data as key health indicator

Scaling the number of streaming jobs without prior collection of performance metrics is a bad idea. There is a high chance that you will discover a dead stream when checking your stream hours after initialization. I’ll cite Murphy’s law as proof.

Thus we vigilantly monitored our data streams. We used tools such as Datadog for metric monitoring, Slack for oncall issue reporting, PagerDuty for urgent cases and our inhouse data auditor as a service (DASH) for counts discrepancy reporting between streamed and source data. More details on monitoring will be discussed in the later part.

Streams are ephemeral

Streams may die due to a hundred and one reasons so don’t blame yourself or your programming insecurities. Issues with upstream dependencies, such as a node within your Kafka cluster running out of disk space, could lead to partition unavailability which would crash the application. On one occasion, our streaming application was unable to resolve DNS when writing to AWS S3 storage. This amounted to multiple failures within our Spark job that eventually culminated in the termination of the stream.

In this case, allow the stream to  shutdown gracefully, send out your alerts and have a mechanism in place to retry the failed stream. We run all streaming jobs on Airflow and any failure to the stream will automatically be retried through a new task issued by the scheduler.

If you have had experience with large scale management of streams, please leave a comment so we can continue this discussion!

Monitoring data streams

Here are some key features that were set up to monitor our streams.

Running : Active jobs ratio

The number of streaming jobs could increase in the future, thus becoming a challenge for the oncall team to track all jobs that are supposed to be up and running.

One proposal  is  to track the number of jobs in production against the number of jobs that are actually running. By querying MySQL tables, we can filter out all the jobs that are meant to be active. Since Trailblazer streams are spark-submit jobs managed by YARN, we can query YARN’s resource manager REST API to retrieve  all the jobs that are running. We then construct a ratio of running : active jobs and report them to Datadog. If the ratio is not 1 for an extended duration, an alert will be issued for the oncall to take action.

If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert
If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert

 

Microbatch runtime

We define a 30 second window for each microbatch and track the actual runtime using metrics reported by the query listener. A runtime that exceeds the designated window is a potential indicator that the streaming job is deprived of resources and needs to be scaled up.

Job liveliness

Each job reports its health by emitting a count of 1 heartbeat. This heartbeat is created at the end of every microbatch via a query listener. This process is useful in detecting stale jobs (jobs that are registered as RUNNING in YARN but are actually hung).

Kafka offset divergence

In order to ensure that the message output rate to the consumer exceeds the message input rate from the producer, we sum up all presently ingested topic-partition offsets and compare that value to the sum of all topic-partition end offsets in Kafka. We then add an alerting logic on top of these metrics to inform the oncall team if the difference between the two values grows too big.

It is important to track the offset divergence parameter as streams can be lagging. Should the rate of consumption fall below the rate of message production, we would run the risk of falling short of Kafka’s retention window, leading to data losses.

Hourly data checks

DASH runs hourly and serves as our first line of defence to detect any data quality issues within the streams. We issue queries to the source database and our streaming layer to confirm that the ID counts of data created within the last hour match.

DASH helps in the early detection of upstream issues. We have noticed cases where our Debezium connectors failed and our checker reported fewer data than expected since there were no incoming messages to Kafka.

DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack

 

Materializing tables through compaction

Having CDC data in our data lake does not conclude our responsibilities. Batched compaction allows us to apply all captured CDC, to be available as Presto tables for downstream consumption. The job is set to trigger hourly and process all changes to the database within the past hour.  For example, changes to a record are visible in real-time, but the latest state of the record will not be reflected until the next time a batch job runs. We addressed several issues with streaming during this phase.

Deduplication of data

Trailblazer was not built to deliver exactly once guarantees. We ensure that the issues regarding duplicated CDCs are addressed during compaction.

Availability of all data until certain hour

We want to make sure that downstream pipelines use output data of the hourly batch job only when the pipeline has all records for that hour. In case there is an event that is processed late by streaming, the current pipeline will wait until the data is completed. In this case, we are consciously choosing consistency over availability for our downstream users. For example, missing a few insert booking records in peak hours due to consumer processing delay can generate the wrong downstream results leading to miscalculation in revenue. We want to start  downstream processes only when the data for the hour or day is complete.

Need for latest state of each event

Our compaction job performs upserts on the data to ensure that our downstream users can consume  records in their latest state.  

Future applications

Trailblazer is a milestone for the Data Engineering team as it represents our commitment to achieve large scale data streams to reduce latencies for our end users. Moving ahead, our team will be exploring how we can further optimize streaming jobs by analysing data trends over time and to build applications such as snapshot tables on top of the CDCs being streamed in our data lake.

Save Your Place with Grab!

Post Syndicated from Grab Tech original https://engineering.grab.com/save-your-place-with-grab

Do you find it tedious to type and search for your destination or have a hard time remembering that address of the friend you are going to meet? It can be really confusing when it comes to keeping track of so many addresses that you frequent on a regular basis. To solve this pain point, Grab rolled out a new feature called Saved Places in January’19 across SouthEast Asia.

With Saved Places, you can save an address and also add a label like “Home”, “Work”, “Gym”, etc which makes finding and selecting an address for booking a ride or ordering your favourite food a breeze!

Never forget your addresses again!

To use the feature, fire up your Grab app, head to the “Saved Places” section on the app navigation bar and start adding all your favourite destinations such as your home, your office, your favourite mall or the airport and you are done with the hassle of typing them again.

Save your place with Grab!

 

Hola! your saved addresses are just a click away to order a ride or your favourite meal.

Inspiration behind the work

We at Grab continuously engage with our customers to understand how we can outserve them better. Difficulty in choosing the correct address was one of the key feedback shared by our customers. Our drivers shared funny stories about places that have similar names but outrightly different locations e.g. Sime Road is in Bukit Timah but Simei Road is in Simei almost 20 km away, Nicoll Highway is in Kallang but Nicoll Drive is in Changi almost 20 km away. In this case, even though the users use the address frequently, there remains scope for misselection.

Data-Driven Decisions

Our vast repository of data and insights has helped us identify and solve some challenging problems. Our analysis of millions of transport bookings and food orders revealed that customers usually visit five to seven unique locations and order food at one or two addresses.

One intriguing yet intuitive insight that came out was a set pattern in user’s booking behaviour during weekdays. A set of passengers mostly commute between two addresses, probably going to the office in the morning and coming back home in the evening. These identifiable clusters of pick-up and drop-off locations during peak hours signified our hypothesis of users using a small set of locations for their Grab bookings. The pictures below show such clusters in Singapore and Jakarta where passengers generally commute to and fro in morning and in evening respectively.

Save your place with Grab!

 

This insight also motivated us to test out the concept of user created labels which allows the users to mark their saved places with their own labels like “Home”, “Changi Airport”, “Sis’s House” etc. Initial experiment results were extremely encouraging and we got significantly higher usage and repeat rates from users.

A group of cross functional teams – Analytics, Product, Design, Engineering etc came together, worked backwards from the customer, brainstormed multiple ideas, and finalised a product approach. We then went on to conduct in depth user research and usability testing to ensure that the final product met user expectations and was easy to understand and use.

And users love it!

Since the launch, we have seen significant user adoption for the feature. More than 14 Million users have saved close to 45 Million saved places. That’s ~3 places per user!

Customers from Singapore and Myanmar tend to save around 3 addresses each whereas customers from Indonesia, Malaysia, Philippines, Thailand, Vietnam and Cambodia save 2 addresses each. A customer from Indonesia has saved a whopping 1,191 addresses!

Users across South East Asia have adopted the feature and as of today, a significant portion of our bookings are made using a saved place for either pickup or drop off. If you were curious, here are the most frequently used labels for saving addresses in Singapore (left) and Indonesia (right):

Save your place with Grab!

 

Apart from saving home and office addresses our customers are also saving their child’s school address and places of worship. Some of them are also saving their favourite shopping destinations.

Another observation, as someone may have guessed, is regarding cluster of home addresses. Home addresses in Singapore are evenly scattered across the island (map on upper left) but the same are concentrated in specific pockets of the city in Jakarta (map on lower left). However office addresses are concentrated in specific areas in both cities – CBD and Changi area in Singapore (map on upper right) and along central Jakarta in Jakarta (map on lower right).

Save your place with Grab!

 

This is only the beginning

We’re constantly striving to improve the user experience with Grab and make it as seamless as possible. We have only taken the first steps with Saved Places and the path forward involves deeper understanding of user behaviour with the help of saved places data to create a more personalised experience. This is just the beginning and we’re planning to launch some very innovative features in the coming months.

No More Forgetting to Input ERP Charges – Hello Automated ERP!

Post Syndicated from Grab Tech original https://engineering.grab.com/automated-erp-charges

ERP, standing for Electronic Road Pricing, is a system used to manage road congestion in Singapore. Drivers are charged when they pass through ERP gantries during peak hours. ERP rates vary for different roads and time periods based on the traffic conditions at the time. This encourages people to change their mode of transport, travel route or time of travel during peak hours. ERP is seen as an effective measure in addressing traffic conditions and ensuring drivers continue to have a smooth journey.

Did you know that Singapore has a total of 79 active ERP gantries? Did you also know that every ERP gantry changes its fare 10 times a day on average? For example, total ERP charges for a journey from Ang Mo Kio to Marina will cost $10 if you leave at 8:50am, but $4 if you leave at 9:00am on a working day!

Imagine how troublesome it would have been for Grab’s driver-partners who, on top of having to drive and check navigation, would also have had to remember each and every gantry they passed, calculating their total fare and then manually entering the charges to the total ride cost at the end of the ride.

In fact, based on our driver-partners’ feedback, missing out on ERP charges was listed as one of their top-most pain points. Not only did the drivers find the entire process troublesome, this also led to earnings loss as they would have had to bear the cost of the  ERP fares.

We’re glad to share that, as of 15th March 2019, we’ve successfully resolved this pain point for our driver-partners by introducing automated ERP fare calculation!

So, how did we achieve automating the ERP fare calculation for our drivers-partners? How did we manage to reduce the number of trips where drivers would forget to enter ERP fare to almost zero? Read on!

How we approached the Problem

The question we wanted to solve was – how do we create an impactful feature to make sure that driver -partners have one less thing to handle when they drive?

We started by looking at the problem at hand. ERP fares in Singapore are very dynamic; it changes on the basis of day and time.

Caption: Example of ERP fare changes on a normal weekday in Singapore
Caption: Example of ERP fare changes on a normal weekday in Singapore

 

We wanted to create a system which can identify the dynamic ERP fares at any given time and location, while simultaneously identifying when a driver-partner has passed through any of these gantries.

However, that wasn’t enough. We wanted this feature to be scalable to every country where Grab is in – like Indonesia, Thailand, Malaysia, Philippines, Vietnam. We started studying the ERP (or tolls – as it is known locally) system in other countries. We realized that every country has its own style of calculating toll. While in Singapore ERP charges for cars and taxis are the same, Malaysia applies different charges for cars and taxis. Similarly, Vietnam has different tolls for 4-seaters and 7-seaters. Indonesia and Thailand have couple gantries where you pay only at one of the gantries.Suppose A and B are couple gantries, if you passed through A, you won’t need to pay at B and vice versa. This is where our Ops team came to the rescue!

Boots on the Ground!

Collecting all the ERP or toll data for every country is no small feat, recalls Robinson Kudali, program manager for the project. “We had around 15 people travelling across the region for 2-3 weeks, working on collecting data from every possible source in every country.”

Getting the right geographical coordinates for every gantry is very important. We track driver GPS pings frequently, identify the nearest road to that GPS ping and check the presence of a gantry using its coordinates. The entire process requires you to be very accurate; incorrect gantry location can easily lead to us miscalculating the fare.

Bayu Yanuaragi, our regional mapops lead, explains – “To do this, the first step was to identify all toll gates for all expressways & highways in the country. The team used various mapping software to locate and plot all entry & exit gates using map sources, open data and more importantly government data as references. Each gate was manually plotted using satellite imagery and aligned with our road layers in order to extract the coordinates with a unique gantry ID.”

Location precision is vital in creating the dataset as it dictates whether a toll gate will be detected by the Grab app or not. Next step was to identify the toll charge from one gate to another. Accuracy of toll charge per segment directly reflects on the fare that the passenger pays after the trip.

Caption: ERP gantries visualisation on our map - The purple bars are the gantries that we drew on our map
Caption: ERP gantries visualisation on our map – The purple bars are the gantries that we drew on our map

 

Once the data compilation is done, team would then conduct fieldwork to verify its integrity. If data gaps are identified, modifications would be made accordingly.

Upon submission of the output, stack engineers would perform higher level quality check of the content in staging.

Lastly, we worked with a local team of driver-partners who volunteered to make sure the new system is fully operational and the prices are correct. Inconsistencies observed were reported by these driver-partners, and then corrected in our system.

Closing the loop

Creating a strong dataset did help us in predicting correct fares, but we needed something which allows us to handle the dynamic behavior of the changing toll status too. For example, Singapore government revises ERP fare every quarter, while there could also be ad-hoc changes like activating or deactivating of gantries on an on-going basis.

Garvee Garg, Product Manager for this feature explains: “Creating a system that solves the current problem isn’t sufficient. Your product should be robust enough to handle all future edge case scenarios too. Hence we thought of building a feedback mechanism with drivers.”

In case our ERP fare estimate isn’t correct or there are changes in ERPs on-ground, our driver-partners can provide feedback to us. These feedback directly flow to Customer Experience teamwho does the initial investigation, and from there to our Ops team. A dedicated person from Ops team checks the validity of the feedback, and recommends updates. It only takes 1 day on average to update the data from when we receive the feedback from the driver-partner.

However, validating the driver feedback was a time consuming process. We needed a tool which can ease the life of Ops team by helping them in de-bugging each and every case.

Hence the ERP Workflow tool came into the picture.

99% of the time, feedback from our driver-partners are about error cases. When feedback comes in, this tool would allow the Ops team to check the entire ride history of the driver and map driver’s ride trajectory with all the underlying ERP gantries at that particular point of time. The Ops team  would then be able to identify if ERP fare calculated by our system or as said by driver is right or wrong.

This is only the beginning

By creating a system that can automatically calculate and key in ERP fares for each trip, Grab is proud to say that our driver-partners can now drive with less hassle and focus more on the road which will bring the ride experience and safety for both the driver and the passengers to a new level!

The Automated ERP feature is currently live in Singapore and we are now testing it with our driver-partners in Indonesia and Thailand. Next up, we plan to pilot in the Philippines and Malaysia and soon to every country where Grab is in – so stay tuned for even more innovative ideas to enhance your experience on our super app!

To know more about what Grab has been doing to improve the convenience and experience for both our driver-partners and passengers, check out other stories on this blog!