Tag Archives: data

Search indexing optimisation

Post Syndicated from Grab Tech original https://engineering.grab.com/search-indexing-optimisation

Modern applications commonly utilise various database engines, with each serving a specific need. At Grab Deliveries, MySQL database (DB) is utilised to store canonical forms of data, and ElasticSearch (ES) to provide advanced search capabilities. MySQL serves as the primary data storage for raw data, and ES as the derived storage.

Search data flow
Search data flow

Efforts have been made to synchronise data between MySQL and ES. In this post, a series of techniques will be introduced on how to optimise incremental search data indexing.

Background

The synchronisation of data from the primary data storage to the derived data storage is handled by Food-Puxian, a Data Synchronisation Platform (DSP). In a search service context, it is the synchronisation of data between MySQL and ES.
The data synchronisation process is triggered on every real-time data update to MySQL, which will streamline the updated data to Kafka. DSP consumes the list of Kafka streams and incrementally updates the respective search indexes in ES. This process is also known as Incremental Sync.

Kafka to DSP

DSP uses Kafka streams to implement Incremental Sync. A stream represents an unbounded, continuously updating data set. A stream is ordered, replayable and is a fault-tolerant sequence of immutable.

Data synchronisation process using Kafka
Data synchronisation process using Kafka

The above diagram depicts the process of data synchronisation using Kafka. The Data Producer creates a Kafka stream for every operation done on MySQL and sends it to Kafka in real-time. DSP creates a stream consumer for each Kafka stream and the consumer reads data updates from respective Kafka streams and synchronises them to ES.

MySQL to ES

Indexes in ES correspond to tables in MySQL. MySQL data is stored in tables, while ES data is stored in indexes. Multiple MySQL tables are joined to form an ES index. The below snippet shows the Entity-Relationship mapping in MySQL and ES. Entity A has a one-to-many relationship with entity B. Entity A has multiple associated tables in MySQL, table A1 and A2, and they are joined into a single ES index A.

ER mapping in MySQL and ES
ER mapping in MySQL and ES

Sometimes a search index contains both entity A and entity B. In a keyword search query on this index, e.g. “Burger”, objects from both entity A and entity B whose name contains “Burger” are returned in the search response.

Original Incremental Sync

Original Kafka streams

The Data Producers create a Kafka stream for every MySQL table in the ER diagram above. Every time there is an insert/update/delete operation on the MySQL tables, a copy of the data after the operation executes is sent to its Kafka stream. DSP creates different stream consumers for every Kafka stream since their data structures are different.

Stream Consumer infrastructure

Stream Consumer consists of 3 components.

  • Event Dispatcher: Listens and fetches events from the Kafka stream, pushes them to the Event Buffer and starts a goroutine to run Event Handler for every event whose ID does not exist in the Event Buffer
  • Event Buffer: Caches events in memory by the primary key (aID, bID, etc). An event is cached in the Buffer until it is picked by a goroutine or replaced when a new event with the same primary key is pushed into the Buffer.
  • Event Handler: Reads an event from the Event Buffer and the goroutine started by the Event Dispatcher handles it.
Stream consumer infrastructure
Stream consumer infrastructure

Event Buffer procedure

Event Buffer consists of many sub buffers, each with a unique ID which is the primary key of the event cached in it. The maximum size of a sub buffer is 1. This allows Event Buffer to deduplicate events having the same ID in the buffer.
The below diagram shows the procedure of pushing an event to Event Buffer. When a new event is pushed to the buffer, the old event sharing the same ID will be replaced. The replaced event is therefore not handled.

Pushing an event to the Event Buffer
Pushing an event to the Event Buffer

Event Handler procedure

The below flowchart shows the procedures executed by the Event Handler. It consists of the common handler flow (in white), and additional procedures for object B events (in green). After creating a new ES document by data loaded from the database, it will get the original document from ES to compare if any field is changed and decide whether it is necessary to send the new document to ES.
When object B event is being handled, on top of the common handler flow, it also cascades the update to the related object A in the ES index. We name this kind of operation Cascade Update.

Procedures executed by the Event Handler
Procedures executed by the Event Handler

Issues in the original infrastructure

Data in an ES index can come from multiple MySQL tables as shown below.

Data in an ES index
Data in an ES index

The original infrastructure came with a few issues.

  • Heavy DB load: Consumers read from Kafka streams, treat stream events as notifications then use IDs to load data from the DB to create a new ES document. Data in the stream events are not well utilised. Loading data from the DB every time to create a new ES document results in heavy traffic to the DB. The DB becomes a bottleneck.
  • Data loss: Producers send data copies to Kafka in application code. Data changes made via MySQL command-line tool (CLT) or other DB management tools are lost.
  • Tight coupling with MySQL table structure: If producers add a new column to an existing table in MySQL and this column needs to be synchronised to ES, DSP is not able to capture the data changes of this column until the producers make the code change and add the column to the related Kafka Stream.
  • Redundant ES updates: ES data is a subset of MySQL data. Producers publish data to Kafka streams even if changes are made on fields that are not relevant to ES. These stream events that are irrelevant to ES would still be picked up.
  • Duplicate cascade updates: Consider a case where the search index contains both object A and object B. A large number of updates to object B are created within a short span of time. All the updates will be cascaded to the index containing both objects A and B. This will bring heavy traffic to the DB.

Optimised Incremental Sync

MySQL Binlog

MySQL binary log (Binlog) is a set of log files that contain information about data modifications made to a MySQL server instance. It contains all statements that update data. There are two types of binary logging:

  • Statement-based logging: Events contain SQL statements that produce data changes (inserts, updates, deletes).
  • Row-based logging: Events describe changes to individual rows.

The Grab Caspian team (Data Tech) has built a Change Data Capture (CDC) system based on MySQL row-based Binlog. It captures all the data modifications made to MySQL tables.

Current Kafka streams

The Binlog stream event definition is a common data structure with three main fields: Operation, PayloadBefore and PayloadAfter. The Operation enums are Create, Delete, and Update. Payloads are the data in JSON string format. All Binlog streams follow the same stream event definition. Leveraging PayloadBefore and PayloadAfter in the Binlog event, optimisations of incremental sync on DSP becomes possible.

Binlog stream event main fields
Binlog stream event main fields

Stream Consumer optimisations

Event Handler optimisations

Optimisation 1

Remember that there was a redundant ES updates issue mentioned above where the ES data is a subset of the MySQL data. The first optimisation is to filter out irrelevant stream events by checking if the fields that are different between PayloadBefore and PayloadAfter are in the ES data subset.
Since the payloads in the Binlog event are JSON strings, a data structure only with fields that are present in ES data is defined to parse PayloadBefore and PayloadAfter. By comparing the parsed payloads, it is easy to know whether the change is relevant to ES.
The below diagram shows the optimised Event Handler flows. As shown in the blue flow, when an event is handled, PayloadBefore and PayloadAfter are compared first. An event will be processed only if there is a difference between PayloadBefore and PayloadAfter. Since the irrelevant events are filtered, it is unnecessary to get the original document from ES.

Event Handler optimisation 1
Event Handler optimisation 1

Achievements

  • No data loss: changes made via MySQL CLT or other DB manage tools can be captured.
  • No dependency on MySQL table definition: All the data is in JSON string format.
  • No redundant ES updates and DB reads.
  • ES reads traffic reduced by 90%: Not a need to get the original document from ES to compare with the newly created document anymore.
  • 55% irrelevant stream events filtered out
  • DB load reduced by 55%
ES event updates for optimisation 1
ES event updates for optimisation 1

Optimisation 2

The PayloadAfter in the event provides updated data. This makes us think about whether a completely new ES document is needed each time, with its data read from several MySQL tables. The second optimisation is to change to a partial update using data differences from the Binlog event.
The below diagram shows the Event Handler procedure flow with a partial update. As shown in the red flow, instead of creating a new ES document for each event, a check on whether the document exists will be performed first. If the document exists, which happens for the majority of the time, the data is changed in this event, provided the comparison between PayloadBefore and PayloadAfter is updated to the existing ES document.

Event Handler optimisation 2
Event Handler optimisation 2

Achievements

  • Change most ES relevant events to partial update: Use data in stream events to update ES.
  • ES load reduced: Only fields that have been changed will be sent to ES.
  • DB load reduced: DB load reduced by 80% based on Optimisation 1.
ES event updates for optimisation 2
ES event updates for optimisation 2

Event Buffer optimisation

Instead of replacing the old event, we merge the new event with the old event when the new event is pushed to the Event Buffer.
The size of each sub buffer in Event Buffer is 1. In this optimisation, the stream event is not treated as a notification anymore. We use the Payloads in the event to perform Partial Updates. The old procedure of replacing old events is no longer suitable for the Binlog stream.
When the Event Dispatcher pushes a new event to a non-empty sub buffer in the Event Buffer, it will merge event A in the sub buffer and the new event B into a new Binlog event C, whose PayloadBefore is from Event A and PayloadAfter is from Event B.

merge-operation-for-event-buffer-optimisation
Merge operation for Event Buffer optimisation

Cascade Update optimisation

Optimisation

Use a new stream to handle cascade update events.
When the producer sends data to the Kafka stream, data sharing the same ID will be stored at the same partition. Every DSP service instance has only one stream consumer. When Kafka streams are consumed by consumers, one partition will be consumed by only one consumer. So the Cascade Update events sharing the same ID will be consumed by one stream consumer on the same EC2 instance. With this special mechanism, the in-memory Event Buffer is able to deduplicate most of the Cascade Update events sharing the same ID.
The flowchart below shows the optimised Event Handler procedure. Highlighted in green is the original flow while purple highlights the current flow with Cascade Update events.
When handling an object B event, instead of cascading update the related object A directly, the Event Handler will send a Cascade Update event to the new stream. The consumer of the new stream will handle the Cascade Update event and synchronise the data of object A to the ES.

Event Handler with Cascade Update events
Event Handler with Cascade Update events

Achievements

  • Cascade Update events deduplicated by 80%
  • DB load introduced by cascade update reduced
Cascade Update events
Cascade Update events

Summary

In this article four different DSP optimisations are explained. After switching to MySQL Binlog streams provided by the Coban team and optimising Stream Consumer, DSP has saved about 91% DB reads and 90% ES reads, and the average queries per second (QPS) of stream traffic processed by Stream Consumer increased from 200 to 800. The max QPS at peak hours could go up to 1000+. With a higher QPS, the duration of processing data and the latency of synchronising data from MySQL to ES was reduced. The data synchronisation ability of DSP has greatly improved after optimisation.


Special thanks to Jun Ying Lim and Amira Khazali for proofreading this article.


Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Protecting Personal Data in Grab’s Imagery

Post Syndicated from Grab Tech original https://engineering.grab.com/protecting-personal-data-in-grabs-imagery

Image Collection Using KartaView

Starting a few years ago, we realised the strong demand to better understand the streets where our drivers and clients go, with the purpose to better fulfil their needs and also to be able to quickly adapt ourselves to the rapidly changing environment in the Southeast Asia cities.

One way to fulfil that demand was to create an image collection platform named KartaView which is Grab Geo’s platform for geotagged imagery. It empowers collection, indexing, storage, retrieval of imagery, and map data extraction.

KartaView is a public, partially open-sourced product, used both internally and externally by the OpenStreetMap community and other users. As of 2021, KartaView has public imagery in over 100 countries with various coverage degrees, and 60+ cities of Southeast Asia. Check it out at www.kartaview.com.

Figure 1 - KartaView platform
Figure 1 – KartaView platform

Why Image Blurring is Important

Many incidental people and licence plates are in the collected images, whose privacy is a serious concern. We deeply respect all of them and consequently, we are using image obfuscation as the most effective anonymisation method for ensuring privacy protection.

Because manually annotating the regions in the picture where faces and licence plates are located is impractical, this problem should be solved using machine learning and engineering techniques. Hence we detect and blur all faces and licence plates which could be considered as personal data.

Figure 2 - Sample blurred picture
Figure 2 – Sample blurred picture

In our case, we have a wide range of picture types: regular planar, very wide and 360 pictures in equirectangular format collected with 360 cameras. Also, because we are collecting imagery globally, the vehicle types, licence plates, and human environments are quite diverse in appearance, and are not handled well by off-the-shelf blurring software. So we built our own custom blurring solution which yielded higher accuracy and better cost-efficiency overall with respect to blurring of personal data.

Figure 3 - Example of equirectangular image where personal data has to be blurred
Figure 3 – Example of equirectangular image where personal data has to be blurred

Behind the scenes, in KartaView, there are a set of cool services which are able to derive useful information from the pictures like image quality, traffic signs, roads, etc. A big part of them are using deep learning algorithms which potentially can be negatively affected by running them over blurred pictures. In fact, based on the assessment we have done so far, the impact is extremely low, similar to the one reported in a well known study of face obfuscation in ImageNet [9].

Outline of Grab’s Blurring Process

Roughly, the processing steps are the following:

  1. Transform each picture into a set of planar images. In this way, we further process all pictures, whatever the format they had, in the same way.
  2. Use an object detector able to detect all faces and licence plates in a planar image having a standard field of view.
  3. Transform the coordinates of the detected regions into original coordinates and blur those regions.
Figure 4 - Picture’s processing steps [8]
Figure 4 – Picture’s processing steps [8]

In the following section, we are going to describe in detail the interesting aspects of the second step, sharing the challenges and how we were solving them. Let’s start with the first and most important part, the dataset.

Dataset

Our current dataset consists of images from a wide range of cameras, including normal perspective cameras from mobile phones, wide field of view cameras and also 360 degree cameras.

It is the result of a series of data collections contributed by Grab’s data tagging teams, which may contain 2 classes of dataset that are of interest for us: FACE and LICENSE_PLATE.

The data was collected using Grab internal tools, stored in queryable databases, making it a system that gives the possibility to revisit the data and correct it if necessary, but also making it possible for data engineers to select and filter the data of interest.

Dataset Evolution

Each iteration of the dataset was made to address certain issues discovered while having models used in a production environment and observing situations where the model lacked in performance.

Dataset v1 Dataset v2 Dataset v3
Nr. images 15226 17636 30538
Nr. of labels 64119 86676 242534

If the first version was basic, containing a rough tagging strategy we quickly noticed that it was not detecting some special situations that appeared due to the pandemic situation: people wearing masks.

This led to another round of data annotation to include those scenarios.
The third iteration addressed a broader range of issues:

  • Small regions of interest (objects far away from the camera)
  • Objects in very dark backgrounds
  • Rotated objects or even upside down
  • Variation of the licence plate design due to images from different countries and regions
  • People wearing masks
  • Faces in the mirror – see below the mirror of the motorcycle
  • But the main reason was because of a scenario where the recording, at the start or end (but not only), had close-ups of the operator who was checking the camera. This led to images with large regions of interest containing the camera operator’s face – too large to be detected by the model.

An investigation in the dataset structure, by splitting the data into bins based on the bbox sizes (in pixels), made something clear: the dataset was unbalanced.

We made bins for tag sizes with a stride of 100 pixels and went up to the max present in the dataset which accounted for 1 sample of size 2000 pixels. The majority of the labels were small in size and the higher we would go with the size, the less tags we would have. This made it clear that we would need more targeted annotations for our dataset to try to balance it.

All these scenarios required the tagging team to revisit the data multiple times and also change the tagging strategy by including more tags that were considered at a certain limit. It also required them to pay more attention to small details that may have been missed in a previous iteration.

Data Splitting

To better understand the strategy chosen for splitting the data we need to also understand the source of the data. The images come from different devices that are used in different geo locations (different countries) and are from a continuous trip recording. The annotation team used an internal tool to visualise the trips image by image and mark the faces and licence plates present in them. We would then have access to all those images and their respective metadata.

The chosen ratios for splitting are:

  • Train 70%
  • Validation 10%
  • Test 20%
Number of train images 12733
Number of validation images 1682
Number of test images 3221
Number of labeled classes in train set 60630
Number of labeled classes in validation set 7658
Number of of labeled classes in test set 18388

The split is not so trivial as we have some requirements and need to complete some conditions:

  • An image can have multiple tags from one or both classes but must belong to just one subset.
  • The tags should be split as close as possible to the desired ratios.
  • As different images can belong to the same trip in a close geographical relation we need to force them in the same subset, thus avoiding similar tags in train and test subsets, resulting in incorrect evaluations.

Data Augmentation

The application of data augmentation plays a crucial role while training the machine learning model. There are mainly three ways in which data augmentation techniques can be applied. They are:

  1. Offline data augmentation – enriching a dataset by physically multiplying some of its images and applying modifications to them.
  2. Online data augmentation – on the fly modifications of the image during train time with configurable probability for each modification.
  3. Combination of both offline and online data augmentation.

In our case, we are using the third option which is the combination of both.

The first method that contributes to offline augmentation is a method called image view splitting. This is necessary for us due to different image types: perspective camera images, wide field of view images, 360 degree images in equirectangular format. All these formats and field of views with their respective distortions would complicate the data and make it hard for the model to generalise it and also handle different image types that could be added in the future.

For this we defined the concept of image views which are an extracted portion (view) of an image with some predefined properties. For example, the perspective projection of 75 by 75 degrees field of view patches from the original image.

Here we can see a perspective camera image and the image views generated from it:

Figure 5 - Original image
Figure 5 – Original image
Figure 6 - Two image views generated
Figure 6 – Two image views generated

The important thing here is that each generated view is an image on its own with the associated tags. They also have an overlapping area so we have a possibility to contain the same tag in two views but from different perspectives. This brings us to an indirect outcome of the first offline augmentation.

The second method for offline augmentation is the oversampling of some of the images (views). As mentioned above, we faced the problem of an unbalanced dataset, specifically we were missing tags that occupied high regions of the image, and even though our tagging teams tried to annotate as many as they could find, these were still scarce.

As our object detection model is an anchor-based detector, we did not even have enough of them to generate the anchor boxes correctly. This could be clearly seen in the accuracy of the previous trained models, as they were performing poorly on bins of big sizes.

By randomly oversampling images that contained big tags, up to a minimum required number, we managed to have better anchors and increase the recall for those scenarios. As described below, the chosen object detector for blurring was YOLOv4 which offers a large variety of online augmentations. The online augmentations used are saturation, exposure, hue, flip and mosaic.

Model

As of summer of 2021, the “to go” solution for object detection in images are convolutional neural networks (CNN), being a mature solution able to fulfil the needs efficiently.

Architecture

Most CNN based object detectors have three main parts: Backbone, Neck and (Dense or Sparse Prediction) Heads. From the input image, the backbone extracts features which can be combined in the neck part to be used by the prediction heads to predict object bounding-boxes and their labels.

Figure 7 - Anatomy of one and two-stage object detectors [1]
Figure 7 – Anatomy of one and two-stage object detectors [1]

The backbone is usually a CNN classification network pretrained on some dataset, like ImageNet-1K. The neck combines features from different layers in order to produce rich representations for both large and small objects. Since the objects to be detected have varying sizes, the topmost features are too coarse to represent smaller objects, so the first CNN based object detectors were fairly weak in detecting small sized objects. The multi-scale, pyramid hierarchy is inherent to CNNs so [2] introduced the Feature Pyramid Network which at marginal costs combines features from multiple scales and makes predictions on them. This or improved variants of this technique is used by most detectors nowadays. The head part does the predictions for bounding boxes and their labels.

YOLO is part of the anchor-based one-stage object detectors family being developed originally in Darknet, an open source neural network framework written in C and CUDA. Back in 2015 it was the first end-to-end differentiable network of this kind that offered a joint learning of object bounding boxes and their labels.

One reason for the big success of newer YOLO versions is that the authors carefully merged new ideas into one architecture, the overall speed of the model being always the north star.

YOLOv4 introduces several changes to its v3 predecessor:

  • Backbone – CSPDarknet53: YOLOv3 Darknet53 backbone was modified to use Cross Stage Partial Network (CSPNet [5]) strategy, which aims to achieve richer gradient combinations by letting the gradient flow propagate through different network paths.
  • Multiple configurable augmentation and loss function types, so called “Bag of freebies”, which by changing the training strategy can yield higher accuracy without impacting the inference time.
  • Configurable necks and different activation functions, they call “Bag of specials”.

Insights

For this task, we found that YOLOv4 gave a good compromise between speed and accuracy as it has doubled the speed of a more accurate two-stage detector while maintaining a very good overall precision/recall. For blurring, the main metric for model selection was the overall recall, while precision and intersection over union (IoU) of the predicted box comes second as we want to catch all personal data even if some are wrong. Having a multitude of possibilities to configure the detector architecture and train it on our own dataset we conducted several experiments with different configurations for backbones, necks, augmentations and loss functions to come up with our current solution.

We faced challenges in training a good model as the dataset posed a large object/box-level scale imbalance, small objects being over-represented in the dataset. As described in [3] and [4], this affects the scales of the estimated regions and the overall detection performance. In [3] several solutions are proposed for this out of which the SPP [6] blocks and PANet [7] neck used in YOLOv4 together with heavy offline data augmentation increased the performance of the actual model in comparison to the former ones.

As we have evaluated the model; it still has some issues:

  • Occlusion of the object, either by the camera view, head accessories or other elements:

These cases would need extra annotation in the dataset, just like the faces or licence plates that are really close to the camera and occupy a large region of interest in the image.

  • As we have a limited number of annotations of close objects to the camera view, the model has incorrectly learnt this, sometimes producing false positives in these situations:

Again, one solution for this would be to include more of these scenarios in the dataset.

What’s Next?

Grab spends a lot of effort ensuring privacy protection for its users so we are always looking for ways to further improve our related models and processes.

As far as efficiency is concerned, there are multiple directions to consider for both the dataset and the model. There are two main factors that drive the costs and the quality: further development of the dataset for additional edge cases (e.g. more training data of people wearing masks) and the operational costs of the model.

As the vast majority of current models require a fully labelled dataset, this puts a large work effort on the Data Entry team before creating a new model. Our dataset increased 4x for it’s third version, still there is room for improvement as described in the Dataset section.

As Grab extends its operation in more cities, new data is collected that has to be processed, this puts an increased focus on running detection models more efficiently.

Directions to pursue to increase our efficiency could be the following:

  • As plenty of unlabelled data is available from imagery collection, a natural direction to explore is self-supervised visual representation learning techniques to derive a general vision backbone with superior transferring performance for our subsequent tasks as detection, classification.
  • Experiment with optimisation techniques like pruning and quantisation to get a faster model without sacrificing too much on accuracy.
  • Explore new architectures: YOLOv5, EfficientDet or Swin-Transformer for Object Detection.
  • Introduce semi-supervised learning techniques to improve our model performance on the long tail of the data.

References

  1. Alexey Bochkovskiy et al.. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934v1
  2. Tsung-Yi Lin et al. Feature Pyramid Networks for Object Detection. arXiv:1612.03144v2
  3. Kemal Oksuz et al.. Imbalance Problems in Object Detection: A Review. arXiv:1909.00169v3
  4. Bharat Singh, Larry S. Davis. An Analysis of Scale Invariance in Object Detection – SNIP. arXiv:1711.08189v2
  5. Chien-Yao Wang et al. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. arXiv:1911.11929v1
  6. Kaiming He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv:1406.4729v4
  7. Shu Liu et al. Path Aggregation Network for Instance Segmentation. arXiv:1803.01534v4
  8. http://blog.nitishmutha.com/equirectangular/360degree/2017/06/12/How-to-project-Equirectangular-image-to-rectilinear-view.html
  9. Kaiyu Yang et al. Study of Face Obfuscation in ImageNet: arxiv.org/abs/2103.06191
  10. Zhenda Xie et al. Self-Supervised Learning with Swin Transformers. arXiv:2105.04553v2

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Processing ETL tasks with Ratchet

Post Syndicated from Grab Tech original https://engineering.grab.com/processing-etl-tasks-with-ratchet

Overview

At Grab, the Lending team is focused towards building products that help finance various segments of users, such as Passengers, Drivers, or Merchants, based on their needs. The team builds products that enable users to avail funds in a seamless and hassle-free way. In order to achieve this, multiple lending microservices continuously interact with each other. Each microservice handles different responsibilities, such as providing offers, storing user information, disbursing availed amounts to a user’s account, and many more.

In this tech blog, we will discuss what Data and Extract, Transform and Load (ETL) pipelines are and how they are used for processing multiple tasks in the Lending Team at Grab. We will also discuss Ratchet, which is a Go library, that helps us in building data pipelines and handling ETL tasks. Let’s start by covering the basis of Data and ETL pipelines.

What is a Data Pipeline?

A Data pipeline is used to describe a system or a process that moves data from one platform to another. In between platforms, data passes through multiple steps based on defined requirements, where it may be subjected to some kind of modification. All the steps in a Data pipeline are automated, and the output from one step acts as an input for the next step.

Data Pipeline
Data Pipeline (Source: Hazelcast)

What is an ETL Pipeline?

An ETL pipeline is a type of Data pipeline that consists of 3 major steps, namely extraction of data from a source, transformation of that data into the desired format, and finally loading the transformed data to the destination. The destination is also known as the sink.

Extract-Transform-Load
Extract-Transform-Load (Source: TatvaSoft)

The combination of steps in an ETL pipeline provides functions to assure that the business requirements of the application are achieved.

Let’s briefly look at each of the steps involved in the ETL pipeline.

Data Extraction

Data extraction is used to fetch data from one or multiple sources with ease. The source of data can vary based on the requirement. Some of the commonly used data sources are:

  • Database
  • Web-based storage (S3, Google cloud, etc)
  • Files
  • User Feeds, CRM, etc.

The data format can also vary from one use case to another. Some of the most commonly used data formats are:

  • SQL
  • CSV
  • JSON
  • XML

Once data is extracted in the desired format, it is ready to be fed to the transformation step.

Data Transformation

Data transformation involves applying a set of rules and techniques to convert the extracted data into a more meaningful and structured format for use. The extracted data may not always be ready to use. In order to transform the data, one of the following techniques may be used:

  1. Filtering out unnecessary data.
  2. Preprocessing and cleaning of data.
  3. Performing validations on data.
  4. Deriving a new set of data from the existing one.
  5. Aggregating data from multiple sources into a single uniformly structured format.

Data Loading

The final step of an ETL pipeline involves moving the transformed data to a sink where it can be accessed for its use. Based on requirements, a sink can be one of the following:

  1. Database
  2. File
  3. Web-based storage (S3, Google cloud, etc)

An ETL pipeline may or may not have a loadstep based on its requirements. When the transformed data needs to be stored for further use, the loadstep is used to move the transformed data to the storage of choice. However, in some cases, the transformed data may not be needed for any further use and thus, the loadstep can be skipped.

Now that you understand the basics, let’s go over how we, in the Grab Lending team, use an ETL pipeline.

Why Use Ratchet?

At Grab, we use Golang for most of our backend services. Due to Golang’s simplicity, execution speed, and concurrency support, it is a great choice for building data pipeline systems to perform custom ETL tasks.

Given that Ratchet is also written in Go, it allows us to easily build custom data pipelines.

Go channels are connecting each stage of processing, so the syntax for sending data is intuitive for anyone familiar with Go. All data being sent and received is in JSON, providing a nice balance of flexibility and consistency.

Utilising Ratchet for ETL Tasks

We use Ratchet for multiple ETL tasks like batch processing, restructuring and rescheduling of loans, creating user profiles, and so on. One of the backend services, named Azkaban, is responsible for handling various ETL tasks.

Ratchet uses Data Processors for building a pipeline consisting of multiple stages. Data Processors each run in their own goroutine so all of the data is processed concurrently. Data Processors are organised into stages, and those stages are run within a pipeline. For building an ETL pipeline, each of the three steps (Extract, Transform and Load) use a Data Processor for implementation. Ratchet provides a set of built-in, useful Data Processors, while also providing an interface to implement your own. Usually, the transform stage uses a Custom Data Processor.

Data Processors in Ratchet
Data Processors in Ratchet (Source: Github)

Let’s take a look at one of these tasks to understand how we utilise Ratchet for processing an ETL task.

Whitelisting Merchants Through ETL Pipelines

Whitelisting essentially means making the product available to the user by mapping an offer to the user ID. If a merchant in Thailand receives an option to opt for Cash Loan, it is done by whitelisting that merchant. In order to whitelist our merchants, our Operations team uses an internal portal to upload a CSV file with the user IDs of the merchants and other required information. This CSV file is generated by our internal Data and Risk team and handed over to the Operations team. Once the CSV file is uploaded, the user IDs present in the file are whitelisted within minutes. However, a lot of work goes in the background to make this possible.

Data Extraction

Once the Operations team uploads the CSV containing a list of merchant users to be whitelisted, the file is stored in S3 and an entry is created on the Azkaban service with the document ID of the uploaded file.

File upload by Operations team
File upload by Operations team

The data extraction step makes use of a Custom CSV Data Processor that uses the document ID to first create a PreSignedUrl and then uses it to fetch the data from S3. The data extracted is in bytes and we use commas as the delimiter to format the CSV data.

Data Transformation

In order to transform the data, we define a Custom Data Processor that we call a Transformer for each ETL pipeline. Transformers are responsible for applying all necessary transformations to the data before it is ready for loading. The transformations applied in the merchant whitelisting transformers are:

  1. Convert data from bytes to struct.
  2. Check for presence of all mandatory fields in the received data.
  3. Perform validation on the data received.
  4. Make API calls to external microservices for whitelisting the merchant.

As mentioned earlier, the CSV file is uploaded manually by the Operations team. Since this is a manual process, it is prone to human errors. Validation of data in the data transformation step helps avoid these errors and not propagate them further up the pipeline. Since CSV data consists of multiple rows, each row passes through all the steps mentioned above.

Data Loading

Whenever the merchants are whitelisted, we don’t need to store the transformed data. As a result, we don’t have a loadstep for this ETL task, so we just use an Empty Data Processor. However, this is just one of many use cases that we have. In cases where the transformed data needs to be stored for further use, the loadstep will have a Custom Data Processor, which will be responsible for storing the data.

Connecting All Stages

After defining our Data Processors for each of the steps in the ETL pipeline, the final piece is to connect all the stages together. As stated earlier, the ETL tasks have different ETL pipelines and each ETL pipeline consists of 3 stages defined by their Data Processors.

In order to connect these 3 stages, we define a Job Processor for each ETL pipeline. A Job Processor represents the entire ETL pipeline and encompasses Data Processors for each of the 3 stages. Each Job Processor implements the following methods:

  1. SetSource: Assigns the Data Processor for the Extraction stage.
  2. SetTransformer: Assigns the Data Processor for the Transformation stage.
  3. SetDestination: Assigns the Data Processor for the Load stage.
  4. Execute: Runs the ETL pipeline.
Job processors containing Data Processor for each stage in ETL
Job processors containing Data Processor for each stage in ETL

When the Azkaban service is initialised, we run the SetSource(), SetTransformer() and SetDestination() methods for each of the Job Processors defined. When an ETL task is triggered, the Execute() method of the corresponding Job Processor is run. This triggers the ETL pipeline and gradually runs the 3 stages of ETL pipeline. For each stage, the Data Processor assigned during initialisation is executed.

Conclusion

ETL pipelines help us in streamlining various tasks in our team. As showcased through the example in the above section, an ETL pipeline breaks a task into multiple stages and divides the responsibilities across these stages.

In cases where a task fails in the middle of the process, ETL pipelines help us determine the cause of the failure quickly and accurately. With ETL pipelines, we have reduced the manual effort required for validating data at each step and avoiding propagation of errors towards the end of the pipeline.

Through the use of ETL pipelines and schedulers, we at Lending have been able to automate the entire pipeline for many tasks to run at scheduled intervals without any manual effort involved at all. This has helped us tremendously in reducing human errors, increasing the throughput of the system and making the backend flow more reliable. As we continue to automate more and more of our tasks that have tightly defined stages, we foresee a growth in our ETL pipelines usage.

References

https://www.alooma.com/blog/what-is-a-data-pipeline

http://rkulla.blogspot.com/2016/01/data-pipeline-and-etl-tasks-in-go-using

https://medium.com/swlh/etl-pipeline-and-data-pipeline-comparison-bf89fa240ce9

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Data Engineers of Netflix — Interview with Kevin Wylie

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/data-engineers-of-netflix-interview-with-kevin-wylie-7fb9113a01ea

Data Engineers of Netflix — Interview with Kevin Wylie

This post is part of our “Data Engineers of Netflix” series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix.

Kevin Wylie is a Data Engineer on the Content Data Science and Engineering team. In this post, Kevin talks about his extensive experience in content analytics at Netflix since joining more than 10 years ago.

Kevin grew up in the Washington, DC area, and received his undergraduate degree in Mathematics from Virginia Tech. Before joining Netflix, he worked at MySpace, helping implement page categorization, pathing analysis, sessionization, and more. In his free time he enjoys gardening and playing sports with his 4 kids.

His favorite TV shows: Ozark, Breaking Bad, Black Mirror, Barry, and Chernobyl

Since I joined Netflix back in 2011, my favorite project has been designing and building the first version of our entertainment knowledge graph. The knowledge graph enabled us to better understand the trends of movies, TV shows, talent, and books. Building the knowledge graph offered many interesting technical challenges such as entity resolution (e.g., are these two movie names in different languages really the same?), and distributed graph algorithms in Spark. After we launched the product, analysts and scientists began surfacing new insights that were previously hidden behind difficult-to-use data. The combination of overcoming technical hurdles and creating new opportunities for analysis was rewarding.

Kevin, what drew you to data engineering?

I stumbled into data engineering rather than making an intentional career move into the field. I started my career as an application developer with basic familiarity with SQL. I was later hired into my first purely data gig where I was able to deepen my knowledge of big data. After that, I joined MySpace back at its peak as a data engineer and got my first taste of data warehousing at internet-scale.

What keeps me engaged and enjoying data engineering is giving super-suits and adrenaline shots to analytics engineers and data scientists.

When I make something complex seem simple, or create a clean environment for my stakeholders to explore, research and test, I empower them to do more impactful business-facing work. I like that data engineering isn’t in the limelight, but instead can help create economies of scale for downstream analytics professionals.

What drew you to Netflix?

My wife came across the Netflix job posting in her effort to keep us in Los Angeles near her twin sister’s family. As a big data engineer, I found that there was an enormous amount of opportunity in the Bay Area, but opportunities were more limited in LA where we were based at the time. So the chance to work at Netflix was exciting because it allowed me to live closer to family, but also provided the kind of data scale that was most common for Bay Area companies.

The company was intriguing to begin with, but I knew nothing of the talent, culture, or leadership’s vision. I had been a happy subscriber of Netflix’s DVD-rental program (no late fees!) for years.

After interviewing, it became clear to me that this company culture was different than any I had experienced.

I was especially intrigued by the trust they put in each employee. Speaking with fellow employees allowed me to get a sense for the kinds of people Netflix hires. The interview panel’s humility, curiosity and business acumen was quite impressive and inspired me to join them.

I was also excited by the prospect of doing analytics on movies and TV shows, which was something I enjoyed exploring outside of work. It seemed fortuitous that the area of analytics that I’d be working in would align so well with my hobbies and interests!

Kevin, you’ve been at Netflix for over 10 years now, which is pretty incredible. Over the course of your time here, how has your role evolved?

When I joined Netflix back in 2011, our content analytics team was just 3 people. We had a small office in Los Angeles focused on content, and significantly more employees at the headquarters in Los Gatos. The company was primarily thought of as a tech company.

At the time, the data engineering team mainly used a data warehouse ETL tool called Ab Initio, and an MPP (Massively Parallel Processing) database for warehousing. Both were appliances located in our own data center. Hadoop was being lightly tested, but only in a few high-scale areas.

Fast forward 10 years, and Netflix is now the leading streaming entertainment service — serving members in over 190 countries. In the data engineering space, very little of the same technology remains. Our data centers are retired, Hadoop has been replaced by Spark, Ab Initio and our MPP database no longer fits our big data ecosystem.

In addition to the company and tech shifting, my role has evolved quite a bit as our company has grown. When we were a smaller company, the ability to span multiple functions was valued for agility and speed of delivery. The sooner we could ingest new data and create dashboards and reports for non-technical users to explore and analyze, the sooner we could deliver results. But now, we have a much more mature business, and many more analytics stakeholders that we serve.

For a few years, I was in a management role, leading a great team of people with diverse backgrounds and skill sets. However, I missed creating data products with my own hands so I wanted to step back into a hands-on engineering role. My boss was gracious enough to let me make this change and focus on impacting the business as an individual contributor.

As I think about my future at Netflix, what motivates me is largely the same as what I’ve always been passionate about. I want to make the lives of data consumers easier and to enable them to be more impactful. As the company scales and as we continue to invest in storytelling, the opportunity grows for me to influence these decisions through better access to information and insights. The biggest impact I can make as a data engineer is creating economies of scale by producing data products that will serve a diverse set of use cases and stakeholders.

If I can build beautifully simple data products for analytics engineers, data scientists, and analysts, we can all get better at Netflix’s goal: entertaining the world.

Learning more

Interested in learning more about data roles at Netflix? You’re in the right place! Keep an eye out for our open roles in Data Science and Engineering by visiting our jobs site here. Our culture is key to our impact and growth: read about it here. To learn more about our Data Engineers, check out our chats with Dhevi Rajendran and Samuel Setegne.


Data Engineers of Netflix — Interview with Kevin Wylie was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/the-uefa-euro-2020-final-as-seen-online-by-cloudflare-radar/

The UEFA EURO 2020 final as seen online by Cloudflare Radar

Last night’s Italy-England match was a nail-biter. 1-1 at full time, 1-1 at the end of extra time, and then an amazing penalty shootout with incredible goalkeeping by Pickford and Donnarumma.

Cloudflare has been publishing statistics about all the teams involved in EURO 2020 and traffic to betting websites, sports newspapers, streaming services and sponsors. Here’s a quick look at some specific highlights from England’s and Italy’s EURO 2020.

Two interesting peaks show up in UK visits to sports newspapers: the day after England-Germany and today after England’s defeat. Looks like fans are hungry for analysis and news beyond the goals. You can see all the data on the dedicated England EURO 2020 page on Cloudflare Radar.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

But it was a quiet morning for the websites of the England team’s sponsors.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

Turning to the winners, we can see that Italian readers are even more interested in knowing more about their team’s success.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

And this enthusiasm spills over into visits to the Italian team’s sponsors.

The UEFA EURO 2020 final as seen online by Cloudflare Radar

You can follow along on the dedicated Cloudflare Radar page for Italy in EURO 2020.

Visit Cloudflare Radar for information on global Internet trends, trending domains, attacks and usage statistics.

Data Engineers of Netflix — Interview with Dhevi Rajendran

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/data-engineers-of-netflix-interview-with-dhevi-rajendran-a9ab7c7b36e5

Data Engineers of Netflix — Interview with Dhevi Rajendran

Dhevi Rajendran

This post is part of our “Data Engineers of Netflix” interview series, where our very own data engineers talk about their journeys to Data Engineering @ Netflix.

Dhevi Rajendran is a Data Engineer on the Growth Data Science and Engineering team. Dhevi joined Netflix in July 2020 and is one of many Data Engineers who have onboarded remotely during the pandemic. In this post, Dhevi talks about her passion for data engineering and taking on a new role during the pandemic.

Before Netflix, Dhevi was a software engineer at Two Sigma, where she was most recently on a data engineering team responsible for bringing in datasets from a variety of different sources for research and trading purposes. In her free time, she enjoys drawing, doing puzzles, reading, writing, traveling, cooking, and learning new things.

Her favorite TV shows: Atlanta, Barry, Better Call Saul, Breaking Bad, Dark, Fargo, Succession, The Killing

Her favorite movies: Das Leben der Anderen, Good Will Hunting, Intouchables, Mother, Spirited Away, The Dark Knight, The Truman Show, Up

Dhevi, so what got you into data engineering?

While my background has mostly been in backend software engineering, I was most recently doing backend work in the data space prior to Netflix. One great thing about working with data is the impact you can create as an engineer.

At Netflix, the work that data engineers do to produce data in a robust, scalable way is incredibly important to provide the best experience to our members as they interact with our service.

Beyond the really interesting technical challenges that come with working with big data, there are lots of opportunities to think about higher-level domain challenges as a data engineer. In college, I had done human-computer interaction research on subtitles for the Deaf and hard-of-hearing as well as computational genomics research on Alzheimer’s disease. I’ve always enjoyed learning about new areas and combining this knowledge with my technical skills to solve real-world problems.

What drew you to Netflix?

Netflix’s mission and its culture primarily drew me to Netflix. I liked the idea of being a part of a company that brings joy to so many members around the world with an incredibly powerful platform for their stories to be heard. The blend of creativity and a strong engineering culture at Netflix really appealed to me.

The culture was also something that piqued my interest. I was pretty skeptical of Netflix’s culture memo at first. Many companies have lofty ideals that don’t necessarily translate into the reality of the company culture, so I was surprised to see how consistently the culture memo aligns with the actual culture at the company. I’ve found the culture of freedom and responsibility empowering.

Rather than the typical top-down approach many companies use, Netflix trusts each person to make the right decisions for the company by using their deep knowledge of the problems they’re solving along with the context they gather from their leaders and stakeholders.

This means a lot less red tape, a lot less friction, and a lot more freedom for everyone at the company to do what’s best for the business. I also really appreciate the amount of visibility and input we get into broader strategic decisions that the company makes.

Finally, I was also really excited about joining the Growth Data Engineering team! My team is responsible for building data products relating to how we connect with our new members around the world, which is high-impact and has far-reaching global significance. I love that I get to help Netflix connect with new members around the world and help shape the first impression we make on them.

What is your favorite project or a project that you’re particularly proud of?

I have been primarily involved in the payments space. Not a project per se, but one of the things I’ve enjoyed being involved in is the cross-functional meetings with peers and stakeholders who are working in the payments space. These meetings include product managers, designers, consumer insights researchers, software engineers, data scientists, and people in a wide variety of other roles.

I love that I get to work cross-functionally with such a diverse group of people looking at the same set of problems from a variety of unique perspectives.

In addition to my day-to-day technical work, these meetings have provided me with the opportunity to be involved in the high-level product, design, and strategic discussions, which I value. Through these cross-functional efforts, I’ve also really gotten to learn and appreciate the nuances of payments. From using credit cards (which are fairly common in the US but not as widely adopted outside the US) to physically paying in person, members in different countries prefer to pay for our subscription in a wide variety of ways. It’s incredible to see the thoughtful and deeply member-driven approach we use to think about something as seemingly routine, straightforward, and often taken for granted as payments.

What was it like taking on a new role during the pandemic?

First off, I feel very lucky to have found a new role in this very difficult period. With the amount of change and uncertainty, the past year brought, it somehow felt both fitting and imprudent to voluntarily add a career change to the mix. The prospect was daunting at first. I knew there would be a bunch for me to learn coming into Netflix, considering that I hadn’t worked with the technologies my team uses (primarily Scala and Spark). Looking back now, I’m incredibly grateful for the opportunity and glad that I took it. I’ve already learned so much in the past six months and am excited about how much more I can learn and the impact I can make going forward.

Onboarding remotely has been a unique experience as well. Building relationships and gathering broader context are more difficult right now. I’ve found that I’ve learned to be more proactive and actively seek out opportunities to get to know people and the business, whether through setting up coffee chats, reading memos, or attending meetings covering topics I want to learn more about. I still haven’t met anyone I work with in person, but my teammates, my manager, and people across the company have been really helpful throughout the onboarding process.

It’s been incredible to see how gracious people are with their time and knowledge. The amount of empathy and understanding people have shown to each other, including to those who are new to the company, has made taking the leap and joining Netflix a positive experience.

Learning more

Interested in learning more about data roles at Netflix? You’re in the right place! Keep an eye out for our open roles in Data Science and Engineering here. Our culture is key to our impact and growth: read about it here.


Data Engineers of Netflix — Interview with Dhevi Rajendran was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Securing and managing multi-cloud Presto Clusters with Grab’s DataGateway

Post Syndicated from Grab Tech original https://engineering.grab.com/data-gateway

Introduction

Data is the lifeblood of Grab and the insights we gain from it drive all the most critical business decisions made by Grabbers and our leaders every day.

Grab’s Data Engineering (DE) team is responsible for maintaining the data platform, which consists of data pipelines, job schedulers, and the query/computation engines that are the key components for generating insights from data. SQL is the core language for analytics at Grab and as of early 2020, our Presto platform serves about 200 user groups that add up to 500 users who run 350,000 queries every day. These queries span across 10,000 tables that process up to 1PB of data daily.

In 2016, we started the DataGateway project to enable us to manage data access for the hundreds of Grabbers who needed access to Presto for their work. Since then, DataGateway has grown to become much more than just an access control mechanism for Presto. In this blog, we want to share what we’ve achieved since the initial launch of the project.

The problems we wanted to solve

As we were reviewing the key challenges around data access in Grab and assessing possible solutions, we came up with this prioritized list of user requirements we wanted to work on:

  • Use a single endpoint to serve everyone.
  • Manage user access to clusters, schemas, tables, and fields.
  • Provide seamless user experience when presto clusters are scaled up/down, in/out, or provisioned/decommissioned.
  • Capture audit trail of user activities.

To provide Grabbers with the critical need of interactive querying, as well as performing extract, transform, load (ETL) jobs, we evaluated several technologies. Presto was among the ones we evaluated, and was what we eventually chose although it didn’t meet all of our requirements out of the box. In order to address these gaps, we came up with the idea of a security gateway for the Presto compute engine that could also act as a load balancer/proxy, this is how we ended up creating the DataGateway.

DataGateway is a service that sits between clients and Presto clusters. It is essentially a smart HTTP proxy server that is an abstraction layer on top of the Presto clusters that handles the following actions:

  1. Parse incoming SQL statements to get requested schemas, tables, and fields.
  2. Manage user Access Control List (ACL) to limit users’ data access by checking against the SQL parsing results.
  3. Manage users’ cluster access.
  4. Redirect users’ traffic to the authorized clusters.
  5. Show meaningful error messages to users whenever the query is rejected or exceptions from clusters are encountered.

Anatomy of DataGateway

The DataGateway’s key components are as follows:

  • API Service
  • SQL Parser
  • Auth framework
  • Administration UI

We leveraged Kubernetes to run all these components as microservices.

Figure 1. DataGateway Key Components
Figure 1. DataGateway Key Components

API Service

This is the component that manages all users and cluster-facing processes. We integrated this service with the Presto API, which means it appears to be the same as a Presto cluster to a client. It accepts query requests from clients, gets the parsing result and runs authorization from the SQL Parser and the Auth Framework.

If everything is good to go, the API Service forwards queries to the assigned clusters and continues the entire query process.

Auth Framework

This handles both authentication and authorization requests. It stores the ACL of users and communicates with the API Service and the SQL Parser to run the entire authentication process. But why is it a microservice instead of a module in API Service, you ask? It’s because we keep evolving the security checks at Grab to ensure that everything is compliant with our security requirements, especially when dealing with data.

We wanted to make it flexible to fulfill ad-hoc requests from the security team without affecting the API Service. Furthermore, there are different authentication methods out there that we might need to deal with (OAuth2, SSO, you name it). The API Service supports multiple authentication frameworks that enable different authentication methods for different users.

SQL Parser

This is a SQL parsing engine to get schema, tables, and fields by reading SQL statements. Since Presto SQL parsing works differently in each version, we would compile multiple SQL Parsers that are identical to the Presto clusters we run. The SQL Parser becomes the single source of truth.

Admin UI

This is a UI for Presto administrators to manage clusters and user access, as well as to select an authentication framework, making it easier for the administrators to deal with the entire ecosystem.

How we deployed DataGateway using Kubernetes

In the past couple of years, we’ve had significant growth in workloads from analysts and data scientists. As we were very enthusiastic about Kubernetes, DataGateway was chosen as one of the earliest services for deployment in Kubernetes. DataGateway in Kubernetes is known to be highly available and fully scalable to handle traffic from users and systems.

We also tested the HPA feature of Kubernetes, which is a dynamic scaling feature to scale in or out the number of pods based on actual traffic and resource consumption.

Figure 2. DataGateway deployment using Kubernetes
Figure 2. DataGateway deployment using Kubernetes

Functionality of DataGateway

This section highlights some of the ways we use DataGateway to manage our Presto ecosystem efficiently.

Restrict users based on Schema/Table level access

In a setup where a Presto cluster is deployed on AWS Amazon Elastic MapReduce (EMR) or Elastic Kubernetes Service (EKS), we configure an IAM role and attach it to the EMR or EKS nodes. The IAM role is set to limit the access to S3 storage. However, the IAM only provides bucket-level and file-level control; it doesn’t meet our requirements to have schema, table, and column-level ACLs. That’s how DataGateway is found useful in such scenarios.

One of the DataGateway services is an SQL Parser. As previously covered, this is a service that parses and digs out schemas and tables involved in a query. The API service receives the parsing result and checks against the ACL of users, and decides whether to allow or reject the query. This is a remarkable improvement in our security control since we now have another layer to restrict access, on top of the S3 storage. We’ve implemented an SQL-based access control down to table level.

As shown in the Figure 3, user A is trying run a SQL statement select * from locations.cities. The SQL Parser reads the statement and tells the API service that user A is trying to read data from the table cities in the schema locations. Then, the API service checks against the ACL of user A. The service finds that user A has only read access to table countries in schema locations. Eventually, the API service denies this attempt because user A doesn’t have read access to table cities in the schema locations.

Figure 3. An example of how to check user access to run SQL statements
Figure 3. An example of how to check user access to run SQL statements

The above flow shows an access denied result because the user doesn’t have the appropriate permissions.

Seamless User Experience during the EMR migration

We use AWS EMR to deploy Presto as an SQL query engine since deployment is really easy. However, without DataGateway, any EMR operations such as terminations, new cluster deployment, config changes, and version upgrades, would require quite a bit of user involvement. We would sometimes need users to make changes on their side. For example, request users to change the endpoints to connect to suitable clusters.

With DataGateway, ACLs exist for each of the user accounts. The ACL includes the list of EMR clusters that users are allowed to access. As a Presto access management platform, here the DataGateway redirects user traffics to an appropriate cluster based on the ACL, like a proxy. Users always connect to the same endpoint we offer, which is the DataGateway. To switch over from one cluster to another, we just need to edit the cluster ACL and everything is handled seamlessly.

Figure 4. Cluster switching using DataGateway
Figure 4. Cluster switching using DataGateway

Figure 4 highlights the case when we’re switching EMR from one cluster to another. No changes are required from users.

We executed the migration of our entire Presto platform from an AWS EMR instance to another AWS EMR instance using the same methodology. The migrations were executed with little to no disruption for our users. We were able to move 40 clusters with hundreds of users. They were able to issue millions of queries daily in a few phases over a couple of months.

In most cases, users didn’t have to make any changes on their end, they just continued using Presto as usual while we made the changes in the background.

Multi-Cloud Data Lake/Presto Cluster maintenance

Recently, we started to build and maintain data lakes not just in one cloud, but two – in AWS and Azure. Since most end-users are AWS-based, and each team has their own AWS sub-account to run their services and workloads, it would be a nightmare to bridge all the connections and access routes between these two clouds from end-to-end, sub-account by sub-account.

Here, the DataGateway plays the role of the multi-cloud gateway. Since all end-users’ AWS sub-accounts have peered to DataGateway’s network, everything becomes much easier to handle.

For end-users, they retain the same Presto connection profile. The DE team then handles the connection setup from DataGateway to Azure, and also the deployment of Presto clusters in Azure.

When all is set, end-users use the same endpoint to DataGateway. We offer a feature called Cluster Switch that allows users to switch between AWS Presto cluster and Azure Presto Cluster on the fly by filling in parameters on the connection string. This feature allows users to switch to their target Presto cluster without any endpoint changes. The switch works instantly whenever they do the change. That means users can run different queries in different clusters based on their requirements.

This feature has helped the DE team to maintain Presto Cluster easily. We can spin up different Presto clusters for different teams, so that each team has their own query engine to run their queries with dedicated resources.

Figure 5. Sub-account connections and Queries
Figure 5. Sub-account connections and Queries

Figure 5 shows an example of how sub-accounts connect to DataGateway and run queries on resources in different clouds and clusters.

Figure 6. Sample scenario without DataGateway
Figure 6. Sample scenario without DataGateway

Figure 6 shows a scenario of what would happen if DataGatway doesn’t exist. Each of the accounts would have to maintain its own connections, Virtual Private Cloud (VPC) peering, and express link to connect to our Presto resources.

Summary

DataGateway is playing a key role in Grab’s entire Presto ecosystem. It helps us manage user access and cluster selections on a single endpoint, ensuring that everyone is running their Presto queries on the same place. It also helps distribute workload to different types and versions of Presto clusters.

When we started to deploy the DataGateway on Kubernetes, our vision for the Presto ecosystem underwent an epic change as it further motivated us to continuously improve. Since then, we’ve had new ideas on deployment method/pipeline, microservice implementations, scaling strategy, resource control, we even made use of Kubernetes and designed an on-demand, container-based Presto cluster provisioning engine. We’ll share this in another engineering blog, so do stay tuned!.

We also made crucial enhancements on data access control as we extended Presto’s access controls down to the schema/table-level.

In day-to-day operations, especially when we started to implement data lake in multiple clouds, DataGateway solved a lot of implementation issues. DataGateway made it simpler to switch a user’s Presto cluster from one cloud to another or allow a user to use a different Presto cluster using parameters. DataGateway allowed us to provide a seamless experience to our users.

Looking forward, we’ve more and more ideas for our Presto ecosystem, such Spark DataGateway or AWS Athena integrations, to keep our data safe at any time and to provide our users with a smoother experience when dealing with data used for analysis or research.


Authored by Vinnson Lee on behalf of the Presto Development Team at Grab – Edwin Law, Qui Hieu Nguyen, Rahul Penti, Wenli Wan, Wang Hui and the Data Engineering Team.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Using data science and machine learning for improved customer support

Post Syndicated from Junade Ali original https://blog.cloudflare.com/using-data-science-and-machine-learning-for-improved-customer-support/

Using data science and machine learning for improved customer support

In this blog post we’ll explore three tricks that can be used for data science that helped us solve real problems for our customer support group and our customers. Two for natural language processing in a customer support context and one for identifying attack Internet attack traffic.

Through these examples, we hope to demonstrate how invaluable data processing tricks, visualisations and tools can be before putting data into a machine learning algorithm. By refining data prior to processing, we are able to achieve dramatically improved results without needing to change the underlying machine learning strategies which are used.

Know the Limits (Language Classification)

When browsing a social media site, you may find the site prompts you to translate a post even though it is in your language.

We recently came across a similar problem at Cloudflare when we were looking into language classification for chat support messages. Using an off-the-shelf classification algorithm, users with short messages often had their chats classified incorrectly and our analysis found there’s a correlation between the length of a message and the accuracy of the classification (based on the browser Accept-Language header and the languages of the country where the request was submitted):

Using data science and machine learning for improved customer support

On a subset of tickets, comparing the classified language against the web browser Accept-Language header, we found there was broad agreement between these two properties. When we considered the languages associated with the user’s country, we found another signal.

Using data science and machine learning for improved customer support

In 67% of our sample, we found agreement between these three signals. In 15% of instances the classified language agreed with only the Accept-Language header and in 5% of cases there was only agreement with the languages associated with the user’s country.

Using data science and machine learning for improved customer support

We decided the ideal approach was to train a machine learning model that would take all three signals (plus the confidence rate from the language classification algorithm) and use that to make a prediction. By knowing the limits of a given classification algorithm, we were able to develop an approach that helped compliment it.

A naive approach to do the same may not even need a trained model to do so, simply requiring agreement between two of three properties (classified language, Accept-Language header and country header) helps make a decision about the right language to use.

Hold Your Fire (Fuzzy String Matching)

Fuzzy String Matching is often used in natural language processing when trying to extract information from human text. For example, this can be used for extracting error messages from customer support tickets to do automatic classification. At Cloudflare, we use this as one signal in our natural language processing pipeline for support tickets.

Engineers often use the Levenshtein distance algorithm for string matching; for example, this algorithm is implemented in the Python fuzzywuzzy library. This approach has a high computational overhead (for two strings of length k and l, the algorithm runs in O(k * l) time).

To understand the performance of different string matching algorithms in a customer support context, we compared multiple algorithms (Cosine, Dice, Damerau, LCS and Levenshtein) and measured the true positive rate (TP), false positive rate (FP) and the ratio of false positives to true positives (FP/TP).

Using data science and machine learning for improved customer support

We opted for the Cosine algorithm, not just because it outperformed the Levenshtein algorithm, but also the computational difficulty was reduced to O(k + l) time. The Cosine similarity algorithm is a very simple algorithm; it works by representing words or phrases as a vector representation in a multidimensional vector space, where each unique letter of an alphabet is a separate dimension. The smaller the angle between the two vectors, the closer the word is to another.

The mathematical definitions of each string similarity algorithm and a scientific comparison can be found in our paper: M. Pikies and J. Ali, “String similarity algorithms for a ticket classification system,” 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 2019, pp. 36-41. https://doi.org/10.1109/CoDIT.2019.8820497

There were other optimisations we introduce to the fuzzy string matching approaches; the similarity threshold is determined by evaluating the True Positive and False Positive rates on various sample data. We further devised a new tokenization approach for handling phrases and numeric strings whilst using the FastText natural language processing library to determine candidate values for fuzzy string matching and to improve overall accuracy, we will share more about these optimisations in a further blog post.

Using data science and machine learning for improved customer support

“Beyond it is Another Dimension” (Threat Identification)

Attack alerting is particularly important at Cloudflare – this is useful for both monitoring the overall status of our network and providing proactive support to particular at-risk customers.

DDoS attacks can be represented in granularity by a few different features; including differences in request or error rates over a temporal baseline, the relationship between errors and request volumes and other metrics that indicate attack behaviour. One example of a metric we use to differentiate between whether a customer is under a low volume attack or they are experiencing another issue is the relationship between 499 error codes vs 5xx HTTP status codes. Cloudflare’s network edge returns a 499 status code when the client disconnects before the origin web server has an opportunity to respond, whilst 5xx status codes indicate an error handling the request. In the chart below; the x-axis measures the differential increase in 5xx errors over a baseline, whilst the y-axis represents the rate of 499 responses (each scatter represents a 15 minute interval). During a DDoS attack we notice a linear correlation between these criteria, whilst origin issues typically have an increase in one metric instead of another:

Using data science and machine learning for improved customer support

The next question is how this data can be used in more complicated situations – take the following example of identifying a credential stuffing attack in aggregate. We looked at a small number of anonymised data fields for the most prolific attackers of WordPress login portals. The data is based purely on HTTP headers, in total we saw 820 unique IPs towards 16,248 distinct zones (the IPs were hashed and requests were placed into “buckets” as they were collected). As WordPress returns a HTTP 200 when a login fails and a HTTP 302 on a successful login (redirecting to the login panel), we’re able to analyse this just from the status code returned.

On the left hand chart, the x-axis represents a normalised number of unique zones that are under attack (0 means the attacker is hitting the same site whilst 1 means the attacker is hitting all different sites) and the y-axis represents the success rate (using HTTP status codes, identifying the chance of a successful login). The right hand side chart switches the x-axis out for something called the “variety ratio” – this measures the rate of abnormal 4xx/5xx HTTP status codes (i.e. firewall blocks, rate limiting HTTP headers or 5xx status codes). We see clear clusters on both charts:

Using data science and machine learning for improved customer support

However, by plotting this chart in three dimensions with all three fields represented – clusters appear. These clusters are then grouped using an unsupervised clustering algorithm (agglomerative hierarchical clustering):

Using data science and machine learning for improved customer support

Cluster 1 has 99.45% of requests from the same country and 99.45% from the same User-Agent. This tactic, however, has advantages when looking at other clusters – for example, Cluster 0 had 89% of requests coming from three User-Agents (75%, 12.3% and 1.7%, respectively). By using this approach we are able to correlate such attacks together even when they would be hard to identify on a request-to-request basis (as they are being made from different IPs and with different request headers). Such strategies allow us to fingerprint attacks regardless of whether attackers are continuously changing how they make these requests to us.

By aggregating data together then representing the data in multiple dimensions, we are able to gain visibility into the data that would ordinarily not be possible on a request-to-request basis. In product level functionality, it is often important to make decisions on a signal-to-signal basis (“should this request be challenged whilst this one is allowed?”) but by looking at the data in aggregate we are able to focus  on the interesting clusters and provide alerting systems which identify anomalies. Performing this in multiple dimensions provides the tools to reduce false positives dramatically.

Conclusion

From natural language processing to intelligent threat fingerprinting, using data science techniques has improved our ability to build new functionality. Recently, new machine learning approaches and strategies have been designed to process this data more efficiently and effectively; however, preprocessing of data remains a vital tool for doing this. When seeking to optimise data processing pipelines, it often helps to look not just at the tools being used, but also the input and structure of the data you seek to process.

If you’re interested in using data science techniques to identify threats on a large scale network, we’re hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.

Does Southeast Asia run on coffee?

Post Syndicated from Grab Tech original https://engineering.grab.com/does-southeast-asia-run-on-coffee

This article was originally published in the Grab Medium account on December 4, 2019. Reposting it here for your reading pleasure.

There is no surprise as to why coffee is a go-to drink in the region. For one, almost a third of coffee is produced in Asia, giving us easy access to beans. Coupled with the plethora of local cafes and stores at every corner in Southeast Asia, coffee has become an accessible and affordable drink — and one that enjoys a huge following.

For many, a morning cuppa is fuel to kick start their day. For some it’s the secret weapon to a food coma, for others, it’s fuel to keep them going throughout the day.

To get a glimpse of how our fellow Southeast Asians refuel with coffee on a daily basis, we took a look (along with our ‘kopi’) at GrabFood data, and here is what we found.

Did you know: Coffee orders have grown 1,400% on GrabFood?

How much do we actually love our coffee? It seems like we do, a lot.

Coffee orders on GrabFood has been growing pervasively throughout the major cities, and a timelapse visualisation based on data from GrabFood orders show us the growth of orders across major cities over a 9-month period:

Timelapse visualisation

Time for coffee?

But how reliant are we on caffeine? We analysed the coffee consumption behaviour of GrabFood users from major SEA countries across a typical week.

Coffee Orders by Day of the Week: Singapore coffee orders peak on the weekends

Coffee Orders by Day of the Week - Chart

Turns out most coffee orders are placed on Wednesdays — clearly a much needed shot to overcome the dreaded hump day. And as we head into the weekend, orders begin to decline as Southeast Asians wind down from the work week.

However, the complete opposite happens for our friends in Singapore and the Philippines! Coffee orders actually spike on the weekends, and especially so on Sundays. It can only mean that Singaporeans and Filipinos surely enjoy their coffee catch-ups with friends and family.

AM- Coffee… PM- Still coffee

The question begets — when exactly do SEA coffee drinkers summon that life saving cup from our delivery heroes in green?

Check out this trippy visualisation that resembles jumping coffee beans:

Coffee Orders by Hour of Day — Orders peak at 10am for Thailand and 2pm in Indonesia

Coffee Orders by Day of the Week

While other cities generally reach for the Grab app at noon for that extra boost to fight that food coma through the rest of the day, our friends in Thailand gets their caffeine fix early, with most orders coming in at 10.00am, just before the lunch hour.

Interestingly, coffee orders for Singapore peak at about 4pm in the afternoon… are they working hard, or are they hardly working?

GrabFood’s love is in the air, and it smells like coffee

Curious as to what coffee flavours our SEA neighbours prefer? We spill the (coffee) beans!

Top 3 Coffee Flavours in each Country

Top 3 Coffee Flavours in each Country

What is a non-coffee drinker to do?

What non-coffee drinker drinks

Also known as Matcha Latte, Green Tea Latte seems to be the next big beverage fad in the region , serving as a perfect coffee alternative for non-coffee drinkers.

Matcha latte, made with concentrated shots of green tea and topped with frothy, steamed milk, is gaining popularity. While it offers the same quantity of caffeine as a cup of brewed coffee, the drink is perceived to be as more energising , because of the slower release of caffeine.

It has consistently been one of the top 10 beverage items ordered on GrabFood, and we’ve delivered over 25 million cups of these green, frothy and creamy ‘heaven in a cup’ over the last nine months!

Southeast Asian’s love of tea-based latte (other than green tea) is apparent in Grab’s data! Some of the unique flavours that are being ordered on GrabFood include the following flavours:

Unique flavours

GrabFood Coffee is a hug in a mug

Is your blood type coffee? Whether you feel like caramelly and chocolatey Macchiato, or fruity and floral aroma of freshly brewed Americano, or intense and bitter double-shot Long Black — GrabFood has got you covered! May your coffee get delivered (and kick in) before reality does!

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GrabChat Much? Talk Data to me!

Post Syndicated from Grab Tech original https://engineering.grab.com/grabchat-much-talk-data-to-me

This article was originally published in the Grab Medium account on November 20, 2019. Reposting it here for your reading pleasure.

In September 2016 GrabChat was born, a platform designed to allow seamless communication between passenger and driver-partner. Since then, Grab has continuously improved the GrabChat experience by introducing features such as instant translation, images, and audio chats, and as a result — reduced cancellation rates by up to 50%! We’ve even experimented with various features to deliver hyper-localised experiences in each country! So with all these features, how have our users responded? Let’s take a deeper look into this to uncover some interesting insights from our data in Singapore, Malaysia and Indonesia.

The Chattiest Country

Number of Chats by Country
Number of Chats by Country

In a previous blog post several years ago, we revealed that Indonesia was the chattiest nation in South-east Asia. Our latest data is no different. Indonesia is still the chattiest country out of the three, having an average of 5.5 chats per bookings, while Singapore is the least chatty! Furthermore, passengers in Singapore tend to be chattier than driver-partners, while the reverse relationship is true for the other two countries.

But what do people talk about?

Common words in Indonesia
Common words in Indonesia
Common words in Singapore
Common words in Singapore
Common words in Malaysia
Common words in Malaysia

As expected, most of the chats revolve around pick-up points. There are many similarities between the three countries, such as typing courtesies such as ‘Hi’ and ‘Thank you’, and that the driver-partner/passenger is coming. However, there are slight differences between the countries. Can you spot them all?

In Indonesia, chats are usually in Bahasa Indonesia, and tend to be mostly driver-partners thanking passengers for using Grab.

Chats in Singapore on the other hand, tend to be in English, and contain mostly pick-up locations, such as a car park. There are quite a few unique words in the Singapore context, such as ‘rubbish chute’ and ‘block’ that reflect features of the ubiquitous HDB’s (public housing) found everywhere in Singapore that serve as popular residential pickup points.

Malaysia seems to be a blend of the other two countries, with chats in a mix of English and Bahasa Malaysia. Many of the chats highlight pickup locations, such as a guard house, as well as the phrase all Malaysians know: being stuck in traffic.

Time Trend
Time Trend

Analysis in chat trends across the three countries revealed an unexpected insight: a trend of talking more from midnight until around 4am. Perplexed but intrigued, we dug further to discover what prompted our users to talk more in such odd hours.

From midnight to 4am shops and malls are usually closed during these hours, and pickup locations become more obscure as people wander around town late at night. Driver-partners and passengers thus tend to have more conversations to determine the pickup point. This also explains why the proportion of pick-up location based messages out of all messages is highest between 12 and 6am. On the other hand, these messages are less common in the mornings (6am-12pm) as people tend to be picked up from standard residential locations.

GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) - Image 1
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) – Image 1
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019)  - Image 2
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) – Image 2
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) - Image 3
GrabChat’s Image-function uptake in Jakarta, Singapore, and Kuala Lumpur (Nov 2018 — March 2019) – Image 3

The ability to send images on GrabChat was introduced in September 2018, with the aim of helping driver-partners identify the exact pickup location of passengers. Within the first few weeks of release, 22,000 images were sent in Singapore alone. The increase in uptake of the image feature for the cities of Jakarta, Singapore and Kuala Lumpur can be seen in the images above.

From analysis, we found that areas that were more remote such as Tengah in Singapore tended to have the highest percentage of images sent, indicating that images are useful for users in unfamiliar places.

Safety First

Aside from images, Grab also introduced two other features: templates and audio chats, to avoid driver-partners from texting while driving.

Templates and audio features used by driver-partners, and a reduced number of typed texts by driver-partners per booking
Templates and audio features used by driver-partners, and a reduced number of typed texts by driver-partners per bookin

“Templates” (pre-populated phrases) allowed driver-partners to send templated messages with just a quick tap. In our recent data analysis, we discovered that almost 50% of driver-partner texts comprised of templates.

“Audio chat” alongside “images chat” were introduced in September 2018, and the use of this feature has been steadily increasing, with audio comprising an increasing percentage of driver-partner texts.

With both features being picked up by driver-partners across all three countries, Grab has successfully seen a decrease in the overall number of driver-partner texts (non-templates) per booking within a 3 month period.

A Brief Pick-up Guide

No one likes a cancelled ride, right? Well, after analysing millions of data points, we’ve unearthed some neat tips and tricks to help you complete your ride, and we’re sharing them with you!

Completed Rides
Completed Rides

This first tip might be a no-brainer, but replying your driver-partner would result in a higher completion rate. No one likes to be blue-ticked do they?

Next, we discovered various things you could say that would result in higher completion rates, explained below in the graphic.

Tips for a Better Pickup Experience
Tips for a Better Pickup Experience

Informing the driver-partner that you’re coming, giving them directions, and telling them how to identify you results in almost double the chances of completing the ride!

Last but not least, let’s not forget our manners. Grab’s data analysis revealed that saying ‘thank you’ correlated with an increase in completion rates! Also, be at the pickup point on time — remember, time is money for our driver-partners!

Conclusion

Just like in Shakespeare’s Much Ado about Nothing, ample information can be gathered from the mere whim of a message. Grab is constantly aspiring to achieve the best experience for both passengers and driver-partners, and data plays a huge role in helping us achieve this.

This is just the first page of the book. The amount of information lurking between every page is endless. So stay tuned for more interesting insights about our GrabChat platform!

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

7 Fun Facts about Grab’s Driver-Partners in Singapore

Post Syndicated from Grab Tech original https://engineering.grab.com/seven-facts-about-grab-driver-partners-in-sg

This article was originally published in the Grab Medium account on June 17, 2019. Reposting it here for your reading pleasure.

Grab’s Big Data Story

Grab is on an incredible mission to empower our driver-partners in 336 cities in 8 countries.

Curious about what Grab’s data tells us about driver-partners on the platform?

Let us share with you the most interesting data points we found among our driver-partners in Singapore!

1. It’s a Small World

It's a Small World

Lim Chu Kang may feel like a world away from Katong, but Singapore is a small world for our driver-partners — a driver-partner has a 1 in 400 chance of having a repeat passenger amongst the 5.4 million population!

2. Saturday Night Fever

Saturday Night Fever

The annual average number of rides that a Grab driver-partner complete on Saturday nights is 110. But there was one special driver-partner who did 1,131 Saturday night trips in the year of 2018! Weekend parties just wouldn’t be the same without you. Rock on!

3. Share the Love

Share the Lover

Did you know that having more passengers in a car can yield more 5-star ratings? Our GrabShare passengers share more than just rides — they share their appreciation too! GrabShare rides have an average trip rating of 4.8!

4. The Road More Travelled

The Road More Travelled

Which neighbourhoods are painting the town green? Our driver partners picked up the most passengers from Tampines, while Orchard & Marina Bay areas were the most popular destination in 2018!

5. Tricks of the Trade

Tricks of the Trad

Ever wondered if seasoned driver-partners who have been with us for more than 2 years, have different driving preferences and habits? They tend to start their day 1 hour earlier around 6–7am, and are on auto-accept most of the time. Did you know that drivers on auto-accept spend less time idle waiting for new jobs?

6. Busy Bee

Busy Bee

Did you know? Drivers are twice as likely to get back-to-back allocations during evening peak hours! Drivers with frequent back-to-back jobs earn about 50% more per hour.

7. Road Runner

Road Runner

One of our most active driver-partners covered 57,000km ferrying passengers in 2018 — that’s like driving every road, street, jalan, lorong and tanjong in Singapore for more than 57 times!

How our driver-partners utilize the Grab platform to make a living (and break a few records along the way) never ceases to amaze us.

Interested to know more about the winning strategies among our driver-partners? Look out for the next data story!

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

JavaScript Libraries Are Almost Never Updated Once Installed

Post Syndicated from Zack Bloom original https://blog.cloudflare.com/javascript-libraries-are-almost-never-updated/

JavaScript Libraries Are Almost Never Updated Once Installed

Cloudflare helps run CDNJS, a very popular way of including JavaScript and other frontend resources on web pages. With the CDNJS team’s permission we collect anonymized and aggregated data from CDNJS requests which we use to understand how people build on the Internet. Our analysis today is focused on one question: once installed on a site, do JavaScript libraries ever get updated?

Let’s consider jQuery, the most popular JavaScript library on Earth. This chart shows the number of requests made for a selected list of jQuery versions over the past 12 months:

JavaScript Libraries Are Almost Never Updated Once Installed

Spikes in the CDNJS data as you see with version 3.3.1 are not uncommon as very large sites add and remove CDNJS script tags.

We see a steady rise of version 3.4.1 following its release on May 2nd, 2019. What we don’t see is a substantial decline of old versions. Version 3.2.1 shows an average popularity of 36M requests at the beginning of our sample, and 29M at the end, a decline of approximately 20%. This aligns with a corpus of research which shows the average website lasts somewhere between two and four years. What we don’t see is a decline in our old versions which come close to the volume of growth of new versions when they’re released. In fact the release of 3.4.1, as popular as it quickly becomes, doesn’t change the trend of old version deprecation at all.

If you’re curious, the oldest version of jQuery CDNJS includes is 1.10.0, released on May 25, 2013. The project still gets an average of 100k requests per day, and the sites which use it are growing in popularity:

JavaScript Libraries Are Almost Never Updated Once Installed

To confirm our theory, let’s consider another project, TweenMax:

JavaScript Libraries Are Almost Never Updated Once Installed

As this package isn’t as popular as jQuery, the data has been smoothed with a one week trailing average to make it easier to identify trends.

Version 1.20.4 begins the year with 18M requests, and ends it with 14M, a decline of about 23%, again in alignment with the loss of websites on the Internet. The growth of 2.1.3 shows clear evidence that the release of a new version has almost no bearing on the popularity of old versions, the trend line for those older versions doesn’t change even as 2.1.3 grows to 29M requests per day.

JavaScript Libraries Are Almost Never Updated Once Installed

Clearly the new version cannot be replacing many of the legacy installations.

The clear conclusion is whatever libraries you publish will exist on websites forever. The underlying web platform consequently must support aged conventions indefinitely if it is to continue supporting the full breadth of the web. Cloudflare is also, of course, very interested in how we can contribute to a web which is kept up-to-date. Please make suggestions in the comments below.

Griffin, an anti-fraud risk rule engine making billions of predictions daily

Post Syndicated from Grab Tech original https://engineering.grab.com/griffin

Introduction

At Grab, the scale and fast-moving nature of our business means we need to be vigilant about potential risks to our customers and to our business. Some of the things we watch for include promotion abuse, or passenger safety on late-night ride allocations. To overcome these issues, the TIS (Trust/Identity/Safety) taskforce was formed with a group of AI developers dedicated to fraud detection and prevention.

The team’s mission is:

  • to keep fraudulent users away from our app or services
  • ensure our customers’ safety, and
  • Manage user identities to securely login to the Grab app.

The TIS team’s scope covers not just transport, but also our food, deliver and other Grab verticals.

How we prevented fraudulent transactions in the earlier days

In our early days when Grab was smaller, we used a rules-based approach to block potentially fraudulent transactions. Rules are like boolean conditions that determines if the result will be true or false. These rules were very effective in mitigating fraud risk, and we used to create them manually in the code.

We started with very simple rules. For example:

Rule 1:

 IF a credit card has been declined today

 THEN this card cannot be used for booking

To quickly incorporate rules in our app or service, we integrated them in our backend service code and deployed our service frequently to use the latest rules.

It worked really well in the beginning. Our logic was relatively simple, and only one developer managed the changes regularly. It was very lightweight to trigger the rule deployment and enforce the rules.

However, as the business rapidly expanded, we had to exponentially increase the rule complexity. For example, consider these two new rules:

Rule 2:

IF a credit card has been declined today but this passenger has good booking history

THEN we would still allow this booking to go through, but precharge X amount

Rule 3:

IF a credit card has been declined(but paid off) more than twice in the last 3-months

THEN we would still not allow this booking

The system scans through the rules, one by one, and if it determines that any rule is tripped it will check the other rules. In the example above, if a credit card has been declined more than twice in the last 3-months, the passenger will not be allowed to book even though he has a good booking history.

Though all rules follow a similar pattern, there are subtle differences in the logic and they enable different decisions. Maintaining these complex rules was getting harder and harder.

Now imagine we added more rules as shown in the example below. We first check if the device used by the passenger is a high-risk one. e.g using an emulator for booking. If not, we then check the payment method to evaluate the risk (e.g. any declined booking from the credit card), and then make a decision on whether this booking should be precharged or not. If passenger is using a low-risk  device but is in some risky location where we traditionally see a lot of fraud bookings, we would then run some further checks about the passenger booking history to decide if a pre-charge is also needed.

Now consider that instead of a single passenger, we have thousands of passengers. Each of these passengers can have a large number of rules for review. While not impossible to do, it can be difficult and time-consuming, and it gets exponentially more difficult the more rules you have to take into consideration. Time has to be spent carefully curating these rules.

Rules flow

The more rules you add to increase accuracy, the more difficult it becomes to take them all into consideration.

Our rules were getting 10X more complicated than the example shown above. Consequently, developers had to spend long hours understanding the logic of our rules, and also be very careful to avoid any interference with new rules.

In the beginning, we implemented rules through a three-step process:

  1. Data Scientists and Analysts dived deep into our transaction data, and discovered patterns.
  2. They abstracted these patterns and wrote rules in English (e.g. promotion based booking should be limited to 5 bookings and total finished bookings should be greater than 6, otherwise unallocate current ride)
  3. Developers implemented these rules and deployed the changes to production

Sometimes, the use of English between steps 2 and 3 caused inaccurate rule implementation (e.g. for “X should be limited to 5”, should the implementation be X < 5 or  X <= 5?)

Once a new rule is deployed, we monitored the performance of the rule. For example,

  • How often does the rule fire (after minutes, hours, or daily)?
  • Is it over-firing?
  • Does it conflict with other rules?

Based on implementation, each rule had dependency with other rules. For example, if Rule 1 is fired, we should not continue with Rule 2 and Rule 3.

As a result, we couldn’t  keep each rule evaluation independent.  We had no way to observe the performance of a rule with other rules interfering. Consider an example where we change Rule 1:

From IF a credit card has been declined today

To   IF a credit card has been declined this week

As Rules 2 and 3 depend on Rule 1, their trigger-rate would drop significantly. It means we would have unstable performance metrics for Rule 2 and Rule 3 even though the logic of Rule 2 and Rule 3 does not change. It is very hard for a rule owner to monitor the performance of Rules 2 and Rule 3.

When it comes to the of A/B testing of a new rule, Data Scientists need to put a lot of effort into cleaning up noise from other rules, but most of the time, it is mission-impossible.

After several misfiring events (wrong implementation of rules) and ever longer rule development time (weekly), we realized “No one can handle this manually.“

Birth of Griffin Rule Engine

We decided to take a step back, sit down and closely review our daily patterns. We realized that our daily patterns fall into two categories:

  1. Fetching new data:  e.g. “what is the credit card risk score”, or “how many food bookings has this user ordered in last 7 days”, and transform this data for easier consumption.
  2. Updating/creating rules: e.g. if a credit card risk score is high, decline a booking.

These two categories are essentially divided into two independent components:

  1. Data orchestration – collecting/transforming the data from different data sources.
  2. Rule-based prediction

Based on these findings, we got started with our Data Orchestrator (open sourced at https://github.com/grab/symphony) and Griffin projects.

The intent of Griffin is to provide data scientists and analysts with a way to add new rules to monitor, prevent, and detect fraud across Grab.

Griffin allows technical novices to apply their fraud expertise to add very complex rules that can automate the review of rules without manual intervention.

Griffin  now predicts billions of events every day with 100K+ Queries per second(QPS) at peak time (on only 6 regular EC2s).

Data scientists and analysts can self-service rule changes on the web portal directly, deploy rules with just a few clicks, experiment and monitor performance in real time.

Why we came up with Griffin instead of using third-party tools in the market

Before we decided to create our in-built tool, we did some research for common business rule engines available in the market such as Drools and checked if we should use them. In that process, we found:

  1. Drools has its own Java-based DSL with a non-trivial learning curve (whereas our major users are from Python background).
  2. Limited [expressive power](https://en.wikipedia.org/wiki/Expressive_power_(computer_science),
  3. Limited support for some common math functions (e.g. factorial/ Greatest Common Divisor).
  4. Our nature of business needed dynamic dataset for predictions (for example, a rule may need only passenger booking history on Day 1, but it may use passenger booking history, passenger credit balance, and passenger favorite places on Day 2). On the other hand, Drools usually works well with a static list of dataset instead of dynamic dataset.

Given the above constraints, we decided to build our own rule engine which can better fit our needs.

Griffin Architecture

The diagram depicts the high-level flow of making a prediction through Griffin.

High-level flow of making a prediction through Griffin

Components

  • Data Orchestration: a service that collects all data needed for predictions
  • Rule Engine: a service that makes prediction based on rules
  • Rule Editor: the portal through which users can create/update rules

Workflow

  1. Users create/update rules in the Rule Editor web portal, and save the rules in the database.
  2. Griffin Rule Engine reloads rules immediately as long as it detects any rule changes.
  3. Data Orchestrator sends all dataset (features) needed for a prediction (e.g. whether to block a ride based on passenger past ride pattern, credit card risk) to the Rule Engine
  4. Griffin Rule Engine makes a prediction.

How you can create rules using Griffin

In an abstract view, a rule inside Griffin is defined as:

Rule:

Input:JSON => Result:Boolean

We allow users (analysts, data scientists) to write Python-based rules on WebUI to accommodate some very complicated rules like:

len(list(filter(lambdax: x \>7, (map(lambdax: math.factorial(x), \[1,2,3,4,5,6\]))))) \>2

This significantly optimizes the expressive power of rules.

To match and evaluate a rule more efficiently, we also have other key components associated:

Scenarios

  • Here are some examples: PreBooking, PostBookingCompletion, PostFoodDelivery

Actions

  • Actions such as NotAllowBooking, AuthCapture, SendNotification
  • If a rule result is True, it returns a list of treatments as selected by users, e.g. AuthCapture and SendNotification (the example below is treatments for one Safety-related rule).The one below is for a checkpoint to detect credit-card risk.
Treatments: AuthCapture
  • Each checkpoint has a default treatment. If no rule inside this checkpoint is hit, the rule engine would return the default one (in most cases, it is just “do nothing”).
  • A treatment can only belong to one checkpoint, but one checkpoint can have multiple treatments.

For example, the graph below demonstrates a checkpoint PaxPreRide associated with three treatments: Pass, Decline, Hold

Treatments: Adding

Segments

  • The scope/dimension of a rule. Based on the sample segments below, a rule can be applied only to countries=\[MY,PH\] and verticals=\[GrabBus, GrabCar\]
  • It can be changed at any time on WebUI as well.
Segments

Values of a rule

 

When a rule is hit, more than just treatments, users also want some dynamic values returned. E.g. a max distance of the ride allowed if we believe this booking is medium risk.

Does Python make Griffin run slow?

We picked Python to enjoy its great expressive power and neatness of syntax, but some people ask: Python is slow, would this cause a latency bottleneck?

Our answer is No.

The below graph shows the Latency P99 of Prediction Request from load balancer side(actually the real latency for each prediction is < 6ms, the metrics are peaked at 30ms because some batch requests contain 50 predictions in a single call)

Prediction Request Latency P99

What we did to achieve this?

  • The key idea is to make all computations in CPU and memory only (in other words, no extra I/O).
  • We do not fetch the rules from database for each prediction. Instead, we keep a record called dirty_key, which keeps the latest rule update timestamp. The rule engine would actively check this timestamp and trigger a rule reload only when the dirty_key timestamp in the DB is newer than the latest rule reload time.
  • Rule engine would not fetch any additional new data, instead, all data should be from Data Orchestrator.
  • So the whole prediction flow is only between CPU & memory (and if the data size is small, it could be on CPU cache only).
  • Python GIL essentially enforces a process to have up to one active thread running at a time, no matter how many cores a CPU has. We have Gunicorn to wrap our service, so on the Production machine, we have (2x$num_cores) + 1 processes (see http://docs.gunicorn.org/en/latest/design.html#how-many-workers). The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

The below screenshot is the process snapshot on C5.large machine with 2 vCPU. Note only green processes are active.

Process snapshot on C5.large machine

A lot of trial and error performance tuning:

  • We used to have python-jsonpath-rw for JSONPath query, but the performance was not strong enough. We switched to jmespath and observed about 10ms latency reduction.
  • We use sqlalchemy for DB Query and ORM. We enabled cache for some use cases, but turned out it was over-optimized with stale data. We ended up turning off some caching points to ensure the data consistency.
  • For new dict/list creation, we prefer native call (e.g. {}/[]) instead of function call (see the comparison below).
Native call and Function call
  • Use built-in functions https://docs.python.org/3/library/functions.html. It is written in C, no one can beat it.
  • Add randomness to rule reload so that not all machines run at the same time causing latency spikes.
  • Caching atomic feature units as they are used so that we don’t have to requery for them each time a checkpoint uses it.

How Griffin makes on-call engineers relax

One of the most popular aspects of Griffin is the WebUI. It opens a door for non-developers to make production changes in real time which significantly boosts organisation productivity. In the past a rule change needed 1 week for code change/test/deployment, now it is just 1 minute.

But this also introduces extra risks. Anyone can turn the whole checkpoint down, whether unintentionally or maliciously.

Hence we implemented Shadow Mode and Percentage-based rollout for each rule. Users can put a rule into Shadow Mode to verify the performance without any production impact, and if needed, rollout of a rule can be from 1% all the way to 100%.

We implemented version control for every rule change, and in case anything unexpected happened, we could rollback to the previous version quickly.

Version control
Rollback button

We also built RBAC-based permission system, along with Change Approval flow to make sure any prod change needs at least two people(and approver role has higher permission)

Closing thoughts

Griffin evolved from a fraud-based rule engine to generic rule engine. It can apply to any rule at Grab. For example, Grab just launched Appeal automation several days ago to reduce 50% of the  human effort it typically takes to review straightforward appeals from our passengers and drivers. It was an unplanned use case, but we are so excited about this.

This could happen because from the very beginning we designed Griffin with minimized business context, so that it can be generic enough.

After the launch of this, we observed an amazing adoption rate for various fraud/safety/identity use cases. More interestingly, people now treat Griffin as an automation point for various integration points.

Using Grab’s Trust Counter Service to Detect Fraud Successfully

Post Syndicated from Grab Tech original https://engineering.grab.com/using-grabs-trust-counter-service-to-detect-fraud-successfully

Background

Fraud is not a new phenomenon, but with the rise of the digital economy it has taken different and aggressive forms. Over the last decade, novel ways to exploit technology have appeared, and as a result, millions of people have been impacted and millions of dollars in revenue have been lost. According to ACFE survey, companies lost USD6.3 billion due to fraud. Organizations lose 5% of its revenue annually due to fraud.

In this blog, we take a closer look at how we developed an anti-fraud solution using the Counter service, which can be an indispensable tool in the highly complex world of fraud detection.

Anti-fraud solution using counters

At Grab, we detect fraud by deploying data science, analytics, and engineering tools to search for anomalous and suspicious transactions, or to identify high-risk individuals who are likely to commit fraud. Grab’s Trust Platform team provides a common anti-fraud solution across a variety of business verticals, such as transportation, payment, food, and safety. The team builds tools for managing data feeds, creates SDK for engineering integration, and builds rules engines and consoles for fraud detection.

One example of fraudulent behavior could be that of an individual who masquerades as both driver and passenger, and makes cashless payments to get promotions, for example, earn a one dollar rebate in the next transaction.In our system, we analyze real time booking and payment signals, compare it with the historical data of the driver and passenger pair, and create rules using the rule engine. We count the number of driver and passenger pairs at a given time frame. This counter is provided as an input to the rule.If the counter value exceeds a predefined threshold value, the rule evaluates it as a fraud transaction. We send this verdict back to the booking service.

The conventional method

Fraud detection is a job that requires cross-functional teams like data scientists, data analysts, data engineers, and backend engineers to work together. Usually data scientists or data analysts come up with an offline idea and apply it to real-time traffic. For example, a rule gets invented after brainstorming sessions by data scientists and data analysts. In the conventional method, the rule needs to be communicated to engineers.

Automated solution using the Counter service

To overcome the challenges in the conventional method, the Trust platform team decided to come out with the Counter service, a self-service platform, which provides management tools for users, and a computing engine for integrating with the backend services. This service provides an interface, such as a UI based rule editor and data feed, so that analysts can experiment and create rules without interacting with engineers. The platform team also decided to provide different data contracts, APIs, and SDKs to engineers so that the business verticals can use it quickly and easily.

The major engineering challenges faced in designing the Counter service

There are millions of transactions happening at Grab every day, which implies we needed to perform billions of fraud and safety detections. As seen from the example shared earlier, most predictions require a group of counters. In the above use case, we need to know how many counts of the cashless payment happened for a driver and passenger pair. Due to the scale of Grab’s business, the potential combinations of drivers and passengers could be exponential. However, this is only one use case. So imagine that there could be hundreds of counters for different use cases. Hence it’s important that we provide a platform for stakeholders to manage counters.

Some of the common challenges we faced were:

Scalability

As mentioned above, we could potentially have an exponential number of passengers and drivers in a single counter. So it’s a great challenge to store the counters in the database, read, and query them in real-time. When there are billions of counter keys across a long period of time, the Trust team had to find a scalable way to write and fetch keys effectively and meet the client’s SLA.

Self-serving

A counter is usually invented by data scientists or analysts and used by engineers. For example, every time a new type of counter is needed from data scientists, developers need to manually make code changes, such as adding a new stream, capturing related data sets for the counter, and storing it on the fraud service, then doing a deployment to make the counters ready. It usually takes two or more weeks for the whole iteration, and if there are any changes from the data analysts’ side, which happens often, the situation loops again. The team had to come up with a solution to prevent the long loop of manual tasks by coming out with a self-serving interface.

Manageable and extendable

Due to a lack of connection between real-time and offline data, data analysts and data scientists did not have a clear picture of what is written in the counters. That’s because the conventional counter data were stored in Redis database to satisfy the query SLA. They could not track the correctness of counter value, or its history. With the new solution, the stakeholders can get a real-time picture of what is stored in the counters using the data engineering tools.

The Machine Learning challenges solved by the Counter service

The Counter service plays an important role in our Machine Learning (ML) workflow.

Data Consistency Challenge/Issue

Most of the machine learning workflows need dedicated input data. However, when there is an anti-fraud model that is trained using offline data from the data lake, it is difficult to use the same model in real-time. This is because the model lacks the data contract and the consistency with the data source. In this case, the Counter service becomes a type of data source by providing the value of counters to file system.

ML featuring

Counters are important features for the ML models. Imagine there is a new invention of counters, which data scientists need to evaluate. We need to provide a historical data set for counters to work. The Counter service provides a counter replay feature, which allows data scientists to simulate the counters via historical payload.

In general, the Counter service is a bridge between online and offline datasets, data scientists, and engineers. There was technical debt with regards to data consistency and automation on the ML pipeline, and the Counter service closed this loop.

How we designed the Counter service

We followed the principle of asynchronized data ingestion, and synchronized transaction for designing the Counter service.

The diagram shows how the counters are generated and saved to database.

How the counters are generated and saved to the database

Counter creation workflow

  1. User opens the Counter Creation UI and creates a new key “fraud:counter:counter_name”.
  2. Configures required fields.
  3. The Counter service monitors the new counter-creation, puts a new counter into load script storage, and starts processing new counter events (see Counter Write below).

Counter write workflow

  1. The Counter service monitors multiple streams, assembles extra data from online data services (i.e. Common Data Service (CDS), passenger service, hydra service, etc), so that rich dataset would also be available for editors on each stream resource.
  2. The Counter Processor evaluates the user-configured expression and writes the evaluated values to the dedicated Grab-Stats stream using the GrabPlugin tool.

Counter read workflow

Counter read workflow

We use Grab-Stats as our storage service. Basically Grab-Stats runs above ScyllaDB, which is a distributed NoSQL data store. We use ScyllaDB because of its good performance on aggregation in memory to deal with the time series dataset. In comparison with in-memory storage like AWS elasticCache, it is 10 times cheaper and as reliable as AWS in terms of stability. The p99 of reading from ScyllaDB is less than 150ms which satisfies our SLA.

How we improved the Counter service performance

We used the multi-buckets strategy to improve the Counter service performance.

Background

There are different time windows when you perform a query. Some counters are time sensitive so that it needs to know what happened in the last 30 or 60 minutes. Some other counters focus on the long term and need to know the events in the last 30 or 90 days.

From a transactional database perspective, it’s not possible to serve small range as well as long term events at the same time. This is because the more the need for the accuracy of the data and the longer the time range, the more aggregations need to happen on database. Which means we would not be able to satisfy the SLA. Otherwise we will need to block other process which leads to the service downgrade.

Solution for improving the query

We resolved this problem by using different granularities of the tables. We pre-aggregated the signals into different time buckets, such as 15min, 1 hour, and 1 day.

When a request comes in, the time-range of the request will be divided by the buckets, and the results are conquered. For example, if there is a request for 9/10 23:15:20 to 9/12 17:20:18, the handler will query 15min buckets within the hour.  It will query for hourly buckets for the same day. And it will query the daily buckets for the rest of 2 days. This way, we avoid doing heavy aggregations, but still keep the accuracy in 15 minutes level in a scalable response time.

Counter service UI

We allowed data analysts and data scientists to onboard counters by themselves, from a dedicated web portal. After the counter is submitted, the Counter service takes care of the integration and parsing the logic at runtime.

Counter service UI

Backend integration

We provide SDK for quicker and better integration. The engineers only need to provide the counter identifier ID (which is shown in the UI) and the time duration in the query. Under the hood we provide a GRPC protocol to communicate across services. We divide the query time window to smaller granularities, fetching from different time series tables and then conquering the result. We are also providing a short TTL cache layer to take the uncommon traffic from client such as network retry or traffic throttle. Our QPS are designed to target 100K.

Monitoring the Counter service

The Counter service dashboard helps to track the human errors while editing the counters in real-time. The Counter service sends alerts to slack channel to notify users if there is any error.

Counter service dashboard

We setup Datadog for monitoring multiple system metrics. The figure below shows a portion of stream processing and counter writing. In the example below, the total stream QPS would reach 5k at peak hour, and the total counter saved to storage tier is about 4k. It will keep climbing without an upper limit, when more counters are onboarded.

Counter service dashboard with multiple metrics

The Counter service UI portal also helps users to fetch real-time counter results for verification purposes.

Counter service UI

Future plans

Here’s what we plan to do in the near future to improve the Counter service.

Close the ML workflow loop

As mentioned above, we plan to send the resource payload of the Counter service to the offline data lake, in order to complete the counter replay function for data scientists. We are working on the project called “time traveler”. As the name indicates, it is used not only for serving the online transactional data, but also supports historical data analytics, and provides more flexibility on counter inventions and experiments.

There are more automation steps we plan to do, such as adding a replay button on the web portal, and hooking up with the offline big data engine to trigger the analytics jobs. The performance metrics will be collected and displayed on the web portal. A single platform would be able to manage both the online and offline data.

Integration to Griffin

Griffin is our rule engine. Counters are sometimes an input to a particular rule, and one rule usually needs many counters to work together. We need to provide a better integration to Griffin on backend. We plan to minimize the current engineering effort when using counters on Griffin. A counter then becomes an automated input variable on Griffin, which can be configured on the web portal by any users.

Save Your Place with Grab!

Post Syndicated from Grab Tech original https://engineering.grab.com/save-your-place-with-grab

Do you find it tedious to type and search for your destination or have a hard time remembering that address of the friend you are going to meet? It can be really confusing when it comes to keeping track of so many addresses that you frequent on a regular basis. To solve this pain point, Grab rolled out a new feature called Saved Places in January’19 across SouthEast Asia.

With Saved Places, you can save an address and also add a label like “Home”, “Work”, “Gym”, etc which makes finding and selecting an address for booking a ride or ordering your favourite food a breeze!

Never forget your addresses again!

To use the feature, fire up your Grab app, head to the “Saved Places” section on the app navigation bar and start adding all your favourite destinations such as your home, your office, your favourite mall or the airport and you are done with the hassle of typing them again.

Save your place with Grab!

 

Hola! your saved addresses are just a click away to order a ride or your favourite meal.

Inspiration behind the work

We at Grab continuously engage with our customers to understand how we can outserve them better. Difficulty in choosing the correct address was one of the key feedback shared by our customers. Our drivers shared funny stories about places that have similar names but outrightly different locations e.g. Sime Road is in Bukit Timah but Simei Road is in Simei almost 20 km away, Nicoll Highway is in Kallang but Nicoll Drive is in Changi almost 20 km away. In this case, even though the users use the address frequently, there remains scope for misselection.

Data-Driven Decisions

Our vast repository of data and insights has helped us identify and solve some challenging problems. Our analysis of millions of transport bookings and food orders revealed that customers usually visit five to seven unique locations and order food at one or two addresses.

One intriguing yet intuitive insight that came out was a set pattern in user’s booking behaviour during weekdays. A set of passengers mostly commute between two addresses, probably going to the office in the morning and coming back home in the evening. These identifiable clusters of pick-up and drop-off locations during peak hours signified our hypothesis of users using a small set of locations for their Grab bookings. The pictures below show such clusters in Singapore and Jakarta where passengers generally commute to and fro in morning and in evening respectively.

Save your place with Grab!

 

This insight also motivated us to test out the concept of user created labels which allows the users to mark their saved places with their own labels like “Home”, “Changi Airport”, “Sis’s House” etc. Initial experiment results were extremely encouraging and we got significantly higher usage and repeat rates from users.

A group of cross functional teams – Analytics, Product, Design, Engineering etc came together, worked backwards from the customer, brainstormed multiple ideas, and finalised a product approach. We then went on to conduct in depth user research and usability testing to ensure that the final product met user expectations and was easy to understand and use.

And users love it!

Since the launch, we have seen significant user adoption for the feature. More than 14 Million users have saved close to 45 Million saved places. That’s ~3 places per user!

Customers from Singapore and Myanmar tend to save around 3 addresses each whereas customers from Indonesia, Malaysia, Philippines, Thailand, Vietnam and Cambodia save 2 addresses each. A customer from Indonesia has saved a whopping 1,191 addresses!

Users across South East Asia have adopted the feature and as of today, a significant portion of our bookings are made using a saved place for either pickup or drop off. If you were curious, here are the most frequently used labels for saving addresses in Singapore (left) and Indonesia (right):

Save your place with Grab!

 

Apart from saving home and office addresses our customers are also saving their child’s school address and places of worship. Some of them are also saving their favourite shopping destinations.

Another observation, as someone may have guessed, is regarding cluster of home addresses. Home addresses in Singapore are evenly scattered across the island (map on upper left) but the same are concentrated in specific pockets of the city in Jakarta (map on lower left). However office addresses are concentrated in specific areas in both cities – CBD and Changi area in Singapore (map on upper right) and along central Jakarta in Jakarta (map on lower right).

Save your place with Grab!

 

This is only the beginning

We’re constantly striving to improve the user experience with Grab and make it as seamless as possible. We have only taken the first steps with Saved Places and the path forward involves deeper understanding of user behaviour with the help of saved places data to create a more personalised experience. This is just the beginning and we’re planning to launch some very innovative features in the coming months.

No More Forgetting to Input ERP Charges – Hello Automated ERP!

Post Syndicated from Grab Tech original https://engineering.grab.com/automated-erp-charges

ERP, standing for Electronic Road Pricing, is a system used to manage road congestion in Singapore. Drivers are charged when they pass through ERP gantries during peak hours. ERP rates vary for different roads and time periods based on the traffic conditions at the time. This encourages people to change their mode of transport, travel route or time of travel during peak hours. ERP is seen as an effective measure in addressing traffic conditions and ensuring drivers continue to have a smooth journey.

Did you know that Singapore has a total of 79 active ERP gantries? Did you also know that every ERP gantry changes its fare 10 times a day on average? For example, total ERP charges for a journey from Ang Mo Kio to Marina will cost $10 if you leave at 8:50am, but $4 if you leave at 9:00am on a working day!

Imagine how troublesome it would have been for Grab’s driver-partners who, on top of having to drive and check navigation, would also have had to remember each and every gantry they passed, calculating their total fare and then manually entering the charges to the total ride cost at the end of the ride.

In fact, based on our driver-partners’ feedback, missing out on ERP charges was listed as one of their top-most pain points. Not only did the drivers find the entire process troublesome, this also led to earnings loss as they would have had to bear the cost of the  ERP fares.

We’re glad to share that, as of 15th March 2019, we’ve successfully resolved this pain point for our driver-partners by introducing automated ERP fare calculation!

So, how did we achieve automating the ERP fare calculation for our drivers-partners? How did we manage to reduce the number of trips where drivers would forget to enter ERP fare to almost zero? Read on!

How we approached the Problem

The question we wanted to solve was – how do we create an impactful feature to make sure that driver -partners have one less thing to handle when they drive?

We started by looking at the problem at hand. ERP fares in Singapore are very dynamic; it changes on the basis of day and time.

Caption: Example of ERP fare changes on a normal weekday in Singapore
Caption: Example of ERP fare changes on a normal weekday in Singapore

 

We wanted to create a system which can identify the dynamic ERP fares at any given time and location, while simultaneously identifying when a driver-partner has passed through any of these gantries.

However, that wasn’t enough. We wanted this feature to be scalable to every country where Grab is in – like Indonesia, Thailand, Malaysia, Philippines, Vietnam. We started studying the ERP (or tolls – as it is known locally) system in other countries. We realized that every country has its own style of calculating toll. While in Singapore ERP charges for cars and taxis are the same, Malaysia applies different charges for cars and taxis. Similarly, Vietnam has different tolls for 4-seaters and 7-seaters. Indonesia and Thailand have couple gantries where you pay only at one of the gantries.Suppose A and B are couple gantries, if you passed through A, you won’t need to pay at B and vice versa. This is where our Ops team came to the rescue!

Boots on the Ground!

Collecting all the ERP or toll data for every country is no small feat, recalls Robinson Kudali, program manager for the project. “We had around 15 people travelling across the region for 2-3 weeks, working on collecting data from every possible source in every country.”

Getting the right geographical coordinates for every gantry is very important. We track driver GPS pings frequently, identify the nearest road to that GPS ping and check the presence of a gantry using its coordinates. The entire process requires you to be very accurate; incorrect gantry location can easily lead to us miscalculating the fare.

Bayu Yanuaragi, our regional mapops lead, explains – “To do this, the first step was to identify all toll gates for all expressways & highways in the country. The team used various mapping software to locate and plot all entry & exit gates using map sources, open data and more importantly government data as references. Each gate was manually plotted using satellite imagery and aligned with our road layers in order to extract the coordinates with a unique gantry ID.”

Location precision is vital in creating the dataset as it dictates whether a toll gate will be detected by the Grab app or not. Next step was to identify the toll charge from one gate to another. Accuracy of toll charge per segment directly reflects on the fare that the passenger pays after the trip.

Caption: ERP gantries visualisation on our map - The purple bars are the gantries that we drew on our map
Caption: ERP gantries visualisation on our map – The purple bars are the gantries that we drew on our map

 

Once the data compilation is done, team would then conduct fieldwork to verify its integrity. If data gaps are identified, modifications would be made accordingly.

Upon submission of the output, stack engineers would perform higher level quality check of the content in staging.

Lastly, we worked with a local team of driver-partners who volunteered to make sure the new system is fully operational and the prices are correct. Inconsistencies observed were reported by these driver-partners, and then corrected in our system.

Closing the loop

Creating a strong dataset did help us in predicting correct fares, but we needed something which allows us to handle the dynamic behavior of the changing toll status too. For example, Singapore government revises ERP fare every quarter, while there could also be ad-hoc changes like activating or deactivating of gantries on an on-going basis.

Garvee Garg, Product Manager for this feature explains: “Creating a system that solves the current problem isn’t sufficient. Your product should be robust enough to handle all future edge case scenarios too. Hence we thought of building a feedback mechanism with drivers.”

In case our ERP fare estimate isn’t correct or there are changes in ERPs on-ground, our driver-partners can provide feedback to us. These feedback directly flow to Customer Experience teamwho does the initial investigation, and from there to our Ops team. A dedicated person from Ops team checks the validity of the feedback, and recommends updates. It only takes 1 day on average to update the data from when we receive the feedback from the driver-partner.

However, validating the driver feedback was a time consuming process. We needed a tool which can ease the life of Ops team by helping them in de-bugging each and every case.

Hence the ERP Workflow tool came into the picture.

99% of the time, feedback from our driver-partners are about error cases. When feedback comes in, this tool would allow the Ops team to check the entire ride history of the driver and map driver’s ride trajectory with all the underlying ERP gantries at that particular point of time. The Ops team  would then be able to identify if ERP fare calculated by our system or as said by driver is right or wrong.

This is only the beginning

By creating a system that can automatically calculate and key in ERP fares for each trip, Grab is proud to say that our driver-partners can now drive with less hassle and focus more on the road which will bring the ride experience and safety for both the driver and the passengers to a new level!

The Automated ERP feature is currently live in Singapore and we are now testing it with our driver-partners in Indonesia and Thailand. Next up, we plan to pilot in the Philippines and Malaysia and soon to every country where Grab is in – so stay tuned for even more innovative ideas to enhance your experience on our super app!

To know more about what Grab has been doing to improve the convenience and experience for both our driver-partners and passengers, check out other stories on this blog!

Tourists on GrabChat!

Post Syndicated from Grab Tech original https://engineering.grab.com/tourist-chat-data-story

Just over two years ago we introduced GrabChat, Southeast Asia’s first of its kind in-app messaging platform. Since then we’ve added all sorts of useful features to it. Auto-translated messages, the ability to send photos, and even voice messages! It’s been a great tool to facilitate smoother communications between our driver-partners and our passengers, and one group in particular has found it incredibly useful: tourists!

Now, we’ve analysed tourist data before, but we were curious about how GrabChat in particular has served this demographic. So we looked for interesting insights using sampled tourist chat data from Singapore, Malaysia, and Indonesia for the period of December 2018 to March 2019. That’s more than 3.7 million individual GrabChat messages sent by tourists! Here’s what we found.

Average chats per booking per country

Looking at the volume of the chats being transmitted per booking, we can see that the “chattiest” tourists are from East Timor, Nigeria, and Ukraine with averages of 6.0, 5.6, and 5.1 chats per booking respectively.

Then we wondered: if tourists from all over the world are talking this much to our driver-partners, how are they actually communicating if their mother-tongue is not the local language?

Need a Translator?

When we go to another country, we eat all the heavenly good food, fall in love with the culture, and admire the scenery. Language and communication barriers shouldn’t get in the way of all of that. That’s why Grab’s Chat feature has got it covered!

With Grab’s in-house translation solutions, any Grab passenger can send messages in their preferred language without fear of being misunderstood – or not understood at all! Their messages will be automatically translated into Bahasa Indonesia, Bahasa Melayu, Simplified Chinese, Thai, or Vietnamese depending on where they are. This applies not only apply to Grab’s transport services- GrabChat can be used when ordering GrabFood too!

Percentage of translated GrabChat messages
Indonesia saw the highest usage of translations on a by-booking basis!

 

Let’s look deeper into the tourist translation statistics for each country with the donut charts below. We can see that the most popular translation route for tourists in Indonesia was from English to Indonesian. The story is different for Singapore and Malaysia: we can see that there are translations to and from a more diverse set of languages, reflecting a more multicultural demographic.

Percentage of translated GrabChat messages
The most popular translation routes for tourist bookings in Indonesia, Malaysia, and Singapore.

 

Tap for Templates!

GrabChat also provides achat template feature. Templates are prewritten messages that you can send with just one tap! Did we mention that they are translated automatically too? Passengers and drivers can have a fast, simple, and translated conversation with each other without typing a single word- and sometimes, templates are really all you need.

Examples of chat templates, as they appear in GrabChat!
Examples of chat templates, as they appear in GrabChat!

 

As if all this wasn’t convenient enough, you can also make your own custom templates! Use them for those repetitive, identical messages you always seem to be sending out like telling your drivers where the hotel lobby is, or how to navigate right to your doorstep, or even to send a quick description of what you look like to make it easier for a driver to find you!

Template message usage

Taking a look at individual country data, tourists in Indonesia used templates the most with almost 60% of all of them using a template in their conversations at least once. Malaysia and Singapore saw lower but still sizeable utilisation rates of this feature, at 53% and 33% respectively.

Template message usage percentage
Indonesia saw the highest usage of templates on a by-booking basis.

 

In our analysis, we found an interesting insight! There was a positive correlation between template usage and the success rate of rides. Overall, bookings that used templates in their conversations saw 10% more completions over bookings that didn’t.

Template vs completed bookings

Picture this: a hassle-free experience

A picture says a thousand words, and for tourists using GrabChat’s image feature, those thousand words don’t even need to be translated. Instead of typing out a description of where they are standing for pickup, they can just click, snap, and send an image!

Our data revealed that GrabChat’s image functionality is most frequently used in areas where the tourist traffic is the highest. In fact, image function in GrabChat saw the most use in pickup areas such as airports, large shopping malls, public transport stations, and hotels, because it was harder for drivers to find their passengers in these crowded areas. Even with our super convenient Entrances feature, every little bit of information goes a long way to help your driver find you!

Pickup locations

If we take it a step further and look at the actual areas  within the cities where images were sent the most, we see that our initial hypothesis still holds fast.

Pickup locations
The top 5 pickup areas per country in which images were the most prevalent in GrabChat (for tourists).

 

In Singapore, we see the most images being sent out at the Downtown Core area- this area contains the majestic Marina Bay Sands, the Merlion statue, and the Esplanade, amongst other iconic attractions.

In Malaysia, the highest image usage occurs at none other than the Kuala Lumpur City Centre (KLCC) itself. This area includes the Twin Towers, a plethora of malls and hotels, Bukit Bintang (a bustling and lively night-life zone), and even an aquarium.

Indonesia’s top location for image chats is Kuta. A beach village in Bali, Kuta is a tourist hotspot with surfing, water parks, bars, budget-friendly yet delicious food, and numerous cultural attractions.

Speak up!

Allowing for two-way communication via GrabChat empowers both passengers and drivers to improve their journeys by divulging useful information, and asking clarifying questions: how many bags do you have? Does your car accommodate my pet dog? I’m standing by the lobby with my two kids- these are the sorts of things that are talked about in GrabChat messages.

During the analysis of our multitudes of wide-ranging GrabChat conversations, we picked up some pro-tips for you to get a Grab ride with even more convenience and ease, whether you’re a tourist or not:

Tip #1: Did some shopping on your trip? Swamped with bags? Send a message to your driver to let them know how many pieces of luggage you have with you.

As one might expect, chats that have keywords such as “luggage” or “baggage” (or any other related term) occur the most when riders are going to, or leaving, an airport. Most of the tourists on GrabChat asked the drivers if there was space for all of their things in the car. Interestingly, some of them also told the drivers how to recognise them for pickup based off of the descriptions of their bags!

Tip #2: Your children make good landmarks! If you’re in a crowded spot and you’re worried your driver can’t find you, drop them a message to let them know you’re that family with a baby and a little girl in pigtails.

When it comes to children, we found that passengers mainly use them to help identify themselves to the driver. Messages like “I’m with my two kids” or “We are a family with a baby” came up numerous times, and served as descriptions to facilitate fast pickup. These sorts of chats were the most prevalent in crowded areas like airports and shopping centres.

Tip #3: Don’t get caught off guard- be sure your furry friends have a seat!

Taking a look at pet related chats, we learned that our tourists have used GrabChat to ask clarifying questions to the driver. Passengers have likely considered that not every driver or vehicle is accommodating towards animals. The most common type of message was about whether pets are allowed in the vehicle. For example: “Is it okay if I bring a puppy?” or “I have a dog with me in a carrier, is that alright?”. Better safe than sorry! Alternatively, if you’re travelling with a pet, why not see if GrabPet is available in your country?

From the chat content analysis we have learned that tourists do indeed use GrabChat to talk to their drivers about specific details of their trip. We see that the chat feature is an invaluable tool that anyone can use to clear up any ambiguities and make their journeys more pleasant.

Bubble Tea Craze on GrabFood!

Post Syndicated from Grab Tech original https://engineering.grab.com/bubble-tea-craze-on-grabfood

Bigger and More Bubble Tea!

Bubble tea orders on GrabFood has been constantly and dramatically increasing with an impressive regional average growth rate of 3,000% in the year of 2018!  Just look at the percentage increase over the year of 2018, across all countries!

Countries Bubble tea growth by percentage in 2018*
Indonesia >8500% growth from Jan 2018 to Dec 2018
Philippines >3,500% growth from June 2018 to Dec 2018
Thailand >3,000% growth from Jan 21018 to Dec 2018
Vietnam >1,500% growth from May 2018 to Dec 2018
Singapore >700% growth from May 2018 to Dec 2018
Malaysia >250% growth from May 2018 to Dec 2018

*Time period: January 2018 to December 2018, or from the time GrabFood was launched.

What’s driving this growth is not just die-hard bubble tea fans who can’t go a week without drinking this sweet treat, but a growing bubble tea fan club in Southeast Asia. The number of bubble tea lovers on GrabFood grew over 12,000% in 2018 – and there’s no sign of stopping!

With increasing consumer demand, how is Southeast Asia’s bubble tea supply catching up?  As of December 2018, GrabFood has close to 4,000 bubble tea outlets from a network of over 1,500 brands – a 200% growth in bubble tea outlets in Southeast Asia!

Bubble-Tea-Lover growth on GrabFood

If this stat doesn’t stick, here is a map to show you how much bubble tea orders in different Southeast Asian cities have grown!

Maps of bubble tea merchants on GrabFood

And here is a little shoutout to our star merchants including Chatime, Coco Fresh Tea & Juice, Macao Imperial Tea, Ochaya, Koi Tea, Cafe Amazon, The Alley, iTEA, Gong Cha, and Serenitea.

Just how much do you drink?

On average, Southeast Asians drink  4 cups of bubble tea per person per month on GrabFood. Thai consumers top the regional average by 2 cups, consuming about six cups of bubble tea per person per month. This is closely followed by Filipino consumers who drink an average of 5 cups per person per month.

Average bubble tea consumption by cups per person per month

Favourite Flavours!

Have a look at the dazzling array of Bubble Tea flavours available on GrabFood today and you’ll find some uniquely Southeast Asian flavours like Chendol, Durian, and Gula Melaka, as well as rare flavours like salted cream and cheese! Can you spot your favourite flavours here?

Bubble tea flavour consumption per month

Let’s break it down by the country that GrabFood serves, and see who likes which flavours of Bubble Tea more!

Bubble tea flavour consumption per month by country

Top the Toppings!

Pearl seems to be the unbeatable best topping of most of the countries, except Vietnam whose No. 1 topping turned out to be Cheese Pudding! Top 3 toppings that topped your favorite bubble tea are:

Top list of toppings

Best Time for Bubble Tea!

Don’t we all need a cup of sweet Bubble Tea in the afternoon to get us through the day?  Across Southeast Asia, GrabFood’s data reveals that most people order bubble tea to accompany their meals at lunch, or as a  perfect midday energizer!

Times of the day when most people order bubble tea

Conclusion

So hazelnut or chocolate, pearl or (and) pudding (who says we can’t have the best of both worlds!)? The options are abundant and the choice is yours to enjoy!

If you have a sweet tooth, or simply want to reward yourself with Southeast Asia’s most popular drink, go ahead – you are only a couple of taps away from savouring this cup full of delight

Guiding you Door-to-Door via our Super App!

Post Syndicated from Grab Tech original https://engineering.grab.com/poi-entrances-venues-door-to-door

Remember landing at an airport or going to your favourite mall and the hassle of finding the pickup spot when you booked a cab? When there are about a million entrances, it can get particularly annoying trying to find the right pickup location!

Rolling out across South East Asia  is a brand new booking experience from Grab, designed  to make it easier for you to make a booking at large venues like airports, shopping centers, and tourist destinations! With the new booking flow, it will not only be easier to select one of the pre-designated Grab pickup points, you can also find text and image directions to help you navigate your way through the venue for a smoother rendezvous with your driver!

Inspiration behind the work

Finding your pick-up point closest to you, let alone predicting it, is incredibly challenging, especially when you are inside huge buildings or in crowded areas. Neeraj Mishra, Product Owner for Places at Grab explains: “We rely on GPS-data to understand user’s location which can be tricky when you are indoors or surrounded by skyscrapers. Since the satellite signal has to go through layers of concrete and steel, it becomes weak which adds to the inaccuracy. Furthermore, ensuring that passengers and drivers have the same pick-up point in mind can be tricky, especially with venues that have multiple entrances. ”  

Marina One POI

Grab’s data analysis revealed that “rendezvous distance” (walking distance between the selected pick-up point and where the car is waiting) is more than twice the Grab average when the booking is made from large venues such as airports.

To solve this issue, Grab launched “Entrances” (the green dots on the map) last year, which lists the various pick-up points available at a particular building, and shows them on the map, allowing users to easily choose the one closest to them, and ensuring their drivers know exactly where they want to be picked up from. Since then, Grab has created more than 120,000 such entrances, and we are delighted to inform you that average of rendezvous distances across all  countries have been steadily going down!

Decreasing rendezvous distance across region

One problem remained

But there was still one common pain-point to be solved. Just because a passenger has selected the pick-up point closest to them, doesn’t mean it’s easy for them to find it. This is particularly challenging at very large venues like airports and shopping centres, and especially difficult if the passenger is unfamiliar with the venue, for example – a tourist landing at Jakarta Airport for the very first time. To deliver an even smoother booking and pick-up experience, Grab has rolled out a new feature called Venues – the first in the region – that will give passengers in-app photo and text directions to the pick-up point closest to them.

Let’s break it down! How does it work?

Whether you are a local or a foreigner on holiday or business trip, fret not if you are not too familiar with the place that you are in!

Let’s imagine that you are now at Singapore Changi Airport: your new booking experience will look something like this!

Step 1: Fire the Grab app and click on Transport. You will see a welcome screen showing you where you are!

Welcome to Changi Airport

Step 2: On booking screen, you will see a new pickup menu with a list of available pickup points. Confirm the pickup point you want and make the booking!

Booking screen at Changi Airport

Step 3: Once you’ve been allocated a driver, tap on the bubble to get directions to your pick-up point!

Driver allocated at Changi Airport

Step 4: Follow the landmarks and walking instructions and you’ve arrived at your pick-up point!

Directions to pick-up point at Changi Airport

Curious about how we got this done?

Data-Driven Decisions

Based on a thorough data analysis of historical bookings, Grab identified key venues across our markets in Southeast Asia. Then we dispatched our Operations team to the ground, to identify all pick up points and perform detailed on-ground survey of the venue.

Operations Team’s Leg Work

Nagur Hassan, Operations Manager at Grab, explains the process: “For the venue survey process, we send a team equipped with the tools required to capture the details, like cameras, wifi and bluetooth scanners etc. Once inside the venue, the team identifies strategic landmarks and clear direction signs that are related to drop-off and pick-up points. Team also captures turn-by-turn walking directions to make it easier for Grab users to navigate – For instance, walk towards Starbucks and take a left near H&M store. All the photos and documentations taken on the sites are then brought back to the office for further processing.”

Quality Assurance

Once the data is collected, our in-house team checks the quality of the images and data. We also mask people’s faces and number plates of the vehicles to hide any identity-related information. As of today, we have collected 3400+ images for 1900+ pick up points belonging to 600 key venues! This effort took more than 3000 man-hours in total! And we aim to cover more than 10,000 such venues across the region in the next few months.

This is only the beginning

We’re constantly striving to improve the location accuracy of our passengers by using advanced Machine Learning and constant feedback mechanism. We understand GPS may not always be the most accurate determination of your current location, especially in crowded areas and skyscraper districts. This is just the beginning and we’re planning to launch some very innovative features in the coming months! So stay tuned for more!