All posts by Grab Tech

How we implemented domain-driven development in Golang

Post Syndicated from Grab Tech original https://engineering.grab.com/domain-driven-development-in-golang

Partnerships have always been core to Grab’s super app strategy. We believe in collaborating with partners who are the best in what they do – combining their expertise with what we’re good at so that we can bring high-quality new services to our customers, at the same time create new opportunities for the merchant and driver-partners in our ecosystem.

That’s why we launched GrabPlatform last year. To make it easier for partners to either integrate Grab into their services, or integrate their services into Grab.

In view of that, part of the GrabPlatform’s team mission is to make it easy for partners to integrate with Grab services. These partners are external companies that would like to offer Grab’s services such as ride-booking through their own websites or applications. To do that, we decided to build a website that will serve as a one-stop-shop that would allow them to self-service these integrations.

The challenges we faced with the conventional approach

In the process of building this website, our team noticed that the majority of the functions and responsibilities were added to files without proper segregation. A single file would contain more than 500 lines of code. Each of these files were imported from different collections of source codes, resulting in an unstructured codebase. Any changes to the existing functions risked breaking existing functionality; we realized then that we needed to proactively plan for the future. Hence, we decided to use the principles of Domain-Driven Design (DDD) and idiomatic Go. This blog aims to demonstrate the process of how we leveraged those concepts to design a modern application.

How we implemented DDD in our codebase

Here’s how we went about solving our unstructured codebase using DDD principles.

Step 1: Gather domain (business) knowledge

We collaborated closely with our domain experts (in our case, this was our product team) to identify functionality and flow. From them, we discovered the following key points:

  • After creating a project, developers are added to the project.
  • The domain experts wanted an ability to add other products (e.g. Pricing service, ETA service, GrabPay service) to their projects.
  • They wanted the ability to create multiple authentication clients to access the above products.

Step 2: Break down domain knowledge into bounded context

Now that we had gathered the required domain knowledge (i.e. what our code needed to reflect to our partners), it was time to use the DDD strategic tool Bounded Context to break down problems into subcontexts. Here is a graphical representation of how we converted the problem into smaller units.

Bounded Context

We identified several dependencies on each of the units involved in the project. Take some of these examples:

  • The project domain overlapped with the product and developer domains.
  • Our RideBooking project can only exist if it has some products like Ridebooking APIs and not the other way around.

What this means is a product can exist independent of the project, but a project will have no significance without any product. In the same way, a project is dependent on the developers, but developers can exist whether or not they belong to a project.

Step 3: Identify value objects or entity (lowest layer)

Looking at the above bounded contexts, we figured out the building blocks (i.e. value objects or entity) to break down the above functionality and flow.

// ProjectDAO ...
type ProjectDAO struct {
  ID            int64
  UUID          string
  Status        ProjectStatus
  CreatedAt     time.Time
}

// DeveloperDAO ...
type DeveloperDAO struct {
  ID            int64
  UUID          string
  PhoneHash     *string
  Status        Status
  CreatedAt     time.Time
}

// ProductDAO ...
type ProductDAO struct {
  ID            int64
  UUID          string
  Name          string
  Description   *string
  Status        ProductStatus
  CreatedAt     time.Time
}

// DeveloperProjectDAO to map developer's to a project
type DeveloperProjectDAO struct {
  ID            int64
  DeveloperID   int64
  ProjectID     int64
  Status        DeveloperProjectStatus
}

// ProductProjectDAO to map product's to a project
type ProductProjectDAO struct {
  ID            int64
  ProjectID     int64
  ProductID     int64
  Status        ProjectProductStatus
}

All the objects shown above have ID as a field and can be identifiable, hence they are identified as entities and not as value objects. But if we apply domain knowledge, DeveloperProjectDAO and ProductProjectDAO are actually not independent entities. Project object is the aggregate root since it must exist before the child fields, DevProjectDAO and ProdcutProjectDAO, can exist.

Step 4: Create the repositories

As stated above, we created an interface to abstract the working logic of a particular domain (i.e. Repository). Here is an example of how we designed the repositories:

// ProductRepositoryImpl responsible for product functionality
type ProductRepositoryImpl struct {
  productDao storage.IProductDao // private field
}

type ProductRepository interface {
  GetProductsByIDs(ctx context.Context, ids []int64) ([]IProduct, error)
}

// DeveloperRepositoryImpl
type DeveloperRepositoryImpl struct {
  developerDAO storage.IDeveloperDao // private field
}

type DeveloperRepository interface {
  FindActiveAllowedByDeveloperIDs(ctx context.Context, developerIDs []interface{}) ([]*Developer, error)
  GetDeveloperDetailByProfile(ctx context.Context, developerProfile *appdto.DeveloperProfile) (IDeveloper, error)
}

Here is a look at how we designed our repository for aggregate root project:

// Unexported Struct
type productProjectRepositoryImpl struct {
  productProjectDAO storage.IProjectProductDao // private field
}

type ProductProjectRepository interface {
  GetAllProjectProductByProjectID(ctx context.Context, projectID int64) ([]*ProjectProduct, error)
}

// Unexported Struct
type developerProjectRepositoryImpl struct {
  developerProjectDAO storage.IDeveloperProjectDao // private field
}

type DeveloperProjectRepository interface {
  GetDevelopersByProjectIDs(ctx context.Context, projectIDs []interface{}) ([]*DeveloperProject, error)
  UpdateMappingWithRole(ctx context.Context, developer IDeveloper, project IProject, role string) (*DeveloperProject, error)
}

// Unexported Struct
type projectRepositoryImpl struct {
  projectDao storage.IProjectDao // private field
}

type ProjectRepository interface {
  GetProjectsByIDs(ctx context.Context, projectIDs []interface{}) ([]*Project, error)
  GetActiveProjectByUUID(ctx context.Context, uuid string) (IProject, error)
  GetProjectByUUID(ctx context.Context, uuid string) (*Project, error)
}

type ProjectAggregatorImpl struct {
  projectRepositoryImpl           // private field
  developerProjectRepositoryImpl  // private field
  productProjectRepositoryImpl    // private field
}

type ProjectAggregator interface {
  GetProjects(ctx context.Context) ([]*dto.Project, error)
  AddDeveloper(ctx context.Context, request *appdto.AddDeveloperRequest) (*appdto.AddDeveloperResponse, error)
  GetProjectWithProducts(ctx context.Context, uuid string) (IProject, error)
}

Step 5: Identify Domain Events

The functions described in Step 4 only returns the ID of the developer and product, which conveys no information to the users. In order to provide developer and product information, we use the domain-event technique to return the actual product and developer attributes.

A domain event is something that happened in a bounded context that you want another context of a domain to be aware of. For example, if there are new updates to the developer domain, it’s important to convey these updates to the project domain. This propagation technique is termed as domain event. Domain events enable independence between different classes.

One way to implement it is seen here:

// file: project\_aggregator.go
func (p *ProjectAggregatorImpl) GetProjects(ctx context.Context) ([]*dto.Project, error) {
  ....
  ....
  developers := p.EventHandler.Handle(DomainEvent.FindDeveloperByDeveloperIDs{DeveloperIDs})
  ....
}

// file: event\_type.go
type FindDeveloperByDeveloperIDs struct{ developerID []interface{} }

// file: event\_handler.go
func (e *EventHandler) Handle(event interface{}) interface{} {
  switch op := event.(type) {
      case FindDeveloperByDeveloperIDs:
            developers, _ := e.developerRepository.FindDeveloperByDeveloperIDs(op.developerIDs)
            return developers
      case ....
      ....
    }
}
Domain Event

Some common mistakes to avoid when implementing DDD in your codebase:

  • Not engaging with domain experts. Not interacting with domain experts is a common mistake when using DDD. Talking to domain experts to get an understanding of the problem domain from their perspective is at the core of DDD. Starting with schemas or data modelling instead of talking to domain experts may create code based on a relational model instead of it built around a domain model.
  • Ignoring the language of the domain experts. Creating a ubiquitous language shared with domain experts is also a core DDD practice. This common language must be used in all discussions as well as in the code, e.g. in class and method names.
  • Not identifying bounded contexts. A common approach to solving a complex problem is breaking it down into smaller parts. Creating bounded contexts is breaking down a large domain into smaller ones, each handling one cohesive part of the domain.
  • Using an anaemic domain model. This is a common sign that a team is not doing DDD and often a symptom of a failure in the modelling process. At first, an anaemic domain model often looks like a real domain model with correct names, but the classes lack functionalities. They contain only the Get and Set methods.

How the DDD model improved our software development

Thanks to this brand new clean up, we achieved the following:

  • Core functionalities are evenly distributed to the overall codebase and not limited to just a few files.
  • The developers are aware of what each folder is responsible for by simply looking at the file naming and folder structure.
  • The risk of breaking major functionalities by merely making small changes is greatly reduced. Changing a feature is now more efficient.

The team now finds the code well structured and we require less hand-holding for onboarders, thanks to the simplicity of the structure.

Finally, the most important thing, we now have a system oriented towards our business necessities. Everyone ends up using the same language and terms. Developers communicate better with the business team. The work is more efficient when it comes to establishing solutions for the models that reflect how the business operates, instead of how the software operates.

Lessons Learnt

  • Use DDD to collaborate among all project disciplines (product, business, partner, and so on) and clearly understand the business requirements.
  • Establish a ubiquitous language to discuss domain-related concepts.
  • Use bounded contexts to break down complex domains into manageable parts.
  • Implement a layered architecture (i.e. DDD building blocks) to focus on particular aspects of the application.
  • To simplify your dependency, use domain event to communicate with sub-bounded context.

Driving Southeast Asia forward through people-focused design

Post Syndicated from Grab Tech original https://engineering.grab.com/driving-sea-forward-through-people-focused-design

Southeast Asia is home to around 650 million people from diverse and comparatively different economic, political and social backgrounds. Many people in the region today rely on super apps like Grab to earn a daily living or get from A to B more efficiently and safely. This means that decisions made have real impact on people’s lives – so how do you know when your decisions are right or wrong?

In this post, I’ll share key customer insights that have guided my decisions and informed my design thinking over the last year whilst working as a product designer for Grab in Singapore. I’ve broken my learnings down into 3 transient areas for thinking about product development and how each one addressed our customers’ needs.

  1. Relevance – does the design solve the customer problem? For example, loss of connectivity which is common in Southeast Asia should not completely prevent a customer from accessing the content on our app.
  2. Inclusivity – does the design consider the full range of customer diversity? For example, a driver waiting in the hot sun for his passenger can still use the product. Inclusive design covers people with a range of perspectives, disabilities and environments.
  3. Engagement – does the design invoke a feeling of satisfaction? For example, building a compelling narrative around your product that solicits a higher engagement.

Under each of these areas, I’ll elaborate on how we’ve built empathy from customer insights and applied these to our design thinking.

But before jumping in, think about the lens which frames any customer experience – the mobile device. In Southeast Asia, the commonly used devices are inexpensive low-end devices made by OPPO, Xiaomi, and Samsung. Knowing which devices customers use helps us understand potential performance constraints, different screen resolutions, and custom Android UIs.

Designing for relevance  

Shopping mall in Medan, Indonesia
Shopping mall in Medan, Indonesia

Connectivity

In Southeast Asia, it’s not too hard to find public WiFi. However, the main challenge is finding a reliable network. Take this shopping mall in Medan, Indonesia. The WiFi connectivity didn’t live up to the modern infrastructure of the building. The locals knew this and used mobile data over spotty and congested connectivity. Mobile data is the norm for most people and 4G reach is high, but the power of the connections is relatively low.

Building empathy

To genuinely design for customers’ needs, designers at Grab regularly get out the office to understand what people are doing in the real world. But how do we integrate empathy and compassion into the design process? Throughout this article, I’ll explain how the insights we gathered from around Southeast Asia can inform your decision making process.  

For simulating a loss of connectivity, switch to airplane mode to observe current UI states and limitations. If you have the resources, create a 2G network to compare how bandwidth constraints page loading speeds.Network Link Conditioner for Mac and iOS or Lighthouse by Chrome DevTools can replicate a slower network.

Design implications

This diagram is from Scott Hurff’s book, Designing Products People Love. The book is amazing, but if you don’t have the time to read it, this article offers a quick overview.

Scott Hurff’s UI Stack
Scott Hurff’s UI Stack

An ideal state (the fully loaded experience) is primarily what a lot of designers think about when problem-solving. However, when connectivity is a common customer pain-point, designers at Grab have to design for the less desirable: Blank, Loading, Partial, and Error states in tandem with all the happy paths. Latency can make or break the user experience, so buffer wait times with visual progress to cushion each millisecond. Loading skeletons when you open Grab, act as momentary placeholders for content and reduce the perceived latency to load the full experience.

A loss of connectivity shouldn’t mean the end of your product’s experience. Prepare connectivity issues by keeping screens alive through intuitive visual cues, messaging, and cached content.

Device type and condition

In Southeast Asia, people tend to opt for low-end or hand-me-down devices that can sometimes have cracked screens or depleting batteries. These devices are usually in circulation much longer than in developed markets, and the device’s OS might not be the latest version because of the perceived effort or risk to update.  

A driver’s device taken during research in Indonesia
A driver’s device taken during research in Indonesia

Building empathy

At Grab, we often use a range of popular, in-market devices to understand compatibility during the design process. Installing mainstream apps to a device with a small screen size, 512MB internal memory, low resolution and depleting battery life will provide insights into performance.  If these apps have lite versions or Progressive Web Apps (PWA), try to understand the trade-offs in user experience compared to the parent app.

Grab’s passenger app on the left versus the driver app
Grab’s passenger app on the left versus the driver app

Design implications

Design for small screens first to reduce the chances of design debt later in the development lifecycle. For interactive elements, it’s important to think about all types of customers that will use the product and in what circumstances. For Grab’s driver-partners who may have their devices mounted to the dashboard, tap targets need to be larger and more explicit.  

Similarly, color contrast will vary depending on screen resolution and time of the day. Practical tests involve dimming the screen and standing near a window in bright sunshine (our HQ is in Singapore which helps!). To further improve accessibility, use a tool like Sketch’s Stark plugin to understand if contrast ratios are accessible to visually impaired customers. A general rule is to aim for higher contrast between essential UI components, text and interactive affordances.

Fancy transitions can look great on high-end devices but can appear choppy or delayed on older and less performant phones. Aim for simple animations to offer a more seamless experience.

Passenger verification to improve safety
Passenger verification to improve safety

Day-to-day budgeting

Many people in Southeast Asia earn a daily income, so it’s no surprise that prepaid mobile is more common over a monthly contract. This mindset to ration on a day-to-day basis also extends itself to other essentials like washing powder and nappies. Data can be an expensive necessity, and customers are selective over the types of content that will consume a daily or weekly budget. Some customers might turn off data after getting a ride, and not turn it back on until another Grab service is required.

Building empathy

Rationing data consumption daily can be achieved through not connecting to WiFi, or a more granular way is to turn off WiFi and use an app like Google’s Datally on Android to cap data usage. Starting low, around 50MB per day will increase your understanding around the data trade-offs you make and highlight the apps that require more data to perform certain actions.

Design implications

Where possible, avoid using video when SVG animations can be just as effective, scalable and lightweight. For Grab’s passenger verification flow, we decided to move away from a video tutorial and keep data consumption to a minimum through utilising SVG animations. When a video experience is required, like Grab’s feed on the home screen, disabling autoplay and clearly distinguishing the media as video allowed customers to decide on committing data.

Design for inclusivity  

Mobile-only

The expression “mobile-first” has been bounced around for the last decade, but in Southeast Asia, “mobile-only” is probably more accurate. Most customers have never owned a tablet or laptop, and mobile numbers are more synonymous with a method of registration over an email address. In the region, people rely more on social media and chat apps to understand broadcast or published news reports, events and recommendations. Customers who sign up for a new Grab account, prefer phone numbers and OTP (one-time-password) registration over providing an email address and password. And anecdotally from interviews conducted at Grab, customers didn’t feel the need for email when communication can take place via SMS, WhatsApp, or other messaging apps.

Building empathy

At Grab, we apply design thinking from a mobile-only perspective for our passenger, merchant,  and driver-partner experiences by understanding our customers’ journeys online and off.  These journeys are synthesized back in the office and sometimes recreated with video and physical artifacts to simulate the customer experience. It’s always helpful to remove smartwatches, put away laptops and use an in-market device that offers a similar experience to your customers.

Design implications

When onboarding new customers, offer a relevant sign-in method for a mobile-only customer, like phone number and social account registration. Grab’s passenger sign-up experience addresses these priorities with phone number first, social accounts second.  

Grab’s sign-in screen
Grab’s sign-in screen

PC-era icons are also widely misunderstood by mobile-only customers, so avoid floppy disks to imply Save, or a folder to Change Directory as these offer little symbolic meaning. When icons are paired with text, this can often reinforce meaning and quicken recognition.  For example, a pencil icon alone can be confusing, so adding the word “Edit” will provide more clarity.  

Nightfall in Yogyakarta, Indonesia
Nightfall in Yogyakarta, Indonesia

Diversity and safety

This photo was taken in Yogyakarta, Indonesia. In the evening, women often formed groups to improve personal safety. In an online environment, women often face discrimination, harassment, blackmail, cyberstalking, and more.  Minorities in emerging markets are further marginalised due to employment, literacy, and financial issues.  

Building empathy

Southeast Asia has a very diverse population, and it’s important to understand gender, ethnic,  and class demographics before you plan any research.Research recruitment at Grab involves working with local vendors to recruit diverse groups of customers for interviews and focus groups. When spending time with customers, we try to understand how diversity and safety factors contribute to the experience of the product.

If you don’t have the time and resources to arrange face-to-face interviews, I’d recommend this article for creating a survey: Respectful Collection of Demographic Data

Design for inclusivity

Allow people to control how they represent their identities through pseudonym names and avatars. But does this undermine trust on the platform? No, not really. Credit card registration or more recently, Grab’s passenger and driver selfie verification feature has closed the loop on suspect accounts whilst maintaining everyone’s privacy and safety.  

On the visual design side, our illustration and content guide incorporates diverse representations of ethnic backgrounds, clothing, physical ability, and social class. You can see examples in the app or through our Dribbble page. For user-generated content, allow people to report and flag abusive material. While data and algorithms can do so much, facts and ethics cannot be policed by machine learning.

Language

In Southeast Asia and other emerging markets, customers may set their phone to a language which they aspire to learn but may not fully comprehend. Swipe, tap, drag, pinch, and other specific terms relating to interactions might not easily translate into the local language, and English might be the preferred language regardless of comprehension. It’s surprisingly common to attend an interview with a translator but the device’s UI is set to English.  

A Grab pick-up point taken in Medan, Indonesia
A Grab pick-up point taken in Medan, Indonesia

Building empathy

If your app supports multiple languages, try setting your phone to a different language but know how to change it back again!  At Grab, we test design robustness by incorporating translated text strings into our mocks. Look for visual cues to infer meaning since some customers might be illiterate or not fully comprehend English.

Grab’s Safety Centre in different languages
Grab’s Safety Centre in different languages

Design for different languages, formats and visual cues

To reduce design debt later on, it’s a good idea to start with the smallest screen size and test the most vulnerable parts of the UI with translated text strings. Keep in mind, dates, times, addresses, and phone numbers may have different formats and require special attention. You can apply multiple visual cues to represent important UI states, such as a change in colour, shape and imagery.

Design for engagement

Sharing

From our research studies, word-of-mouth communication and consuming viral content via Instagram or Facebook was more popular than trawling through search page results. The social aspect is extended to the physical environment where devices can sometimes be shared with more than one person, or in some cases, one mobile is used concurrently with more than one user at a time. In numerous interviews, customers talk about not using biometric authentication so that family members can access their devices.

Building empathy

To understand the layers of personalisation, privacy and security on a device, it’s worth loaning a device from your research team or just borrow a friend’s phone (if they let you!).  How far do you get before you require biometric authentication or a PIN to proceed further? If you decide to wipe a personal device, what steps can you miss out from the setup, and how does that affect your experience post setup?

Offline to Online: GrabNow connecting with driver
Offline to Online: GrabNow connecting with driver

Design for sharing

If necessary, facilitate device sharing through easy switching of accounts, and enable people to remove or hide private content after use. Allow content to be easily shared for both online and offline in-person situations. Using this approach, GrabNow allows passengers to find and connect with a nearby driver without having to pre-book and wait for a driver to arrive. This offline to online interaction also saves data and battery for the customer.

Support and tutoring

In Southeast Asia, people find troubleshooting issues from inside a help page troublesome and generally prefer human assistance, like speaking to someone through a call centre. The opportunity for face-to-face tutoring on how something works is often highly desired and is much more effective than standard onboarding flows that many apps use. From the many physical phone stores, it’s not uncommon for people to go and ask for help or get apps manually installed onto their device.

Building empathy

Apart from speaking with your customers regularly, always look through the Play and App Store reviews for common issues. Understand your customers’ problems and the jargon they use to describe what happened. If you have a customer support team, the tickets created will be a key indicator of where your customers need the most support.

Help and Feedback Design implications

Make support accessible through a variety of methods: online forms, email, and if possible, allow customers to call in. With in-app or online forms, try to use drop-downs or pre-populated quick responses to reduce typing, triage the type of support, and decrease misunderstanding when a request comes in.  When a customer makes a Grab transport booking for the first time, we assist the customer through step-by-step contextual call-outs.

Local aesthetics

This photo was taken in Medan, Indonesia, on the day of an important wedding. It was impressive to see hundreds of handcrafted, colourful placards lining the streets for miles, but maybe more admirable that such an occasion was shared with the community and passers-by, and not only for the wedding guests.  

A wedding celebration flower board in Medan, Indonesia
A wedding celebration flower board in Medan, Indonesia

These types of public displays are not exclusive to weddings in Southeast Asia, vibrant colours and decorative patterns are woven into the fabric of everyday life, indicative of a jovial spirit that many people in the region possess.

Building empathy

What are some of the immediate patterns and surfaces that exist in your workspace? Looking around your immediate environment can provide an immediate assessment of visual stimuli that can influence your decisions on a day-to-day basis.

Wall space can be incredibly valuable when you can display photos from your research trip, or find online inspiration to recreate some of the visual imagery from your target markets.  When speaking with your customers, ask to see mobile wallpapers, and think about how fashion could also play a role in determining an aesthetic choice. Lastly, take time out when on a research trip to explore the streets, museums, and absorb some of the local cultures.

Design to delight and surprise customers

Capture local inspiration on research trips to incorporate into visual collections that can be a source of inspiration for colour, imagery, and textures. Find opportunities in your product to delight and engage customers through appropriate images and visuals. Grab’s marketing consent experience leverages illustrative visuals to help customers understand the different categories that require their consent.

For all our markets, we work with local teams around culturally sensitive visuals and imagery to ensure our content is not offensive or portrays the wrong connotations.

My top 5 for guerrilla field research

If you don’t have enough time, stakeholder buy-in or budget to do research, getting out of the office to do your own is sometimes the only answer. Here are my top 5 things to keep in mind.

  1. Don’t jump in. Always start with observation to capture customers’ natural behaviour.
  2. Sanity check scripts. Your time and customers’ time is valuable; streamline your script and prepare for u-turns and potential Facebook and Instagram friend requests at the end!  
  3. Ask the right people. It’s difficult to know who wants to or has time for your 10-minute intercept. Look for individuals sitting around and not groups if possible (group feedback can be influenced by the most vocal person).
  4. Focus on the user.Never multitask when speaking to the user. Jotting notes on an answer sheet is less distracting than using your mobile or laptop (and less dangerous in some places!). Ask permission to record audio if you want to avoid notetaking all together but this does create more work later on.
  5. Use insights to enrich understanding. Insights are not trends and should be used in conjunction with quantitative data to validate decision making.

Feel inspired by this article and want to learn more? Grab is hiring across Southeast Asia and Seattle. Connect with me on LinkedIn or Twitter @PhilipMadeley to learn more about design at Grab.

Griffin, an anti-fraud risk rule engine making billions of predictions daily

Post Syndicated from Grab Tech original https://engineering.grab.com/griffin

Introduction

At Grab, the scale and fast-moving nature of our business means we need to be vigilant about potential risks to our customers and to our business. Some of the things we watch for include promotion abuse, or passenger safety on late-night ride allocations. To overcome these issues, the TIS (Trust/Identity/Safety) taskforce was formed with a group of AI developers dedicated to fraud detection and prevention.

The team’s mission is:

  • to keep fraudulent users away from our app or services
  • ensure our customers’ safety, and
  • Manage user identities to securely login to the Grab app.

The TIS team’s scope covers not just transport, but also our food, deliver and other Grab verticals.

How we prevented fraudulent transactions in the earlier days

In our early days when Grab was smaller, we used a rules-based approach to block potentially fraudulent transactions. Rules are like boolean conditions that determines if the result will be true or false. These rules were very effective in mitigating fraud risk, and we used to create them manually in the code.

We started with very simple rules. For example:

Rule 1:

 IF a credit card has been declined today

 THEN this card cannot be used for booking

To quickly incorporate rules in our app or service, we integrated them in our backend service code and deployed our service frequently to use the latest rules.

It worked really well in the beginning. Our logic was relatively simple, and only one developer managed the changes regularly. It was very lightweight to trigger the rule deployment and enforce the rules.

However, as the business rapidly expanded, we had to exponentially increase the rule complexity. For example, consider these two new rules:

Rule 2:

IF a credit card has been declined today but this passenger has good booking history

THEN we would still allow this booking to go through, but precharge X amount

Rule 3:

IF a credit card has been declined(but paid off) more than twice in the last 3-months

THEN we would still not allow this booking

The system scans through the rules, one by one, and if it determines that any rule is tripped it will check the other rules. In the example above, if a credit card has been declined more than twice in the last 3-months, the passenger will not be allowed to book even though he has a good booking history.

Though all rules follow a similar pattern, there are subtle differences in the logic and they enable different decisions. Maintaining these complex rules was getting harder and harder.

Now imagine we added more rules as shown in the example below. We first check if the device used by the passenger is a high-risk one. e.g using an emulator for booking. If not, we then check the payment method to evaluate the risk (e.g. any declined booking from the credit card), and then make a decision on whether this booking should be precharged or not. If passenger is using a low-risk  device but is in some risky location where we traditionally see a lot of fraud bookings, we would then run some further checks about the passenger booking history to decide if a pre-charge is also needed.

Now consider that instead of a single passenger, we have thousands of passengers. Each of these passengers can have a large number of rules for review. While not impossible to do, it can be difficult and time-consuming, and it gets exponentially more difficult the more rules you have to take into consideration. Time has to be spent carefully curating these rules.

Rules flow

The more rules you add to increase accuracy, the more difficult it becomes to take them all into consideration.

Our rules were getting 10X more complicated than the example shown above. Consequently, developers had to spend long hours understanding the logic of our rules, and also be very careful to avoid any interference with new rules.

In the beginning, we implemented rules through a three-step process:

  1. Data Scientists and Analysts dived deep into our transaction data, and discovered patterns.
  2. They abstracted these patterns and wrote rules in English (e.g. promotion based booking should be limited to 5 bookings and total finished bookings should be greater than 6, otherwise unallocate current ride)
  3. Developers implemented these rules and deployed the changes to production

Sometimes, the use of English between steps 2 and 3 caused inaccurate rule implementation (e.g. for “X should be limited to 5”, should the implementation be X < 5 or  X <= 5?)

Once a new rule is deployed, we monitored the performance of the rule. For example,

  • How often does the rule fire (after minutes, hours, or daily)?
  • Is it over-firing?
  • Does it conflict with other rules?

Based on implementation, each rule had dependency with other rules. For example, if Rule 1 is fired, we should not continue with Rule 2 and Rule 3.

As a result, we couldn’t  keep each rule evaluation independent.  We had no way to observe the performance of a rule with other rules interfering. Consider an example where we change Rule 1:

From IF a credit card has been declined today

To   IF a credit card has been declined this week

As Rules 2 and 3 depend on Rule 1, their trigger-rate would drop significantly. It means we would have unstable performance metrics for Rule 2 and Rule 3 even though the logic of Rule 2 and Rule 3 does not change. It is very hard for a rule owner to monitor the performance of Rules 2 and Rule 3.

When it comes to the of A/B testing of a new rule, Data Scientists need to put a lot of effort into cleaning up noise from other rules, but most of the time, it is mission-impossible.

After several misfiring events (wrong implementation of rules) and ever longer rule development time (weekly), we realized “No one can handle this manually.“

Birth of Griffin Rule Engine

We decided to take a step back, sit down and closely review our daily patterns. We realized that our daily patterns fall into two categories:

  1. Fetching new data:  e.g. “what is the credit card risk score”, or “how many food bookings has this user ordered in last 7 days”, and transform this data for easier consumption.
  2. Updating/creating rules: e.g. if a credit card risk score is high, decline a booking.

These two categories are essentially divided into two independent components:

  1. Data orchestration – collecting/transforming the data from different data sources.
  2. Rule-based prediction

Based on these findings, we got started with our Data Orchestrator (open sourced at https://github.com/grab/symphony) and Griffin projects.

The intent of Griffin is to provide data scientists and analysts with a way to add new rules to monitor, prevent, and detect fraud across Grab.

Griffin allows technical novices to apply their fraud expertise to add very complex rules that can automate the review of rules without manual intervention.

Griffin  now predicts billions of events every day with 100K+ Queries per second(QPS) at peak time (on only 6 regular EC2s).

Data scientists and analysts can self-service rule changes on the web portal directly, deploy rules with just a few clicks, experiment and monitor performance in real time.

Why we came up with Griffin instead of using third-party tools in the market

Before we decided to create our in-built tool, we did some research for common business rule engines available in the market such as Drools and checked if we should use them. In that process, we found:

  1. Drools has its own Java-based DSL with a non-trivial learning curve (whereas our major users are from Python background).
  2. Limited [expressive power](https://en.wikipedia.org/wiki/Expressive_power_(computer_science),
  3. Limited support for some common math functions (e.g. factorial/ Greatest Common Divisor).
  4. Our nature of business needed dynamic dataset for predictions (for example, a rule may need only passenger booking history on Day 1, but it may use passenger booking history, passenger credit balance, and passenger favorite places on Day 2). On the other hand, Drools usually works well with a static list of dataset instead of dynamic dataset.

Given the above constraints, we decided to build our own rule engine which can better fit our needs.

Griffin Architecture

The diagram depicts the high-level flow of making a prediction through Griffin.

High-level flow of making a prediction through Griffin

Components

  • Data Orchestration: a service that collects all data needed for predictions
  • Rule Engine: a service that makes prediction based on rules
  • Rule Editor: the portal through which users can create/update rules

Workflow

  1. Users create/update rules in the Rule Editor web portal, and save the rules in the database.
  2. Griffin Rule Engine reloads rules immediately as long as it detects any rule changes.
  3. Data Orchestrator sends all dataset (features) needed for a prediction (e.g. whether to block a ride based on passenger past ride pattern, credit card risk) to the Rule Engine
  4. Griffin Rule Engine makes a prediction.

How you can create rules using Griffin

In an abstract view, a rule inside Griffin is defined as:

Rule:

Input:JSON => Result:Boolean

We allow users (analysts, data scientists) to write Python-based rules on WebUI to accommodate some very complicated rules like:

len(list(filter(lambdax: x \>7, (map(lambdax: math.factorial(x), \[1,2,3,4,5,6\]))))) \>2

This significantly optimizes the expressive power of rules.

To match and evaluate a rule more efficiently, we also have other key components associated:

Scenarios

  • Here are some examples: PreBooking, PostBookingCompletion, PostFoodDelivery

Actions

  • Actions such as NotAllowBooking, AuthCapture, SendNotification
  • If a rule result is True, it returns a list of treatments as selected by users, e.g. AuthCapture and SendNotification (the example below is treatments for one Safety-related rule).The one below is for a checkpoint to detect credit-card risk.
Treatments: AuthCapture
  • Each checkpoint has a default treatment. If no rule inside this checkpoint is hit, the rule engine would return the default one (in most cases, it is just “do nothing”).
  • A treatment can only belong to one checkpoint, but one checkpoint can have multiple treatments.

For example, the graph below demonstrates a checkpoint PaxPreRide associated with three treatments: Pass, Decline, Hold

Treatments: Adding

Segments

  • The scope/dimension of a rule. Based on the sample segments below, a rule can be applied only to countries=\[MY,PH\] and verticals=\[GrabBus, GrabCar\]
  • It can be changed at any time on WebUI as well.
Segments

Values of a rule

 

When a rule is hit, more than just treatments, users also want some dynamic values returned. E.g. a max distance of the ride allowed if we believe this booking is medium risk.

Does Python make Griffin run slow?

We picked Python to enjoy its great expressive power and neatness of syntax, but some people ask: Python is slow, would this cause a latency bottleneck?

Our answer is No.

The below graph shows the Latency P99 of Prediction Request from load balancer side(actually the real latency for each prediction is < 6ms, the metrics are peaked at 30ms because some batch requests contain 50 predictions in a single call)

Prediction Request Latency P99

What we did to achieve this?

  • The key idea is to make all computations in CPU and memory only (in other words, no extra I/O).
  • We do not fetch the rules from database for each prediction. Instead, we keep a record called dirty_key, which keeps the latest rule update timestamp. The rule engine would actively check this timestamp and trigger a rule reload only when the dirty_key timestamp in the DB is newer than the latest rule reload time.
  • Rule engine would not fetch any additional new data, instead, all data should be from Data Orchestrator.
  • So the whole prediction flow is only between CPU & memory (and if the data size is small, it could be on CPU cache only).
  • Python GIL essentially enforces a process to have up to one active thread running at a time, no matter how many cores a CPU has. We have Gunicorn to wrap our service, so on the Production machine, we have (2x$num_cores) + 1 processes (see http://docs.gunicorn.org/en/latest/design.html#how-many-workers). The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

The below screenshot is the process snapshot on C5.large machine with 2 vCPU. Note only green processes are active.

Process snapshot on C5.large machine

A lot of trial and error performance tuning:

  • We used to have python-jsonpath-rw for JSONPath query, but the performance was not strong enough. We switched to jmespath and observed about 10ms latency reduction.
  • We use sqlalchemy for DB Query and ORM. We enabled cache for some use cases, but turned out it was over-optimized with stale data. We ended up turning off some caching points to ensure the data consistency.
  • For new dict/list creation, we prefer native call (e.g. {}/[]) instead of function call (see the comparison below).
Native call and Function call
  • Use built-in functions https://docs.python.org/3/library/functions.html. It is written in C, no one can beat it.
  • Add randomness to rule reload so that not all machines run at the same time causing latency spikes.
  • Caching atomic feature units as they are used so that we don’t have to requery for them each time a checkpoint uses it.

How Griffin makes on-call engineers relax

One of the most popular aspects of Griffin is the WebUI. It opens a door for non-developers to make production changes in real time which significantly boosts organisation productivity. In the past a rule change needed 1 week for code change/test/deployment, now it is just 1 minute.

But this also introduces extra risks. Anyone can turn the whole checkpoint down, whether unintentionally or maliciously.

Hence we implemented Shadow Mode and Percentage-based rollout for each rule. Users can put a rule into Shadow Mode to verify the performance without any production impact, and if needed, rollout of a rule can be from 1% all the way to 100%.

We implemented version control for every rule change, and in case anything unexpected happened, we could rollback to the previous version quickly.

Version control
Rollback button

We also built RBAC-based permission system, along with Change Approval flow to make sure any prod change needs at least two people(and approver role has higher permission)

Closing thoughts

Griffin evolved from a fraud-based rule engine to generic rule engine. It can apply to any rule at Grab. For example, Grab just launched Appeal automation several days ago to reduce 50% of the  human effort it typically takes to review straightforward appeals from our passengers and drivers. It was an unplanned use case, but we are so excited about this.

This could happen because from the very beginning we designed Griffin with minimized business context, so that it can be generic enough.

After the launch of this, we observed an amazing adoption rate for various fraud/safety/identity use cases. More interestingly, people now treat Griffin as an automation point for various integration points.

Using Grab’s Trust Counter Service to Detect Fraud Successfully

Post Syndicated from Grab Tech original https://engineering.grab.com/using-grabs-trust-counter-service-to-detect-fraud-successfully

Background

Fraud is not a new phenomenon, but with the rise of the digital economy it has taken different and aggressive forms. Over the last decade, novel ways to exploit technology have appeared, and as a result, millions of people have been impacted and millions of dollars in revenue have been lost. According to ACFE survey, companies lost USD6.3 billion due to fraud. Organizations lose 5% of its revenue annually due to fraud.

In this blog, we take a closer look at how we developed an anti-fraud solution using the Counter service, which can be an indispensable tool in the highly complex world of fraud detection.

Anti-fraud solution using counters

At Grab, we detect fraud by deploying data science, analytics, and engineering tools to search for anomalous and suspicious transactions, or to identify high-risk individuals who are likely to commit fraud. Grab’s Trust Platform team provides a common anti-fraud solution across a variety of business verticals, such as transportation, payment, food, and safety. The team builds tools for managing data feeds, creates SDK for engineering integration, and builds rules engines and consoles for fraud detection.

One example of fraudulent behavior could be that of an individual who masquerades as both driver and passenger, and makes cashless payments to get promotions, for example, earn a one dollar rebate in the next transaction.In our system, we analyze real time booking and payment signals, compare it with the historical data of the driver and passenger pair, and create rules using the rule engine. We count the number of driver and passenger pairs at a given time frame. This counter is provided as an input to the rule.If the counter value exceeds a predefined threshold value, the rule evaluates it as a fraud transaction. We send this verdict back to the booking service.

The conventional method

Fraud detection is a job that requires cross-functional teams like data scientists, data analysts, data engineers, and backend engineers to work together. Usually data scientists or data analysts come up with an offline idea and apply it to real-time traffic. For example, a rule gets invented after brainstorming sessions by data scientists and data analysts. In the conventional method, the rule needs to be communicated to engineers.

Automated solution using the Counter service

To overcome the challenges in the conventional method, the Trust platform team decided to come out with the Counter service, a self-service platform, which provides management tools for users, and a computing engine for integrating with the backend services. This service provides an interface, such as a UI based rule editor and data feed, so that analysts can experiment and create rules without interacting with engineers. The platform team also decided to provide different data contracts, APIs, and SDKs to engineers so that the business verticals can use it quickly and easily.

The major engineering challenges faced in designing the Counter service

There are millions of transactions happening at Grab every day, which implies we needed to perform billions of fraud and safety detections. As seen from the example shared earlier, most predictions require a group of counters. In the above use case, we need to know how many counts of the cashless payment happened for a driver and passenger pair. Due to the scale of Grab’s business, the potential combinations of drivers and passengers could be exponential. However, this is only one use case. So imagine that there could be hundreds of counters for different use cases. Hence it’s important that we provide a platform for stakeholders to manage counters.

Some of the common challenges we faced were:

Scalability

As mentioned above, we could potentially have an exponential number of passengers and drivers in a single counter. So it’s a great challenge to store the counters in the database, read, and query them in real-time. When there are billions of counter keys across a long period of time, the Trust team had to find a scalable way to write and fetch keys effectively and meet the client’s SLA.

Self-serving

A counter is usually invented by data scientists or analysts and used by engineers. For example, every time a new type of counter is needed from data scientists, developers need to manually make code changes, such as adding a new stream, capturing related data sets for the counter, and storing it on the fraud service, then doing a deployment to make the counters ready. It usually takes two or more weeks for the whole iteration, and if there are any changes from the data analysts’ side, which happens often, the situation loops again. The team had to come up with a solution to prevent the long loop of manual tasks by coming out with a self-serving interface.

Manageable and extendable

Due to a lack of connection between real-time and offline data, data analysts and data scientists did not have a clear picture of what is written in the counters. That’s because the conventional counter data were stored in Redis database to satisfy the query SLA. They could not track the correctness of counter value, or its history. With the new solution, the stakeholders can get a real-time picture of what is stored in the counters using the data engineering tools.

The Machine Learning challenges solved by the Counter service

The Counter service plays an important role in our Machine Learning (ML) workflow.

Data Consistency Challenge/Issue

Most of the machine learning workflows need dedicated input data. However, when there is an anti-fraud model that is trained using offline data from the data lake, it is difficult to use the same model in real-time. This is because the model lacks the data contract and the consistency with the data source. In this case, the Counter service becomes a type of data source by providing the value of counters to file system.

ML featuring

Counters are important features for the ML models. Imagine there is a new invention of counters, which data scientists need to evaluate. We need to provide a historical data set for counters to work. The Counter service provides a counter replay feature, which allows data scientists to simulate the counters via historical payload.

In general, the Counter service is a bridge between online and offline datasets, data scientists, and engineers. There was technical debt with regards to data consistency and automation on the ML pipeline, and the Counter service closed this loop.

How we designed the Counter service

We followed the principle of asynchronized data ingestion, and synchronized transaction for designing the Counter service.

The diagram shows how the counters are generated and saved to database.

How the counters are generated and saved to the database

Counter creation workflow

  1. User opens the Counter Creation UI and creates a new key “fraud:counter:counter_name”.
  2. Configures required fields.
  3. The Counter service monitors the new counter-creation, puts a new counter into load script storage, and starts processing new counter events (see Counter Write below).

Counter write workflow

  1. The Counter service monitors multiple streams, assembles extra data from online data services (i.e. Common Data Service (CDS), passenger service, hydra service, etc), so that rich dataset would also be available for editors on each stream resource.
  2. The Counter Processor evaluates the user-configured expression and writes the evaluated values to the dedicated Grab-Stats stream using the GrabPlugin tool.

Counter read workflow

Counter read workflow

We use Grab-Stats as our storage service. Basically Grab-Stats runs above ScyllaDB, which is a distributed NoSQL data store. We use ScyllaDB because of its good performance on aggregation in memory to deal with the time series dataset. In comparison with in-memory storage like AWS elasticCache, it is 10 times cheaper and as reliable as AWS in terms of stability. The p99 of reading from ScyllaDB is less than 150ms which satisfies our SLA.

How we improved the Counter service performance

We used the multi-buckets strategy to improve the Counter service performance.

Background

There are different time windows when you perform a query. Some counters are time sensitive so that it needs to know what happened in the last 30 or 60 minutes. Some other counters focus on the long term and need to know the events in the last 30 or 90 days.

From a transactional database perspective, it’s not possible to serve small range as well as long term events at the same time. This is because the more the need for the accuracy of the data and the longer the time range, the more aggregations need to happen on database. Which means we would not be able to satisfy the SLA. Otherwise we will need to block other process which leads to the service downgrade.

Solution for improving the query

We resolved this problem by using different granularities of the tables. We pre-aggregated the signals into different time buckets, such as 15min, 1 hour, and 1 day.

When a request comes in, the time-range of the request will be divided by the buckets, and the results are conquered. For example, if there is a request for 9/10 23:15:20 to 9/12 17:20:18, the handler will query 15min buckets within the hour.  It will query for hourly buckets for the same day. And it will query the daily buckets for the rest of 2 days. This way, we avoid doing heavy aggregations, but still keep the accuracy in 15 minutes level in a scalable response time.

Counter service UI

We allowed data analysts and data scientists to onboard counters by themselves, from a dedicated web portal. After the counter is submitted, the Counter service takes care of the integration and parsing the logic at runtime.

Counter service UI

Backend integration

We provide SDK for quicker and better integration. The engineers only need to provide the counter identifier ID (which is shown in the UI) and the time duration in the query. Under the hood we provide a GRPC protocol to communicate across services. We divide the query time window to smaller granularities, fetching from different time series tables and then conquering the result. We are also providing a short TTL cache layer to take the uncommon traffic from client such as network retry or traffic throttle. Our QPS are designed to target 100K.

Monitoring the Counter service

The Counter service dashboard helps to track the human errors while editing the counters in real-time. The Counter service sends alerts to slack channel to notify users if there is any error.

Counter service dashboard

We setup Datadog for monitoring multiple system metrics. The figure below shows a portion of stream processing and counter writing. In the example below, the total stream QPS would reach 5k at peak hour, and the total counter saved to storage tier is about 4k. It will keep climbing without an upper limit, when more counters are onboarded.

Counter service dashboard with multiple metrics

The Counter service UI portal also helps users to fetch real-time counter results for verification purposes.

Counter service UI

Future plans

Here’s what we plan to do in the near future to improve the Counter service.

Close the ML workflow loop

As mentioned above, we plan to send the resource payload of the Counter service to the offline data lake, in order to complete the counter replay function for data scientists. We are working on the project called “time traveler”. As the name indicates, it is used not only for serving the online transactional data, but also supports historical data analytics, and provides more flexibility on counter inventions and experiments.

There are more automation steps we plan to do, such as adding a replay button on the web portal, and hooking up with the offline big data engine to trigger the analytics jobs. The performance metrics will be collected and displayed on the web portal. A single platform would be able to manage both the online and offline data.

Integration to Griffin

Griffin is our rule engine. Counters are sometimes an input to a particular rule, and one rule usually needs many counters to work together. We need to provide a better integration to Griffin on backend. We plan to minimize the current engineering effort when using counters on Griffin. A counter then becomes an automated input variable on Griffin, which can be configured on the web portal by any users.

About being a Principal Engineer at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/about-being-a-principal-engineer-at-grab

Over the past few years Grab has grown from a small startup to one of the largest technology companies in South-East Asia. Along with the company’s growth, the number of microservices, features and teams also grew substantially. At the time of writing this blog, we have around 350 microservices powering our super-app.

A great engineering team is a critical component of our success. As an engineer you have two career paths in front of you: an individual contributor role, or a management role. While a management role is generally better understood, this article clarifies what it means to be a principal engineer at Grab, which is one of the highest levels of our engineering career ladder.

Engineering Career Ladder

Improving the Quality

“You set the standard for engineering excellence in your technical family. Your architectures are exemplary in terms of efficiency, stability, extensibility, testability and the ability to evolve over time. Your software is robust in the presence of failures, scalable, and cost-effective. Your coding practices are exemplary in terms of code organization, clarity, simplicity, error handling, and documentation. You tackle intrinsically hard problems, acquiring expertise as needed. You decompose complex problems into straightforward solutions.” – Grab’s Engineering Career Ladder

 

So, what does a principal engineer do? As your career progresses from junior to senior to lead engineer we have more and more responsibilities; you manage larger and larger systems. For example, junior engineer might manage a specific component of a micro-service. A senior engineer would be tasked with designing and operating an entire micro-service or product. While a lead engineer would typically be concerned with the architecture at a team level.

Principal engineer level is akin to a senior manager where instead of indirectly managing people (manager of managers) you take care of the architecture of an entire sub-organisation, known as Tech Family/Platform. These Tech Families usually have more than 50 engineers spread across multiple teams and function as a tiny company with their own business owners, designers, product managers, etc.

Challenging Projects

“You take engineering ownership of projects that may require the work of several teams to implement; you divide responsibilities so that each team can work independently and have the system come together into an integrated whole. Your projects often cross team, tech family, platform, and even R&D center boundaries. You solicit differing views and keep an open mind. You are adept at building consensus.” – Grab’s Engineering Career Ladder

 

As a principal engineer, your job is to solve larger problems and translate somewhat vague problems into a set of actionable items. You might be faced with a large problem such as “improve efficiency and interoperability of Grab’s transportation system.” You will need to understand the problem, the business impact and see how can it be improved. It might require you to design new systems, change existing systems, understand the costs involved and get the right people together to make it happen.

Solving such a problem all by yourself is pretty much impossible. You have to work with other managers and other engineers together as a team to make it happen. Help your lead/senior engineers to design the right system by giving them a clear objective but let them take care of the system-level architecture.  

You will also need to work with managers, advise them to get things done, and get the right things prioritised by the team. While you don’t need to be well-versed in project management and agile methodologies, you do need to be able to plan ahead with your teams and have an understanding of how much time a project or migration will take.

A Tech Family can easily have 20 or more micro-services. You need to have a good understanding of their functional requirements and interactions. This is challenging as learning new things is always “uncomfortable” and takes time. You must reach out to engineers, product managers, and data scientists, ideally face-to-face to build empathy. Keep asking questions and try to understand how things work. You will also need to read the existing documentation and their code.

Technical Ownership

“You are the origin of significant technical contributions to our architecture and infrastructure.  You take technical ownership of the design and quality of the security, performance, availability, and operational aspects of the software built by one or more teams. You identify where your time is needed, transitioning between coding, design, and architecture based on project and team needs. You deliver software in ways that empower teams to self-service, providing clear adoption/migration paths.” – Grab’s Engineering Career Ladder

 

As a principal engineer you work together with the Head of Engineering and managers within the Tech Family and improve the quality of systems across the board. Typically, no-one tells you what needs to be done. You need to identify gaps, raise them and keep improving the systems.

You also need to learn how to manage your own time better so you can prioritise effectively. This boils down to knowing your strengths, your weaknesses. For example, if you are really good in building distributed systems but have no clue about the latest-and-greatest design in information security, get the right InfoSec engineers in this meeting and consider skipping it yourself. Avoid trying to do everything at once and be in every single meeting you get invited – you still have to review code, design and focus, so plan accordingly.

You will also need to understand the business impact of your decisions. For example, if you contribute to product features, know how impactful this feature is going to be to the organisation. If you don’t know it – ask the Product Manager responsible for it. If you work on a platform feature, for example improving the build system, know how it will help: saving 30 minutes of build time for every engineer each day is a huge achievement.

More often than not, you will have to drive migrations, this is akin to code refactoring but on a system-level and will involve a lot of collaboration with the people. Understand what a technical debt is and how it can be mitigated – a good architecture minimises technical debt and in turn accelerates time-to-market and helps business flourish.

Technical Leadership

“You amplify your impact by leading design reviews for complex software and/or critical features. You probe assumptions, illuminate pitfalls, and foster shared understanding. You align teams toward coherent architectural strategies.” – Grab’s Engineering Career Ladder

 

In Grab we have a process known as RFC (Request For Comments) which allows engineers to submit designs and ideas for a larger audience to debate. This is especially important given that our organisation is spread across several continents with research and development offices in Southeast Asia, the US, India and China. While any engineer is welcome to comment on these RFCs, it is a duty of lead and principal engineers’ to review them on a regular basis. This will help you to expand your knowledge of existing systems and help others with improving their designs.

Communication is a key skill that you need to keep improving and it is often the Achilles’ heel of many engineers who would rather be doing work in their corner without talking to anyone else. This is perfectly fine for a junior (or even some senior engineers) but it is critical for a principal engineer to communicate. Let’s break this down to a set of specific skills that you’d need to sharpen.

You need to be able to write effectively in order to convey your ideas to others. This includes knowing your audience and wording it in such a way that readers can understand. A technical design document whose audience are engineers is not written the same way as a design proposal whose audience are product and business managers.

You need to be able to publicly present and talk about various projects that you are working on. This includes creation of slide decks with good visuals and distilling down months of work to just a couple of slides. The best way of learning this is to get out there and keep presenting your work – you will get better over time.

You also need to be able to drive meetings and discussions without wasting anyone’s time. As a technical leader, one of your key responsibilities is to get people moving in the same direction and driving consensus during meetings.

Teaching and Learning

“You educate other engineers, both at an individual level and at scale: keeping the engineering community up to date on advanced technical issues, technologies, and trends. Examples include onboarding bootcamps for new hires, interns, specific skill-gap training development, and sharing specialized knowledge to raise the technical bar for other engineers/teams/dev centers.”

 

A principal engineer is a technical leader and as a leader you have the responsibility to mentor, coach fellow engineers, regardless of their level. In addition to code-reviews, you can organise office hours in your team and knowledge sharing sessions where everyone could present something. You could also help with bootcamps and help new hires in getting up-to-speed.

Most importantly, you will also need to keep learning whichever way works for you – reading journals and papers, blog posts, watching video-recorded talks, attending conferences and browsing through a variety of open-source projects. You will also learn from other Grabbers as even a junior engineer can teach you something, we all have our strengths and weaknesses. Keep improving and working on yourself!

Data first, SLA always

Post Syndicated from Grab Tech original https://engineering.grab.com/data-first-sla-always

Introducing Trailblazer, the Data Engineering team’s solution to implementing change data capture of all upstream databases. In this article, we introduce the reason why we needed to move away from periodic batch ingestion towards a real time solution and show how we achieved this through an end to end streaming pipeline.

Context

Our mission as Grab’s Data Engineering team is to fulfill 100% of SLAs for data availability to our downstream users. Our 40 person team is responsible for providing accurate and reliable data to data analysts and data scientists so that they can produce actionable reports that will help Grab’s leadership team make data-driven decisions. We maintain data for a variety of business intelligence tools such as Tableau, Presto and Holistics as well as predictive algorithms for all of Grab.

We ingest data from multiple upstream sources, such as relational databases, Kafka or third party applications such as Salesforce or Zendesk. The majority of these source data exists in MySQL and we run ETL pipelines to mirror any updates into our data lake. These pipelines are triggered on an hourly or daily basis and are powered by an in-house Loader application which performs Spark batch ingestion and loading of data from source to sink.

Problems with the Loader application started to surface when Grab’s data exceeded the petabyte threshold. As such for larger tables, the most practical method to ingest data was to perform ETL only on rows that were updated within a specified timeframe. This is akin to issuing the query

SELECT * FROM table WHERE updated >= [start_time] AND updated < [end_time]

Now imagine two situations. One, firing this query to a huge table without an updated field. Two, firing the same query to the huge table, this time without indexes on the updated field. In the first scenario, the query will never work and we can never perform incremental ingestion on the table based on a timed window. The second scenario carries the dangers of creating high CPU load to replicate the database that we are querying from. Neither has an ideal outcome.

One other problem that we identified was the unpredictability of growth in data volume. Tables smaller than one gigabyte were ingested by fully scanning the table and overwriting the data in the data lake. This worked out well for us until the table size increased exponentially, at which point our Spark jobs failed due to JDBC timeouts. If we were only dealing with a handful of tables, this issue could have been addressed by switching our data ingestion strategy from full scan to a timed window.

When assessing the issue, we discovered that there were hundreds of tables running under the full scan strategy, all of them potentially crashing our data system, all time bombs silently waiting to explode.

The team urgently needed a new approach to ETL. Our Loader application was highly coupled to upstream table characteristics. We needed to find solutions that were truly scalable, which meant decoupling our pipelines from the upstream.

Change data capture (CDC)

Much like event sourcing, any log change to the database is captured and streamed out for downstream applications to consume. This process is lightweight since any row level update to the table is instantly captured by a real time processor, avoiding the need for large chunked queries on the table. In addition, CDC works regardless of upstream table definition, so we do not need to worry about missing updated columns impacting our data migration process.

Binary Logs (binlogs) are the CDC agents of MySQL. All updates, insertions or deletions performed on the table are captured as a series of logged events containing the past state of the row and it’s newly modified state. Check out the binlogs reference to find out more.

In order to persist all binlogs generated upstream, our team created a Spark Structured Streaming application called Trailblazer. Trailblazer streams all MySQL binlogs to our data lake. These binlogs serve as a foundation for us to build Presto tables for data auditing and help to remove the direct dependency of our batch ETL jobs to the source MySQL.

Trailblazer is an amalgamation of various data streaming stacks. Binlogs are captured by Debezium which runs on Kafka connect clusters. All binlogs are sent to our Kafka cluster, which is managed by the Data Engineering Infrastructure team and are streamed out to a real time bucket via a Spark structured streaming application. Hourly or daily ETL compaction jobs ingests the change logs from the real time bucket to materialize tables for downstream users to consume.

CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services
CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services

 

Some statistics

To date, we are streaming hundreds oftables across 60 Spark streaming jobs and with the constant increase in Grab’s database instances, the numbers are expected to keep growing.

Designing Trailblazer streams

We built our streaming application using Spark structured streaming 2.3. Structured streaming was designed to remove the technical aspects of provisioning streams. Developers can focus on perfecting business logic without worrying about fundamentals such as checkpoint management or reading and writing to data sources.

Key architecture for Trailblazer streaming
Key architecture for Trailblazer streaming

 

In the design phase, we made sure to follow several key principles that helped in managing our streams.

Checkpoints have to be externally managed

Structured streaming manages checkpoints both in a local directory and in a ‘_metadata’ directory on S3 buckets, such that the state of the stream can be restored in the event of failure and restart.

This is all well and good, with two exceptions. First, changing the starting point of data ingestion meant ssh-ing into the machine and manipulating metadata, which could be extremely dangerous. Second, we could not assume cluster prevalence since clusters can die and be recreated with data erased from its local disk or the distributed file system.

Our solution was to do a work around at the application level. All checkpoints will be stored in temporary directories with the existing timestamp appended as path (eg /tmp/checkpoint/job_A/1560697200/… ). A linearly progressive timestamp guarantees that the same directory will never be reused by new instances of the stream. This explains why we never restore its state from local disk but instead, store all checkpoints in a highly available Redis cluster, with key as the Kafka topic and value as a JSON of partition : offset.

Key

debz-schema-A.schema_A.table_B

Value

{"11":19183566,"12":19295602,"13":18992606[[a]](#cmnt1)[[b]](#cmnt2)[[c]](#cmnt3)[[d]](#cmnt4)[[e]](#cmnt5)[[f]](#cmnt6),"14":19269499,"15":19197199,"16":19060873,"17":19237853,"18":19107959,"19":19188181,"0":19193976,"1":19072585,"2":19205764,"3":19122454,"4":19231068,"5":19301523,"6":19287447,"7":19418871,"8":19152003,"9":19112431,"10":19151479}
Example of how offsets are stored in Redis as Key : Value pairs

 

Fortunately, structured streaming provides the StreamQueryListener class which we can use to register checkpoints after the completion of each microbatch.

Streams must handle 0, 1 or 1 million data

Scalability is at the heart of all well-designed applications. Spark streaming jobs are built for scalability in the face of varying data volumes.

In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing
In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing

 

There are a few settings that we can configure to influence the degree of scalability for a streaming app

  • spark.dynamicAllocation.enabled=true gives spark autonomy to provision / revoke executors to suit the workload
  • spark.dynamicAllocation.maxExecutors controls the maximum job parallelism
  • maxOffsetsPerTrigger controls the maximum number of messages ingested from Kafka per microbatch
  • trigger controls the duration between microbatchs and is a property of the DataStreamWriter class

Data as key health indicator

Scaling the number of streaming jobs without prior collection of performance metrics is a bad idea. There is a high chance that you will discover a dead stream when checking your stream hours after initialization. I’ll cite Murphy’s law as proof.

Thus we vigilantly monitored our data streams. We used tools such as Datadog for metric monitoring, Slack for oncall issue reporting, PagerDuty for urgent cases and our inhouse data auditor as a service (DASH) for counts discrepancy reporting between streamed and source data. More details on monitoring will be discussed in the later part.

Streams are ephemeral

Streams may die due to a hundred and one reasons so don’t blame yourself or your programming insecurities. Issues with upstream dependencies, such as a node within your Kafka cluster running out of disk space, could lead to partition unavailability which would crash the application. On one occasion, our streaming application was unable to resolve DNS when writing to AWS S3 storage. This amounted to multiple failures within our Spark job that eventually culminated in the termination of the stream.

In this case, allow the stream to  shutdown gracefully, send out your alerts and have a mechanism in place to retry the failed stream. We run all streaming jobs on Airflow and any failure to the stream will automatically be retried through a new task issued by the scheduler.

If you have had experience with large scale management of streams, please leave a comment so we can continue this discussion!

Monitoring data streams

Here are some key features that were set up to monitor our streams.

Running : Active jobs ratio

The number of streaming jobs could increase in the future, thus becoming a challenge for the oncall team to track all jobs that are supposed to be up and running.

One proposal  is  to track the number of jobs in production against the number of jobs that are actually running. By querying MySQL tables, we can filter out all the jobs that are meant to be active. Since Trailblazer streams are spark-submit jobs managed by YARN, we can query YARN’s resource manager REST API to retrieve  all the jobs that are running. We then construct a ratio of running : active jobs and report them to Datadog. If the ratio is not 1 for an extended duration, an alert will be issued for the oncall to take action.

If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert
If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert

 

Microbatch runtime

We define a 30 second window for each microbatch and track the actual runtime using metrics reported by the query listener. A runtime that exceeds the designated window is a potential indicator that the streaming job is deprived of resources and needs to be scaled up.

Job liveliness

Each job reports its health by emitting a count of 1 heartbeat. This heartbeat is created at the end of every microbatch via a query listener. This process is useful in detecting stale jobs (jobs that are registered as RUNNING in YARN but are actually hung).

Kafka offset divergence

In order to ensure that the message output rate to the consumer exceeds the message input rate from the producer, we sum up all presently ingested topic-partition offsets and compare that value to the sum of all topic-partition end offsets in Kafka. We then add an alerting logic on top of these metrics to inform the oncall team if the difference between the two values grows too big.

It is important to track the offset divergence parameter as streams can be lagging. Should the rate of consumption fall below the rate of message production, we would run the risk of falling short of Kafka’s retention window, leading to data losses.

Hourly data checks

DASH runs hourly and serves as our first line of defence to detect any data quality issues within the streams. We issue queries to the source database and our streaming layer to confirm that the ID counts of data created within the last hour match.

DASH helps in the early detection of upstream issues. We have noticed cases where our Debezium connectors failed and our checker reported fewer data than expected since there were no incoming messages to Kafka.

DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack

 

Materializing tables through compaction

Having CDC data in our data lake does not conclude our responsibilities. Batched compaction allows us to apply all captured CDC, to be available as Presto tables for downstream consumption. The job is set to trigger hourly and process all changes to the database within the past hour.  For example, changes to a record are visible in real-time, but the latest state of the record will not be reflected until the next time a batch job runs. We addressed several issues with streaming during this phase.

Deduplication of data

Trailblazer was not built to deliver exactly once guarantees. We ensure that the issues regarding duplicated CDCs are addressed during compaction.

Availability of all data until certain hour

We want to make sure that downstream pipelines use output data of the hourly batch job only when the pipeline has all records for that hour. In case there is an event that is processed late by streaming, the current pipeline will wait until the data is completed. In this case, we are consciously choosing consistency over availability for our downstream users. For example, missing a few insert booking records in peak hours due to consumer processing delay can generate the wrong downstream results leading to miscalculation in revenue. We want to start  downstream processes only when the data for the hour or day is complete.

Need for latest state of each event

Our compaction job performs upserts on the data to ensure that our downstream users can consume  records in their latest state.  

Future applications

Trailblazer is a milestone for the Data Engineering team as it represents our commitment to achieve large scale data streams to reduce latencies for our end users. Moving ahead, our team will be exploring how we can further optimize streaming jobs by analysing data trends over time and to build applications such as snapshot tables on top of the CDCs being streamed in our data lake.

Save Your Place with Grab!

Post Syndicated from Grab Tech original https://engineering.grab.com/save-your-place-with-grab

Do you find it tedious to type and search for your destination or have a hard time remembering that address of the friend you are going to meet? It can be really confusing when it comes to keeping track of so many addresses that you frequent on a regular basis. To solve this pain point, Grab rolled out a new feature called Saved Places in January’19 across SouthEast Asia.

With Saved Places, you can save an address and also add a label like “Home”, “Work”, “Gym”, etc which makes finding and selecting an address for booking a ride or ordering your favourite food a breeze!

Never forget your addresses again!

To use the feature, fire up your Grab app, head to the “Saved Places” section on the app navigation bar and start adding all your favourite destinations such as your home, your office, your favourite mall or the airport and you are done with the hassle of typing them again.

Save your place with Grab!

 

Hola! your saved addresses are just a click away to order a ride or your favourite meal.

Inspiration behind the work

We at Grab continuously engage with our customers to understand how we can outserve them better. Difficulty in choosing the correct address was one of the key feedback shared by our customers. Our drivers shared funny stories about places that have similar names but outrightly different locations e.g. Sime Road is in Bukit Timah but Simei Road is in Simei almost 20 km away, Nicoll Highway is in Kallang but Nicoll Drive is in Changi almost 20 km away. In this case, even though the users use the address frequently, there remains scope for misselection.

Data-Driven Decisions

Our vast repository of data and insights has helped us identify and solve some challenging problems. Our analysis of millions of transport bookings and food orders revealed that customers usually visit five to seven unique locations and order food at one or two addresses.

One intriguing yet intuitive insight that came out was a set pattern in user’s booking behaviour during weekdays. A set of passengers mostly commute between two addresses, probably going to the office in the morning and coming back home in the evening. These identifiable clusters of pick-up and drop-off locations during peak hours signified our hypothesis of users using a small set of locations for their Grab bookings. The pictures below show such clusters in Singapore and Jakarta where passengers generally commute to and fro in morning and in evening respectively.

Save your place with Grab!

 

This insight also motivated us to test out the concept of user created labels which allows the users to mark their saved places with their own labels like “Home”, “Changi Airport”, “Sis’s House” etc. Initial experiment results were extremely encouraging and we got significantly higher usage and repeat rates from users.

A group of cross functional teams – Analytics, Product, Design, Engineering etc came together, worked backwards from the customer, brainstormed multiple ideas, and finalised a product approach. We then went on to conduct in depth user research and usability testing to ensure that the final product met user expectations and was easy to understand and use.

And users love it!

Since the launch, we have seen significant user adoption for the feature. More than 14 Million users have saved close to 45 Million saved places. That’s ~3 places per user!

Customers from Singapore and Myanmar tend to save around 3 addresses each whereas customers from Indonesia, Malaysia, Philippines, Thailand, Vietnam and Cambodia save 2 addresses each. A customer from Indonesia has saved a whopping 1,191 addresses!

Users across South East Asia have adopted the feature and as of today, a significant portion of our bookings are made using a saved place for either pickup or drop off. If you were curious, here are the most frequently used labels for saving addresses in Singapore (left) and Indonesia (right):

Save your place with Grab!

 

Apart from saving home and office addresses our customers are also saving their child’s school address and places of worship. Some of them are also saving their favourite shopping destinations.

Another observation, as someone may have guessed, is regarding cluster of home addresses. Home addresses in Singapore are evenly scattered across the island (map on upper left) but the same are concentrated in specific pockets of the city in Jakarta (map on lower left). However office addresses are concentrated in specific areas in both cities – CBD and Changi area in Singapore (map on upper right) and along central Jakarta in Jakarta (map on lower right).

Save your place with Grab!

 

This is only the beginning

We’re constantly striving to improve the user experience with Grab and make it as seamless as possible. We have only taken the first steps with Saved Places and the path forward involves deeper understanding of user behaviour with the help of saved places data to create a more personalised experience. This is just the beginning and we’re planning to launch some very innovative features in the coming months.

No More Forgetting to Input ERP Charges – Hello Automated ERP!

Post Syndicated from Grab Tech original https://engineering.grab.com/automated-erp-charges

ERP, standing for Electronic Road Pricing, is a system used to manage road congestion in Singapore. Drivers are charged when they pass through ERP gantries during peak hours. ERP rates vary for different roads and time periods based on the traffic conditions at the time. This encourages people to change their mode of transport, travel route or time of travel during peak hours. ERP is seen as an effective measure in addressing traffic conditions and ensuring drivers continue to have a smooth journey.

Did you know that Singapore has a total of 79 active ERP gantries? Did you also know that every ERP gantry changes its fare 10 times a day on average? For example, total ERP charges for a journey from Ang Mo Kio to Marina will cost $10 if you leave at 8:50am, but $4 if you leave at 9:00am on a working day!

Imagine how troublesome it would have been for Grab’s driver-partners who, on top of having to drive and check navigation, would also have had to remember each and every gantry they passed, calculating their total fare and then manually entering the charges to the total ride cost at the end of the ride.

In fact, based on our driver-partners’ feedback, missing out on ERP charges was listed as one of their top-most pain points. Not only did the drivers find the entire process troublesome, this also led to earnings loss as they would have had to bear the cost of the  ERP fares.

We’re glad to share that, as of 15th March 2019, we’ve successfully resolved this pain point for our driver-partners by introducing automated ERP fare calculation!

So, how did we achieve automating the ERP fare calculation for our drivers-partners? How did we manage to reduce the number of trips where drivers would forget to enter ERP fare to almost zero? Read on!

How we approached the Problem

The question we wanted to solve was – how do we create an impactful feature to make sure that driver -partners have one less thing to handle when they drive?

We started by looking at the problem at hand. ERP fares in Singapore are very dynamic; it changes on the basis of day and time.

Caption: Example of ERP fare changes on a normal weekday in Singapore
Caption: Example of ERP fare changes on a normal weekday in Singapore

 

We wanted to create a system which can identify the dynamic ERP fares at any given time and location, while simultaneously identifying when a driver-partner has passed through any of these gantries.

However, that wasn’t enough. We wanted this feature to be scalable to every country where Grab is in – like Indonesia, Thailand, Malaysia, Philippines, Vietnam. We started studying the ERP (or tolls – as it is known locally) system in other countries. We realized that every country has its own style of calculating toll. While in Singapore ERP charges for cars and taxis are the same, Malaysia applies different charges for cars and taxis. Similarly, Vietnam has different tolls for 4-seaters and 7-seaters. Indonesia and Thailand have couple gantries where you pay only at one of the gantries.Suppose A and B are couple gantries, if you passed through A, you won’t need to pay at B and vice versa. This is where our Ops team came to the rescue!

Boots on the Ground!

Collecting all the ERP or toll data for every country is no small feat, recalls Robinson Kudali, program manager for the project. “We had around 15 people travelling across the region for 2-3 weeks, working on collecting data from every possible source in every country.”

Getting the right geographical coordinates for every gantry is very important. We track driver GPS pings frequently, identify the nearest road to that GPS ping and check the presence of a gantry using its coordinates. The entire process requires you to be very accurate; incorrect gantry location can easily lead to us miscalculating the fare.

Bayu Yanuaragi, our regional mapops lead, explains – “To do this, the first step was to identify all toll gates for all expressways & highways in the country. The team used various mapping software to locate and plot all entry & exit gates using map sources, open data and more importantly government data as references. Each gate was manually plotted using satellite imagery and aligned with our road layers in order to extract the coordinates with a unique gantry ID.”

Location precision is vital in creating the dataset as it dictates whether a toll gate will be detected by the Grab app or not. Next step was to identify the toll charge from one gate to another. Accuracy of toll charge per segment directly reflects on the fare that the passenger pays after the trip.

Caption: ERP gantries visualisation on our map - The purple bars are the gantries that we drew on our map
Caption: ERP gantries visualisation on our map – The purple bars are the gantries that we drew on our map

 

Once the data compilation is done, team would then conduct fieldwork to verify its integrity. If data gaps are identified, modifications would be made accordingly.

Upon submission of the output, stack engineers would perform higher level quality check of the content in staging.

Lastly, we worked with a local team of driver-partners who volunteered to make sure the new system is fully operational and the prices are correct. Inconsistencies observed were reported by these driver-partners, and then corrected in our system.

Closing the loop

Creating a strong dataset did help us in predicting correct fares, but we needed something which allows us to handle the dynamic behavior of the changing toll status too. For example, Singapore government revises ERP fare every quarter, while there could also be ad-hoc changes like activating or deactivating of gantries on an on-going basis.

Garvee Garg, Product Manager for this feature explains: “Creating a system that solves the current problem isn’t sufficient. Your product should be robust enough to handle all future edge case scenarios too. Hence we thought of building a feedback mechanism with drivers.”

In case our ERP fare estimate isn’t correct or there are changes in ERPs on-ground, our driver-partners can provide feedback to us. These feedback directly flow to Customer Experience teamwho does the initial investigation, and from there to our Ops team. A dedicated person from Ops team checks the validity of the feedback, and recommends updates. It only takes 1 day on average to update the data from when we receive the feedback from the driver-partner.

However, validating the driver feedback was a time consuming process. We needed a tool which can ease the life of Ops team by helping them in de-bugging each and every case.

Hence the ERP Workflow tool came into the picture.

99% of the time, feedback from our driver-partners are about error cases. When feedback comes in, this tool would allow the Ops team to check the entire ride history of the driver and map driver’s ride trajectory with all the underlying ERP gantries at that particular point of time. The Ops team  would then be able to identify if ERP fare calculated by our system or as said by driver is right or wrong.

This is only the beginning

By creating a system that can automatically calculate and key in ERP fares for each trip, Grab is proud to say that our driver-partners can now drive with less hassle and focus more on the road which will bring the ride experience and safety for both the driver and the passengers to a new level!

The Automated ERP feature is currently live in Singapore and we are now testing it with our driver-partners in Indonesia and Thailand. Next up, we plan to pilot in the Philippines and Malaysia and soon to every country where Grab is in – so stay tuned for even more innovative ideas to enhance your experience on our super app!

To know more about what Grab has been doing to improve the convenience and experience for both our driver-partners and passengers, check out other stories on this blog!

How We Built A Logging Stack at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/how-built-logging-stack

And Solved Our Inhouse Logging Problem

Problem:

Let me take you back a year ago at Grab. When we lacked any visualizations or metrics for our service logs. When performing a query for a string from the last three days was something only run before you went for a beverage.

When a service stops responding, Grab’s core problems were and are:

  • We need to know it happened before the customer does.
  • We need to know why it happened.
  • We need to solve our customers’ problems fast.

We had a hodgepodge of log-based solutions for developers when they needed to figure out the above, or why a driver never showed up, or a customer wasn’t receiving our promised promotions. These included logs in a cloud based storage service (which could take hours to retrieve). Or a SAS provider constantly timing out on our queries. Or even asking our SREs to fetch logs from the potential machines for the service engineer, a rather laborious process.

Here’s what we did with our logs to solve these problems.

Issues:

Our current size and growth rate ruled out several available logging systems. By size, we mean a LOT of data and a LOT of users who search through hundreds of billions of logs to generate reports. Or who track down that one user who managed to find that pesky corner case in our code.

When we started this project, we generated 25TB of logging data a day. Our first thought was “Do we really need all of these logs?”. To this day our feeling is “probably not”.

However, we can’t always define what another developer can and cannot do. Besides, this gave us an amazing opportunity to build something to allow for all that data!

Some of our SREs had used the ELK Stack (Elasticsearch / Logstash / Kibana). They thought it could handle our data and access loads, so it was our starting point.

How We Built a Multi-Petabyte Cluster:

Information Gathering:

It started with gathering numbers. How much data did we produce each day? How many days were retained? What’s a reasonable response time to wait for?

Before starting a project, understand your parameters. This helps you spec out your cluster, get buy-in from higher ups, and increase your success rate when rolling out a product used by the entire engineering organization. Remember, if it’s not better than what they have now, why will they switch?

A good starting point was opening the floor to our users. What features did they want? If we offered a visualization suite so they can see ERROR event spikes, would they use it? How about alerting them about SEGFAULTs? Hands down the most requested feature was speed; “I want an easy webUI that shows me the user ID when I search for it, and get all the results in <5 seconds!”

Getting Our Feet Wet:

New concerns always pop up during a project. We’re sure someone has correlated the time spent in R&D to the number of problems. We had an always moving target, since as our proof of concept began, our daily logger volume kept increasing.

Thankfully, using Elasticsearch as our data store meant we could fully utilize horizontal scaling. This let us start with a simple 5 node cluster as we built out our proof-of-concept (POC). Once we were ready to onboard more services, we could move into a larger footprint.

The specs at the time called for about 80 nodes to handle all our data. But if we designed our system correctly, we’d only need to increase the number of Elasticsearch nodes as we enrolled more customers. Our key operating metrics were CPU utilization, heap memory needed for the JVM, and total disk space.

Initial Design:

First, we set up tooling to use Ansible both to launch a machine and to install and configure Elasticsearch. Then we were ready to scale.

Our initial goal was to keep the design as simple as possible. Opting to allow each node in our cluster to perform all responsibilities. In this setup each node would behave as all of the four available types:

  • Ingest: Used for transforming and enriching documents before sending them to data nodes for indexing.
  • Coordinator: Proxy node for directing search and indexing requests.
  • Master: Used to control cluster operations and determine a quorum on indexed documents.
  • Data: Nodes that hold the indexed data.

These were all design decisions made to move our proof of concept along, but in hindsight they might have created more headaches down the road with troubleshooting, indexing speed, and general stability. Remember to do your homework when spec’ing out your cluster.

It’s challenging to figure out why you are losing master nodes because someone filled up the field data cache performing a search. Separating your nodes can be a huge help in tracking down your problem.

We also decided to further reduce complexity by going with ingest nodes over Logstash. But at the time, the documentation wasn’t great so we had a lot of trial and error in figuring out how they work. Particularly as compared to something more battle tested like Logstash.

If you’re unfamiliar with ingest node design, they are lightweight proxies to your data nodes that accept a bulk payload, perform post-processing on documents,and then send the documents to be indexed by your data nodes. In theory, this helps keep your entire pipeline simple. And in Elasticsearch’s defense, ingest nodes have made massive improvements since we began.

But adding more ingest nodes means ADDING MORE NODES! This can create a lot of chatter in your cluster and cause more complexity when  troubleshooting problems. We’ve seen when an ingest node failing in an odd way caused larger cluster concerns than just a failed bulk send request.

Monitoring:

This isn’t anything new, but we can’t overstate the usefulness of monitoring. Thankfully, we already had a robust tool called Datadog with an additional integration for Elasticsearch. Seeing your heap utilization over time, then breaking it into smaller graphs to display the field data cache or segment memory, has been a lifesaver. There’s nothing worse than a node falling over due to an OOM with no explanation and just hoping it doesn’t happen again.

At this point, we’ve built out several dashboards which visualize a wide range of metrics from query rates to index latency. They tell us if we sharply drop on log ingestion or if circuit breakers are tripping. And yes, Kibana has some nice monitoring pages for some cluster stats. But to know each node’s JVM memory utilization on a 400+ node cluster, you need a robust metric system.

Pitfalls:

Common Problems:

There are many blogs about the common problems encountered when creating an Elasticsearch cluster and Elastic does a good job of keeping blog posts up to date. We strongly encourage you to read them. Of course, we ran into classic problems like ensuring our Java objects were compressed (Hints: Don’t exceed 31GB of heap for your JVM and always confirm you’ve enabled compression).

But we also ran into some interesting problems that were less common. Let’s look at some major concerns you have to deal with at this scale.

Grab’s Problems:

Field Data Cache:

So, things are going well, all your logs are indexing smoothly, and suddenly you’re getting Out Of Memory (OOMs) events on your data nodes. You rush to find out what’s happening, as more nodes crash.

A visual representation of your JVM heap’s memory usage is very helpful here. You can always hit the Elasticsearch API, but after adding more then 5 nodes to your cluster this kind of breaks down. Also, you don’t want to know what’s going on while a node is down, but what happened before it died.

Using our graphs, we determined the field data cache went from virtually zero memory used in the heap to 20GB! This forced us to read up on how this value is set, and, as of this writing, the default value is still 100% of the parent heap memory. Basically, this breaks down to allowing 70% of your total heap being allocated to a single search in the form of field data.

Now, this should be a rare case and it’s very helpful to keep the field names and values in memory for quick lookup. But, if, like us, you have several trillion documents, you might want to watch out.

From our logs, we tracked down a user who was sorting by the _id field. We believe this is a design decision in how Kibana interacts with Elasticsearch. A good counter argument would be a user wants a quick memory lookup if they search for a document using the _id. But for us, this meant a user could load into memory every ID in the indices over a 14 day period.

The consequences? 20+GB of data loaded into the heap before the circuit breaker tripped. It then only took 2 queries at a time to knock a node over.

You can’t disable indexing that field, and you probably don’t want to. But you can prevent users from stumbling into this and disable the _id field in the Kibana advanced settings. And make sure you re-evaluate your circuit breakers. We drastically lowered the available field cache and removed any further issues.

Translog Compression:

At first glance, compression seems an obvious choice for shipping shards between nodes. Especially if you have the free clock cycles, why not minimize the bandwidth between nodes?

However, we found compression between nodes can drastically slow down shard transfers. By disabling compression, shipping time for a 50GB shard went from 1h to 20m. This was because Lucenesegments are already compressed, a new issue we ran into full force and are actively working with the community to fix. But it’s also a configuration to watch out for in your setup, especially if you want a fast recovery of a shard.

Segment Memory:

Most of our issues involved the heap memory being exhausted. We can’t stress enough the importance of having visualizations around how the JVM is used. We learned this lesson the hard way around segment memory.

This is a prime example of why you need to understand your data when building a cluster. We were hitting a lot of OOMs and couldn’t figure out why. We had fixed the field cache issue, but what was using all our RAM?

There is a reason why having a 16TB data node might be a poorly spec’d machine. Digging into it, we realized we simply allocated too many shards to our nodes. Looking up the total segment memory used per index should give a good idea of how many shards you can put on a node before you start running out of heap space. We calculated on average our 2TB indices used about 5GB of segment memory spread over 30 nodes.

The numbers have since changed and our layout was tweaked, but we came up with calculations showing we could allocate about 8TB of shards to a node with 32GB heap memory before we running into issues. That’s if you really want to push it, but it’s also a metric used to keep your segment memory per node around 50%. This allows enough memory to run queries without knocking out your data nodes. Naturally this led us to ask “What is using all this segment memory per node?”

Index Mapping and Field Types:

Could we lower how much segment memory our indices used to cut our cluster operation costs? Using the segments data found in the ES cluster and some simple Python loops, we tracked down the total memory used per field in our index.

We used a lot of segment memory for the _id field (but can’t do much about that). It also gave us a good breakdown of our other fields. And we realized we indexed fields in completely unnecessary ways. A few fields should have been integers but were keyword fields. We had fields no one would ever search against and which could be dropped from index memory.

Most importantly, this began our learning process of how tokens and analyzers work in Elasticsearch/Lucene.

Picking the Wrong Analyzer:

By default, we use Elasticsearch’s Standard Analyzer on all analyzed fields. It’s great, offering a very close approximation to how users search and it doesn’t explode your index memory like an N-gram tokenizer would.

But it does a few things we thought unnecessary, so we thought we could save a significant amount of heap memory. For starters, it keeps the original tokens: the Standard Analyzer would break IDXVB56KLM into tokens IDXVB, 56,  and KLM. This usually works well, but it really hurts you if you have a lot of alphanumeric strings.

We never have a user search for a user ID as a partial value. It would be more useful to only return the entire match of an alphanumeric string. This has the added benefit of only storing the single token in our index memory. This modification alone stripped a whole 1GB off our index memory, or at our scale meant we could eliminate 8 nodes.

We can’t stress enough how cautious you need to be when changing analyzers on a production system. Throughout this process, end users were confused why search results were no longer returning or returning weird results. There is a nice kibana pluginthat gives you a representation of how your tokens look with a different analyzer, or use the build in ES tools to get the same understanding.

Be Careful with Cloud Maintainers:

We realized that running a cluster at this scale is expensive. The hardware alone sets you back a lot, but our hidden bigger cost was cross traffic between availability zones.

Most cloud providers offer different “zones” for your machines to entice you to achieve a High-Availability environment. That’s a very useful thing to have, but you need to do a cost/risk analysis. If you migrate shards from HOT to WARM to COLD nodes constantly, you can really rack up a bill. This alone was about 30% of our total cluster cost, which wasn’t cheap at our scale.

We re-worked how our indices sat in the cluster. This let us create a different index for each zone and pin logging data so it never left the zone it was generated in. One small tweak to how we stored data cut our costs dramatically. Plus, it was a smaller scope for troubleshooting. We’d know a zone was misbehaving and could focus there vs. looking at everything.

Conclusion:

Running our own logging stack started as a challenge. We roughly knew the scale we were aiming for; it wasn’t going to be trivial or easy. A year later, we’ve gone from pipe-dream to production and immensely grown the team’s ELK stack knowledge.

We could probably fill 30 more pages with odd things we ran into, hacks we implemented, or times we wanted to pull our hair out. But we made it through and provide a superior logging platform to our engineers at a significant price reduction while maintaining a stable platform.

There are many different ways we could have started knowing what we do now. For example, using Logstash over Ingest nodes, changing default circuit breakers, and properly using heap space to prevent node failures. But hindsight is 20/20 and it’s rare for projects to not change.

We suggest anyone wanting to revamp their centralized logging system look at the ELK solutions. There is a learning curve, but the scalability is outstanding and having subsecond lookup time for assisting a customer is phenomenal. But, before you begin, do your homework to save yourself weeks of troubleshooting down the road. In the end though, we’ve received nothing but praise from Grab engineers about their experiences with our new logging system.

Making Grab’s everyday app super

Post Syndicated from Grab Tech original https://engineering.grab.com/grab-everyday-super-app

Grab is Southeast Asia’s leading superapp, providing highly-used daily services such as ride-hailing, food delivery, payments, and more. Our goal is to give people better access to the services that matter to them, with more value and convenience, so we’ve been expanding our ecosystem to include bill payments, hotel bookings, trip planners, and videos – with more to come. We want to outserve our customers – not just by packing the Grab app with useful features and services, but by making the whole experience a unique and personalized one for each of them.

To realize our super app ambitions, we work with partners who, like us, want to help drive Southeast Asia forward.

A lot of the collaborative work we do with our partners can be seen in the Grab Feed. This is where we broadcast various types of content about Grab and our partners in an aggregated manner, adding value to the overall user experience. Here’s what the feed looks like:

Grab super app feed
Waiting for the next promo? Check the Feed.
Looking for news and entertainment? Check the Feed.
Want to know if it’s a good time to book a car? CHECK. THE. FEED.

 

As we continue to add more cards, services, and chunks of content into Grab Feed, there’s a risk that our users will find it harder to find the information relevant to them. So we work to ensure that our platform is able to distinguish and show information based on what’s most suited for the user’s profile. This goes back to what has always been our central focus – the customer – and is why we put so much importance in personalising the Grab experience for each of them.

To excel in a heavily diversified market like Southeast Asia, we leverage on the depth of our data to understand what sorts of information users want to see and when they should see them. In this article we will discuss Grab Feed’s recommendation logic and strategies, as well as its future roadmap.

Start your Engines

Grab super app feed

The problem we’re trying to solve here is known as the recommendations problem. In a nutshell, this problem is about inferring the preference of consumers to recommend content and services to them. In Grab Feed, we have different types of content that we want to show to different types of consumers and our challenge is to ensure that everyone gets quality content served to them.

Grab super app feed

To solve this, we have built a recommendation engine, which is a system that suggests the type of content a user should consider consuming. In order to make a recommendation, we need to understand three factors:

  1. Users. There’s a lot we can infer about our users based on how they’ve used the Grab app, such as the number of rides they’ve taken, the type of food they like to order, the movie voucher deals they’ve purchased, the games they’ve played, and so on.
    This information gives us the opportunity to understand our users’ preferences better, enabling us to match their profiles with relevant and suitable content.
  2. Items. These are the characteristics of the content. We consider the type of the content (e.g. video, games, rewards) and consumability (e.g. purchase, view, redeem). We also consider other metadata such as store hours for merchants, points to burn for rewards, and GPS coordinates for points of interest.
  3. Context. This pertains to the setting in which a user is consuming our content. It could be the time of day, the user’s location, or the current feed category.

Using signals from all these factors, we build a model that returns a ranked set of cards to the user. More on this in the next few sections.

Understanding our User

Grab super app feed

Interpreting user preference from the signals mentioned above is a whole challenge in itself. It’s important here to note that we are in a constant state of experimentation. Slowly but surely, we are continuing to fine tune how to measure content preferences. That being said, we look at two areas:

  1. Action. We firmly believe that not all interactions are made equal. Does liking a card actually mean you like it? Do you like things at the same rate as your friends? What about transactions, are those more preferred? The feed introduces a lot of ways for the users to give feedback to the platform. These events include likes, clicks, swipes, views, transactions, and call-to-actions.

Depending on the model, we can take slightly different approaches. We can learn the importance of each event and aggregate them to have an expected rating, or we can predict the probability of each event and rank accordingly.

  1. Recency. Old interactions are probably not as useful as new ones. The feed is a product that is constantly evolving, and so are the preferences of our users. Failing to decay the weight of older interactions will give us recommendations that are no longer meaningful to our users.

Optimising the Experience

Grab super app feed

Building a viable recommendation engine requires several phases. Working iteratively, we are able to create a few core recommendation strategies to produce the final model in determining the content’s relevance to the user. We’ll discuss each strategy in this section.

  1. Popularity. This strategy is better known as trending recommendations. We capture online clickstream events over a rolling time window and aggregate the events to show the user what’s popular to everyone at that point in time. Listening to the crowds is generally an effective strategy, but this particular strategy also helps us address the cold start problem by providing recommendations for new feed users.
  2. User Favourites. We understand that our users have different tastes and that users will have content that they engage with more than other users would.  In this strategy, we capture that personal engagement and the user’s evolving preferences.
  3. Collaborative Filtering.A key goal in building our everyday super app is to let users experience different services. To allow discoverability, we study similar users to uncover a s et ofsimilar preferences they may have, which we can then use to guide what we show other users.
  4. Habitual Behaviour. There will be times where users only want to do a specific thing, and we wouldn’t want them to scroll all the way down just to do it. We’ve built in habitual recommendations to address this. So if users always use the feed to scroll through food choices at lunch or to take a peek at ride peaks (pun intended) on Sunday morning, we’ve still got them covered.
  5. Deep Recommendations. We’ve shown you how we use Feed data to drive usage across the platform. But what about using the platform data to drive the user feed behaviour? By embedding users’ activities from across our multiple businesses, we’re also able to leverage this data along with clickstream to determine the content preferences for each user.

We apply all these strategies to find out the best recommendations to serve the users either by selection or by aggregation. These decisions are determined through regular experiments and studies of our users.

Always Learning

We’re constantly learning and relearning about our users. There are a lot of ways to understand behaviour and a lot of different ways to incorporate different strategies, so we’re always iterating on these to deliver the most personal experience on the app.

To identify a user’s preferences and optimal strategy exposure, we capitalise on our Experimentation Platform to expose different configurations of our Recommendation Engine to different users. To monitor the quality of our recommendations, we measure the impact with online metrics such as interaction, clickthrough, and engagement rates and offline metrics like [email protected] Normalized Discounted Cumulative Gain (NDCG).

Future Work

Through our experience building out this recommendations platform, we realised that the space was large enough and that there’s a lot of pieces that can continuously be built. To keep improving, we’re already working on the following items:

  1. Multi-objective optimisation for business and technical metrics
  2. Building out automation pipelines for hyperparameter optimisation
  3. Incorporating online learning for real-time model updates
  4. Multi-armed bandits for user personalised recommendation strategies
  5. Recsplanation system to allow stakeholders to better understand the system

Conclusion

Grab is one of Southeast Asia’s fastest growing companies. As its business, partnerships, and offerings continue to grow, the super app real estate problem will only keep on getting bigger. In this post, we discuss how we are addressing that problem by building out a recommendation system that understands our users and personalises the experience for each of them. This system (us included) continues to learn and iterate from our users feedback to deliver the best version for them.

If you’ve got any feedback, suggestions, or other great ideas, feel free to reach me at [email protected] Interested in working on these technologies yourself? Check out our career page.

Catwalk: Serving Machine Learning Models at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/catwalk-serving-machine-learning-models-at-scale

Introduction

Grab’s unwavering ambition is to be the best Super App in Southeast Asia that adds value to the everyday for our customers. In order to achieve that, the customer experience must be flawless for each and every Grab service. Let’s take our frequently used ride-hailing service as an example. We want fair pricing for both drivers and passengers, accurate estimation of ETAs, effective detection of fraudulent activities, and ensured ride safety for our customers. The key to perfecting these customer journeys is artificial intelligence (AI).

Grab has a tremendous amount of data that we can leverage to solve complex problems such as fraudulent user activity, and to provide our customers personalized experiences on our products. One of the tools we are using to make sense of this data is machine learning (ML).

As Grab made giant strides towards increasingly using machine learning across the organization, more and more teams were organically building model serving solutions for their own use cases. Unfortunately, these model serving solutions required data scientists to understand the infrastructure underlying them. Moreover, there was a lot of overlap in the effort it took to build these model serving solutions.

That’s why we came up with Catwalk: an easy-to-use, self-serve, machine learning model serving platform for everyone at Grab.

Goals

To determine what we wanted Catwalk to do, we first looked at the typical workflow of our target audience – data scientists at Grab:

  • Build a trained model to solve a problem.
  • Deploy the model to their project’s particular serving solution. If this involves writing to a database, then the data scientists need to programmatically obtain the outputs, and write them to the database. If this involves running the model on a server, the data scientists require a deep understanding of how the server scales and works internally to ensure that the model behaves as expected.
  • Use the deployed model to serve users, and obtain feedback such as user interaction data. Retrain the model using this data to make it more accurate.
  • Deploy the retrained model as a new version.
  • Use monitoring and logging to check the performance of the new version. If the new version is misbehaving, revert back to the old version so that production traffic is not affected. Otherwise run an AB test between the new version and the previous one.

We discovered an obvious pain point – the process of deploying models requires additional effort and attention, which results in data scientists being distracted from their problem at hand. Apart from that, having many data scientists build and maintain their own serving solutions meant there was a lot of duplicated effort. With Grab increasingly adopting machine learning, this was a state of affairs that could not be allowed to continue.

To address the problems, we came up with Catwalk with goals to:

  1. Abstract away the complexities and expose a minimal interface for data scientists
  2. Prevent duplication of effort by creating an ML model serving platform for everyone in Grab
  3. Create a highly performant, highly available, model versioning supported ML model serving platform and integrate it with existing monitoring systems at Grab
  4. Shorten time to market by making model deployment self-service

What is Catwalk?

In a nutshell, Catwalk is a platform where we run Tensorflow Serving containers on a Kubernetes cluster integrated with the observability stack used at Grab.

In the next sections, we are going to explain the two main components in Catwalk – Tensorflow Serving and Kubernetes, and how they help us obtain our outlined goals.

What is Tensorflow Serving?

Tensorflow Serving is an open-source ML model serving project by Google. In Google’s own words, “Tensorflow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. Tensorflow Serving provides out-of-the-box integration with Tensorflow models, but can be easily extended to serve other types of models and data.”

Why Tensorflow Serving?

There are a number of ML model serving platforms in the market right now. We chose Tensorflow Serving because of these three reasons, ordered by priority:

  1. Highly performant. It has proven performance handling tens of millions of inferences per second at Google according to their website.
  2. Highly available. It has a model versioning system to make sure there is always a healthy version being served while loading a new version into its memory
  3. Actively maintained by the developer community and backed by Google

Even though, by default, Tensorflow Serving only supports models built with Tensorflow, this is not a constraint, though, because Grab is actively moving toward using Tensorflow.

How are we using Tensorflow Serving?

In this section, we will explain how we are using Tensorflow Serving and how it helps abstract away complexities for data scientists.

Here are the steps showing how we are using Tensorflow Serving to serve a trained model:

  1. Data scientists export the model using tf.saved_model API and drop it to an S3 models bucket. The exported model is a folder containing model files that can be loaded to Tensorflow Serving.
  2. Data scientists are granted permission to manage their folder.
  3. We run Tensorflow Serving and point it to load the model files directly from the S3 models bucket. Tensorflow Serving supports loading models directly from S3 out of the box. The model is served!
  4. Data scientists come up with a retrained model. They export and upload it to their model folder.
  5. As Tensorflow Serving keeps watching the S3 models bucket for new models, it automatically loads the retrained model and serves. Depending on the model configuration, it can either gracefully replace the running model version with a newer version or serve multiple versions at the same time.
Tensorflow Serving Diagram

The only interface to data scientists is a path to their model folder in the S3 models bucket. To update their model, they upload exported models to their folder and the models will automatically be served. The complexities are gone. We’ve achieved one of the goals!

Well, not really…

Imagine youare going to run Tensorflow Serving to serve one model in a cloud provider, which means you  need a compute resource from a cloud provider to run it. Running it on one box doesn’t provide high availability, so you need another box running the same model. Auto scaling is also needed in order to scale out based on the traffic. On top of these many boxes lies a load balancer. The load balancer evenly spreads incoming traffic to all the boxes, thus ensuring that there is a single point of entry for any clients, which can be abstracted away from the horizontal scaling. The load balancer also exposes an HTTP endpoint to external users. As a result, we form a Tensorflow Serving cluster that is ready to serve.

Next, imagine you have more models to deploy. You have three options

  1. Load the models into the existing cluster – having one cluster serve all models.
  2. Spin up a new cluster to serve each model – having multiple clusters, one cluster serves one model.
  3. Combination of 1 and 2 – having multiple clusters, one cluster serves a few models.

The first option would not scale, because it’s just not possible to load all models into one cluster as the cluster has limited resources.

The second option will definitely work but it doesn’t sound like an effective process, as you need to create a set of resources every time you have a new model to deploy. Additionally, how do you optimize the usage of resources, e.g., there might be unutilized resources in your clusters that could potentially be shared by the rest.

The third option looks promising, you can manually choose the cluster to deploy each of your new models into so that all the clusters’ resource utilization is optimal. The problem is you have to manuallymanage it. Managing 100 models using 25 clusters can be a challenging task. Furthermore, running multiple models in a cluster can also cause a problem as different models usually have different resource utilization patterns and can interfere with each other. For example, one model might use up all the CPU and the other model won’t be able to serve anymore.

Wouldn’t it be better if we had a system that automatically orchestrates model deployments based on resource utilization patterns and prevents them from interfering with each other? Fortunately, that  is exactly what Kubernetes is meant to do!

So what is Kubernetes?

Kubernetes abstracts a cluster of physical/virtual hosts (such as EC2) into a cluster of logical hosts (pods in Kubernetes terms). It provides a container-centric management environment. It orchestrates computing, networking, and storage infrastructure on behalf of user workloads.

Let’s look at some of the definitions of Kubernetes resources

Tensorflow Serving Diagram
  • Cluster – a cluster of nodes running Kubernetes.
  • Node – a node inside a cluster.
  • Deployment – a configuration to instruct Kubernetes the desired state of an application. It also takes care of rolling out an update (canary, percentage rollout, etc), rolling back and horizontal scaling.
  • Pod – a single processing unit. In our case, Tensorflow Serving will be running as a container in a pod. Pod can have CPU/memory limits defined.
  • Service – an abstraction layer that abstracts out a group of pods and exposes the application to clients.
  • Ingress – a collection of routing rules that govern how external users access services running in a cluster.
  • Ingress Controller – a controller responsible for reading the ingress information and processing that data accordingly such as creating a cloud-provider load balancer or spinning up a new pod as a load balancer using the rules defined in the ingress resource.

Essentially, we deploy resources to instruct Kubernetes the desired state of our application and Kubernetes will make sure that it is always the case.

How are we using Kubernetes?

In this section, we will walk you through how we deploy Tensorflow Serving in Kubernetes cluster and how it makes managing model deployments very convenient.

We used a managed Kubernetes service, to create a Kubernetes cluster and manually provisioned compute resources as nodes. As a result, we have a Kubernetes cluster with nodes that are ready to run applications.

An application to serve one model consists of

  1. Two or more Tensorflow Serving pods that serves a model with an autoscaler to scale pods based on resource consumption
  2. A load balancer to evenly spread incoming traffic to pods
  3. An exposed HTTP endpoint to external users

In order to deploy the application, we need to

  1. Deploy a deployment resource specifying
  2. Number of pods of Tensorflow Serving
  3. An S3 url for Tensorflow Serving to load model files
  4. Deploy a service resource to expose it
  5. Deploy an ingress resource to define an HTTP endpoint url

Kubernetes then allocates Tensorflow Serving pods to the cluster with the number of pods according to the value defined in deployment resource. Pods can be allocated to any node inside the cluster, Kubernetes makes sure that the node it allocates a pod into has sufficient resources that the pod needs. In case there is no node that has sufficient resources, we can easily scale out the cluster by adding new nodes into it.

In order for the rules defined inthe ingressresource to work, the cluster must have an ingress controller running, which is what guided our choice of the load balancer. What an ingress controller does is simple: it keeps checking the ingressresource, creates a load balancer and defines rules based on rules in the ingressresource. Once the load balancer is configured, it will be able to redirect incoming requests to the Tensorflow Serving pods.

That’s it! We have a scalable Tensorflow Serving application that serves a model through a load balancer! In order to serve another model, all we need to do is to deploy the same set of resources but with the model’s S3 url and HTTP endpoint.

To illustrate what is running inside the cluster, let’s see how it looks like when we deploy two applications: one for serving pricing model another one for serving fraud-check model. Each application is configured to have two Tensorflow Serving pods and exposed at /v1/models/model

Tensorflow Serving Diagram

There are two Tensorflow Serving pods that serve fraud-check model and exposed through a load balancer. Same for the pricing model, the only differences are the model it is serving and the exposed HTTP endpoint url. The load balancer rules for pricing and fraud-check model look like this

IfThen forward to
Path is /v1/models/pricingpricing pod ip-1
pricing pod ip-2
Path is /v1/models/fraud-checkfraud-check pod ip-1
fraud-check pod ip-2

Stats and Logs

The last piece is how stats and logs work. Before getting to that, we need to introduce DaemonSet. According to the document, DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage collected. Deleting a DaemonSet will clean up the pods it created.

We deployed datadog-agent and filebeat as a DaemonSet. As a result, we always have one datadog-agent pod and one filebeat pod in all nodes and they are accessible from Tensorflow Serving pods in the same node. Tensorflow Serving pods emit a stats event for every request to datadog-agent pod in the node it is running in.

Here is a sample of DataDog stats:

DataDog stats

And logs that we put in place:

Logs

Benefits Gained from Catwalk

Catwalk has become the go-to, centralized system to serve machine learning models. Data scientists are not required to take care of the serving infrastructure hence they can focus on what matters the most: come up with models to solve customer problems. They are only required to provide exported model files and estimation of expected traffic in order to prepare sufficient resources to run their model. In return, they are presented with an endpoint to make inference calls to their model, along with all necessary tools for monitoring and debugging. Updating the model version is self-service, and the model improvement cycle is much shorter than before. We used to count in days, we now count in minutes.

Future Plans

Improvement on Automation

Currently, the first deployment of any model will still need some manual task from the platform team. We aim to automate this processentirely. We’ll work with our awesome CI/CD team who is making the best use of Spinnaker.

Model serving on mobile devices

As a platform, we are looking at setting standards for model serving across Grab. This includes model serving on mobile devices as well. Tensorflow Serving also provides a Lite version to be used on mobile devices. It is a whole new paradigm with vastly different tradeoffs for machine learning practitioners. We are quite excited to set some best practices in this area.

gRPC support

Catwalk currently supports HTTP/1.1. We’ll hook Grab’s service discovery mechanism to open gRPC traffic, which TFS already supports.

If you are interested in building pipelines for machine learning related topics, and you share our vision of driving South East Asia forward, come join us!

React Native in GrabPay

Post Syndicated from Grab Tech original https://engineering.grab.com/react-native-in-grabpay

Overview

It wasn’t too long ago that Grab formed a new team, GrabPay, to improve the cashless experience in Southeast Asia and to venture into the promising mobile payments arena. To support the work, Grab also decided to open a new R&D center in Bangalore.

It was an exciting journey for the team from the very beginning, as it gave us the opportunity to experiment with new cutting edge technologies. Our first release was the GrabPay Merchant App, the first all React Native Grab App. Its success gave us the confidence to use React Native to optimize the Grab PAX app.

React Native is an open source mobile application framework. It lets developers use React (a JavaScript library for building user interfaces) with native platform capabilities. Its two big advantages are:

  • We could make cross-platform mobile apps and components completely in JavaScript.
  • Its hot reloading feature significantly reduced development time.

This post describes our work on developing React Native components for Grab apps, the challenges faced during implementation, what we learned from other internal React Native projects, and our future roadmap.

Before embarking on our work with React Native, these were the goals we set out. We wanted to:

  • Have a reusable code between Android and iOS as well as across various Grab apps (Driver app, Merchant app, etc.).
  • Have a single codebase to minimize the effort needed to modify and maintain our code long term.
  • Match the performance and standards of existing Grab apps.
  • Use as few Engineering resources as possible.

Challenges

Many Grab teams located across Southeast Asia and in the United States support the App platform. It was hard to convince all of them to add React Native as a project dependency and write new feature code with React Native. In particular, having React Native dependency significantly increases a project’s binary’s size,

But the initial cost was worth it. We now have only a few modules, all written in React Native:

  • Express
  • Transaction History
  • Postpaid Billpay

As there is only one codebase instead of two, the modules take half the maintenance resources. Debugging is faster with React Native’s hot reloading. And it’s much easier and faster to implement one of our modules in another app, such as DAX.

Another challenge was creating a universally acceptable format for a bridging library to communicate between existing code and React Native modules. We had to define fixed guidelines to create new bridges and define communication protocols between React Native modules and existing code.

Invoking a module written in React Native from a Native Module (written in a standard computer language such as Swift or Kotlin) should follow certain guidelines. Once all Grab’s tech families reached consensus on solutions to these problems, we started making our bridges and doing the groundwork to use React Native.

Foundation

On the native side, we used the Grablet architecture to add our React Native modules. Grablet gave us a wonderful opportunity to scale our Grab platform so it could be used by any tech family to plug and play their module. And the module could be in any of  Native, React Native, Flutter, or Web.

We also created a framework encapsulating all the project’s React Native Binaries. This simplified the React Native Upgrade process. Dependencies for the framework are react, react-native, and react-native-event-bridge.

We had some internal proof of concept projects for determining React Native’s performance on different devices, as discussed here. Many teams helped us make an extensive set of JS bridges for React Native in Android and iOS. Oleksandr Prokofiev wrote this bridge creation example:

publicfinalclassDeviceKitModule: NSObject, RCTBridgeModule {
 privateletdeviceKit: DeviceKitService

 publicinit(deviceKit: DeviceKitService) {
   self.deviceKit = deviceKit
   super.init()
 }
 publicstaticfuncmoduleName() -> String {
   return"DeviceKitModule"
 }
 publicfuncmethodsToExport() -> [RCTBridgeMethod] {
   let methods: [RCTBridgeMethod?] = [
     buildGetDeviceID()
     ]
   return methods.compactMap { $0 }
 }

 privatefuncbuildGetDeviceID() -> BridgeMethodWrapper? {
   returnBridgeMethodWrapper("getDeviceID", { [weakself] (_: [Any], _, resolve) in
     letvalue = self?.deviceKit.getDeviceID()
     resolve(value)
   })
 }
}

GrabPay Components and React Native

The GrabPay Merchant App gave us a good foundation for React Native in terms of

  • Component libraries
  • Networking layer and api middleware
  • Real world data for internal assessment of performance and stability

We used this knowledge to build theTransaction History and GrabPay Digital Marketplace components inside the Grab Pax App with React Native.

Component Library

We selected particularly useful components from the Merchant App codebase such as GPText, GPTextInput, GPErrorView, and GPActivityIndicator. We expanded that selection to a common (internal) component library of approximately 20 stateless and stateful components.

API Calls

We used to make api calls using axios (now deprecated). We now make calls from the Native side using bridges that return a promise and make api calls using an existing framework. This helped us remove the dependency for getting an access token from  Native-Android or Native-iOS to make the calls. Also it helped us optimize the api requests, as suggested by Parashuram from Facebook’s React Native team.

Locale

We use React Localize Redux for all our translations and moment for our date time conversion as per the device’s current Locale. We currently support translation in five languages; English, Chinese Simplified, Bahasa Indonesia, Malay, and Vietnamese. This Swift code shows how we get the device’s current Locale from the native-react Native Bridge.

public func methodsToExport() -> [RCTBridgeMethod] {
   let methods: [RCTBridgeMethod?] =  [
     BridgeMethodWrapper("getLocaleIdentifier", { (_, _, resolver) in
     letlocaleIdentifier = self.locale.getLocaleIdentifier()
     resolver(localeIdentifier)
   })]
   return methods.compactMap { $0 }
 }

Redux

Redux is an extremely lightweight predictable state container that behaves consistently in every environment. We use Redux with React Native to manage its state.

For in-app navigation we use react-navigation. It is very flexible in adapting to both the Android and iOS navigation and gesture recognition styles.

End Product

After setting up our foundation bridges and porting skeleton boilerplate code from the GrabPay Merchant App, we wrote two payments modules using GrabPay Digital Marketplace (also known as BillPay), React Native, and Transaction History.

Grab app - Selecting a company

The ios Version is on the left and the Android version is on the right. Not only do their UIs look identical, but also their code is identical. A single codebase lets us debug faster, deliver quicker, and maintain smaller (codebase; apologies but it was necessary for the joke).

Grab app - Company selected

We launched BillPay first in Indonesia, then in Vietnam and Malaysia. So far, it’s been a very stable product with little to no downtime.

Transaction History started in Singapore and is now rolling out in other countries.

Flow For BillPay

BillPay Flow

The above shows BillPay’s flow.

  1. We start with the first screen, called Biller List. It shows all the postpaid billers available for the current region. For now, we show Billers based on which country the user is in. The user selects a biller.
  2. We then asks for your customerID (or prefills that value if you have paid your bill before). The amount is either fetched from the backend or filled in by the user, depending on the region and biller type.
  3. Next, the user confirms all the entered details before they pay the dues.
  4. Finally, the user sees their bill payment receipt. It comes directly from the biller, and so it’s a valid proof of payment.

Our React Native version has kept the same experience as our Native developed App and help users pay their bills seamlessly and hassle free.

Future

We are moving code to Typescript to reduce compile-time bugs and clean up our code. In addition to reducing native dependencies, we will refactor modules as needed. We will also have 100% unit test code coverage. But most importantly, we plan to open source our component library as soon as we feel it is stable.

Connecting the Invisibles to Design Seamless Experiences

Post Syndicated from Grab Tech original https://engineering.grab.com/connecting-the-invisibles-to-design-seamless-experiences

Leonardo Da Vinci's Vitruvian Man
Leonardo Da Vinci’s Vitruvian Man (Source: Public Doman @Wikicommons)

Before we begin, what is service design anyway?

In the world of design jargon, meet “service design”. Unlike other objectives in design to simplify and clarify, service design is not about building singular touchpoints. Rather, it is about bringing ease and harmony into large and often complex ecosystems.

Think of the human body. There are organ systems such as the cardiovascular, respiratory, musculoskeletal, and nervous systems. These systems perform key functions that we see and feel everyday, like breathing, moving, and feeling.

Service design serves as the connective tissue that brings the amazing systems together to work in harmony. Much of the work done by the service design team at Grab revolves around connecting online experiences to the offline world, connecting challenges across a complex ecosystem, and enabling effective collaboration across cross-functional teams.

Connecting online experiences to the offline world

We explore holistic experiences by visualizing the connections across features, both through the online-offline as well as internal-external interactions. At Grab, we have a collection of (very cool!) features that many teams have worked hard to build. However, equally important is how a person arrives from feature to feature seamlessly, from the app to their physical experiences, as well as how our internal teams at Grab support and execute behind-the-scenes throughout our various systems.

For example, placing an order on GrabFood requires much more work than sending information to the merchant through the Grab app. How might Grab

  • allocate drivers effectively,
  • support unhappy paths with our customer support team,
  • resolve discrepancies in our operations teams, and
  • store this data in a system that can continue to expand for future uses to come?
Connecting online experiences to the offline world

Connecting challenges across a complex ecosystem

Sometimes, as designers, we might get too caught up in solving problems through a singular lens, and overlook how it affects the rest of the system. Meanwhile, many problems are part of a connected network. Changing one part of the problem can potentially affect other parts of the network.

Considering those connections, or the “stuff in between”, makes service design a holistic practice – crossing boundaries between teams in search of a root cause, and considering how treating one problem might affect other parts of the network.

  • If this happens, then what?
  • Which point in the system is easiest to fix and has the greatest impact?

For example, if we want to introduce a feature for drivers to report restaurant closings, how might Grab

  • Ensure the report is accurate?
  • Deal with accidental closings or fraud?
  • Use that data for our operations team to make decisions?
  • Let drivers know when their report has led to a successful action?
  • Last but not least, is this the easiest point in the system to fix restaurant opening inaccuracies, or should this be tackled through an operational fix?
Connecting challenges across a complex ecosystem

Facilitating effective collaborations in cross-functional teams

Finally, we believe in the power of a participatory design process to unlock meaningful, customer-centric solutions. Working on the “stuff in between” often puts the service design team in the thick of alignment of priorities, creation of a common vision, and coherent action plans. Achieving this requires solid facilitation and processes for cross-team collaboration.

  • Who are the right stakeholders and how do we engage?
  • How does an initiative affect stakeholders, and how can they contribute?
  • How can we create visual processes that allow diverse stakeholders to have a shared understanding and co-create solutions?
Facilitating effective collaborations in cross-functional teams

What’s the ultimate goal? A Harmonious Backstage for a Delightful Customer Experience

By facilitating cross-functional collaborations and espousing a whole-of-Grab approach, the service design team at Grab helps to connect the dots in an interconnected ‘super-app’ service ecosystem. By empathising with our users, and having a deep understanding of how different parts of the Grab ecosystem affect one another, we hope to unleash the full power of Grab to deliver maximum value and delight to serve our users.

Tourists on GrabChat!

Post Syndicated from Grab Tech original https://engineering.grab.com/tourist-chat-data-story

Just over two years ago we introduced GrabChat, Southeast Asia’s first of its kind in-app messaging platform. Since then we’ve added all sorts of useful features to it. Auto-translated messages, the ability to send photos, and even voice messages! It’s been a great tool to facilitate smoother communications between our driver-partners and our passengers, and one group in particular has found it incredibly useful: tourists!

Now, we’ve analysed tourist data before, but we were curious about how GrabChat in particular has served this demographic. So we looked for interesting insights using sampled tourist chat data from Singapore, Malaysia, and Indonesia for the period of December 2018 to March 2019. That’s more than 3.7 million individual GrabChat messages sent by tourists! Here’s what we found.

Average chats per booking per country

Looking at the volume of the chats being transmitted per booking, we can see that the “chattiest” tourists are from East Timor, Nigeria, and Ukraine with averages of 6.0, 5.6, and 5.1 chats per booking respectively.

Then we wondered: if tourists from all over the world are talking this much to our driver-partners, how are they actually communicating if their mother-tongue is not the local language?

Need a Translator?

When we go to another country, we eat all the heavenly good food, fall in love with the culture, and admire the scenery. Language and communication barriers shouldn’t get in the way of all of that. That’s why Grab’s Chat feature has got it covered!

With Grab’s in-house translation solutions, any Grab passenger can send messages in their preferred language without fear of being misunderstood – or not understood at all! Their messages will be automatically translated into Bahasa Indonesia, Bahasa Melayu, Simplified Chinese, Thai, or Vietnamese depending on where they are. This applies not only apply to Grab’s transport services- GrabChat can be used when ordering GrabFood too!

Percentage of translated GrabChat messages
Indonesia saw the highest usage of translations on a by-booking basis!

 

Let’s look deeper into the tourist translation statistics for each country with the donut charts below. We can see that the most popular translation route for tourists in Indonesia was from English to Indonesian. The story is different for Singapore and Malaysia: we can see that there are translations to and from a more diverse set of languages, reflecting a more multicultural demographic.

Percentage of translated GrabChat messages
The most popular translation routes for tourist bookings in Indonesia, Malaysia, and Singapore.

 

Tap for Templates!

GrabChat also provides achat template feature. Templates are prewritten messages that you can send with just one tap! Did we mention that they are translated automatically too? Passengers and drivers can have a fast, simple, and translated conversation with each other without typing a single word- and sometimes, templates are really all you need.

Examples of chat templates, as they appear in GrabChat!
Examples of chat templates, as they appear in GrabChat!

 

As if all this wasn’t convenient enough, you can also make your own custom templates! Use them for those repetitive, identical messages you always seem to be sending out like telling your drivers where the hotel lobby is, or how to navigate right to your doorstep, or even to send a quick description of what you look like to make it easier for a driver to find you!

Template message usage

Taking a look at individual country data, tourists in Indonesia used templates the most with almost 60% of all of them using a template in their conversations at least once. Malaysia and Singapore saw lower but still sizeable utilisation rates of this feature, at 53% and 33% respectively.

Template message usage percentage
Indonesia saw the highest usage of templates on a by-booking basis.

 

In our analysis, we found an interesting insight! There was a positive correlation between template usage and the success rate of rides. Overall, bookings that used templates in their conversations saw 10% more completions over bookings that didn’t.

Template vs completed bookings

Picture this: a hassle-free experience

A picture says a thousand words, and for tourists using GrabChat’s image feature, those thousand words don’t even need to be translated. Instead of typing out a description of where they are standing for pickup, they can just click, snap, and send an image!

Our data revealed that GrabChat’s image functionality is most frequently used in areas where the tourist traffic is the highest. In fact, image function in GrabChat saw the most use in pickup areas such as airports, large shopping malls, public transport stations, and hotels, because it was harder for drivers to find their passengers in these crowded areas. Even with our super convenient Entrances feature, every little bit of information goes a long way to help your driver find you!

Pickup locations

If we take it a step further and look at the actual areas  within the cities where images were sent the most, we see that our initial hypothesis still holds fast.

Pickup locations
The top 5 pickup areas per country in which images were the most prevalent in GrabChat (for tourists).

 

In Singapore, we see the most images being sent out at the Downtown Core area- this area contains the majestic Marina Bay Sands, the Merlion statue, and the Esplanade, amongst other iconic attractions.

In Malaysia, the highest image usage occurs at none other than the Kuala Lumpur City Centre (KLCC) itself. This area includes the Twin Towers, a plethora of malls and hotels, Bukit Bintang (a bustling and lively night-life zone), and even an aquarium.

Indonesia’s top location for image chats is Kuta. A beach village in Bali, Kuta is a tourist hotspot with surfing, water parks, bars, budget-friendly yet delicious food, and numerous cultural attractions.

Speak up!

Allowing for two-way communication via GrabChat empowers both passengers and drivers to improve their journeys by divulging useful information, and asking clarifying questions: how many bags do you have? Does your car accommodate my pet dog? I’m standing by the lobby with my two kids- these are the sorts of things that are talked about in GrabChat messages.

During the analysis of our multitudes of wide-ranging GrabChat conversations, we picked up some pro-tips for you to get a Grab ride with even more convenience and ease, whether you’re a tourist or not:

Tip #1: Did some shopping on your trip? Swamped with bags? Send a message to your driver to let them know how many pieces of luggage you have with you.

As one might expect, chats that have keywords such as “luggage” or “baggage” (or any other related term) occur the most when riders are going to, or leaving, an airport. Most of the tourists on GrabChat asked the drivers if there was space for all of their things in the car. Interestingly, some of them also told the drivers how to recognise them for pickup based off of the descriptions of their bags!

Tip #2: Your children make good landmarks! If you’re in a crowded spot and you’re worried your driver can’t find you, drop them a message to let them know you’re that family with a baby and a little girl in pigtails.

When it comes to children, we found that passengers mainly use them to help identify themselves to the driver. Messages like “I’m with my two kids” or “We are a family with a baby” came up numerous times, and served as descriptions to facilitate fast pickup. These sorts of chats were the most prevalent in crowded areas like airports and shopping centres.

Tip #3: Don’t get caught off guard- be sure your furry friends have a seat!

Taking a look at pet related chats, we learned that our tourists have used GrabChat to ask clarifying questions to the driver. Passengers have likely considered that not every driver or vehicle is accommodating towards animals. The most common type of message was about whether pets are allowed in the vehicle. For example: “Is it okay if I bring a puppy?” or “I have a dog with me in a carrier, is that alright?”. Better safe than sorry! Alternatively, if you’re travelling with a pet, why not see if GrabPet is available in your country?

From the chat content analysis we have learned that tourists do indeed use GrabChat to talk to their drivers about specific details of their trip. We see that the chat feature is an invaluable tool that anyone can use to clear up any ambiguities and make their journeys more pleasant.

Bubble Tea Craze on GrabFood!

Post Syndicated from Grab Tech original https://engineering.grab.com/bubble-tea-craze-on-grabfood

Bigger and More Bubble Tea!

Bubble tea orders on GrabFood has been constantly and dramatically increasing with an impressive regional average growth rate of 3,000% in the year of 2018!  Just look at the percentage increase over the year of 2018, across all countries!

CountriesBubble tea growth by percentage in 2018*
Indonesia>8500% growth from Jan 2018 to Dec 2018
Philippines>3,500% growth from June 2018 to Dec 2018
Thailand>3,000% growth from Jan 21018 to Dec 2018
Vietnam>1,500% growth from May 2018 to Dec 2018
Singapore>700% growth from May 2018 to Dec 2018
Malaysia>250% growth from May 2018 to Dec 2018

*Time period: January 2018 to December 2018, or from the time GrabFood was launched.

What’s driving this growth is not just die-hard bubble tea fans who can’t go a week without drinking this sweet treat, but a growing bubble tea fan club in Southeast Asia. The number of bubble tea lovers on GrabFood grew over 12,000% in 2018 – and there’s no sign of stopping!

With increasing consumer demand, how is Southeast Asia’s bubble tea supply catching up?  As of December 2018, GrabFood has close to 4,000 bubble tea outlets from a network of over 1,500 brands – a 200% growth in bubble tea outlets in Southeast Asia!

Bubble-Tea-Lover growth on GrabFood

If this stat doesn’t stick, here is a map to show you how much bubble tea orders in different Southeast Asian cities have grown!

Maps of bubble tea merchants on GrabFood

And here is a little shoutout to our star merchants including Chatime, Coco Fresh Tea & Juice, Macao Imperial Tea, Ochaya, Koi Tea, Cafe Amazon, The Alley, iTEA, Gong Cha, and Serenitea.

Just how much do you drink?

On average, Southeast Asians drink  4 cups of bubble tea per person per month on GrabFood. Thai consumers top the regional average by 2 cups, consuming about six cups of bubble tea per person per month. This is closely followed by Filipino consumers who drink an average of 5 cups per person per month.

Average bubble tea consumption by cups per person per month

Favourite Flavours!

Have a look at the dazzling array of Bubble Tea flavours available on GrabFood today and you’ll find some uniquely Southeast Asian flavours like Chendol, Durian, and Gula Melaka, as well as rare flavours like salted cream and cheese! Can you spot your favourite flavours here?

Bubble tea flavour consumption per month

Let’s break it down by the country that GrabFood serves, and see who likes which flavours of Bubble Tea more!

Bubble tea flavour consumption per month by country

Top the Toppings!

Pearl seems to be the unbeatable best topping of most of the countries, except Vietnam whose No. 1 topping turned out to be Cheese Pudding! Top 3 toppings that topped your favorite bubble tea are:

Top list of toppings

Best Time for Bubble Tea!

Don’t we all need a cup of sweet Bubble Tea in the afternoon to get us through the day?  Across Southeast Asia, GrabFood’s data reveals that most people order bubble tea to accompany their meals at lunch, or as a  perfect midday energizer!

Times of the day when most people order bubble tea

Conclusion

So hazelnut or chocolate, pearl or (and) pudding (who says we can’t have the best of both worlds!)? The options are abundant and the choice is yours to enjoy!

If you have a sweet tooth, or simply want to reward yourself with Southeast Asia’s most popular drink, go ahead – you are only a couple of taps away from savouring this cup full of delight

Why you should organise an immersion trip for your next project

Post Syndicated from Grab Tech original https://engineering.grab.com/why-you-should-organise-an-immersion-trip-for-your-next-project

Sherizan Sheikh is a Design Lead at Grab Ventures, an incubation arm that looks at experiences beyond ride-hailing, for example, groceries, healthcare and autonomous deliveries.

Grab Ventures is where exciting initiatives are birthed in Grab. From strategic partnerships like GrabFresh, Grab’s first on-demand grocery delivery service, to exploratory concepts such as on-demand e-scooter rentals, there has never been a more exciting time to be in this unique space.

Cover GrabFresh

 

In my role as Design Lead for Grab Ventures, I juggle between both sides of the coin and whether it’s a partnership or exploratory concept, I ask myself:

“How do I know who my customers are, and what are their pain points?”

So I like to answer that question by starting with traditional research methods like desktop research and surveys, just to name a few. At Grab, it’s usually not enough to answer those questions.

That said, I find that some of the best insights are formed from immersion trips.

In one sentence, an immersion trip is getting deeply involved in a user’s life by understanding him or her through observation and conversation.

Our CEO, Anthony Tan, picking items for a customer, on an immersion trip.
Our CEO, Anthony Tan, picking items for a customer, on an immersion trip.

 

For designers and researchers in Singapore, it plucks you out of your everyday reality and drops you into someone else’s, somewhere else, where 99.9% of the time, everything you expect and anticipate gets thrown out in a matter of minutes. I’ve trained myself to switch my mindset, go back to basics, and learn (or relearn) everything I need to know about the country I’d be visiting even if I’ve been there countless times.

Fun fact: In 2018, I spent about 100 days in Indonesia. That means roughly 30% of 2018 was spent on the ground, observing, shadowing, interviewing people (and getting stuck in traffic) and loving it.

Why immersions?

Understanding one’s country, culture and her people is something that gets me excited as I continuously build empathy visit upon a visit, interview after interview.

I remembered one time during an immersion trip, we interviewed locals at different supermarkets to learn and understand their motivations: why they prefer to visit the supermarket vs purchasing them online. One of our hypotheses was that the key motivator for Indonesians to buy groceries online must be down to convenience. We were wrong.

It boiled down to 2 key factors.

1) Freshness: We found out that many of the locals still felt the need to touch and feel the products before they buy. There were many instances where they felt the need to touch the fresh produce on the shelves, cutting off a piece of fruit or even poking the eyes of the fish to check its freshness.

Oranges

 

2) Price: The de-facto for most locals as they are price-sensitive. Every decision was made with the price tag in mind. They are willing to travel far, spend the time to go through the traffic just to get to the wet or supermarket that offers the lowest prices and value for money. Through observations, while shadowing at a local wet market, we also found something interesting. Most of the wet market vendors are getting WhatsApp messages from their regular customers seeking fresh produce and making orders. The transactions were mostly via e-wallets or bank transfers. The vendors then packed them and get bike drivers to help with the delivery. I couldn’t have gotten this valuable information if I was just sitting at my desk.

An immersion trip is an excellent opportunity to learn about our customers and the meaning behind their behaviours. There is only so much we can learn from white papers and reports. As soon as you are in the same environment as your users, seeing your users do everyday errands or acts, like grocery shopping or hopping on a bike, feeling their frustrations and experiencing them yourself, you’ll get so much more fruitful and valuable insights to help shape your next product. (Or even, improve an existing one!)

My colleagues trying to blend in.
My colleagues trying to blend in.

 

Now that I’ve sold you on this idea, here are some tips on how to plan and execute effective immersion trips, share your findings and turn them into actionable insights for your team and stakeholders.

Pro tip #1 – Generate a hypothesis

Generating a hypothesis is a valuable exercise. It enables you to focus on the “wants vs. needs” and to validate your assumptions beyond desktop research. Be sure to get your core team members together, including Business, Ops and Tech, to generate a hypothesis. I’ll give an example below.

Pro tip #2 – Have short immersion days with a debrief at the end for everyone

Scheduling really depends on your project. I have planned for trips that are from a few hours to up to fourteen days long. Be sure not to have too many locations in a single day and spread them out evenly in case there are unexpected roadblocks such as traffic jams that might contribute to rushed research.

Do include Brief and Debrief sessions into your schedule. I’d recommend shorter immersion days so that you have enough energy left for the critical Debrief session at the end of the day. The structure should be kept very simple with focus of collating ALL observations from the contextual inquiries you did into writing. It’s actually up to you how you structure your document.

Be prepared for the unexpected.
Be prepared for the unexpected.

 

Pro tip #3 – Recce locations beforehand

Once you’ve nailed down the locations, it is essential for you to get a local resident to recce the places first. In Southeast Asia, more often than so would you realise that information found online is unreliable and misleading, so doing a physical recce will save you a lot of time.

I had experienced a few time-wasting incidents when we did not expect specific locations to be what was intended. For example, while on our grocery-run, we wanted to visit a local wet market that opens only very early in the morning. We got up at 5 am, drove about 1.5 hours and only to realize the wet market is not open to the public and we eventually got chased out by the security guards.

Pro tip #4 – Never assume a customer’s journey

(even though you’ve experienced it before as a customer)

One of the most important processes throughout a product life cycle is to understand a customer’s journey. It’s particularly important to understand the journey if we are not familiar with the actual environment. Take our GrabFresh service as an example. It’s a complex journey that happens behind the scenes. Desktop research might not be enough to fully validate the journey hence, an immersion trip that allows you to be on the field will ensure you go through the lifecycle of the entire process to observe and note all the phases that happen in the real environment.

GrabFresh user journey

 

Pro tip #5 – Be 100% sure of your open-ended, non-leading questions that will validate your hypothesis.

This part is an essential piece to the quality of your immersion outcome. Not spending enough time crafting or vetting the questions thoroughly might end up with skewed insights and could jeopardise your entire immersion. Please be sure your questions links up with your hypothesis and provide backup questions to support your assumptions.

For example, don’t suggest answers in questions.

Bad: “Why do you like this supermarket? Cheap? Convenient?”

Good: “Tell me why you chose this particular supermarket?”

Pro tip #6 – Break into smaller groups of 2 to 3. Dress comfortably and like a local. Keep your expensive belongings out of sight.

During my recent trip, I visited a lot of places that unknowingly had very tight security. One of the mistakes I made was going as a group of 6 (foreign-looking, and – okay –  maybe a little touristy with our appearances and expensive gadgets).

Out of nowhere, once we started approaching customers for interviews, and snapping photos with our cameras and phones, we could see the security teams walking towards us. Unfortunately, we were asked to leave the premises when we could not provide a permit.

As luck would have it, we eyed a few customers and approached them when they were further away from the original location. Success!

Pro tip #7 – Find translators with people skills and interview experience.

Most of my immersion trips are overseas, where English is not the main language. I get annoyed at myself for not being able to interview non-English speaking customers. Having seasoned, outgoingtranslators does help a lot! If you feel awkward standing around waiting for a translated answer, feel free to step away and let the translator interview the customer without feeling pressured. Be sure it’s all recorded for transcription later.

Insights + Action plan= Strategy

Findings are significant, it’s the basis of everything that you do while you are in immersion. But what’s more important is the ability to connect those dots and extract value from them. It’s similar to how we can amass tons of raw data but entirely pointless if nothing is done with it.

A good strategy usually comes from good insights that are actionable.

For example, we found out that a % of customers that we interviewed did not know that GrabFresh has a pool of professional shoppers who pick grocery items for customers. Their impression was that a driver would receive their order, drive to the location, get out of their vehicle and go into the store to do the picking. That’s not right. It hinders customers from making their first purchase through the app.

Observing a personal shopper interacting with Grab driver-partner.
Observing a personal shopper interacting with Grab driver-partner.

 

So, in this case, our hypothesis was: if customers are aware of personal shoppers, the number of orders will increase.

This opinion was a shared one that may have had an impact on our business. So we needed to take this back to the team, look at the data, brainstorm, and come up with a great strategy to improve the perception and its impact on our business (whether good or bad).

Wrapping Up

After a full immersion, it is always important to ask each and every member of some of these questions:

“What went well? What did you learn?”

“What can be improved? If you could change one thing, what would it be?”

I’d usually document them and have a reflection for myself so that I can pick up what worked, what didn’t and continue to improve for my next immersion trip.

Following the Double Diamond framework, immersion trips are part of the “Discover”phase where we gather customer insights. Typically, I follow up with a Design sprint workshop where we start framing the problems. This is where we have a session where experts and researchers share their domain knowledge and research insights uncovered from various methodologies including immersions.

Then, hopefully, we will have some actionable changes that we can execute confidently.

So, good luck, bring some sunblock and see you on the ground!

If you’d like to connect with Sherizan, you can find him on LinkedIn.

Preventing Pipeline Calls from Crashing Redis Clusters

Post Syndicated from Grab Tech original https://engineering.grab.com/preventing-pipeline-calls-from-crashing-redis-clusters

Introduction

On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement. During this period, roughly only 1 in 21 calls to Apollo, a primary transport booking service, succeeded. This brought Grab rides down significantly for the one minute it took the Redis Cluster to self-recover. This behavior was totally unexpected and completely breached our intention of having multiple replicas.

This blog post describes Grab’s outage post-mortem findings.

Understanding the infrastructure

With Grab’s continuous growth, our services must handle large amounts of data traffic involving high processing power for reading and writing operations. To address this significant growth, reduce handler latency, and improve overall performance, many of our services use Redis – a common in-memory data structure storage – as a cache, database, or message broker. Furthermore, we use a Redis Cluster, a distributed implementation of Redis, for shorter latency and higher availability.

Apollo is our driver-side state machine. It is on almost all requests’ critical path and is a primary component for booking transport and providing great service for customer bookings. It stores individual driver availability in an AWS ElastiCache Redis Cluster, letting our booking service efficiently assign jobs to drivers. It’s critical to keep Apollo running and available 24/7.

Apollo's infrastructure

Because of Apollo’s significance, its Redis Cluster has 3 shards each with 2 slaves. It hashes all keys and, according to the hash value, divides them into three partitions. Each partition has two replications to increase reliability.

We use the Go-Redis client, a popular Redis library, to direct all written queries to the master nodes (which then write to their slaves) to ensure consistency with the database.

Master and slave nodes in the Redis Cluster

For reading related queries, engineers usually turn on the ReadOnly flag and turn off the RouteByLatency flag. These effectively turn on ReadOnlyFromSlaves in the Grab gredis3 library, so the client directs all reading queries to the slave nodes instead of the master nodes. This load distribution frees up master node CPU usage.

Client reading and writing from/to the Redis Cluster

When designing a system, we consider potential hardware outages and network issues. We also think of ways to ensure our Redis Cluster is highly efficient and available; setting the above-mentioned flags help us achieve these goals.

Ideally, this Redis Cluster configuration would not cause issues even if a master or slave node breaks. Apollo should still function smoothly. So, why did that February Apollo outage happen? Why did a single down slave node cause a 95+% call failure rate to the Redis Cluster during the dim-out time?

Let’s start by discussing how to construct a local Redis Cluster step by step, then try and replicate the outage. We’ll look at the reasons behind the outage and provide suggestions on how to use a Redis Cluster client in Go.

How to set up a local Redis Cluster

1. Download and install Redis from here.

2. Set up configuration files for each node. For example, in Apollo, we have 9 nodes, so we need to create 9 files like this with different port numbers(x).

// file_name: node_x.conf (do not include this line in file)

port 600x

cluster-enabled yes

cluster-config-file cluster-node-x.conf

cluster-node-timeout 5000

appendonly yes

appendfilename node-x.aof

dbfilename dump-x.rdb

3. Initiate each node in an individual terminal tab with:

$PATH/redis-4.0.9/src/redis-server node_1.conf

4. Use this Ruby script to create a Redis Cluster. (Each master has two slaves.)

$PATH/redis-4.0.9/src/redis-trib.rb create --replicas 2127.0.0.1:6001..... 127.0.0.1:6009

>>> Performing Cluster Check (using node 127.0.0.1:6001)

M: 7b4a5d9a421d45714e533618e4a2b3becc5f8913 127.0.0.1:6001

   slots:0-5460 (5461 slots) master

   2 additional replica(s)

S: 07272db642467a07d515367c677e3e3428b7b998 127.0.0.1:6007

   slots: (0 slots) slave

   replicates 05363c0ad70a2993db893434b9f61983a6fc0bf8

S: 65a9b839cd18dcae9b5c4f310b05af7627f2185b 127.0.0.1:6004

   slots: (0 slots) slave

   replicates 7b4a5d9a421d45714e533618e4a2b3becc5f8913

M: 05363c0ad70a2993db893434b9f61983a6fc0bf8 127.0.0.1:6003

   slots:10923-16383 (5461 slots) master

   2 additional replica(s)

S: a78586a7343be88393fe40498609734b787d3b01 127.0.0.1:6006

   slots: (0 slots) slave

   replicates 72306f44d3ffa773810c810cfdd53c856cfda893

S: e94c150d910997e90ea6f1100034af7e8b3e0cdf 127.0.0.1:6005

   slots: (0 slots) slave

   replicates 05363c0ad70a2993db893434b9f61983a6fc0bf8

M: 72306f44d3ffa773810c810cfdd53c856cfda893 127.0.0.1:6002

   slots:5461-10922 (5462 slots) master

   2 additional replica(s)

S: ac6ffbf25f48b1726fe8d5c4ac7597d07987bcd7 127.0.0.1:6009

   slots: (0 slots) slave

   replicates 7b4a5d9a421d45714e533618e4a2b3becc5f8913

S: bc56b2960018032d0707307725766ec81e7d43d9 127.0.0.1:6008

   slots: (0 slots) slave

   replicates 72306f44d3ffa773810c810cfdd53c856cfda893

[OK] All nodes agree about slots configuration.

5. Finally, we try to send queries to our Redis Cluster, e.g.

$PATH/redis-4.0.9/src/redis-cli -c -p 6001 hset driverID 100 state available updated_at 11111

What happens when nodes become unreachable?

Redis Cluster Server

As long as the majority of a Redis Cluster’s masters and at least one slave node for each unreachable master are reachable, the cluster is accessible. It can survive even if a few nodes fail.

Let’s say we have N masters, each with K slaves, and random T nodes become unreachable. This algorithm calculates the Redis Cluster failure rate percentage:

if T <= K:
        availability = 100%
else:
        availability = 100% - (1/(N*K - T))

If you successfully built your own Redis Cluster locally, try to kill any node with a simple command-c. The Redis Cluster broadcasts to all nodes that the killed node is now unreachable, so other nodes no longer direct traffic to that port.

If you bring this node back up, all nodes know it’s reachable again. If you kill a master node, the Redis Cluster promotes a slave node to a temp master for writing queries.

$PATH/redis-4.0.9/src/redis-server node_x.conf

With this information, we can’t answer the big question of why a single slave node failure caused an over 95% failure rate in the Apollo outage. Per the above theory, the Redis Cluster should still be 100% available. So, the Redis Cluster server could properly handle an outage, and we concluded it wasn’t the failure rate’s cause. So we looked at the client side and Apollo’s queries.

Golang Redis Cluster Client & Apollo Queries

Apollo’s client side is based on the Go-Redis Library.

During the Apollo outage, we found some code returned many errors during certain pipeline GET calls. When Apollo tried to send a pipeline of HMGET calls to its Redis Cluster, the pipeline returned errors.

First, we looked at the pipeline implementation code in the Go-Redis library. In the function defaultProcessPipeline, the code assigns each command to a Redis node in this line err:=c.mapCmdsByNode(cmds, cmdsMap).

func (c *ClusterClient) mapCmdsByNode(cmds []Cmder, cmdsMap *cmdsMap) error {
state, err := c.state.Get()
        if err != nil {
                setCmdsErr(cmds, err)
                returnerr
        }

        cmdsAreReadOnly := c.cmdsAreReadOnly(cmds)
        for_, cmd := range cmds {
                var node *clusterNode
                var err error
                if cmdsAreReadOnly {
                        _, node, err = c.cmdSlotAndNode(cmd)
                } else {
                        slot := c.cmdSlot(cmd)
                        node, err = state.slotMasterNode(slot)
                }
                if err != nil {
                        returnerr
                }
                cmdsMap.mu.Lock()
                cmdsMap.m[node] = append(cmdsMap.m[node], cmd)
                cmdsMap.mu.Unlock()
        }
        return nil
}

Next, since the readOnly flag is on, we look at the cmdSlotAndNode function. As mentioned earlier, you can get better performance by setting readOnlyFromSlaves to true, which sets RouteByLatency to false. By doing this, RouteByLatency will not take priority and the master does not receive the read commands.

func (c *ClusterClient) cmdSlotAndNode(cmd Cmder) (int, *clusterNode, error) {
        state, err := c.state.Get()
        if err != nil {
                return 0, nil, err
        }

        cmdInfo := c.cmdInfo(cmd.Name())
        slot := cmdSlot(cmd, cmdFirstKeyPos(cmd, cmdInfo))

        if c.opt.ReadOnly && cmdInfo != nil && cmdInfo.ReadOnly {
                if c.opt.RouteByLatency {
                        node, err:= state.slotClosestNode(slot)
                        return slot, node, err
                }

                if c.opt.RouteRandomly {
                        node:= state.slotRandomNode(slot)
                        return slot, node, nil
                }

                node, err:= state.slotSlaveNode(slot)
                return slot, node, err
        }

        node, err:= state.slotMasterNode(slot)
        return slot, node, err
}

Now, let’s try and better understand the outage.

  1. When a slave becomes unreachable, all commands assigned to that slave node fail.
  2. We found in Grab’s Redis library code that a single error in all cmds could cause the entire pipeline to fail.
  3. In addition, engineers return a failure in their code if err != nil. This explains the high failure rate during the outage.
func (w *goRedisWrapperImpl) getResultFromCommands(cmds []goredis.Cmder) ([]gredisapi.ReplyPair, error) {
        results := make([]gredisapi.ReplyPair, len(cmds))
        var err error
        for idx, cmd := range cmds {
                results[idx].Value, results[idx].Err = cmd.(*goredis.Cmd).Result()
                if results[idx].Err == goredis.Nil {
                        results[idx].Err = nil
                        continue
                }
                if err == nil && results[idx].Err != nil {
                        err = results[idx].Err
                }
        }

        return results, err
}

Our next question was, “Why did it take almost one minute for Apollo to recover?”.  The Redis Cluster broadcasts instantly to its other nodes when one node is unreachable. So we looked at how the client assigns jobs.

When the Redis Cluster client loads the node states, it only refreshes the state once a minute. So there’s a maximum one minute delay of state changes between the client and server. Within that minute, the Redis client kept sending queries to that unreachable slave node.

func (c *clusterStateHolder) Get() (*clusterState, error) {
        v := c.state.Load()
        if v != nil {
                state := v.(*clusterState)
                if time.Since(state.createdAt) > time.Minute {
                        c.LazyReload()
                }
                return state, nil
        }
        return c.Reload()
}

What happened to the write queries? Did we lose new data during that one min gap? That’s a very good question! The answer is no since all write queries only went to the master nodes and the Redis Cluster client with a watcher for the master nodes. So, whenever any master node becomes unreachable, the client is not oblivious to the change in state and is well aware of the current state. See the Watcher code.

How to use Go Redis safely?

Redis Cluster Client

One way to avoid a potential outage like our Apollo outage is to create another Redis Cluster client for pipelining only and with a true RouteByLatency value. The Redis Cluster determines the latency according to ping calls to its server.

In this case, all pipelining queries would read through the master nodesif the latency is less than 1ms (code), and as long as the majority side of partitions are alive, the client will get the expected results. More load would go to master with this setting, so be careful about CPU usage in the master nodes when you make the change.

Pipeline Usage

In some cases, the master nodes might not handle so much traffic. Another way to mitigate the impact of an outage is to check for  errors on individual queries when errors happen in a pipeline call.

In Grab’s Redis Cluster library, the function Pipeline(PipelineReadOnly) returns a response with an error for individual reply.

func (c *clientImpl) Pipeline(ctx context.Context, argsList [][]interface{}) ([]gredisapi.ReplyPair, error) {
        defer c.stats.Duration(statsPkgName, metricElapsed, time.Now(), c.getTags(tagFunctionPipeline)...)
        pipe := c.wrappedClient.Pipeline()
        cmds := make([]goredis.Cmder, len(argsList))
        for i, args := range argsList {
                cmd := goredis.NewCmd(args...)
                cmds[i] = cmd
                _ = pipe.Process(cmd)
        }
        _, _ = pipe.Exec()
        return c.wrappedClient.getResultFromCommands(cmds)
}

func (w *goRedisWrapperImpl) getResultFromCommands(cmds []goredis.Cmder) ([]gredisapi.ReplyPair, error) {
        results := make([]gredisapi.ReplyPair, len(cmds))
        var err error
        for idx, cmd := range cmds {
                results[idx].Value, results[idx].Err = cmd.(*goredis.Cmd).Result()
                if results[idx].Err == goredis.Nil {
                        results[idx].Err = nil
                        continue
                }
                if err == nil && results[idx].Err != nil {
                        err = results[idx].Err
                }
        }

        return results, err
}

type ReplyPair struct {
        Value interface{}
        Err   error
}

Instead of returning nil or an error message when err != nil, we could check for errors for each result so successful queries are not affected. This might have minimized the outage’s business impact.

Go Redis Cluster Library

One way to fix the Redis Cluster library is to reload nodes’ status when an error happens.In the go-redis library, defaultProcessor has this logic, which can be applied to defaultProcessPipeline.

In Conclusion

We’ve shown how to build a local Redis Cluster server, explained how Redis Clusters work, and identified its potential risks and solutions. Redis Cluster is a great tool to optimize service performance, but there are potential risks when using it. Please carefully consider our points about how to best use it. If you have any questions, please ask them in the comments section.

Guiding you Door-to-Door via our Super App!

Post Syndicated from Grab Tech original https://engineering.grab.com/poi-entrances-venues-door-to-door

Remember landing at an airport or going to your favourite mall and the hassle of finding the pickup spot when you booked a cab? When there are about a million entrances, it can get particularly annoying trying to find the right pickup location!

Rolling out across South East Asia  is a brand new booking experience from Grab, designed  to make it easier for you to make a booking at large venues like airports, shopping centers, and tourist destinations! With the new booking flow, it will not only be easier to select one of the pre-designated Grab pickup points, you can also find text and image directions to help you navigate your way through the venue for a smoother rendezvous with your driver!

Inspiration behind the work

Finding your pick-up point closest to you, let alone predicting it, is incredibly challenging, especially when you are inside huge buildings or in crowded areas. Neeraj Mishra, Product Owner for Places at Grab explains: “We rely on GPS-data to understand user’s location which can be tricky when you are indoors or surrounded by skyscrapers. Since the satellite signal has to go through layers of concrete and steel, it becomes weak which adds to the inaccuracy. Furthermore, ensuring that passengers and drivers have the same pick-up point in mind can be tricky, especially with venues that have multiple entrances. ”  

Marina One POI

Grab’s data analysis revealed that “rendezvous distance” (walking distance between the selected pick-up point and where the car is waiting) is more than twice the Grab average when the booking is made from large venues such as airports.

To solve this issue, Grab launched “Entrances” (the green dots on the map) last year, which lists the various pick-up points available at a particular building, and shows them on the map, allowing users to easily choose the one closest to them, and ensuring their drivers know exactly where they want to be picked up from. Since then, Grab has created more than 120,000 such entrances, and we are delighted to inform you that average of rendezvous distances across all  countries have been steadily going down!

Decreasing rendezvous distance across region

One problem remained

But there was still one common pain-point to be solved. Just because a passenger has selected the pick-up point closest to them, doesn’t mean it’s easy for them to find it. This is particularly challenging at very large venues like airports and shopping centres, and especially difficult if the passenger is unfamiliar with the venue, for example – a tourist landing at Jakarta Airport for the very first time. To deliver an even smoother booking and pick-up experience, Grab has rolled out a new feature called Venues – the first in the region – that will give passengers in-app photo and text directions to the pick-up point closest to them.

Let’s break it down! How does it work?

Whether you are a local or a foreigner on holiday or business trip, fret not if you are not too familiar with the place that you are in!

Let’s imagine that you are now at Singapore Changi Airport: your new booking experience will look something like this!

Step 1: Fire the Grab app and click on Transport. You will see a welcome screen showing you where you are!

Welcome to Changi Airport

Step 2: On booking screen, you will see a new pickup menu with a list of available pickup points. Confirm the pickup point you want and make the booking!

Booking screen at Changi Airport

Step 3: Once you’ve been allocated a driver, tap on the bubble to get directions to your pick-up point!

Driver allocated at Changi Airport

Step 4: Follow the landmarks and walking instructions and you’ve arrived at your pick-up point!

Directions to pick-up point at Changi Airport

Curious about how we got this done?

Data-Driven Decisions

Based on a thorough data analysis of historical bookings, Grab identified key venues across our markets in Southeast Asia. Then we dispatched our Operations team to the ground, to identify all pick up points and perform detailed on-ground survey of the venue.

Operations Team’s Leg Work

Nagur Hassan, Operations Manager at Grab, explains the process: “For the venue survey process, we send a team equipped with the tools required to capture the details, like cameras, wifi and bluetooth scanners etc. Once inside the venue, the team identifies strategic landmarks and clear direction signs that are related to drop-off and pick-up points. Team also captures turn-by-turn walking directions to make it easier for Grab users to navigate – For instance, walk towards Starbucks and take a left near H&M store. All the photos and documentations taken on the sites are then brought back to the office for further processing.”

Quality Assurance

Once the data is collected, our in-house team checks the quality of the images and data. We also mask people’s faces and number plates of the vehicles to hide any identity-related information. As of today, we have collected 3400+ images for 1900+ pick up points belonging to 600 key venues! This effort took more than 3000 man-hours in total! And we aim to cover more than 10,000 such venues across the region in the next few months.

This is only the beginning

We’re constantly striving to improve the location accuracy of our passengers by using advanced Machine Learning and constant feedback mechanism. We understand GPS may not always be the most accurate determination of your current location, especially in crowded areas and skyscraper districts. This is just the beginning and we’re planning to launch some very innovative features in the coming months! So stay tuned for more!

Loki, a dynamic mock server for HTTP/TCP testing

Post Syndicated from Grab Tech original https://engineering.grab.com/loki-dynamic-mock-server-http-tcp-testing

Background

In a previous article we introduced Mockers – an innovative tool for local box testing at Grab. Mockers used a Shift Left testing strategy, making testing more effective and cheaper for development teams. Mockers’ popularity and success motivated us to create Loki – a one-stop dynamic mock server for local box testing of mobile apps.

There are some unique challenges in mobile apps testing at Grab. End-to-end testing of an app is difficult due to high dependency on backend services and other apps. Staging environment, which hosts a plethora of backend services, is tough to manage and maintain. Issues such as staging downtime, configuration mismatches, and data corruption can affect staging adding to the testing woes. Moreover, our apps are fairly complex, utilizing multiple transport protocols such as HTTP, HTTPS, TCP for various business flows.

The business flows are also complex, requiring exhaustive set up such as credit card payments set up, location spoofing, etc resulting in high maintenance costs for automated testing. Loki simulates these flows and developers can easily test use cases that take longer to set up in a real backend staging.

Loki is our attempt to address challenges in mobile app testing by turning every developer local box into a full fledged pseudo backend environment where all mobile workflows can be tested without any external dependencies. It mocks backend services on developer local boxes, decoupling the mobile apps from real backend services, which provides several advantages such as:

No need to deploy frequently to staging

Testing is blocked if the app receives a bad response from staging. In these cases, code changes have to be deployed on staging to fix issues before resuming tests. In contrast, using Loki lets developers continue testing without any immediate need to deploy code changes to staging.

Allows parallel frontend and backend development

Loki acts as a mock backend service when the real backend is still evolving. It lets the frontend development run in parallel with backend development.

Overcome time limitations

In a one week regression-and-release scenario, testing time is limited. However, the application UI rendering and functionality still needs reasonable testing. Loki lets developers concentrate on testing in the available time instead of fixing dependencies on backend services.

Loki – Grab’s solution to simplify mobile apps testing

At Grab, we have multiple mobile apps that are dependent on each other. For example, our Passenger and Driver apps are two sides of a coin; the driver gets a job card only when a passenger requests a booking. These apps are developed by different teams, each with its own release cycle. This can make it tricky to confidently and repeatedly test the whole business flow across apps. Apps also depend on multiple backend services to execute a booking or food order and communicate over different protocols.

Here’s a look at how our mobile apps interact with backend services over different protocols:

Mobile app interaction with backend services

Loki is a dynamic mock server, written in Golang, running in a Docker container on the local box or in CI. It is easy to set up and run through standard Docker commands. In the context of mobile app testing, it plays the role of backend services, so you no longer need to set up an extensive staging environment.

The Loki architecture looks like this:

Loki architecture

The technical challenges we had to overcome

We wanted a comprehensive mocking solution so that teams don’t need to integrate multiple tools to achieve independent testing. It turned out that mocking TCP was most challenging because:

  • It is a long running client-server connection, and it doesn’t follow an HTTP-like request/response pattern.
  • Messages can be sent to the app without an incoming request as well, hence we had to expose a way via Loki to set a mock expectation which can send messages to the app without any request triggering it.
  • As TCP is a long running connection, we needed a way to delimit incoming requests so we know when we can truncate and deserialize the incoming request into JSON.

We engineered the Loki backend to support both HTTP and TCP protocols on different ports. Yet, the mock expectations are set up using RESTful APIs over HTTP for both protocols. A single point of entry for setting expectations made it more intuitive for our developers.

An in-memory cron implementation pushes scheduled messages to the app over a TCP connection. This enabled testing of complex use cases such as drivers getting new job cards, driver and passenger chat workflows, etc. The delimiter for TCP protocol is configurable at start up, so each team can decide when to truncate the request.

To enable Loki on our CI, we had to reduce its memory footprint. Hence, we built Loki with pluggable storages. MySQL is used when running on local and on CI we switch seamlessly to in-memory cache or Redis.

For testing apps locally, developers must validate complex use cases such as:

  • Payment related flows, which require the response to include the same payment ID as sent in the request. This is a case of simple mapping of request fields in the response JSON.

  • Flows requiring runtime logic execution. For example, a job card sent to a driver must have a valid timestamp, requiring runtime computation on Loki.

To support these cases and many more, we added JavaScript injection capability to Loki. So, when we set an expectation for an HTTP request/response pair or for TCP events, we can specify JavaScript for computing the dynamic response. This is executed in a sandbox by an in-house JS execution library.

Grab follows a transactional workflow for bookings. Over the life of a ride, bookings go through different statuses. So, Loki had to address multiple HTTP requests to the same endpoint returning different responses. This feature is required for successfully mocking a whole ride end-to-end.

Loki uses  an HTTP API “httpTimesAndOrder” for this feature. For example, using “httpTimesAndOrder”, you can configure the same status endpoint (/ride/status) to return different ride statuses such as “PICKING” for the first five requests, “IN_RIDE” for the next three requests, and so on.

Now, let’s look at how to use Loki to mock HTTP requests and TCP events.

Mocking HTTP requests

To mock HTTP requests, developers first point their app to send requests to the Loki mock server. Then, they set up expectations for all requests sent to the Loki mock server.

Loki mock server

For example, the Passenger app calls an HTTP dependency GET /closeby/drivers/ to get nearby drivers. To mock it with Loki, you set an expected response on the Loki mock server. When the GET /closeby/drivers/ request is actually made from the Passenger app, Loki returns the set response.

This snippet shows how to set an expected response for the GET /closeby/drivers/request:

Loki API: POST `/api/v1/expectations`

Request Body :

{
  "uriToMock": "/closeby/drivers",
  "method": "GET",
  "response": {
    "drivers": [
      1001,
      1002,
      1010
    ]
  }
}

Workflow for setting expectations and receiving responses

Workflow for setting expectations and receiving responses

Mocking TCP events

Developers point their app to Loki over a TCP connection and set up the TCP expectations. Loki then generates scheduled events such as sending push messages (job cards, notifications, etc) to the apps pointing at Loki.

For example, if the Driver app, after it starts, wants to get a job card, you can set an expectation in Loki to push a job card over the TCP connection to the Driver app after a scheduled time interval.

This snippet shows how to set the TCP expectation and schedule a push message:

Loki API: POST `/api/v1/tcp/expectations/pushmessage`

Request Body :

{
  "name": "samplePushMsg",
  "msgSequence": [
    {
      "messages": {
        "body": {
          "jobCardID": 1001
        }
      }
    },
    {
      "messages": {
        "body": {
          "jobCardID": 1002
        }
      }
    }
  ],
  "schedule": "@every 1m"
}

Workflow for scheduling a push message over TCP

Workflow for scheduling a push message over TCP

Some example use cases

Now that you know about Loki, let’s look at some example use cases.

Generating a custom response at runtime

Our first example is customizing a runtime response for both HTTP and TCP requests. This is helpful when developers need dynamic responses to requests. For example, you can add parameters from the request URL or request body to the runtime response.

It’s simple to implement this with a JavaScript function. Assume you want to embed a message parameter in the request URL to the response. To do this, you first use a POST method to set up the expectation (in JSON format) for the request on Loki:

Loki API: POST `/api/v1/feature/expectations`

Request Body :

{
  "expectations": [{
    "name": "Sample call",
    "desc": "v1/test/{name}",
    "tags": "v1/test/{name}",
    "resource": "/v1/test?name=user1",
    "verb": "POST",
    "response": {
      "body": "{ \"msg\": \"Hi \"}",
      "status": 200
    },
    "clientOptions": {
"javascript": "function main(req, resp) { var url = req.RequestURI; var captured = /name=([^&]+)/.exec(url)[1]; resp.msg =  captured ? resp.msg + captured : resp.msg + 'myDefaultValue'; return resp }"
    },
    "isActive": 1
  }]
}

When Loki receives the request, the JavaScript function used in the clientOptionskey, adds name to the response at runtime. For example, this is the request’s fixed response:

{
    "msg": "Hi "
}

But, after using the JavaScript function to add the URL parameter, the dynamic response is:

{
    "msg": "Hi user1"
}

Similarly, you can use JavaScript to add other dynamic responses such as modifying the response’s JSON array, adding parameters to push messages, etc.

Defining a response sequence for mocked API endpoints

Here’s another interesting example – defining the response sequence for API endpoints.

A response sequence is useful when you need different responses from the same API endpoint. For example, a status endpoint should return different ride statuses such as ‘allocating’, ‘allocated’, ‘picking’, etc. depending on the stage of a ride.

To do this, developers set up their HTTP expectations on Loki. Then, they easily define the response sequence for an API endpoint using a Loki POST method.

In this example:

  • times – specifies the number of times the same response is returned.
  • after – specifies one or more expectations that must match before a specified expectation is matched.

Here, the expectations are matched in this sequence when a request is made to an endpoint – Allocating > Allocated > Pickuser > Completed. Further, Completed is set to two times, so Loki returns this response two times.

Loki API: POST `/api/v1/feature/sequence`

Request Body :
  "httpTimesAndOrder": [
      {
          "name": "Allocating",
          "times": 1
      },
      {
          "name": "Allocated",
          "times": 1,
          "after": ["Allocating"]
      },
      {
          "name": "Pickuser",
          "times": 1,
          "after": ["Allocated"]
      },
      {
          "name": "Completed",
          "times": 2,
          "after": ["Pickuser"]
      }
  ]
}

In conclusion

Since Loki’s inception, we have set up a full range CI with proper end-to-end app UI tests and, to a great extent, decoupled our app releases from the staging backend. This improved delivery cycles, and we did faster bug catching and more exhaustive testing. Moreover, both developers and QAs can easily play with apps to perform exploratory testing as well as manual functional validations. Teams are also using Loki to run automated scripts (Espresso and XCUItests) for validating the mobile app pages.

Loki’s adoption is growing steadily at Grab. With our frequent release of new mobile app features, Loki helps teams meet our high quality bar and achieve huge productivity gains.

If you have any feedback or questions on Loki, please leave a comment.

How we harnessed the wisdom of crowds to improve restaurant location accuracy

Post Syndicated from Grab Tech original https://engineering.grab.com/correcting-restaurant-locations-harnessing-wisdom-of-the-crowd

While studying GPS ping data to understand how long our driver-partners needed to spend at restaurants during a GrabFood delivery, we came across an interesting observation. We realized that there was a significant proportion of restaurants where our driver-partners were waiting for abnormally short durations, often for just seconds.

Considering that it typically takes a driver a few minutes to enter the restaurant, pick up the order and then leave, we decided to dig further into this phenomenon. What we uncovered was that these super short pit stops were restaurants that were registered at incorrect coordinates within the system due to reasons such as the restaurant had moved to a new location, or human error during onboarding the restaurants. Incorrectly registered locations within our system impact all involved parties – eaters may not see the restaurant because it falls outside their delivery radius or they may see an incorrect ETA, drivers may have trouble finding the restaurant and may end up having to cancel the order, and restaurants who may get fewer orders without really knowing why. 

So we asked ourselves – how can we improve this situation by leveraging the wealth of data that we have? 

The Solution

One of the biggest advantages we have is the huge driver-partner fleet we have on the ground in cities across Southeast Asia. They know the roads and cities like the back of their hand, and they are resourceful. As a result, they are often able to find the restaurants and complete orders even if the location was registered incorrectly. Knowing this, we looked at GPS pings and timestamps from these drivers, and combined this information with when they indicated that they have ordered or collected food from the restaurant. This is then used to infer the “pick-up location” from which the food was collected. 

Inferring this location is not so straightforward though. GPS ping quality can vary significantly across devices and will be affected by whether the device is outdoors or indoors (e.g. if the restaurant is inside a mall). Hence we compute metrics from times and distances between pings, ping frequency and ping quality to filter out orders where the GPS quality is determined to be sub-par. The thresholds for such filtering are determined based on a statistical analysis of orders by regions and times of day. 

One of the outcomes of such an analysis is that we deemed it acceptable to consider a driver “at” a restaurant, if their GPS ping falls within a predetermined radius of the registered location of the restaurant. However, knowing that a driver is at the restaurant does not necessarily tell us “when” he or she  is actually at the restaurant. See the following figure for an example. 

Map showing driver paths and GPS location

 

As you can see from the area covered by the green circle, there are 3 distinct occurrences or “streaks” when the driver can be determined to be at the restaurant location – once when they are approaching the restaurant from the southwest before taking two right turns, then again when they are actually at the restaurant coming in from the northeast, and again when they leave the restaurant heading southwest before making a U-turn and then heading northeast. In this case, if the driver indicates that they have collected the food during the second streak, chronology is respected – the driver reaches the restaurant, the driver collects the food, the driver leaves the restaurant. However if the driver indicates that they have collected the food during one of the other streaks, that is an invalid pick-up even though it is “at” the restaurant.

Such potentially invalid pick-ups could result in noisy estimates of restaurant location, as well as hamper us in our parent task of accurately estimating how long drivers need to wait at restaurants. Therefore, we modify the definition of the driver being at the restaurant to only include the time of the longest streak i.e. the time when the driver spent the longest time within the registered location radius. 

Extending this across multiple orders and drivers, we can form a cluster of pick-up locations (both “at” and otherwise) for each restaurant. Each restaurant then gets ranked through a combination of:

Order volume: Restaurants which receive more orders are likely to have more valid signals for any predictions we make. Increasing the confidence we have in our estimates.

Fraction of the orders where the pick-up location was not “at” the restaurant: This fraction indicates the number of orders with a pick-up location not near the registered restaurant location (with near being defined both spatially and temporally as above). A higher value indicates a higher likelihood of the restaurant not being in the registered location subject to order volume

Median distance between registered and estimated locations: This factor is used to rank restaurants by a notion of “importance”. A restaurant which is just outside the fixed radius from above can be addressed after another restaurant which is a kilometer away. 

This ranked list of restaurants is then passed on to our mapping operations team to verify. The team checks various sources to verify if the restaurant is incorrectly located which is then fed back to the GrabFood system and the locations updated accordingly.

Results

  • We have a system to catch and fix obvious errors

The table below shows a few examples of errors we were able to catch and fix. The image on the left shows the distance between an incorrectly registered address and the actual location of the restaurant.

RestaurantPath from registered location to estimated locationZoomed in view of estimated location
Sederhana  MinangSederhana  Minang path from registered to estimated locationSederhana  Minang zoomed in view of estimated location
Papa Ron’s PizzaPapa Ron's Pizza path from registered to estimated locationPapa Ron's Pizza zoomed in view of estimated location
Rich-O Donuts & CafeRich-O Donuts & Cafe path from registered to estimated locationRich-O Donuts & Cafe zoomed in view of estimated location

Fixing these errors periodically greatly reduced the median error distance (measured as the straight line distance between the estimated location and registered location) in each city as restaurant locations were corrected.

BangkokHo Chi Minh
Median error distance in BangkokMedian error distance in Ho Chi Minh
  • We helped to reduce cancellations

We also tracked the number of GrabFood orders cancelled because the restaurant could not be found by our driver-partners as indicated on the app. Once we started making periodic updates, we saw a 5x decrease in cancellations because of incorrect restaurant locations. 

Relative cancellation rate due to incorrect location
  • We discovered some interesting findings!

In some cases, we were actually stumped when trying to correct some of the locations according to what the system estimated. One of the most interesting examples was the restaurant “Waroeng Steak and Shake” in Bekasi. According to our system, the restaurant’s location was further up Jalan Raya Jatiwaringin than we thought it to be. 

Waroeng Steak and Shake map location

Examining this on Google Maps, we noticed that both locations oddly seemed to have a branch of the restaurant. What was going on here? 

Waroeng Steak and Shake map location on Google Maps

By looking at Google Reviews (credit to my colleague Kenneth Loh for the idea), we realized that  the restaurant seemed to have changed its location, and this is what our system was picking up on. 

Waroeng Steak and Shake Google Maps reviews

In summary, the system was able to respond to a change in location for the restaurant without any active action taken by the restaurant and while other data sources had duplicates. 

What’s Next?

Going forward, we are looking to automate some aspects of this workflow. Currently, the validation part is handled by our mapping operations team and we are looking to feedback their validation and actions taken so that we can finetune various hyperparameters in our system (registered location radii, normalization factors, etc) and/or train more advanced models that are cognizant of different geo and driver characteristics in different markets.

Additionally while we know that we should expect poor results for some scenarios (e.g. inside malls due to poor GPS quality and often approximate registered locations), we can extract such information (restaurant is inside a mall in this case) through a combination of manual feedback from operations teams and drivers, as well as automated NLP techniques such as name and address parsing and entity recognition. 

In the end, it is always useful to question the predictions that a system makes. By looking at some abnormally small wait times at restaurants, we were able to discover, provide feedback and continually update restaurant locations within the GrabFood ecosystem resulting in an overall better experience for our eaters, driver-partners and merchant-partners.