Tag Archives: Engineering

Protecting Personal Data in Grab’s Imagery

Post Syndicated from Grab Tech original https://engineering.grab.com/protecting-personal-data-in-grabs-imagery

Image Collection Using KartaView

Starting a few years ago, we realised the strong demand to better understand the streets where our drivers and clients go, with the purpose to better fulfil their needs and also to be able to quickly adapt ourselves to the rapidly changing environment in the Southeast Asia cities.

One way to fulfil that demand was to create an image collection platform named KartaView which is Grab Geo’s platform for geotagged imagery. It empowers collection, indexing, storage, retrieval of imagery, and map data extraction.

KartaView is a public, partially open-sourced product, used both internally and externally by the OpenStreetMap community and other users. As of 2021, KartaView has public imagery in over 100 countries with various coverage degrees, and 60+ cities of Southeast Asia. Check it out at www.kartaview.com.

Figure 1 - KartaView platform
Figure 1 – KartaView platform

Why Image Blurring is Important

Many incidental people and licence plates are in the collected images, whose privacy is a serious concern. We deeply respect all of them and consequently, we are using image obfuscation as the most effective anonymisation method for ensuring privacy protection.

Because manually annotating the regions in the picture where faces and licence plates are located is impractical, this problem should be solved using machine learning and engineering techniques. Hence we detect and blur all faces and licence plates which could be considered as personal data.

Figure 2 - Sample blurred picture
Figure 2 – Sample blurred picture

In our case, we have a wide range of picture types: regular planar, very wide and 360 pictures in equirectangular format collected with 360 cameras. Also, because we are collecting imagery globally, the vehicle types, licence plates, and human environments are quite diverse in appearance, and are not handled well by off-the-shelf blurring software. So we built our own custom blurring solution which yielded higher accuracy and better cost-efficiency overall with respect to blurring of personal data.

Figure 3 - Example of equirectangular image where personal data has to be blurred
Figure 3 – Example of equirectangular image where personal data has to be blurred

Behind the scenes, in KartaView, there are a set of cool services which are able to derive useful information from the pictures like image quality, traffic signs, roads, etc. A big part of them are using deep learning algorithms which potentially can be negatively affected by running them over blurred pictures. In fact, based on the assessment we have done so far, the impact is extremely low, similar to the one reported in a well known study of face obfuscation in ImageNet [9].

Outline of Grab’s Blurring Process

Roughly, the processing steps are the following:

  1. Transform each picture into a set of planar images. In this way, we further process all pictures, whatever the format they had, in the same way.
  2. Use an object detector able to detect all faces and licence plates in a planar image having a standard field of view.
  3. Transform the coordinates of the detected regions into original coordinates and blur those regions.
Figure 4 - Picture’s processing steps [8]
Figure 4 – Picture’s processing steps [8]

In the following section, we are going to describe in detail the interesting aspects of the second step, sharing the challenges and how we were solving them. Let’s start with the first and most important part, the dataset.

Dataset

Our current dataset consists of images from a wide range of cameras, including normal perspective cameras from mobile phones, wide field of view cameras and also 360 degree cameras.

It is the result of a series of data collections contributed by Grab’s data tagging teams, which may contain 2 classes of dataset that are of interest for us: FACE and LICENSE_PLATE.

The data was collected using Grab internal tools, stored in queryable databases, making it a system that gives the possibility to revisit the data and correct it if necessary, but also making it possible for data engineers to select and filter the data of interest.

Dataset Evolution

Each iteration of the dataset was made to address certain issues discovered while having models used in a production environment and observing situations where the model lacked in performance.

Dataset v1 Dataset v2 Dataset v3
Nr. images 15226 17636 30538
Nr. of labels 64119 86676 242534

If the first version was basic, containing a rough tagging strategy we quickly noticed that it was not detecting some special situations that appeared due to the pandemic situation: people wearing masks.

This led to another round of data annotation to include those scenarios.
The third iteration addressed a broader range of issues:

  • Small regions of interest (objects far away from the camera)
  • Objects in very dark backgrounds
  • Rotated objects or even upside down
  • Variation of the licence plate design due to images from different countries and regions
  • People wearing masks
  • Faces in the mirror – see below the mirror of the motorcycle
  • But the main reason was because of a scenario where the recording, at the start or end (but not only), had close-ups of the operator who was checking the camera. This led to images with large regions of interest containing the camera operator’s face – too large to be detected by the model.

An investigation in the dataset structure, by splitting the data into bins based on the bbox sizes (in pixels), made something clear: the dataset was unbalanced.

We made bins for tag sizes with a stride of 100 pixels and went up to the max present in the dataset which accounted for 1 sample of size 2000 pixels. The majority of the labels were small in size and the higher we would go with the size, the less tags we would have. This made it clear that we would need more targeted annotations for our dataset to try to balance it.

All these scenarios required the tagging team to revisit the data multiple times and also change the tagging strategy by including more tags that were considered at a certain limit. It also required them to pay more attention to small details that may have been missed in a previous iteration.

Data Splitting

To better understand the strategy chosen for splitting the data we need to also understand the source of the data. The images come from different devices that are used in different geo locations (different countries) and are from a continuous trip recording. The annotation team used an internal tool to visualise the trips image by image and mark the faces and licence plates present in them. We would then have access to all those images and their respective metadata.

The chosen ratios for splitting are:

  • Train 70%
  • Validation 10%
  • Test 20%
Number of train images 12733
Number of validation images 1682
Number of test images 3221
Number of labeled classes in train set 60630
Number of labeled classes in validation set 7658
Number of of labeled classes in test set 18388

The split is not so trivial as we have some requirements and need to complete some conditions:

  • An image can have multiple tags from one or both classes but must belong to just one subset.
  • The tags should be split as close as possible to the desired ratios.
  • As different images can belong to the same trip in a close geographical relation we need to force them in the same subset, thus avoiding similar tags in train and test subsets, resulting in incorrect evaluations.

Data Augmentation

The application of data augmentation plays a crucial role while training the machine learning model. There are mainly three ways in which data augmentation techniques can be applied. They are:

  1. Offline data augmentation – enriching a dataset by physically multiplying some of its images and applying modifications to them.
  2. Online data augmentation – on the fly modifications of the image during train time with configurable probability for each modification.
  3. Combination of both offline and online data augmentation.

In our case, we are using the third option which is the combination of both.

The first method that contributes to offline augmentation is a method called image view splitting. This is necessary for us due to different image types: perspective camera images, wide field of view images, 360 degree images in equirectangular format. All these formats and field of views with their respective distortions would complicate the data and make it hard for the model to generalise it and also handle different image types that could be added in the future.

For this we defined the concept of image views which are an extracted portion (view) of an image with some predefined properties. For example, the perspective projection of 75 by 75 degrees field of view patches from the original image.

Here we can see a perspective camera image and the image views generated from it:

Figure 5 - Original image
Figure 5 – Original image
Figure 6 - Two image views generated
Figure 6 – Two image views generated

The important thing here is that each generated view is an image on its own with the associated tags. They also have an overlapping area so we have a possibility to contain the same tag in two views but from different perspectives. This brings us to an indirect outcome of the first offline augmentation.

The second method for offline augmentation is the oversampling of some of the images (views). As mentioned above, we faced the problem of an unbalanced dataset, specifically we were missing tags that occupied high regions of the image, and even though our tagging teams tried to annotate as many as they could find, these were still scarce.

As our object detection model is an anchor-based detector, we did not even have enough of them to generate the anchor boxes correctly. This could be clearly seen in the accuracy of the previous trained models, as they were performing poorly on bins of big sizes.

By randomly oversampling images that contained big tags, up to a minimum required number, we managed to have better anchors and increase the recall for those scenarios. As described below, the chosen object detector for blurring was YOLOv4 which offers a large variety of online augmentations. The online augmentations used are saturation, exposure, hue, flip and mosaic.

Model

As of summer of 2021, the “to go” solution for object detection in images are convolutional neural networks (CNN), being a mature solution able to fulfil the needs efficiently.

Architecture

Most CNN based object detectors have three main parts: Backbone, Neck and (Dense or Sparse Prediction) Heads. From the input image, the backbone extracts features which can be combined in the neck part to be used by the prediction heads to predict object bounding-boxes and their labels.

Figure 7 - Anatomy of one and two-stage object detectors [1]
Figure 7 – Anatomy of one and two-stage object detectors [1]

The backbone is usually a CNN classification network pretrained on some dataset, like ImageNet-1K. The neck combines features from different layers in order to produce rich representations for both large and small objects. Since the objects to be detected have varying sizes, the topmost features are too coarse to represent smaller objects, so the first CNN based object detectors were fairly weak in detecting small sized objects. The multi-scale, pyramid hierarchy is inherent to CNNs so [2] introduced the Feature Pyramid Network which at marginal costs combines features from multiple scales and makes predictions on them. This or improved variants of this technique is used by most detectors nowadays. The head part does the predictions for bounding boxes and their labels.

YOLO is part of the anchor-based one-stage object detectors family being developed originally in Darknet, an open source neural network framework written in C and CUDA. Back in 2015 it was the first end-to-end differentiable network of this kind that offered a joint learning of object bounding boxes and their labels.

One reason for the big success of newer YOLO versions is that the authors carefully merged new ideas into one architecture, the overall speed of the model being always the north star.

YOLOv4 introduces several changes to its v3 predecessor:

  • Backbone – CSPDarknet53: YOLOv3 Darknet53 backbone was modified to use Cross Stage Partial Network (CSPNet [5]) strategy, which aims to achieve richer gradient combinations by letting the gradient flow propagate through different network paths.
  • Multiple configurable augmentation and loss function types, so called “Bag of freebies”, which by changing the training strategy can yield higher accuracy without impacting the inference time.
  • Configurable necks and different activation functions, they call “Bag of specials”.

Insights

For this task, we found that YOLOv4 gave a good compromise between speed and accuracy as it has doubled the speed of a more accurate two-stage detector while maintaining a very good overall precision/recall. For blurring, the main metric for model selection was the overall recall, while precision and intersection over union (IoU) of the predicted box comes second as we want to catch all personal data even if some are wrong. Having a multitude of possibilities to configure the detector architecture and train it on our own dataset we conducted several experiments with different configurations for backbones, necks, augmentations and loss functions to come up with our current solution.

We faced challenges in training a good model as the dataset posed a large object/box-level scale imbalance, small objects being over-represented in the dataset. As described in [3] and [4], this affects the scales of the estimated regions and the overall detection performance. In [3] several solutions are proposed for this out of which the SPP [6] blocks and PANet [7] neck used in YOLOv4 together with heavy offline data augmentation increased the performance of the actual model in comparison to the former ones.

As we have evaluated the model; it still has some issues:

  • Occlusion of the object, either by the camera view, head accessories or other elements:

These cases would need extra annotation in the dataset, just like the faces or licence plates that are really close to the camera and occupy a large region of interest in the image.

  • As we have a limited number of annotations of close objects to the camera view, the model has incorrectly learnt this, sometimes producing false positives in these situations:

Again, one solution for this would be to include more of these scenarios in the dataset.

What’s Next?

Grab spends a lot of effort ensuring privacy protection for its users so we are always looking for ways to further improve our related models and processes.

As far as efficiency is concerned, there are multiple directions to consider for both the dataset and the model. There are two main factors that drive the costs and the quality: further development of the dataset for additional edge cases (e.g. more training data of people wearing masks) and the operational costs of the model.

As the vast majority of current models require a fully labelled dataset, this puts a large work effort on the Data Entry team before creating a new model. Our dataset increased 4x for it’s third version, still there is room for improvement as described in the Dataset section.

As Grab extends its operation in more cities, new data is collected that has to be processed, this puts an increased focus on running detection models more efficiently.

Directions to pursue to increase our efficiency could be the following:

  • As plenty of unlabelled data is available from imagery collection, a natural direction to explore is self-supervised visual representation learning techniques to derive a general vision backbone with superior transferring performance for our subsequent tasks as detection, classification.
  • Experiment with optimisation techniques like pruning and quantisation to get a faster model without sacrificing too much on accuracy.
  • Explore new architectures: YOLOv5, EfficientDet or Swin-Transformer for Object Detection.
  • Introduce semi-supervised learning techniques to improve our model performance on the long tail of the data.

References

  1. Alexey Bochkovskiy et al.. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934v1
  2. Tsung-Yi Lin et al. Feature Pyramid Networks for Object Detection. arXiv:1612.03144v2
  3. Kemal Oksuz et al.. Imbalance Problems in Object Detection: A Review. arXiv:1909.00169v3
  4. Bharat Singh, Larry S. Davis. An Analysis of Scale Invariance in Object Detection – SNIP. arXiv:1711.08189v2
  5. Chien-Yao Wang et al. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. arXiv:1911.11929v1
  6. Kaiming He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv:1406.4729v4
  7. Shu Liu et al. Path Aggregation Network for Instance Segmentation. arXiv:1803.01534v4
  8. http://blog.nitishmutha.com/equirectangular/360degree/2017/06/12/How-to-project-Equirectangular-image-to-rectilinear-view.html
  9. Kaiyu Yang et al. Study of Face Obfuscation in ImageNet: arxiv.org/abs/2103.06191
  10. Zhenda Xie et al. Self-Supervised Learning with Swin Transformers. arXiv:2105.04553v2

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Processing ETL tasks with Ratchet

Post Syndicated from Grab Tech original https://engineering.grab.com/processing-etl-tasks-with-ratchet

Overview

At Grab, the Lending team is focused towards building products that help finance various segments of users, such as Passengers, Drivers, or Merchants, based on their needs. The team builds products that enable users to avail funds in a seamless and hassle-free way. In order to achieve this, multiple lending microservices continuously interact with each other. Each microservice handles different responsibilities, such as providing offers, storing user information, disbursing availed amounts to a user’s account, and many more.

In this tech blog, we will discuss what Data and Extract, Transform and Load (ETL) pipelines are and how they are used for processing multiple tasks in the Lending Team at Grab. We will also discuss Ratchet, which is a Go library, that helps us in building data pipelines and handling ETL tasks. Let’s start by covering the basis of Data and ETL pipelines.

What is a Data Pipeline?

A Data pipeline is used to describe a system or a process that moves data from one platform to another. In between platforms, data passes through multiple steps based on defined requirements, where it may be subjected to some kind of modification. All the steps in a Data pipeline are automated, and the output from one step acts as an input for the next step.

Data Pipeline
Data Pipeline (Source: Hazelcast)

What is an ETL Pipeline?

An ETL pipeline is a type of Data pipeline that consists of 3 major steps, namely extraction of data from a source, transformation of that data into the desired format, and finally loading the transformed data to the destination. The destination is also known as the sink.

Extract-Transform-Load
Extract-Transform-Load (Source: TatvaSoft)

The combination of steps in an ETL pipeline provides functions to assure that the business requirements of the application are achieved.

Let’s briefly look at each of the steps involved in the ETL pipeline.

Data Extraction

Data extraction is used to fetch data from one or multiple sources with ease. The source of data can vary based on the requirement. Some of the commonly used data sources are:

  • Database
  • Web-based storage (S3, Google cloud, etc)
  • Files
  • User Feeds, CRM, etc.

The data format can also vary from one use case to another. Some of the most commonly used data formats are:

  • SQL
  • CSV
  • JSON
  • XML

Once data is extracted in the desired format, it is ready to be fed to the transformation step.

Data Transformation

Data transformation involves applying a set of rules and techniques to convert the extracted data into a more meaningful and structured format for use. The extracted data may not always be ready to use. In order to transform the data, one of the following techniques may be used:

  1. Filtering out unnecessary data.
  2. Preprocessing and cleaning of data.
  3. Performing validations on data.
  4. Deriving a new set of data from the existing one.
  5. Aggregating data from multiple sources into a single uniformly structured format.

Data Loading

The final step of an ETL pipeline involves moving the transformed data to a sink where it can be accessed for its use. Based on requirements, a sink can be one of the following:

  1. Database
  2. File
  3. Web-based storage (S3, Google cloud, etc)

An ETL pipeline may or may not have a loadstep based on its requirements. When the transformed data needs to be stored for further use, the loadstep is used to move the transformed data to the storage of choice. However, in some cases, the transformed data may not be needed for any further use and thus, the loadstep can be skipped.

Now that you understand the basics, let’s go over how we, in the Grab Lending team, use an ETL pipeline.

Why Use Ratchet?

At Grab, we use Golang for most of our backend services. Due to Golang’s simplicity, execution speed, and concurrency support, it is a great choice for building data pipeline systems to perform custom ETL tasks.

Given that Ratchet is also written in Go, it allows us to easily build custom data pipelines.

Go channels are connecting each stage of processing, so the syntax for sending data is intuitive for anyone familiar with Go. All data being sent and received is in JSON, providing a nice balance of flexibility and consistency.

Utilising Ratchet for ETL Tasks

We use Ratchet for multiple ETL tasks like batch processing, restructuring and rescheduling of loans, creating user profiles, and so on. One of the backend services, named Azkaban, is responsible for handling various ETL tasks.

Ratchet uses Data Processors for building a pipeline consisting of multiple stages. Data Processors each run in their own goroutine so all of the data is processed concurrently. Data Processors are organised into stages, and those stages are run within a pipeline. For building an ETL pipeline, each of the three steps (Extract, Transform and Load) use a Data Processor for implementation. Ratchet provides a set of built-in, useful Data Processors, while also providing an interface to implement your own. Usually, the transform stage uses a Custom Data Processor.

Data Processors in Ratchet
Data Processors in Ratchet (Source: Github)

Let’s take a look at one of these tasks to understand how we utilise Ratchet for processing an ETL task.

Whitelisting Merchants Through ETL Pipelines

Whitelisting essentially means making the product available to the user by mapping an offer to the user ID. If a merchant in Thailand receives an option to opt for Cash Loan, it is done by whitelisting that merchant. In order to whitelist our merchants, our Operations team uses an internal portal to upload a CSV file with the user IDs of the merchants and other required information. This CSV file is generated by our internal Data and Risk team and handed over to the Operations team. Once the CSV file is uploaded, the user IDs present in the file are whitelisted within minutes. However, a lot of work goes in the background to make this possible.

Data Extraction

Once the Operations team uploads the CSV containing a list of merchant users to be whitelisted, the file is stored in S3 and an entry is created on the Azkaban service with the document ID of the uploaded file.

File upload by Operations team
File upload by Operations team

The data extraction step makes use of a Custom CSV Data Processor that uses the document ID to first create a PreSignedUrl and then uses it to fetch the data from S3. The data extracted is in bytes and we use commas as the delimiter to format the CSV data.

Data Transformation

In order to transform the data, we define a Custom Data Processor that we call a Transformer for each ETL pipeline. Transformers are responsible for applying all necessary transformations to the data before it is ready for loading. The transformations applied in the merchant whitelisting transformers are:

  1. Convert data from bytes to struct.
  2. Check for presence of all mandatory fields in the received data.
  3. Perform validation on the data received.
  4. Make API calls to external microservices for whitelisting the merchant.

As mentioned earlier, the CSV file is uploaded manually by the Operations team. Since this is a manual process, it is prone to human errors. Validation of data in the data transformation step helps avoid these errors and not propagate them further up the pipeline. Since CSV data consists of multiple rows, each row passes through all the steps mentioned above.

Data Loading

Whenever the merchants are whitelisted, we don’t need to store the transformed data. As a result, we don’t have a loadstep for this ETL task, so we just use an Empty Data Processor. However, this is just one of many use cases that we have. In cases where the transformed data needs to be stored for further use, the loadstep will have a Custom Data Processor, which will be responsible for storing the data.

Connecting All Stages

After defining our Data Processors for each of the steps in the ETL pipeline, the final piece is to connect all the stages together. As stated earlier, the ETL tasks have different ETL pipelines and each ETL pipeline consists of 3 stages defined by their Data Processors.

In order to connect these 3 stages, we define a Job Processor for each ETL pipeline. A Job Processor represents the entire ETL pipeline and encompasses Data Processors for each of the 3 stages. Each Job Processor implements the following methods:

  1. SetSource: Assigns the Data Processor for the Extraction stage.
  2. SetTransformer: Assigns the Data Processor for the Transformation stage.
  3. SetDestination: Assigns the Data Processor for the Load stage.
  4. Execute: Runs the ETL pipeline.
Job processors containing Data Processor for each stage in ETL
Job processors containing Data Processor for each stage in ETL

When the Azkaban service is initialised, we run the SetSource(), SetTransformer() and SetDestination() methods for each of the Job Processors defined. When an ETL task is triggered, the Execute() method of the corresponding Job Processor is run. This triggers the ETL pipeline and gradually runs the 3 stages of ETL pipeline. For each stage, the Data Processor assigned during initialisation is executed.

Conclusion

ETL pipelines help us in streamlining various tasks in our team. As showcased through the example in the above section, an ETL pipeline breaks a task into multiple stages and divides the responsibilities across these stages.

In cases where a task fails in the middle of the process, ETL pipelines help us determine the cause of the failure quickly and accurately. With ETL pipelines, we have reduced the manual effort required for validating data at each step and avoiding propagation of errors towards the end of the pipeline.

Through the use of ETL pipelines and schedulers, we at Lending have been able to automate the entire pipeline for many tasks to run at scheduled intervals without any manual effort involved at all. This has helped us tremendously in reducing human errors, increasing the throughput of the system and making the backend flow more reliable. As we continue to automate more and more of our tasks that have tightly defined stages, we foresee a growth in our ETL pipelines usage.

References

https://www.alooma.com/blog/what-is-a-data-pipeline

http://rkulla.blogspot.com/2016/01/data-pipeline-and-etl-tasks-in-go-using

https://medium.com/swlh/etl-pipeline-and-data-pipeline-comparison-bf89fa240ce9

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

App Moduralisation at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/app-moduralisation-at-scale

Grab a coffee ☕️, sit back and enjoy reading. 😃

Wanna know how we improved our app’s build time performance and developer experience at Grab? Continue reading…

Where it all began

Imagine you are working on an app that grows continuously as more and more features are added to it, it becomes challenging to manage the code at some point. Code conflicts increase due to coupling, development slows down, releases take longer to ship, collaboration becomes difficult, and so on.

Grab superapp is one such app that offers many services like booking taxis, ordering food, payments using an e-wallet, transferring money to friends/families, paying at merchants, and many more, across Southeast Asia.

Grab app followed a monolithic architecture initially where the entire code was held in a single module containing all the UI and business logic for almost all of its features. But as the app grew, new developers were hired, and more features were built, it became difficult to work on the codebase. We had to think of better ways to maintain the codebase, and that’s when the team decided to modularise the app to solve the issues faced.

What is Modularisation?

Breaking the monolithic app module into smaller, independent, and interchangeable modules to segregate functionality so that every module is responsible for executing a specific functionality and will contain everything necessary to execute that functionality.

Modularising the Grab app was not an easy task as it brought many challenges along with it because of its complicated structure due to the high amount of code coupling.

Approach and Design

We divided the task into the following sub-tasks to ensure that only one out of many functionalities in the app was impacted at a time.

  • Setting up the infrastructure by creating Base/Core modules for Networking, Analytics, Experimentation, Storage, Config, and so on.
  • Building Shared Library modules for Styling, Common-UI, Utils, etc.
  • Incrementally building Feature modules for user-facing features like Payments Home, Wallet Top Up, Peer-to-Merchant (P2M) Payments, GrabCard and many others.
  • Creating Kit modules for the feature to feature module communication. This step helped us in building the feature modules in parallel.
  • Finally, the App module is used as a hub to connect all the other modules together using dependency injection (Dagger).
Modularised app structure
Modularised app structure

In the above diagram, payments-home, wallet top-up, and grabcard are different features provided by the Grab app. top-up-kit and grabcard-kit are bridges that expose functionalities from topup and grabcard modules to the payments-home module, respectively.

In the process of modularising the Grab app, we ensured that a feature module did not directly depend on other feature modules so that they could be built in parallel using the available CPU cores of the machine, hence reducing the overall build time of the app.

With the Kit module approach, we separated our code into independent layers by depending only on abstractions instead of concrete implementation.

Modularisation Benefits

  • Faster build times and hence faster CI: Gradle build system compiles only the changed modules and uses the binaries of all the non-affected modules from its cache. So the compilation becomes faster. Moreover, independent modules are run in parallel on different threads.
  • Fine dependency graph: Dependencies of a module are well defined.
  • Reusability across other apps: Modules can be used across different apps by converting them into an AAR SDK.
  • Scale and maintainability: Teams can work independently on the modules owned by them without blocking each other.
  • Well-defined code ownership: Clear responsibility on who owns which code.

Limitations

  • Requires more effort and time to modularise an app.
  • Separate configuration files to be maintained for each module.
  • Gradle sync time starts to grow.
  • IDE becomes very slow and its memory usage goes up a lot.
  • Parallel execution of the module depends on the machine’s capabilities.

Where we are now

There are more than 1,000 modules in the Grab app and are still counting.

At Grab, we have many sub-teams which take care of different features available in the app. Grab Financial Group (GFG) is one such sub-team that handles everything related to payments in the app. For example: P2P & P2M money transfers, e-Wallet activation, KYC, and so on.

We started modularising payments further in July 2020 as it was already bombarded with too many features and it was difficult for the team to work on the single payments module. The result of payments modularisation is shown in the following chart.

Build time graph of payments module
Build time graph of payments module

As of today, we have about 200+ modules in GFG and more than 95% of the modules take less than 15s to build.

Conclusion

Modularisation has helped us a lot in reducing the overall build time of the app and also, in improving the developer experience by breaking dependencies and allowing us to define code ownership. Having said that, modularisation is not an easy or a small task, especially for large projects with legacy code. However, with careful planning and the right design, modularisation can help in forming a well-structured and maintainable project.

Hope you enjoyed reading. Don’t forget to 👏.

References:

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

App Modularisation at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/app-modularisation-at-scale

Grab a coffee ☕️, sit back and enjoy reading. 😃

Wanna know how we improved our app’s build time performance and developer experience at Grab? Continue reading…

Where it all began

Imagine you are working on an app that grows continuously as more and more features are added to it, it becomes challenging to manage the code at some point. Code conflicts increase due to coupling, development slows down, releases take longer to ship, collaboration becomes difficult, and so on.

Grab superapp is one such app that offers many services like booking taxis, ordering food, payments using an e-wallet, transferring money to friends/families, paying at merchants, and many more, across Southeast Asia.

Grab app followed a monolithic architecture initially where the entire code was held in a single module containing all the UI and business logic for almost all of its features. But as the app grew, new developers were hired, and more features were built, it became difficult to work on the codebase. We had to think of better ways to maintain the codebase, and that’s when the team decided to modularise the app to solve the issues faced.

What is Modularisation?

Breaking the monolithic app module into smaller, independent, and interchangeable modules to segregate functionality so that every module is responsible for executing a specific functionality and will contain everything necessary to execute that functionality.

Modularising the Grab app was not an easy task as it brought many challenges along with it because of its complicated structure due to the high amount of code coupling.

Approach and Design

We divided the task into the following sub-tasks to ensure that only one out of many functionalities in the app was impacted at a time.

  • Setting up the infrastructure by creating Base/Core modules for Networking, Analytics, Experimentation, Storage, Config, and so on.
  • Building Shared Library modules for Styling, Common-UI, Utils, etc.
  • Incrementally building Feature modules for user-facing features like Payments Home, Wallet Top Up, Peer-to-Merchant (P2M) Payments, GrabCard and many others.
  • Creating Kit modules for the feature to feature module communication. This step helped us in building the feature modules in parallel.
  • Finally, the App module is used as a hub to connect all the other modules together using dependency injection (Dagger).
Modularised app structure
Modularised app structure

In the above diagram, payments-home, wallet top-up, and grabcard are different features provided by the Grab app. top-up-kit and grabcard-kit are bridges that expose functionalities from topup and grabcard modules to the payments-home module, respectively.

In the process of modularising the Grab app, we ensured that a feature module did not directly depend on other feature modules so that they could be built in parallel using the available CPU cores of the machine, hence reducing the overall build time of the app.

With the Kit module approach, we separated our code into independent layers by depending only on abstractions instead of concrete implementation.

Modularisation Benefits

  • Faster build times and hence faster CI: Gradle build system compiles only the changed modules and uses the binaries of all the non-affected modules from its cache. So the compilation becomes faster. Moreover, independent modules are run in parallel on different threads.
  • Fine dependency graph: Dependencies of a module are well defined.
  • Reusability across other apps: Modules can be used across different apps by converting them into an AAR SDK.
  • Scale and maintainability: Teams can work independently on the modules owned by them without blocking each other.
  • Well-defined code ownership: Clear responsibility on who owns which code.

Limitations

  • Requires more effort and time to modularise an app.
  • Separate configuration files to be maintained for each module.
  • Gradle sync time starts to grow.
  • IDE becomes very slow and its memory usage goes up a lot.
  • Parallel execution of the module depends on the machine’s capabilities.

Where we are now

There are more than 1,000 modules in the Grab app and are still counting.

At Grab, we have many sub-teams which take care of different features available in the app. Grab Financial Group (GFG) is one such sub-team that handles everything related to payments in the app. For example: P2P & P2M money transfers, e-Wallet activation, KYC, and so on.

We started modularising payments further in July 2020 as it was already bombarded with too many features and it was difficult for the team to work on the single payments module. The result of payments modularisation is shown in the following chart.

Build time graph of payments module
Build time graph of payments module

As of today, we have about 200+ modules in GFG and more than 95% of the modules take less than 15s to build.

Conclusion

Modularisation has helped us a lot in reducing the overall build time of the app and also, in improving the developer experience by breaking dependencies and allowing us to define code ownership. Having said that, modularisation is not an easy or a small task, especially for large projects with legacy code. However, with careful planning and the right design, modularisation can help in forming a well-structured and maintainable project.

Hope you enjoyed reading. Don’t forget to 👏.

References:

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Adding support for cross-cluster associations to Rails 7

Post Syndicated from Eileen M. Uchitelle original https://github.blog/2021-07-12-adding-support-cross-cluster-associations-rails-7/

Ever since we made the leap at GitHub to upgrade off our fork of Rails and worked hard to stay up to date with the latest releases, we’ve consistently looked for ways to improve the Rails framework upstream. We do this in many ways – running GitHub off of Rails main, reporting and fixing bugs we find, and most importantly pushing functionality upstream that the entire Ruby community can benefit from.

Most recently, we extracted internal functionality to disable joining queries when an association crossed multiple databases. Prior to our work in this area, Rails had no support for handling associations that spanned across clusters; teams had to write SQL to achieve this.

Background

At GitHub, we have 30 databases configured in our Rails monolith—15 primaries and 15 replicas. We use “functional partitioning” to split up our data, which means that each of those 15 primaries has a different schema. In contrast, a “horizontal sharding” approach would have 15 shards with the same schema.

While there are some workarounds for joining across clusters in MySQL, they are usually not performant or else they require additional setup. Without these workarounds, attempting to join from a table in cluster A to a table in cluster B would result in an error. To work around this limitation, teams had to write SQL, selecting IDs from the first table to then use in the second query to find the appropriate records. This was extra work and could be error-prone. We had an opportunity to make this process smoother by implementing non-join queries in Rails.

Let’s look at some code to see how this works:

Let’s say we have three models: Dog, Human, and Treat.

# table dogs in database animals
class Dog < AnimalsRecord  
  has_many: treats, through: :humans
  has_many :humans
end

# table humans in database people
class Human < PeopleRecord
  has_many :treats
  has_many :dogs
end

# table treats in database default
class Treat < ApplicationRecord
  has_many :dogs, through: :humans
  has_many :humans
end

If our Rails application code loaded the dog.treats association, usually that would automatically perform a join query:

SELECT treats.* FROM treats INNER JOIN humans ON treats.human_id = humans.id WHERE humans.dog_id = 2

Looking at the inheritance chain, we can see that Dog, Treat, and Human all inherit from different base classes. Each of these base classes belongs to a different database connection, which means that records for all three models are stored in different databases.

Since the data is stored across multiple primaries, when the join on dog.treats is run we’ll see an application error:

ActiveRecord::StatementInvalid (Table 'people_db_cluster.humans' doesn't exist)

One of the best features Rails provides out of the box is generating SQL for you. But since GitHub’s data lives in different databases, we could no longer take advantage of this. We had an opportunity to improve Rails in a way that benefited our engineers and everyone else in the Rails community who uses multiple databases.

Implementation

Prior to our work in this area, engineers working on any associations that crossed database boundaries would be forced to manually query IDs rather than using Active Record’s association APIs. Writing SQL can be error prone and defeats the purpose of Active Record’s convenience methods like dog.treats.

A little over two years ago, we started experimenting with an internal gem to disable joins for cross-database associations. We chose to implement this outside of Rails first so that we could work out the majority of bugs before merging to Rails. We wanted to be sure that we could use it successfully in production and that it didn’t cause any significant friction in development or any performance concerns in production. This is how many of Rails’ popular features get developed. We often extract implementations from large production applications – if it’s something we need and something a lot of applications can benefit from, we make it stable first, then upstream it to Rails.

The overall implementation is relatively small. To accomplish disabling joins, we added an option to has_many :through associations called disable_joins. When set to true for an association, Rails will generate separate queries for each database rather than a join query.

This needed to be an option on the association rather than performed at runtime because Rails associations are lazily loaded – the SQL is generated when the association objects are created, which means that by the time Rails runs the SQL to load dog.treats the join will already be generated. After adding the option in Rails, we implemented a new scoping class that would handle the order, limit, scopes, and other options.

Now applications can add the following to their associations to make sure Rails generates two or more queries instead of joins:

class Dog < AnimalsRecord
  has_many: treats, through: :humans, disable_joins: true
  has_many :humans
end

And that’s all that’s needed to disable generating joins for associations that cross database servers!

Now, calls to dog.treats will generate the following SQL:

SELECT "humans"."id" FROM "humans" WHERE "humans"."dog_id" = ?  [["dog_id", 1]]
SELECT "treats".* FROM "treats" WHERE "treats"."human_id" IN (?, ?, ?)  [["human_id", 1], ["human_id", 2], ["human_id", 3]]

Caveats

There are a couple of important caveats to keep in mind when using this new feature. Applications that need to disable joins may see that those associations have slower database performance. This would be true whether you wrote the SQL manually or use Rails’ new disable_joins feature. Fundamentally, if you’re performing multiple queries across multiple databases, that can be slower than performing a single join query on one database. It’s really important to make sure that your queries are efficient and that proper indexes are in place before using this feature. And as always, when making major changes to your database queries, it’s important to benchmark and understand how those changes will affect your application.

Additionally, if your queries rely on an order and limit from the join database you may see a performance impact on requests. When two queries are joined, MySQL can perform the order based on the table that’s joined (i.e., order by humans.human_id which would order the returned treats by the human ID). However, when you’re splitting queries the order can’t be applied by the database. To solve this, Rails orders the records in-memory based on the order they would have been returned if there was a join. This preserves the expected behavior but since the order and limit are performed in-memory, you’ll usually want to avoid performing these actions on hundreds of thousands of records.

Conclusion

Four years ago, contributing to Rails at this level was just something we’d hoped to be able to do one day. We were so far behind in our upgrades that it was difficult to contribute changes to the framework. This might look like a small change but it’s a clear demonstration of the hard work we’ve done to improve the technical debt in our application and ensure that we’re giving back to the community whenever we can.

By adding support for handling associations across databases, we help empower other applications to scale as their traffic and data grow. Additionally, by pushing this code into Rails and out of our private internal gem we’ll find even more improvements and edge cases in applications that aren’t GitHub. As we continue to grow and improve Rails for us at GitHub, we’ll continue to improve it for the entire community. This pull request is just one example of how we intend to do that for years to come.

GitHub Availability Report: June 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-07-07-github-availability-report-june-2021/

In June, we experienced no incidents resulting in service downtime to our core services.

Please follow our status page for real time updates and watch our blog for next month’s availability report. To learn more about what we’re working on, check out the GitHub engineering blog.

Reshaping Chat Support for Our Users

Post Syndicated from Grab Tech original https://engineering.grab.com/reshaping-chat-support

Introduction

The Grab support team plays a key role in ensuring our users receive support when things don’t go as expected or whenever there are questions on our products and services.

In the past, when users required real-time support, their only option was to call our hotline and wait in the queue to talk to an agent. But voice support has its downsides: sometimes it is complex to describe an issue in the app, and it requires the user’s full attention on the call.

With chat messaging apps growing massively in the last years, chat has become the expected support channel users are familiar with. It offers real-time support with the option of multitasking and easily explaining the issue by sharing pictures and documents. Compared to voice support, chat provides access to the conversation for future reference.

With chat growth, building a chat system tailored to our support needs and integrated with internal data, seemed to be the next natural move.

In our previous articles, we covered the tech challenges of building the chat platform for the web, our workforce routing system and improving agent efficiency with machine learning. In this article, we will explain our approach and key learnings when building our in-house chat for support from a Product and Design angle.

A glimpse at agent and user experience
A glimpse at agent and user experience

Why Reinvent the Wheel

We wanted to deliver a product that would fully delight our users. That’s why we decided to build an in-house chat tool that can:

  1. Prevent chat disconnections and ensure a consistent chat experience: Building a native chat experience allowed us to ensure a stable chat session, even when users leave the app. Besides, leveraging on the existing Grab chat infrastructure helped us achieve this fast and ensure the chat experience is consistent throughout the app. You can read more about the chat architecture here.
  2. Improve productivity and provide faster support turnarounds: By building the agent experience in the CRM tool, we could reduce the number of tools the support team uses and build features tailored to our internal processes. This helped to provide faster help for our users.
  3. Allow integration with internal systems and services: Chat can be easily integrated with in-house AI models or chatbot, which helps us personalise the user experience and improve agent productivity.
  4. Route our users to the best support specialist available: Our newly built routing system accounts for all the use cases we were wishing for such as prioritising certain requests, better distribution of the chat load during peak hours, making changes at scale and ensuring each chat is routed to the best support specialist available.

Fail Fast with an MVP

Before building a full-fledged solution, we needed to prove the concept, an MVP that would have the key features and yet, would not take too much effort if it fails. To kick start our experiment, we established the success criteria for our MVP; how do we measure its success or failure?

Defining What Success Looks Like

Any experiment requires a hypothesis – something you’re trying to prove or disprove and it should relate to your final product. To tailor the final product around the success criteria, we need to understand how success is measured in our situation. In our case, disconnections during chat support was one of the key challenges faced so our hypothesis was:

Starting with Design Sprint

Our design sprint aimed to solutionise a series of problem statements, and generate a prototype to validate our hypothesis. To spark ideation, we run sketching exercises such as Crazy 8, Solution sketch and end off with sharing and voting.


Some of the prototypes built during the Design sprint

Defining MVP Scope to Run the Experiment

To test our hypothesis quickly, we had to cut the scope by focusing on the basic functionality of allowing chat message exchanges with one agent.

Here is the main flow and a sneak peek of the design:

Accepting chats
Accepting chats
Handling concurrent chats
Handling concurrent chats

What We Learnt from the Experiment

During the experiment, we had to constantly put ourselves in our users’ shoes as ‘we are not our users’. We decided to shadow our chat support agents and get a sense of the potential issues our users actually face. By doing so, we learnt a lot about how the tool was used and spotted several problems to address in the next iterations.

In the end, the experiment confirmed our hypothesis that having a native in-app chat was more stable than the previous chat in use, resulting in a better user experience overall.

Starting with the End in Mind

Once the experiment was successful, we focused on scaling. We defined the most critical jobs to be done for our users so that we could scale the product further. When designing solutions to tackle each of them, we ensured that the product would be flexible enough to address future pain points. Would this work for more channels, more users, more products, more countries?

Before scaling, the problems to solve were:

  • Monitoring the performance of the system in real-time, so that swift operational changes can be made to ensure users receive fast support;
  • Routing each chat to the best agent available, considering skills, occupancy, as well as issue prioritisation. You can read more about the our routing system design here;
  • Easily communicate with users and show empathy, for which we built file-sharing capabilities for both users and agents, as well as allowing emojis, which create a more personalised experience.

Scaling Efficiently

We broke down the chat support journey to determine what areas could be improved.

Reducing Waiting Time

When analysing the current wait time, we realised that when there was a surge in support requests, the average waiting time increased drastically. In these cases, most users would be unresponsive by the time an agent finally attends to them.

To solve this problem, the team worked on a dynamic queue limit concept based on Little’s law. The idea is that considering the number of incoming chats and the agents’ capacity, we can forecast the number of users we can handle in a reasonable time, and prevent the remaining from initiating a chat. When this happens, we ensure there’s a backup channel for support so that no user is left unattended.

This allowed us to reduce chat waiting time by ~30% and reduce unresponsive users by ~7%.

Reducing Time to Reply

A big part of the chat time is spent typing the message to send to the user. Although the previous tool had templated messages, we observed that 85% of them were free-typed. This is because agents felt the templates were impersonal and wanted to add their personal style to the messages.

With this information in mind, we knew we could help by providing autocomplete suggestions  while the agents are typing. We built a machine learning based feature that considers several factors such as user type, the entry point to support, and the last messages exchanged, to suggest how the agent should complete the sentence. When this feature was first launched, we reduced the average chat time by 12%!

Read this to find out more about how we built this machine learning feature, from defining the problem space to its implementation.


Reducing the Overall Chat Time

Looking at the average chat time, we realised that there was still room for improvement. How can we help our agents to manage their time better so that we can reduce the waiting time for users in the queue?

We needed to provide visibility of chat durations so that our agents could manage their time better. So, we added a timer at the top of each chat window to indicate how long the chat was taking.

Timer in the minimised chat
Timer in the minimised chat

We also added nudges to remind agents that they had other users to attend to while they were in the chat.

Timer in the maximised chat
Timer in the maximised chat

By providing visibility via prompts and colour-coded indicators to prevent exceeding the expected chat duration, we reduced the average chat time by 22%!

What We Learnt from this Project

  • Start with the end in mind. When you embark on a big project like this, have a clear vision of how the end state looks like and plan each step backwards. How does success look like and how are we going to measure it? How do we get there?
  • Data is king. Data helped us spot issues in real-time and guided us through all the iterations following the MVP. It helped us prioritise the most impactful problems and take the right design decisions. Instrumentation must be part of your MVP scope!
  • Remote user testing is better than no user testing at all. Ideally, you want to do user testing in the exact environment your users will be using the tool but a pandemic might make things a bit more complex. Don’t let this stop you! The qualitative feedback we received from real users, even with a prototype on a video call, helped us optimise the tool for their needs.
  • Address the root cause, not the symptoms. Whenever you are tasked with solving a big problem, break it down into its components by asking “Why?” until you find the root cause. In the first phases, we realised the tool had a longer chat time compared to 3rd party softwares. By iteratively splitting the problem into smaller ones, we were able to address the root causes instead of the symptoms.
  • Shadow your users whenever you can. By looking at the users in action, we learned a ton about their creative ways to go around the tool’s limitations. This allowed us to iterate further on the design and help them be more efficient.

Of course, this would not have been possible without the incredible work of several teams: CSE, CE, Comms platform, Driver and Merchant teams.


Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Debugging High Latency Due to Context Leaks

Post Syndicated from Grab Tech original https://engineering.grab.com/debugging-high-latency-market-store

Background

Market-Store is an in-house developed general purpose feature store that is used to serve real-time computed machine learning (ML) features. Market-Store has a stringent SLA around latency, throughput, and availability as it empowers ML models, which are used in Dynamic Pricing and Consumer Experience.

Problem

As Grab continues to grow, introducing new ML models and handling increased traffic, Market-Store started to experience high latency. Market-Store’s SLA states that 99% of transactions should be within 200ms, but our latency increased to 2 seconds. This affected the availability and accuracy of our models that rely on Market-Store for real-time features.

Latency Issue

We used different metrics and logs to debug the latency issue but could not find any abnormalities that directly correlated to the API’s performance. We discovered that the problem went away temporarily when we restarted the service. But during the next peak period, the service began to struggle once again and the problem became more prominent as Market-Store’s query per second (QPS) increased.

The following graph shows an increase in the memory used with time over 12 hours. Even as the system load receded, memory usage continued to increase.

The continuous increase in memory consumption indicated the possibility of a memory leak, which occurs when memory is allocated but not returned after its use is over. This results in consistently increasing consumed memory until the service runs out of memory and crashes.

Although we could restart the service and resolve the issue temporarily, the increasing memory use suggested a deeper underlying root cause. This meant that we needed to conduct further investigation with tools that could provide deeper insights into the memory allocations.

Debugging Using Go Tools

PPROF is a profiling tool by Golang that helps to visualise and analyse profiles from Go programmes. A profile is a collection of stack traces showing the call sequences in your programme that eventually led to instances of a particular event i.e. allocation. It also provides details such as Heap and CPU information, which could provide insights into the bottlenecks of the Go programme.

By default, PPROF is enabled on all Grab Go services, making it the ideal tool to use in our scenario. To understand how memory is allocated, we used PPROF to generate Market-Store’s Heap profile, which can be used to understand how inuse memory was allocated for the programme.

You can collect the Heap profile by running this command:

go tool pprof 'http://localhost:6060/debug/pprof/heap'

The command then generates the Heap profile information as shown in the diagram below:

From this diagram, we noticed that a lot of memory was allocated and held by the child context created from Async Library even after the tasks were completed.

In Market-Store, we used the Async Library, a Grab open-source library, which typically used to run concurrent tasks. Any contexts created by the Async Library should be cleaned up after the background tasks are completed. This way, memory would be returned to the service.

However, as shown in the diagram, memory was not being returned, resulting in a memory leak, which explains the increasing memory usage even as Market-Store’s system load decreased.

Uncovering the Real Issue

So we knew that Market-Store’s latency was affected, but we didn’t know why. From the first graph, we saw that memory usage continued to grow even as Market-Store’s system load decreased. Then, PPROF showed us that the memory held by contexts was not cleaned up, resulting in a memory leak.

Through our investigations, we drew a correlation between the increase in memory usage and a degradation in the server’s API latency. In other words, the memory leak resulted in a high memory consumption and eventually, caused the latency issue.

However, there was no change in our service that would have impacted how contexts are created and cleaned up. So what caused the memory leak?

Debugging the Memory Leak

We needed to look into the Async Library and how it worked. For Market-Store, we updated the cache asynchronously for the write-around caching mechanism. We use the Async Library for running the update tasks in the background.

The following code snippet explains how the Async Library works:


async.Consume(context.Background(), runtime.NumCPU()*4, buffer)

// Consume runs the tasks with a specific max concurrency

func Consume(ctx context.Context, concurrency int, tasks chan Task) Task {

   // code...

   return Invoke(ctx, func(context.Context) (interface{}, error) {

       workers := make(chan int, concurrency)

       concurrentTasks := make([]Task, concurrency)

       // code ...

       t.Run(ctx).ContinueWith(ctx, func(interface{}, error) (interface{}, error) {

       // code...

      })

    }

}

func Invoke(ctx context.Context, action Work) Task {

    return NewTask(action).Run(ctx)

}

func(t *task) Run(ctx context.Context) Task {

    ctx, t.cancel = context.WithCancel(ctx)

    go t.run(ctx)

    return t

}

Note: Code that is not relevant to this article was replaced with code.

As seen in the code snippet above, the Async Library initialises the Consume method with a background context, which is then passed to all its runners. Background contexts are empty and do not track or have links to child contexts that are created from them.

In Market-Store, we use background contexts because they are not bound by request contexts and can continue running even after a request context is cleaned up. This means that once the task has finished running, the memory consumed by child contexts would be freed up, avoiding the issue of memory leaks altogether.

Identifying the Cause of the Memory Leak

Upon further digging, we discovered an MR that was merged into the library to address a task cancellation issue. As shown in the code snippet below, the Consume method had been modified such that task contexts were being passed to the runners, instead of the empty background contexts.

func Consume(ctx context.Context, concurrency int, tasks chan Task) Task {

     // code...

     return Invoke(ctx, func(taskCtx context.Context) (interface{}, error) {

         workers := make(chan int, concurrency)

         concurrentTasks := make([]Task, concurrency)

         // code ...

         t.Run(taskCtx).ContinueWith(ctx, func(interface{}, error) (interface{}, error) {

            // code...

        })

     }

}

Before we explain the code snippet, we should briefly explain what Golang contexts are. A context is a standard Golang package that carries deadlines, cancellation signals, and other request-scoped values across API boundaries and between processes. We should always remember to cancel contexts after using them.

Importance of Context Cancellation

When a context is cancelled, all contexts derived from it are also cancelled. This means that there will be no unaccounted contexts or links and it can be achieved by using the Async Library’s CancelFunc.

The Async Library’s CancelFunc method will:

  • Cancel the created child context and its children
  • Remove the parent reference from the child context
  • Stop any associated timers

We should always make sure to call the CancelFunc method after using contexts, to ensure that contexts and memory are not leaked.

Explaining the Impact of the MR

In the previous code snippet, we see that task contexts are passed to runners and they are not being cancelled. The Async Library created task contexts from non-empty contexts, which means the task contexts are tracked by the parent contexts. So, even if the work associated with these task contexts is complete, they will not be cleaned up by the system (garbage collected).

As we started using task contexts instead of background contexts and did not cancel them, the memory used by these contexts was never returned, thus resulting in a memory leak.

It took us several tries to debug and investigate the root cause of Market-Store’s high latency issue and through this incident, we learnt several important things that would help prevent a memory leak from recurring.

  • Always cancel the contexts you’ve created. Leaving it to garbage collection (system cleanup) may result in unexpected memory leaks.

  • Go profiling can provide plenty of insights about your programme, especially when you’re not sure where to start troubleshooting.

  • Always benchmark your dependencies when integrating or updating the versions to ensure they don’t have any performance bottlenecks.


Special thanks to Chip Dong Lim for his contributions and for designing the GIFs included in this article.


Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

A framework for building Open Graph images

Post Syndicated from Jason Etcovich original https://github.blog/2021-06-22-framework-building-open-graph-images/

You know that feeling when you make your latest hack project public, and you’re ready to share it with the world? And when you go to Twitter to post a link to your repository, you just see a big picture of yourself? We wanted to make that a better experience.

We recently set about creating a framework and service for automatically generating social sharing images for repositories and other resources on GitHub.

Before the update

Before, when you shared a link to a repository on any social media platform, you’d see something like this:

Screenshot of an old Twitter preview for GitHub repo links

We heard from you that seeing the author’s face was unexpected. Plus, there’s not a lot of quick information here, aside from the plaintext title and description.

We do have custom repository images, and you can still use those to give your project some bespoke branding—but most people don’t upload a custom image for their repositories, so we wanted to create a better default experience for every repo on GitHub.

After the update

Now, we generate a new image for you on-the-fly when you share a link to a repository somewhere:

Screenshot of new Twitter preview card for NASA

We create similar cards for issues, pull requests and commits, with more resources coming soon (like Discussions, Releases and Gists):

Screenshot of open graph Twitter card for a pull request

Open Graph image for a pull request

Screenshot of open graph Twitter card for a commit

Open Graph image for a commit

Screenshot of open graph Twitter card for an issue link

Open Graph image for, you guessed it, an issue

What’s going on behind the scenes? A quick intro to Open Graph

Open Graph is a set of standards for websites to be able to declare metadata that other platforms can pick out, to get a TL;DR of the page. You’d declare something like this in your HTML:

<meta property="og:image" content="https://www.rd.com/wp-content/uploads/2020/01/GettyImages-454238885-scaled.jpg" />

In addition to the image, we also define a number of other meta tags that are used for rendering information outside of GitHub, like og:title and og:description.

When a crawler (like Twitter’s crawling bot, which activates any time you share a link on Twitter) looks at your page, it’ll see those meta tags and grab the image. Then, when that platform shows a preview of your website, it’ll use the information it found. Twitter is one example, but virtually all social platforms use Open Graph to unfurl rich previews for links.

How does the image generator work?

I’ll show you! We’ve leveraged the magic of open source technologies to string some tools together. There are a ton of services that do image templating on-demand, but we wanted to deploy our own within our own infrastructure, to ensure that we have the control we need to generate any kind of image.

So: our custom Open Graph image service is a little Node.js app that uses the GitHub GraphQL API to collect data, generates some HTML from a template, and pipes it to Puppeteer to “take a screenshot” of that HTML. This is not a novel idea—lots of companies and projects (like vercel/og-image) use a similar process to generate an image.

We have a couple of routes that match patterns similar to what you’d find on GitHub.com:

// https://github.com/rails/rails/pull/41080
router.get("/:owner/:repo/pull/:number", generateImageMiddleware(Pull));

// https://github.com/rails/rails/issues/41078
router.get("/:owner/:repo/issues/:number", generateImageMiddleware(Issue));

// https://github.com/rails/rails/commit/2afc9059c9eb509f47d94250be0a917059afa1ae
router.get("/:owner/:repo/commit/:oid", generateImageMiddleware(Commit));

// https://github.com/rails/rails/pull/41080/commits/2afc9059c9eb509f47d94250be0a917059afa1ae
router.get("/:owner/:repo/pull/:number/commits/:oid", generateImageMiddleware(Commit));

// https://github.com/rails/rails/*
router.get("/:owner/:repo*", generateImageMiddleware(Repository));

When our application receives a request that matches one of those routes, we use the GitHub GraphQL API to collect some data based on the route parameters and generate an image using code similar to this:

async function generateImage(template, templateData) {
 // Render some HTML from the relevant template
 const html = compileTemplate(template, templateData);
 
 // Create a new page
 const page = await browser.newPage();
 
 // Set the content to our rendered HTML
 await page.setContent(html, { waitUntil: "networkIdle0" });
 
 const screenshotBuffer = await page.screenshot({
   fullPage: false,
   type: "png",
 });
 
 await page.close();
 
 return screenshotBuffer;
}

Some performance gotchas

Puppeteer can be really slow—it’s launching an entire Chrome browser, so some slowness is to be expected. But we quickly saw some performance problems that we just couldn’t live with. Here are a couple of things we did to significantly improve performance of image generation:

waitUntil: networkIdle0 is aggressively patient, so we replaced it

One Saturday night, I was generating and digging through Chromium traces, as one does, to determine why this service was so slow. I dug into these traces with the help of Electron maintainer and semicolon enthusiast @MarshallOfSound. We discovered a huge, two-second block of idle time (in pink):

Screenshot showing two seconds of idle time in Chromium trace

That’s a trace of everything between browser.newPage() and page.close(). The giant pink bar is “idle time,” and (through trial and error) we determined that this was the waitUntil: networkidle0 option passed to page.setContent(). We needed to set this option to say “only continue once all images, fonts, etc have finished loading,” so that we don’t take screenshots before the pages are actually ready. However, it seemed to add a significant amount of idle time, despite the page being ready for a screenshot 300ms in. Per networkidle0‘s docs:

networkidle0 – consider setting content to be finished when there are no more than 0 network connections for at least 500 ms.

We deduced that that big pink block was due to Puppeteer’s backoff time, where it waits 500ms before considering all network connections complete; but the numbers didn’t really line up. That pink bar shouldn’t be nearly that big, at around two seconds instead of the expected 500-ish milliseconds.

So, how did we fix it? Well, we want to wait until all images/fonts have loaded, but clearly Puppeteer’s method of doing so was a little greedy. It’s hard to see in a still image, but the below screenshot shows that all images have been decoded and render by about ~115ms into the trace:

Screenshot showing images decoded and rendered

All we had to do was provide Puppeteer with a different heuristic to know when the page was “done” and ready for a screenshot. Here’s what we came up with:

   // Set the content to our rendered HTML
   await page.setContent(html, { waitUntil: "domcontentloaded" });
 
   // Wait until all images and fonts have loaded
   await page.evaluate(async () => {
     const selectors = Array.from(document.querySelectorAll("img"));
     await Promise.all([
       document.fonts.ready,
       ...selectors.map((img) => {
         // Image has already finished loading, let’s see if it worked
         if (img.complete) {
           // Image loaded and has presence
           if (img.naturalHeight !== 0) return;
           // Image failed, so it has no height
           throw new Error("Image failed to load");
         }
         // Image hasn’t loaded yet, added an event listener to know when it does
         return new Promise((resolve, reject) => {
           img.addEventListener("load", resolve);
           img.addEventListener("error", reject);
         });
       }),
     ]);
   });

This isn’t magic—it’s standard DOM practices. But it was a much better solution for our use-case than the abstraction provided by Puppeteer. We changed waitUntil to domcontentloaded to ensure that the HTML had finished being parsed, then passed a custom function to page.evaluate. This gets run in the context of the page itself but pipes the return value to the outer context. This meant that we could listen for image load events and pause execution until the Promises have been resolved.

You can see the difference in our performance graphs (going from ~2.25 seconds to ~600ms):

Screenshot of difference in performance graphs, difference in our performance graphs, going from ~2.25 seconds to ~600ms

Double your rendering speed with 1mb of memory

More memory means more speed, right? Sure! At GitHub, when we deploy a new service to our internal Kubernetes infrastructure, it gets a default amount of memory: 512MB (technically MiB, but who’s counting?). When we were scaling this service to be enabled for 100% of repositories, we wanted to increase our resource limits to ensure we didn’t see any performance degradations as the service saw more traffic. What we didn’t know was that 512MB was a magic number – and that setting our memory limit to at least 1MB more would unlock significantly better performance within Chromium.

When we bumped that limit, we saw this change:

Graph showing reduction in time to generate image

In production, that was a reduction of almost 500ms to generate an image. It stands to reason that more memory will be “faster” but not that much without any increase in traffic—so what happened? Well, it turns out that Chromium has a flag for devices with less than 512MB of memory and considers these low-spec devices. Chromium uses that flag to run some processes sequentially instead of in parallel, to improve reliability at the cost of performance on devices that couldn’t support increased performance anyway. If you’re interested in running a service like this on your own, check to see if you can bump the memory limit past 512MB – the results are pretty great!

Stats

Generating an image takes 280ms on average. We could go even lower if we wanted to make some other changes, like generating a JPEG instead of a PNG.

The image generator service generates around two million unique-ish images per day. We also return a cached image for 40% of the total requests.

And that’s it! I hope you’re enjoying these images in your Twitter feeds. I know it’s made mine a lot more colorful. If you have any questions or comments, feel free to ping me on Twitter: @JasonEtco!

GitHub Availability Report: May 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-06-02-github-availability-report-may-2021/

Introduction

In May, we experienced two incidents resulting in significant impact and degraded state of availability for API requests, GitHub Pages, GitHub Actions and the GitHub Packages service, specifically the GitHub Packages Container registry service.

May 8 06:46 UTC (46 minutes)

This incident was caused by failures in an underlying MySQL database, which caused some operations to time out for the GitHub Container registry service. During this incident, some customers viewing packages in the UI or interacting with the registry through “docker push” and “docker pull” may have experienced failures as the engineering team investigated the incident. After performing a failover to one of our database replicas, the affected systems were properly restored.

Our internal engineering team is now prioritizing work that will help ensure reduced impact to customers should such underlying outages happen again. This work includes creating internal documentation, dashboards, and enhanced alerts to quickly triage the cause of operation failures. We will also continue to actively maintain and increase replicas in different regions and availability zones that serve as a line of defense against unexpected region outages.

May 16 07:17 UTC (lasting 9 hours 48 minutes)

This incident was caused when a foreign key for scoped tokens exceeded max INT32, which resulted in high failure rates for GitHub Actions and GitHub Pages. It also prevented some access to operations against the GitHub API and low-level git commands, such as “push” and “pull”, using scoped tokens. We mitigated this with a long-running schema migration to change the foreign key to INT64.

Once the foreign key migration was successful, the internal engineering teams then worked to slowly remove token records stored in our cache layer that were considered invalid. After these cached records were removed, newly created API tokens were able to generate new records and API calls resumed working as expected.

Alerting and linting are already in place to help prevent integer overflows in the database. Unfortunately, these mechanisms were not sufficient in this case due to it being a foreign key that predated our linting. In response, we are manually auditing all our INT32 columns and investigating further improvements to our automation to help prevent these types of issues moving forward.

Given the nature of this overflow, only a single GitHub Action used on a single repository received unauthorized access grants for a short period of time. We revoked these grants and confirmed that no unauthorized access was gained through the use of this Action in this repository.

Our internal engineering teams are actively working on reducing the impact and likelihood of this class of issue happening in the future. This work includes tooling to prevent database inconsistencies and improved alerting to allow faster remediation.

In summary

From our open source release of the GitHub Artifact Exporter to our adoption of OpenTelemetry SDKs, you can learn more about what we are working on to improve our internal development tooling and infrastructure in the GitHub Engineering Blog.

Why (and how) GitHub is adopting OpenTelemetry

Post Syndicated from Wolfgang Hennerbichler original https://github.blog/2021-05-26-why-and-how-github-is-adopting-opentelemetry/

Over the years, GitHub engineers have developed many ways to observe how our systems behave. We mostly make use of statsd for metrics, the syslog format for plain text logs and OpenTracing for request traces. While we have somewhat standardized what we emit, we tend to solve the same problems over and over in each new system we develop.

And, while each component serves its individual purpose well, interoperability is a challenge. For example, several pieces of GitHub’s infrastructure use different statsd dialects, which means we have to special-case our telemetry code in different places – a non-trivial amount of work!

Different components can use different vocabularies for similar observability concepts, making investigatory work difficult. For example, there’s quite a bit of complexity around following a GitHub.com request across various telemetry signals, which can include metrics, traces, logs, errors and exceptions. While we emit a lot of telemetry data, it can still be difficult to use in practice.

We needed a solution that would allow us to standardize telemetry usage at GitHub, while making it easy for developers around the organization to instrument their code. The OpenTelemetry project provided us with exactly that!

Introducing OpenTelemetry

OpenTelemetry introduces a common, vendor-neutral format for telemetry signals: OTLP. It also enables telemetry signals to be easily correlated with each other.

Moreover, we can reduce manual work for our engineers by adopting the OpenTelemetry SDKs – their inherent extensibility allows engineers to avoid re-instrumenting their applications if we need to change our telemetry collection backend, and with only one client SDK per-language we can easily propagate best practices among codebases.

OpenTelemetry empowers us to build integrated and opinionated solutions for our engineers. Designing for observability can be at the forefront of our application engineer’s minds, because we can make it so rewarding.

What we’re working on, and why

At the moment, we’re focusing on building out excellent support for the OpenTelemetry tracing signal. We believe that tracing your application should be the main entry point to observability, because we live in a distributed-systems world and tracing illuminates this in a way that’s almost magical.

We believe that tracing allows us to naturally and easily add/derive additional signals once it’s in place: many metrics, for example, can be calculated by backends automatically, tracing events can be converted to detailed logs automatically, and exceptions can automatically be reported to our tracking systems.

We’re building opinionated internal helper libraries for OpenTelemetry, so that engineers can add tracing to their systems quickly, rather than spending hours re-inventing the wheel for unsatisfying results. For example, our helper libraries automatically make sure you’re not emitting traces during your tests, to keep your test suites running smoothly:

# in an initializer, or config/application.rb
OpenTelemetry::SDK.configure do |c|
  if Rails.env.test?
    c.add_span_processor(
      # In production, you almost certainly want BatchSpanProcessor!
      OpenTelemetry::SDK::Trace::Export::SimpleSpanProcessornew(
        OpenTelemetry::SDK::Trace::Export::NoopSpanExporter.new
      )
    )
  end
end

The OpenTelemetry community has started writing libraries that automatically add distributed tracing capabilities to other libraries we rely on – such as the Ruby Postgres adapter, for example. We believe that careful application of auto-instrumentation can yield powerful results that encourage developers to customize tracing to meet their specific needs. Here, in a Rails application that did not previously have any tracing, the following auto-instrumentation was enough for useful, actionable insights:

# in an initializer, or config/application.rb
OpenTelemetry::SDK.configure do |c|
  c.use 'OpenTelemetry::Instrumentation::Rails'
  c.use 'OpenTelemetry::Instrumentation::PG', enable_sql_obfuscation: true
  c.use 'OpenTelemetry::Instrumentation::ActiveJob'
  # This application makes a variety of outbound HTTP calls, with a variety of underlying
  # HTTP client libraries - you may not need this many!
  c.use 'OpenTelemetry::Instrumentation::Faraday'
  c.use 'OpenTelemetry::Instrumentation::Net::HTTP'
  c.use 'OpenTelemetry::Instrumentation::RestClient'
end

This short sample is the standard way to configure the OpenTelemetry SDK—and each c.use line tells the SDK to load and initialize a certain set of auto-instrumentation for us. After that’s happened, we can immediately gain insight into a poorly performing page! This request tracing view below demonstrates a full, crystal-clear trace of database operations from auto-instrumentation alone:

Screenshot of OpenTelemetry tracing to identify poor page performance

We’re building intelligent automatic correlation of the signals that we have today, with OpenTelemetry tracing as the root. By adding trace identifiers into log lines automatically we can connect request traces to logs for example. We have exciting plans about automatically building useful and relevant dashboards, alerts and tools for teams based on these signals. As our OpenTelemetry efforts progress, we’ll shift our focus towards our logging and metrics systems, for the times when tracing isn’t enough. And we’re contributing as much as we can back to the OpenTelemetry project, for everyone to use!

Why should you be excited and how should you get involved?

OpenTelemetry is a CNCF project and it’s been up and running for quite some time. It’s a project with an incredibly wide scope, and one that we believe has real potential to transform how our entire industry approaches one of the most necessary aspects of our work—how to observe and understand our systems. That’s why we’re so excited about it, and why we’re introducing it to our engineers.

Better, more observable distributed systems are possible, and together we have the opportunity to shape them. If it’s interesting to you, we invite you to join the OpenTelemetry community in improving observability for everyone. And if you want to help us make our internal developer experience better, we invite you to join us!

Building a Hyper Self-Service, Distributed Tracing and Feedback System for Rule & Machine Learning (ML) Predictions

Post Syndicated from Grab Tech original https://engineering.grab.com/building-hyper-self-service-distributed-tracing-feedback-system

Introduction

In Grab, the Trust, Identity, Safety, and Security (TISS) is a team of software engineers and AI developers working on fraud detection, login identity check, safety issues, etc. There are many TISS services, like grab-fraud, grab-safety, and grab-id. They make billions of business decisions daily using the Griffin rule engine, which determines if a passenger can book a trip, get a food promotion, or if a driver gets a delivery booking.

There is a natural demand to log down all these important business decisions, store them and query them interactively or in batches. Data analysts and scientists need to use the data to train their machine learning models. RiskOps and customer service teams can query the historical data and help consumers.

That’s where Archivist comes in; it is a new tracing, statistics and feedback system for rule and machine learning-based predictions. It is reliable and performant. Its innovative data schema is flexible for storing events from different business scenarios. Finally, it provides a user-friendly UI, which has access control for classified data.

Here are the impacts Archivist has already made:

  • Currently, there are 2 teams with a total of 5 services and about 50 business scenarios using Archivist. The scenarios include fraud prevention (e.g. DriverBan, PassengerBan), payment checks (e.g. PayoutBlockCheck, PromoCheck), and identity check events like PinTrigger.
  • It takes only a few minutes to onboard a new business scenario (event type), by using the configuration page on the user portal. Previously, it took at least 1 to 2 days.
  • Each day, Archivist logs down 80 million logs to the ElasticSearch cluster, which is about 200GB of data.
  • Each week, Customer Experience (CE)/Risk Ops goes to the user portal and checks Archivist logs for about 2,000 distinct customers. They can search based on numerous dimensions such as the Passenger/DriverID, phone number, request ID, booking code and payment fingerprint.

Background

Each day, TISS services make billions of business decisions (predictions), based on the Griffin rule engine and ML models.

After the predictions are made, there are still some tough questions for these services to answer.

  • If Risk Ops believes a prediction is false-positive, a consumer could be banned. If this happens, how can consumers or Risk Ops report or feedback this information to the new rule and ML model training quickly?
  • As CustomService/Data Scientists investigating any tickets opened due to TISS predictions/decisions, how do you know which rules and data were used? E.g. why the passenger triggered a selfie, or why a booking was blocked.
  • After Data Analysts/Data Scientists (DA/DS) launch a new rule/model, how can they track the performance in fine-granularity and in real-time? E.g. week-over-week rule performance in a country or city.
  • How can DA/DS access all prediction data for data analysis or model training?
  • How can the system keep up with Grab’s business launch speed, with maximum self-service?

Problem

To answer the questions above, TISS services previously used company-wide Kibana to log predictions.  For example, a log looks like: PassengerID:123,Scenario:PinTrigger,Decision:Trigger,.... This logging method had some obvious issues:

  • Logs in plain text don’t have any structure and are not friendly to ML model training as most ML models need processed data to make accurate predictions.
  • Furthermore, there is no fine-granularity access control for developers in Kibana.
  • Developers, DA and DS have no access control while CEs have no access at all. So CE cannot easily see the data and DA/DS cannot easily process the data.

To address all the Kibana log issues, we developed ActionTrace, a code library with a well-structured data schema. The logs, also called documents, are stored in a dedicated ElasticSearch cluster with access control implemented. However, after using it for a while, we found that it still needed some improvements.

  1. Each business scenario involves different types of entities and ActionTrace is not fully self-service. This means that a lot of development work was needed to support fast-launching business scenarios. Here are some examples:
    • The main entities in the taxi business are Driver and Passenger,

    • The main entities in the food business can be Merchant, Driver and Consumer.

    All these entities will need to be manually added into the ActionTrace data schema.

  2.  Each business scenario may have their own custom information logged. Because there is no overlap, each of them will correspond to a new field in the data schema. For example:
    • For any scenario involving payment, a valid payment method and expiration date is logged.
    • For the taxi business, the geohash is logged.
  3.   To store the log data from ActionTrace, different teams need to set up and manage their own ElasticSearch clusters. This increases hardware and maintenance costs.

  4. There was a simple Web UI created for viewing logs from ActionTrace, but there was still no access control in fine granularity.

Solution

We developed Archivist, a new tracing, statistics, and feedback system for ML/rule-based prediction events. It’s centralised, performant and flexible. It answers all the issues mentioned above, and it is an improvement over all the existing solutions we have mentioned previously.

The key improvements are:

  • User-defined entities and custom fields
    • There are no predefined entity types. Users can define up to 5 entity types (E.g. PassengerId, DriverId, PhoneNumber, PaymentMethodId, etc.).
    • Similarly, there are a limited number of custom data fields to use, in addition to the common data fields shared by all business scenarios.
  • A dedicated service shared by all other services
    • Each service writes its prediction events to a Kafka stream. Archivist then reads the stream and writes to the ElasticSearch cluster.
    • The data writes are buffered, so it is easy to handle traffic surges in peak time.
    • Different services share the same Elastic Cloud Enterprise (ECE) cluster, but they create their own daily file indices so the costs can be split fairly.
  • Better support for data mining, prediction stats and feedback
    • Kafka stream data are simultaneously written to AWS S3. DA/DS can use the PrestoDB SQL query engine to mine the data.
    • There is an internal web portal for viewing Archivist logs. Customer service teams and Ops can use no-risk data to address CE tickets, while DA, DS and developers can view high-risk data for code/rule debugging.
  • A reduction of development days to support new business launches
    • Previously, it took a week to modify and deploy the ActionTrace data schema. Now, it only takes several minutes to configure event schemas in the user portal.
  • Saves time in RiskOps/CE investigations
    • With the new web UI which has access control in place, the different roles in the company, like Customer service and Data analysts, can access the Archivist events with different levels of permissions.
    • It takes only a few clicks for them to find the relevant events that impact the drivers/passengers.

Architecture Details

Archivist’s system architecture is shown in the diagram below.

Archivist system architecture
Archivist system architecture
  • Different services (like fraud-detection, safety-allocation, etc.) use a simple SDK to write data to a Kafka stream (the left side of the diagram).
  • In the centre of Archivist is an event processor. It reads data from Kafka, and writes them to ElasticSearch (ES).
  • The Kafka stream writes to the Amazon S3 data lake, so DA/DS can use the Presto SQL query engine to query them.
  • The user portal (bottom right) can be used to view the Archivist log and update configurations. It also sends all the web requests to the API Handler in the centre.

The following diagram shows how internal and external users use Archivist as well as the interaction between the Griffin rule engine and Archivist.

Archivist use cases
Archivist use cases

Flexible Event Schema

In Archivist, a prediction/decision is called an event. The event schema can be divided into 3 main parts conceptually.

  1. Data partitioning: Fields like service_name and event_type categorise data by services and business scenarios.
    Field name Type Example Notes
    service_name string GrabFraud Name of the Service
    event_type string PreRide PaxBan/SafeAllocation
  2. Business decision making: request_id, decisions, reasons, event_content are used to record the business decision, the reason and the context (E.g. The input features of machine learning algorithms).
    Field name Type Example Notes
    request_id string a16756e8-efe2-472b-b614-ec6ae08a5912 a 32-digit id for web requests
    event_content string Event context
    decisions [string] [“NotAllowBook”, “SMS”] A list
    reasons string json payload string of the response from engine.
  3. Customisation: Archivist provides user-defined entities and custom fields that we feel are sufficient and flexible for handling different business scenarios.
    Field name Type Example Notes
    entity_type_1 string Passenger
    entity_id_1 string 12151
    entity_type_2 string Driver
    entity_id_2 string 341521-rdxf36767
    string
    entity_id_5 string
    custom_field_type_1 string “MessageToUser”
    custom_field_1 string “please contact Ops” User defined fields
    custom_field_type_2 “Prediction rule:”
    custom_field_2 string “ML rule: 123, version:2”
    string
    custom_field_6 string

A User Portal to Support Querying, Prediction Stats and Feedback

DA, DS, Ops and CE can access the internal user portal to see the prediction events, individually and on an aggregated city level.

A snapshot of the Archivist logs showing the aggregation of the data in each city
A snapshot of the Archivist logs showing the aggregation of the data in each city

There are graphs on the portal, showing the rule/model performance on individual customers over a period of time.

Rule performance on a customer over a period of time
Rule performance on a customer over a period of time

How to Use Archivist for Your Service

If you want to get onboard Archivist, the coding effort is minimal. Here is an example of a code snippet to log an event:

Code snippet to log an event
Code snippet to log an event

Lessons

During the implementation of Archivist, we learnt some things:

  • A good system needs to support multi-tenants from the beginning. Originally, we thought we could use just one Kafka stream, and put all the documents from different teams into one ElasticSearch (ES) index. But after one team insisted on keeping their data separately from others, we created more Kafka streams and ES indexes. We realised that this way, it’s easier for us to manage data and share the cost fairly.
  • Shortly after we launched Archivist, there was an incident where the ES data writes were choked. Because each document write is a goroutine, the number of goroutines increased to 400k and the memory usage reached 100% within minutes. We added a patch (2 lines of code) to limit the maximum number of goroutines in our system. Since then, we haven’t had any more severe incidents in Archivist.

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

GitHub Artifact Exporter open source release

Post Syndicated from Jason Macgowan original https://github.blog/2021-05-18-github-artifact-exporter-open-source-release/

GitHub is the home for software development teams and is the place where they collaborate and build. For larger organizations, you might have a dedicated reporting team that wants to export this activity at a granular level, so it can be modified and presented for audits. GitHub provides a powerful API for accessing this data programmatically, but we know that may not be the perfect solution for the many people involved in a given organization. In fact, a common request we’ve seen is for the ability to download issues and other repository data as a CSV file. Sometimes, you just want a spreadsheet!

So, we built the GitHub Artifact Exporter to help reporting teams get the data they need without requiring them to know how to interact with the GitHub API.

What data can you export from GitHub?

GitHub Artifact Exporter provides a CLI and a simple GUI for exporting GitHub Issues and related comments based on a date range, and it supports GitHub’s full search syntax, allowing you to filter results based on your search parameters

The CLI also supports exporting:

  • Commits
  • Milestones, including associated Issues
  • Projects, including associated issues
  • Pull requests, including comments
  • Releases

Exporter format

Both the CLI and GUI support two formats for data exports, JSON and CSV.

JSON

Newline delimited JSON can be used to process each line.

Screenshot of JSON data export using GitHub Artifact Exporter

CSV

CSV provides a comma-delimited export where each line represents an issue and a single comment.

Screenshot of comma-delimited CSV data export using GitHub Artifact Exporter

Using the GUI

When you open the GUI, you’re greeted with the screen below. You’ll need to fill in a personal access token, the owner of the repository, and the name of the repository itself.

The owner of the repository will either be your personal account name or your organization name. The name of the repository will be the URL slug that you see in the URL bar. The GitHub Artifact Exporter’s Owner and Repository would be “github” and “github-artifact-exporter” respectively.

Next, input a search string to filter the issues in your repository, select whether you want CSV or JSON output, and hit export! You’ll be prompted with a dialogue allowing you to choose the location to save the file.

Screenshot of GUI for GitHub Artifact Exporter, showing the fields described above.

Using the CLI

The CLI can be used to generate the same JSON and CSV data as the GUI, in addition to implementing a handful of other search types. See the usage portion of the README for full details.

For example, to get all the pull requests in your repository, this command could be used:
github-artifact-exporter.exe repo:pulls --owner github --repo github-artifact-exporter --token $GITHUB_TOKEN --format JSON.

Try it out!

We hope that this tool helps your team export your data in an easier fashion. To get started, check out the prerequisites then download the GitHub Artifact Exporter. We would love any suggestions or feedback in the repository.

Security keys are now supported for SSH Git operations

Post Syndicated from Kevin Jones original https://github.blog/2021-05-10-security-keys-supported-ssh-git-operations/

GitHub has been at the forefront of security key adoption for many years. We were an early adopter of Universal 2nd Factor (“U2F”) and were also one of the first sites to transition to Webauthn. We’re always on the lookout for new standards that both increase security and usability. Today we’re taking the next step by shipping support for security keys when using Git over SSH.

What are security keys and how do they work?

Security keys, such as the YubiKey, are portable and transferable between machines in a convenient form factor. Most security keys connect via USB, NFC, or Bluetooth. When used in a web browser with two-factor authentication enabled, security keys provide a strong, convenient, and phishing-proof alternative to one-time passwords provided by applications or SMS. Much of the data on the key is protected from external access and modification, ensuring the secrets cannot be taken from the security key. Security keys should be protected as a credential, so keep track of them and you can be confident that you have usable, strong authentication. As long as you retain access to the security key, you can be confident that it can’t be used by anyone else for any other purpose.

Screenshot of GitHub identity verification screen with security key option

Use your existing security key for Git operations

When used for SSH operations, security keys move the sensitive part of your SSH key from your computer to a secure external security key. SSH keys that are bound to security keys protect you from accidental private key exposure and malware. You perform a gesture, such as a tap on the security key, to indicate when you intend to use the security key to authenticate. This action provides the notion of “user presence.”

Security keys are not limited to a single application, so the same individual security key is available for both web and SSH authentication. You don’t need to acquire a separate security key for each use case. And unlike web authentication, two-factor authentication is not a requirement when using security keys to authenticate to Git. As always, we recommend using a strong password, enrolling in two-factor authentication, and setting up account recovery mechanisms. Conveniently, security keys themselves happen to be a great recovery option for securely retaining access to your two-factor-enabled account if you lose access to your phone and backup codes.

The same SSH keys you already know and love, just a little different

Generating and using security keys for SSH is quite similar to how you generated and used SSH keys in the past. You can password-protect your key and require a security key! According to our data, you likely either use an RSA or ed25519 key. Now you can use two additional key types: ecdsa-sk and ed25519-sk, where the “sk” suffix is short for “security key.”

$ ssh-keygen -t ecdsa-sk -C <email address> 
Generating public/private ecdsa-sk key pair. 
You may need to touch your authenticator to authorize key generation.

Once generated, you add these new keys to your account just like any other SSH key. You’ll still create a public and private key pair, but secret bits are generated and stored in the security key, with the public part stored on your machine like any other SSH public key. There is a private key file stored on your machine, but your private SSH key is a reference to the security key device itself. If your private key file on your computer is stolen, it would be useless without the security key. When using SSH with a security key, none of the sensitive information ever leaves the physical security key device. If you’re the only person with physical access to your security key, it’s safe to leave plugged in at all times.

Screenshot of SSH keys associated with GitHub user account

Safer Git access and key management

With security keys, you can achieve a higher level of account security and protection from account takeover. You can take things a step further by removing your previously registered SSH keys, using only SSH keys backed by security keys. Using only SSH keys backed by security keys gives you strong assurance that you are the only person pulling your Git data via SSH as long as you keep the security key safe like any other private key.

Security keys provide meaningful safety assurances even if you only access Git on trusted, consistent systems. At the other end of the spectrum, you might find yourself working in numerous unfamiliar environments where you need to perform Git operations. Security keys dramatically reduce the impact of inadvertent exposure without the need to manage each SSH key on your account carefully. You can confidently generate and leave SSH keys on any system for any length of time and not have to worry about removing access later. We’ll remove unused keys from your account, making key management even easier. Remember to periodically use keys you want to retain over time so we don’t delete them for you.

Protecting against unintended operations

Every remote Git operation will require an additional key tap to ensure that malware cannot initiate requests without your approval. You can still perform local operations, such as checkout, branch, and merge, without interruption. When you’re happy with your code or ready to receive updates, remote operations like push, fetch, and pull will require that you tap your security key before continuing. As always, SSH keys must be present and optionally unlocked with a password for all Git operations. Unlike password-protected SSH keys, clients do not cache security key taps for multiple operations.

$ git clone [email protected]:github/secure_headers.git
Cloning into 'secure_headers'...
Confirm user presence for key ECDSA-SK SHA256:xzg6NAJDyJB1ptnbRNy8UxD6mdm7J/YBdu2p5+fCUa0
User presence confirmed

Already familiar with using SSH keys backed by security keys? In that case, you might wonder why we require verification (via the security key “tap”) when you can configure your security key to allow operations to proceed as long as the security key is present. While we understand the appeal of removing the need for the taps, we determined our current approach to require presence and intention is the best balance between usability and security.

Towards a future with fewer passwords

Today, you can use a password, a personal access token (PAT), or an SSH key to access Git at GitHub. Later this year, as we continue to iterate toward more secure authentication patterns, passwords will no longer be supported for Git operations. We recognize that passwords are convenient, but they are a consistent source of account security challenges. We believe passwords represent the present and past, but not the future. We would rather invest in alternatives, like our Personal Access Tokens, by adding features such as fine-grained access and more control over expiration. It’s a long journey, but every effort to reduce the use of passwords has improved the security of the entire GitHub ecosystem.

By removing password support for Git, as we already successfully did for our API, we will raise the baseline security hygiene for every user and organization, and for the resulting software supply chain. By adding SSH security key support, we have provided a new, more secure, and easy-to-use way to strongly authenticate with Git while preventing unintended and potentially malicious access. If you are ready to make the switch, log in to your account and follow the instructions in our documentation to create a new key and add it to your account.

We wanted to extend our gratitude to Yubico, with whom we’ve partnered several times over the years, for being an early collaborator with us on this feature and providing us valuable feedback to ensure we continue to improve developer security.

Our Journey to Continuous Delivery at Grab (Part 2)

Post Syndicated from Grab Tech original https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab-part2

In the first part of this blog post, you’ve read about the improvements made to our build and staging deployment process, and how plenty of manual tasks routinely taken by engineers have been automated with Conveyor: an in-house continuous delivery solution.

This new post begins with the introduction of the hermeticity principle for our deployments, and how it improves the confidence with promoting changes to production. Changes sent to production via Conveyor’s deployment pipelines are then described in detail.

Overview of Grab delivery process
Overview of Grab delivery process

Finally, looking back at the engineering efficiency improvements around velocity and reliability over the last 2 years, we answer the big question – was the investment on a custom continuous delivery solution like Conveyor the right decision for Grab?

Improving Confidence in our Production Deployments with Hermeticity

The term deployment hermeticity is borrowed from build systems. A build system is called hermetic if builds always produce the same artefacts regardless of changes in the environment they run on. Similarly, we call our deployments hermetic if they always result in the same deployed artefacts regardless of the environment’s change or the number of times they are executed.

The behaviour of a service is rarely controlled by a single variable. The application that makes up your service is an important driver of its behaviour, but its configuration is an important contributor, for example. The behaviour for traditional microservices at Grab is dictated mainly by 3 versioned artefacts: application code, static and dynamic configuration.

Conveyor has been integrated with the systems that operate changes in each of these parameters. By tracking all 3 parameters at every deployment, Conveyor can reproducibly deploy microservices with similar behaviour: its deployments are hermetic.

Building upon this property, Conveyor can ensure that all deployments made to production have been tested before with the same combination of parameters. This is valuable to us:

  • An outcome of staging deployments for a specific set of parameters is a good predictor of outcomes in production deployments for the same set of parameters and thus it makes testing in staging more relevant.
  • Rollbacks are hermetic; we never rollback to a combination of parameters that has not been used previously.

In the past, incidents had resulted from an application rollback not compatible with the current dynamic configuration version; this was aggravating since rollbacks are expected to be a safe recovery mechanism. The introduction of hermetic deployments has largely eliminated this category of problems.

Hermeticity is maintained by registering the deployment parameters as artefacts after each successfully completed pipeline. Users must then select one of the registered deployment metadata to promote to production.

At this point, you might be wondering: why not use a single pipeline that includes both staging and production deployments? This was indeed how it started, with a single pipeline spanning multiple environments. However, engineers soon complained about it.

The most obvious reason for the complaint was that less than 20% of changes deployed in staging will make their way to production. This meant that engineers would have toil associated with each completed staging deployment since the pipeline must be manually cancelled rather than continued to production.

The other reason is that this multi-environment pipeline approach reduced flexibility when promoting changes to production. There are different ways to apply changes to a cluster. For example, lengthy pipelines that refresh instances can be used to deploy any combination of changes, while there are quicker pipelines restricted to dynamic configuration changes (such as feature flags rollouts). Regardless of the order in which the changes are made and how they are applied, Conveyor tracks the change.

Eventually, engineers promote a deployment artefact to production. However they do not need to apply changes in the same sequence with which were applied to staging. Furthermore, to prevent erroneous actions, Conveyor presents only changes that can be applied with the requested pipeline (and sometimes, no changes are available). Not being forced into a specific method of deploying changes is one of added benefits of hermetic deployments.

Returning to Our Journey Towards Engineering Efficiency

If you can recall, the first part of this blog post series ended with a description of staging deployment. Our deployment to production starts with a verification that we uphold our hermeticity principle, as explained above.

Our production deployment pipelines can run for several hours for large clusters with rolling releases (few run for days), so we start by acquiring locks to ensure there are no concurrent deployments for any given cluster.

Before making any changes to the environment, we automatically generate release notes, giving engineers a chance to abort if the wrong set of changes are sent to production.

The pipeline next waits for a deployment slot. Early on, engineers adopted deployment windows that coincide with office hours, such that if anything goes wrong, there is always someone on hand to help. Prior to the introduction of Conveyor, however, engineers would manually ask a Slack bot for approval. This interaction is now automated, and the only remaining action left is for the engineer to approve that the deployment can proceed via a single click, in line with our hands-off deployment principle.

When the canary is in production, Conveyor automates monitoring it. This process is similar to the one already discussed in the first part of this blog post: Engineers can configure a set of alerts that Conveyor will keep track of. As soon as any one of the alerts is triggered, Conveyor automatically rolls back the service.

If no alert is raised for the duration of the monitoring period, Conveyor waits again for a deployment slot. It then publishes the release notes for that deployment and completes the deployments for the cluster. After the lock is released and the deployment registered, the pipeline finally comes to its successful completion.

Benefits of Our Journey Towards Engineering Efficiency

All these improvements made over the last 2 years have reduced the effort spent by engineers on deployment while also reducing the failure rate of our deployments.

If you are an engineer working on DevOps in your organisation, you know how hard it can be to measure the impact you made on your organisation. To estimate the time saved by our pipelines, we can model the activities that were previously done manually with a rudimentary weighted graph. In this graph, each edge carries a probability of the activity being performed (100% when unspecified), while each vertex carries the time taken for that activity.

Focusing on our regular staging deployments only, such a graph would look like this:

The overall amount of effort automated by the staging pipelines () is represented in the graph above. It can be converted into the equation below:

This equation shows that for each staging deployment, around 16 minutes of work have been saved. Similarly, for regular production deployments, we find that 67 minutes of work were saved for each deployment:

Moreover, efficiency was not the only benefit brought by the use of deployment pipelines for our traditional microservices. Surprisingly perhaps, the rate of failures related to production changes is progressively reducing while the amount of production changes that were made with Conveyor increased across the organisation (starting at 1.5% of failures per deployments, and finishing at 0.3% on average over the last 3 months for the period of data collected):

Keep Calm and Automate

Since the first draft for this post was written, we’ve made many more improvements to our pipelines. We’ve begun automating Database Migrations; we’ve extended our set of hermetic variables to Amazon Machine Image (AMI) updates; and we’re working towards supporting container deployments.

Through automation, all of Conveyor’s deployment pipelines have contributed to save more than 5,000 man-days of efforts in 2020 alone, across all supported teams. That’s around 20 man-years worth of effort, which is around 3 times the capacity of the team working on the project! Investments in our automation pipelines have more than paid for themselves, and the gains go up every year as more workflows are automated and more teams are onboarded.

If Conveyor has saved efforts for engineering teams, has it then helped to improve velocity? I had opened the first part of this blog post with figures on the deployment funnel for microservice teams at Grab, towards the end of 2018. So where do the figures stand today for these teams?

In the span of 2 years, the average number of build and staging deployment performed each day has not varied much. However, in the last 3 months of 2020, engineers have sent twice more changes to production than they did for the same period in 2018.

Perhaps the biggest recognition received by the team working on the project, was from Grab’s engineers themselves. In the 2020 internal NPS survey for engineering experience at Grab, Conveyor received the highest score of any tools (built in-house or not).


All these improvements in efficiency for our engineers would never have been possible without the hard work of all team members involved in the project, past and present: Tanun Chalermsinsuwan, Aufar Gilbran, Deepak Ramakrishnaiah, Repon Kumar Roy (Kowshik), Su Han, Voislav Dimitrijevikj, Stanley Goh, Htet Aung Shine, Evan Sebastian, Qijia Wang, Oscar Ng, Jacob Sunny, Subhodip Mandal and many others who have contributed and collaborated with them.


Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

GitHub Availability Report: April 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-05-05-github-availability-report-april-2021/

In April, we experienced two incidents resulting in significant impact and degraded state of availability for API requests and the GitHub Packages service, specifically the GitHub Packages Container registry service.

April 1 21:30 UTC (lasting one hour and 34 minutes)

This incident was caused by failures in our DNS resolution, resulting in a degraded state of availability for the GitHub Packages Container registry service. During this incident, some of our internal services that support the Container registry experienced intermittent failures when trying to connect to dependent services. The inability to resolve requests to these services resulted in users being unable to push new container images to the Container registry as well as pull existing images. The Container registry is currently in a public beta, and only beta users were impacted during this incident. The broader GitHub Packages service remained unaffected.

As a next step, we are looking at increasing the cache times of our DNS resolutions to decrease the impact of intermittent DNS resolution failures in the future.

April 22 17:16 UTC (lasting 53 minutes)

Our service monitors detected an elevated error rate when using API requests, which resulted in a degraded state of availability for repository creation. Upon further investigation of this incident, we identified the issue was caused by a bug from a recent data migration. In a data migration to isolate our secret scanning tables into their own cluster, a bug was discovered that broke the ability of the application to successfully write to the secret scanning database. The incident revealed a hitherto unknown dependency that repository creation had upon secret scanning, which makes a call for every repository created. Due to this dependency, repository creation was blocked until we were able to roll back the data migration.

As next steps, we are actively working with our vendor to update the data migration tool and have amended our migration process to include revised steps for remediation, in case similar incidents occur. Furthermore, our application code has been updated to remove the dependency on secret scanning for the creation of repositories.

In summary

From scaling the GitHub API to improving large monorepo performance, we will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

How we use Web Components at GitHub

Post Syndicated from Kristján Oddsson original https://github.blog/2021-05-04-how-we-use-web-components-at-github/

At GitHub, we pride ourselves on delivering a first-class developer experience. A considerable part of our work is on our front end, which we strive to keep as lightweight, fast, and accessible as possible. For a product as large as GitHub, this can be quite the task. Like many front-end codebases, we leverage components, independent, isolated, and reusable pieces of code that allow application teams to deliver high fidelity UI quickly and efficiently while still keeping to our high standards of quality.

We’re using Web Components in a big way at GitHub. We have over a dozen open-source Web Components and with dozens more that are closed source.

How we got here

When GitHub launched over a decade ago, we had a modest front-end codebase that mostly used jQuery. Ten years and nearly 85,000 lines of code later, we had a large front-end codebase that was starting to show growing pains. We ultimately transitioned away from jQuery (for reasons which we detailed in a blog post at the time) and started using new technologies which could better solve our problems.

We began to dabble with a new technology called Web Components, a set of native browser technologies that allow the development of customized HTML elements, progressively enhanced with JavaScript.

We chose to use Web Components because our codebase was already structured into component-like behaviors. Still, as the GitHub monolith grew in size, we saw the need to implement better encapsulations before the front-end became unmanageable – and Web Components fit the bill. Web Components offered better portability and encapsulation than our existing JavaScript behaviors. We were happy to experiment with Web Components alongside our existing front-end infrastructure since it doesn’t incur any upfront cost or “buy-in” to a specific framework.

Our first two custom elements shipped in 2014: <relative-time> and <local-time>, which show times and dates in friendly formats, and <include-fragment>, which allows us to lazy load HTML fragments. Slowly we realized just how powerful these elements could be and began replacing design patterns within the codebase wholesale, such as replacing our “facebox” modal dialog pattern with <details-dialog>. Our components now range from very generic, multi-purpose behaviors like <remote-input> to specific single-purpose components such as the <markdown-toolbar> element and its siblings.

For the power Web Components affords, there are still pain points and pitfalls. With such a large codebase owned by hundreds of engineers across dozens of teams, we need to provide as much support and tooling as possible, encoding best practices without manual code review becoming a bottleneck.

Improving how we author components

To make engineers effective at writing high-quality Web Components—and encourage best practices—we’ve been working on a few tools to make authoring Web Components much easier.

ViewComponent

We’ve been transitioning our Rails code to using ViewComponent, a framework for building reusable components within Rails. ViewComponent goes hand-in-hand with Web Components since a ViewComponent can have a one-to-one relationship with a Web Component, allowing our developers to work on a single abstraction for both front-end and backend.

Catalyst

Catalyst, our open source library that makes it easier to write web components, has been a driving force that ties together some of our best practices. Catalyst leverages TypeScript to add decorators, which save on a lot of the boilerplate necessary to write Web Components.

Catalyst took inspiration from the excellent Stimulus library, and Google’s LitElement. It is designed to answer the specific set of needs our developers require. Our internal developer experience surveys have shown success in providing a substantial improvement in authoring code over legacy patterns.

You can read more about Catalyst and its conventions, patterns, and anti-patterns in our guide.

Tooling

We provide a set of open-source linter configurations for developers. For general code practices we have eslint-plugin-github. We also have eslint-plugin-custom-elements, which provides further checks for authoring Web Components. Extracting these to open source repositories allows us to remove code from the monolith but remain consistent.

We also have internal tests to verify that developers follow best practices and ensure that they don’t continue to use deprecated patterns and behaviors. One of our tests makes sure that a deprecated “facebox” pattern isn’t introduced to the codebase and suggests using a <details-dialog> element as an alternative.

class FaceboxDeprecationTest < Test::Fast::TestCase
  EXPECTED_NUMBER_OF_FACEBOXES = 44

  # Find facebox triggers set with rel=facebox, in either HTML attributes or
  # as part of hash assignment (for rails helpers)
  REGEX_FOR_FACEBOX_BINDING = %r|rel\s*[=:]>?\s*["']?facebox|
  REGEX_FOR_DATA_FACEBOX = %r|data-facebox\s*=>?\s*|

  def test_limit_facebox
    actual_rel_facebox = grep(REGEX_FOR_FACEBOX_BINDING, options: %w[-In], paths: %w[app/**/*.erb])
    actual_data_facebox = grep(REGEX_FOR_DATA_FACEBOX, options: %w[-In], paths: %w[app/**/*.erb])
    count = actual_rel_facebox.count("\n") + actual_data_facebox.count("\n")

    assert_operator count, :<=, EXPECTED_NUMBER_OF_FACEBOXES, <<-EOL
It looks like you added a facebox. Please use <details-dialog> instead.

If you must increment EXPECTED_NUMBER_OF_FACEBOXES in this test, please
/cc @github/ui-frameworks-reviewers in your pull request, as we may be able to help! Thanks.
EOL

    assert_equal EXPECTED_NUMBER_OF_FACEBOXES, count, <<-EOL
It looks like you removed a facebox. YOU ARE AWESOME! 💖 💖 💖 💖 💖
Please decrement EXPECTED_NUMBER_OF_FACEBOXES in this test and treat yourself to
something special.  You deserve it.
EOL
  end
end

Our new lifecycle of Web Components

The road from an application-specific front-end behavior to an open-source Web Component starts with a Catalyst component inside the monolith codebase. Components that are good candidates for extraction get generalized into a robust, strictly behavioral, dependency-free Web Component.

In the monolith, engineers might prototype ideas and ship them slowly using feature flags while continuously revising them. After the component has been tested in production for some time, we will look for opportunities to lift the component into its own repository. We also regularly assess the codebase to find reusable patterns, generic behaviors, or components that otherwise have a compelling reason to be lifted into their own repositories.

Start with Catalyst

We encourage developers to write Catalyst components while developing user interfaces within the dotcom monolith. The benefits of utilizing Catalyst from the start are that the library abstracts away some common pitfalls of writing Web Components and enforces best practices.

Registering a Web Component can incur some boilerplate, but we make it easier with naming conventions and a sprinkle of TypeScript decorators. Actions in Catalyst make event listening easier than managing global event listeners. Mutating targets for existing HTML is better than rendering HTML templates in the browser.

Extracting a component from the monolith

While contained within our application, components might have application-specific code added to them as they get used in different contexts. Application-specific code is OK when you have specific needs that you need to fulfill, but if a component is supposed to work in other contexts, it needs to be flexible and generalized. Most often, this comes in the form of hard-coded options that should be made configurable. Generalizing a component so that it is more portable is critical to enabling re-use by other teams.

Before extracting the Catalyst component, we remove the Catalyst-specific functionality and convert it to a plain Web Component. Why remove the library that the team claims makes writing Web Components easier? While Catalyst is beneficial for developers, we want our components to have zero dependencies. Requiring developers outside of the GitHub organization to understand Catalyst before contributing code back to the component is extra friction that we don’t want to incur.

Open source components mean specific requirements

While monolith components might be tightly coupled to specific application logic, have third-party dependencies, and lean on existing tests, we have a different set of standards for open source components. Our open source components should have close to 0 dependencies, be framework and library agnostic, lightweight, style free, decoupled from any other components, and should do only one thing and one thing well.

While we want our components to be dependency-free, we also want the same robustness guarantees from the monolith—and that includes types. We include TypeScript definitions with our components, also written using ES modules, to enable bundlers to consume it easily. We also ensure there is a full test suite and linter setup using our standardized configurations.

An excellent example of a component that has gone through the open-source lifecycle is the <typing-effect> element. A product engineering team within GitHub recently prototyped a terminal-inspired UI in which text appeared as if someone were typing it. In a previous project, we used the excellent and robust typed.js for a typing animation, and the team initially reached for that library again.

The parts of the UI that the team already built were lightweight, and our tooling identified that adding typed.js to the page would increase the bundle size fivefold. The UI Systems team asked the product engineering team to consider other options. We found that the functionality we needed for this application was limited enough that it was worth trying to build it ourselves.

The product engineering team wrote the first version of our own typing animation for the new UI. Using Catalyst, it took less than a day and fewer than 40 lines of code. Realizing that other teams could use this effect, we decided to put it through the lifecycle steps. We refactored the component to have zero dependencies, “ejected” it out of the Catalyst library, and open-sourced the component as <typing-effect>.

Results

Overall, we’re thrilled with the changes that we’ve made to the GitHub front-end since our last post. According to the internal developer surveys that we’ve conducted, our developers are pleased with Catalyst and ViewComponent!

Developers enjoy the encapsulation of ViewComponent, making it easier to test UI and increasing developer confidence. Developers feel Catalyst is a welcome change from “old-style” JavaScript without the massive leap to a different framework or paradigm.

What’s next?

We’re continuing to open source more generic open-source behavioral Web Components under the name “GitHub Elements.” We have a collection of these elements on https://github.com/github/github-elements, which syncs to our page on webcomponents.org.

We’re excited about Web Components’ future and continue to monitor proposed changes to the HTML spec. The two proposals we are most excited about these days are the Template Parts and Declarative Shadow DOM proposals. These proposals would make it even easier for engineers to ship Web Components and would solve some common pain points that we have with the current state of Web Components. We’ve implemented the minimum viable bits of Template Parts in a ponyfill, which is being used in production today, and we’re keen to see it gain traction from the broader community.

Special thanks

Thanks to Keith Cirkel for kick-starting this post, Ben Scofield and Joel Hawksley for reviewing it, and Nick Holden for working with us to extract the <typing-effect> element.

Scaling monorepo maintenance

Post Syndicated from Taylor Blau original https://github.blog/2021-04-29-scaling-monorepo-maintenance/

At GitHub, we serve some of the largest Git repositories on the planet. We also serve some of the fastest-growing repositories. Each day, the largest repositories we host become even larger.

About a year ago, we noticed that the job we use to repack Git repositories began hitting our self-imposed timeouts on larger repositories. Even when expanding these timeouts, failing maintenance on these repositories has generally been the cause of degraded performance that is hard to mitigate.

Today, these problems do not exist. GitHub can repack even the largest repositories we host in a fraction of the time it used to take. In this post, we’ll talk about what problems we were encountering, the solutions we built, how we deployed them safely, and describe some possible future directions.

All of our work here is being contributed to the open source Git project, and will be available in an upcoming release.

The problem

Why is GitHub’s maintenance job so expensive in the first place? It’s because we chose to have maintenance repack the entire contents of each repository into a single packfile. Doing so is expensive, but having just one packfile carries some benefits, too. With only one packfile, looking up objects doesn’t require opening and searching through multiple packs to find it. It also means that all objects can be compressed as a delta relative to all other objects (Git’s packfile format supports cross-pack deltas, but currently Git will never store them on disk). But, the most important reason is that reachability bitmaps, a performance-critical data structure, are only compatible with a single pack.

A new feature in Git, multi-pack indexes solves the former problem by making all object lookups go through a single index, but didn’t yet solve the latter. So, we set out to fill in the gaps by bringing bitmap support to multi-pack indexes in order to remove the single-pack limitation on reachability bitmaps.

But in order to build multi-pack bitmaps, we had to solve a number of other interesting problems along the way. First, we had to decide how to arrange the objects in a multi-pack index to achieve good bitmap compression. We also had to figure out how to quickly invert that ordering to translate between bit positions back to the objects they refer to. Some of these steps also yielded notable performance improvements on single-pack repositories, too. Finally, we had to figure out a new repacking strategy that scaled with the size of recent pushes, rather than with the size of the entire repository.

But before we get into all of that, let’s start from the very beginning.

Objects, packs, and fetching

You might be aware that Git stores the contents of your repositories as a set of objects. Each object represents an individual piece of your repository: a single file, tree, commit, or tag. These objects may be stored individually as “loose” objects, or together in a packfile.

If you’ve ever heard that a Git repository is “nothing more than a directed acyclic graph,” then you know that these objects can refer to one another. A tree refers to a set of blobs (which correspond to files) and other trees (which correspond to sub-directories). A commit refers to a single tree (the repository root), and zero or more other commits which are its parents.

These links help Git figure out which objects it needs to transfer to fulfill a fetch or clone request. When you fetch a repository from GitHub, Git performs a negotiation to figure out which objects to send. The server advertises the objects at the tip of its references (basically the tips of branches and tags). The client does the same, along with the set of references that it wants from the server. Then, the server walks the links between the requested objects and the objects that the client already has in order to figure out what to send.

Above is an object graph. The client advertises its ref tips (indicated by the darker blue commits). The server’s advertised references are colored dark red. The blue shaded area represents the result of walking along the edges to obtain the reachability closure of the objects the client already has. The red shaded area represents the same closure from the server’s perspective, excluding objects that the client already has. Objects in this region are the ones which need to be sent to the client.

During a fetch, Git needs to send not just the commits in between what the client has and wants, but also all of the objects that are reachable from those commits. Because Git doesn’t store the list of every reachable object, this may be expensive, especially in the case of a clone. When cloning, the client doesn’t have any objects, so it asks the server for all of the objects reachable from any reference.

In an earlier blog post, we talked about how we use reachability bitmaps to accelerate this negotiation. In case you haven’t read that post, below is a quick primer.

Reachability bitmaps

How does Git handle this special case where the client has nothing and wants everything? Ultimately, the server needs to determine the reachability closure of all of the reference tips. In other words, it needs a list of all of the objects at the reference tips, and all of the ancestors of those objects in order to assemble a complete copy of the repository at the other end.

Unfortunately for us, the larger the repository is, the longer it takes Git to compute the list of objects to send. This isn’t feasible even for medium-sized repositories. Git could handle our case specially by just sending every object it has, but that might result in many unwanted objects being sent, too. (For example, GitHub stores the outcome of “test merges” in special refs which aren’t ordinarily sent during fetches, but whose objects are stored in the same object directory nonetheless.)

Instead, Git stores a set of reachability bitmaps corresponding to some of the commits in a packfile. The idea is rather simple: arrange the objects in a pack in some order (the particular order used is something we’ll discuss shortly in detail). Then, the ith bit in the bitmap corresponding to commit C is 1 if C can reach the ith object, and 0 otherwise.

Having a one-to-one correspondence between objects and bit positions has a couple of appealing properties: taking the union of reachable objects between commits is as simple as ORing their bitmaps together, and taking the difference is as simple as combining AND and NOT. So, when a bitmap exists, Git can dramatically speed up the object negotiation phase:

  • First, OR all of the bitmaps corresponding to the reference tips that the client wants. Call this new bitmap W.
  • Then, do the same with the bitmaps corresponding to the reference tips that the client advertised as already having. Call this bitmap H.
  • Finally, compute W AND NOT H to produce the set of objects the client needs (in other words, everything it wants but does not already have). Then, send those objects.

Because all of the reachability information is encoded directly into the bitmaps, Git saves time by avoiding the need to open up and parse objects, allowing it to produce the same result in a fraction of the time.

This idea has been used since at least the 1970s to speed up queries in relational databases. In Git, reachability bitmaps can provide dramatic speed-ups when walking objects that reside in the same pack: walking all of the objects in the Linux kernel repository took more than 33 seconds without bitmaps, but only 1.57 seconds to perform the same traversal with bitmaps.

The object order

How does Git turn a set of objects into a sequence of bit positions? One way you might imagine to do this is assign bit positions in lexicographic order. The first bit corresponds to 000023961a, the second to 0000d6543f, the third to 000182eacf, and so on in lexical order.

Why not do this? Recall that an object’s ID is determined by a SHA-1 of its contents, which means that reachability of any object in this order is only reachable to nearby objects by chance. And reachability of nearby objects matters: Git compresses the bitmaps using EWAH compression, which relies on having long runs of identical bits. If the object order makes reachability look essentially random from bit to bit, Git won’t be able to efficiently compress the bitmaps.

Pack order—that is, the physical arrangement of objects in a packfile—produces a sequence of bit positions that tends to place reachable objects next to each other. And this produces exactly the kind of long runs of identical bits that make EWAH compression perform well.

The problem

But, all of this creates a problem for us: if the order of bit positions is dictated by a pack, then bitmaps are coupled in implementation and in concept to the existence of a single packfile. So, any objects that accumulate outside of the bitmapped pack won’t benefit from the same speed-ups.

To address this, we periodically repack the repository’s entire contents into a single pack, and then generate a new reachability bitmap. This makes reachability queries in more recent parts of the repository’s history faster.

But generating that new pack takes time; in fact, it’s quadratic. Bigger repositories take longer to repack, but also grow at a faster rate, which means they run maintenance more often. This compounding effect sometimes makes it such that some repositories are constantly undergoing maintenance: by the time one maintenance job has finished, another is already sitting in the queue, waiting to be run.

Since the bottleneck for maintenance is the compression of an entire repository’s contents into a single packfile, what would it take to be able to repack the contents into multiple packfiles instead?

Multi-pack indexes

To order a set of objects spanning multiple packs, we looked to a recent Git feature: multi-pack indexes.

When Git wants to locate an object by name in a single pack, it uses that pack’s index (.idx) file, which provides a binary-searchable list of object locations. Multi-pack indexes work similarly to .idx files, but the location they indicate is a pair containing both the pack containing the object, as well as where in that pack the object can be found.

The figure above gives a flavor for what kind of data is organized in a multi-pack index. Here, there are three packs, each with a handful of objects. The multi-pack index stores the location of each unique object among the set of packs. When multiple copies of an object exist (like the green or red objects in packs xyz and abc), ties are broken in favor of the copy in the pack with the earliest modification time.

The order of objects in the multi-pack index differs from the order in each individual pack. Since a pack is free to store objects in any order it wants, the multi-pack index stores objects in lexicographic order so that an object can be found quickly by name using a binary search.

Ordering objects

Given a multi-pack index, how should we order the objects in the packs it contains? We discussed earlier that ordering objects lexicographically results in poor compression. We also noted that objects in packs are ordered topologically, which for our purposes we can consider to imply that individual objects tend to appear near other objects which they reach in this order.

So, any ordering of the objects in a multi-pack index should capture as much of those two properties as possible. With that in mind, we decided on the following order:

  • Objects are first grouped according to which packfile they appear in, and packs are ordered according to the multi-pack index.
  • Objects within the same pack should be ordered according to their locations within that pack.

This effectively concatenates the pack-order of multiple packs together, according to some other ordering defined on the packs themselves.

To see what this looks like, let’s overlay a portion of a bitmap that covers the objects in our earlier example:

The first three bits correspond to the red, yellow, and green objects, respectively. Each one of those objects comes from the xyz pack, which means that the xyz pack has the oldest modification among the three. Scanning the bitmap from left to right, these objects appear in pack order; that is, the byte offset of the red object is less than the byte offsets of the yellow and green objects that follow it.

The purple and blue objects come next, since they are in the pack that follows. But note that the copies of the red and green objects in the abc pack don’t correspond to any bits highlighted. Why? Because the multi-pack index selected the copy of those objects in the earlier pack.

Finally, the orange and pink objects appear, also in pack order. And, as we expect, the copy of the purple object that appears in pack 123 isn’t included in the bitmap, because the copy in pack abc was.

This ordering gives us great locality, but we still need to address how to map bit positions back to the objects they represent. For example, let’s look at the fifth bit position, which we know refers to the blue object: how could Git discover this same fact?

You could reasonably imagine that knowing how many total objects are in each pack would be good enough to figure out which objects each bit corresponds to. But that isn’t enough information; we don’t know how many unique objects selected by the multi-pack index appear in each pack, and we also don’t know which non-unique objects are missing. So it’s not good enough to count past the three bits corresponding to objects in pack xyz and then count two more bits up to the fifth bit, because that would point at the copy of the green object in pack abc.

Reverse indexes

To solve this problem, we introduced reverse indexes. In the same way that the pack index provides a mapping from object name to object location, the reverse index maps an object’s location back to its name.

The idea is simple: in addition to the pack’s contents (stored in a .pack file) and index (stored in an .idx file), we’ll write an array of numbers (which comprise the reverse index, and are stored in a new .rev file). These numbers provide a mapping between an object’s position in pack order and its position in lexicographic order.

To better understand this, let’s take a look at an example on a single pack.

The .idx file (shown in the lower left) lists objects in lexicographic order: the yellow object comes before the red one, and the red one comes before the green one. But their pack order is different: there, the red object comes before the yellow one instead of the other way around. The reverse index helps us unwind the two: it tells us that the red object is in the position 3 in lexicographic order, and the yellow object is in position 1.

The reverse index allows us to map quickly from offsets into the packfile to object positions they correspond to. That allows us to quickly determine the size of a packed object. For example, say that you want to figure out how large the red object is. Because Git doesn’t store that information directly, you have one of two options: either scan linearly through the packed data (inflating its contents until you locate the stream end), or locate the adjacent object (in this case, the yellow one) by name, and measure the difference of their offsets.

Without the reverse index, there is no way to figure out where the adjacent object starts. But with a reverse index, locating the red object is as simple as reading the adjacent entry in the reverse index to discover the offset. Here, that value is 1, which points at an entry in the .idx file, which in turn points at the location in the pack.

Before the on-disk reverse index, Git computed this table on the fly and stored the results in an array in memory. This had a couple of major downsides. To build a reverse index on the fly, Git has to allocate a pair of pack offset and index position. This requires memory and runtime, which both scale relative to the size of the pack. Even though this processes using radix sort, sorting the reverse index entries can be noticeably slow when done once per process.

Some initial testing of these reverse indexes showed that they could enable rather dramatic speed-ups when serving fetches on real-world repositories. To verify our early results, we gradually rolled out reverse indexes on a handful of repositories.

Below is the 50th percentile of CPU time for fetches to Homebrew/homebrew-core on our testing host before and after the change:

In this case, we reduced the amount of time it took to serve any fetch of that repository by around 80%. Here’s another plot from the same time which shows the resident set size of the program used to serve fetches by replicas of that same repository:

Encouraged by our first results in production, we proceeded to cautiously roll out on-disk reverse indexes to all other replicas. Once we completed the roll-out, we let a couple of days pass in order for replicas to have enough time to generate reverse indexes. Then, we sampled the overall CPU time it took to serve fetches across all repositories and observed the drop we were hoping for:

In the graph above, you can see three 24-hour cycles. The first day was before we rolled out .rev files, and the second two peaks are after. The per-day peaks dropped from around 10.8 seconds to 7 seconds, for a collective savings of around 35% on all repositories.

Multi-pack bitmaps

Now that we have a format for .rev files, we can reuse it as the missing piece we need to build multi-pack bitmaps. Instead of writing positions relative to a pack .idx file, a multi-pack reverse index writes positions which are relative to a multi-pack-index file.

This provides exactly the information we need to rediscover the mapping between bit positions and the objects they refer to in the multi-pack index. To see why, let’s take a look at another example:

To figure out that the fifth bit corresponds to the blue object in the above diagram, we read the fifth entry in the multi-pack reverse index and get back that the fifth bit maps to the eleventh object in the multi-pack index. And, sure enough, the 11th object points back to the blue object that we were looking for in pack abc.

Putting it all together, this gives us a bitmap which can refer to objects in multiple packs. Based on its filename, each bitmap knows whether or not it belongs to a pack (and if so, which one) or to a multi-pack index. And based on that information, it can translate its object lookups to be relative to a packfile, or to the multi-pack index via its reverse index.

Since we chose the ordering carefully, these multi-pack bitmaps compress exactly as well as their single-pack counterparts. And they decouple bitmaps from individual packs. So, a repository can still have at most one bitmap, but that bitmap can now correspond to multiple packs.

Geometric repacking

Now that we can include multiple packs in a single bitmap, what’s the best way to repack a repository during maintenance?

With single-pack bitmaps, the only option was to pack everything together into one enormous pack. But now that this restriction no longer exists, we have to figure out the best way to repack the objects. When deciding on a new repacking strategy, we wanted something that struck a balance between two properties:

  • On average, there is a relatively small number of packs in the repository.
  • On average, we create a pack that collects objects that were pushed since the last maintenance run.

We decided on an invariant to ensure that the packs in a repository form a geometric progression by object size. In particular, if you sort the packs by number of objects, with the first pack having the most and the final pack having the least, then each pack will contain at least twice as many objects as the next one.

We taught git repack a new --geometric= mode, which creates exactly this geometric progression. Soon (at the time of writing, these patches are still being submitted and reviewed), you’ll be able to try this yourself on your own repository by running:

$ packsizes() {
    find .git/objects/pack -type f -name '*.pack' |
    while read pack; do
      printf "%7d %s\n" \
        "$(git show-index < ${pack%.pack}.idx | wc -l)" "$pack"
    done | sort -rn
  }
$ packsizes # before
$ git repack --write-midx --write-bitmap-index -d --geometric=2
$ packsizes # after

How does this command work? We select a set of packs and combine their objects into one new pack that replaces the set. But picking this set optimally is NP-hard, so we have to approximate it.

To see how we perform that approximation, let's walk through an example. The first step is to figure out how many packs already form a geometric progression. To do this, imagine ordering packs by how many objects they contain. Then, consider each adjacent pair of packs from largest to smallest. At each step, ask: "is the larger pack at least twice the size of the smaller one?". If so, then every pack from that pair onwards already forms a geometric progression.

In this example, the second and third packs (each containing one object) violate our progression. At that point, we know that the second pack, and any packs below it must be repacked together in order to restore the progression.

But we can't just repack the first two packs together, since the combined pack would still be too big (it would contain two objects, and the third pack only contains one). So we grow the set of packs to combine until the invariant is restored:

Here, we had to combine the first four packs together in order to restore a geometric progression. Those packs together contain 7 objects, which is less than half of the next-largest pack (which contains 32 objects). And we can't combine any more packs, since doing so would violate the remainder of the progression.

At this point, we can write a new pack containing just the objects in the set of packs we combined in the previous step. After discarding the now-redundant packs, the remaining packs again form a geometric progression:

This means that the number of packs grow logarithmically over time, so a repository will never have too many packs at any one point in time. It also has the appealing property that older objects tend to get pushed into the larger packs, meaning they get repacked less frequently as time passes. An important corollary of this is that each repack tends to focus on the most recently pushed objects. In other words, this strategy tends to make repacking take time proportional to the number of new objects, not the number of overall objects.

In order to keep the repository performing well over many geometric repacks, we intersperse an all-into-one repack once for every eight geometric repacks.

Importantly, this means that even though the classically-slow repacks are still slow, we aren't forced to run them every time we want to repack a repository.

Deploying to production

Now that we have a way to write a bitmap that covers the objects in multiple packs, a way to quickly map between bit positions and the objects they refer to in a multi-pack index, and a repacking strategy which only requires modifying recent additions to the repository, we are ready to put everything together.

Before rolling out a change as large as this one, we performed extensive local testing to ensure that writing bitmaps worked correctly, and that our new repacking strategy wasn't silently corrupting repositories. But our local testing can only go so far: there are endless corner cases when serving fetches and clones, so the real test occurs only after putting real traffic in front of these new paths. Our goal was to design a deployment strategy that simultaneously exercised enough of those cases, while also ensuring that any potential corruption could never occur on a majority of repository replicas.

Our first test cases were internal repositories. We broke our original tests into two phases. In the first phase, we wrote "multi-pack bitmaps" which contained only a single pack. This allowed us to exercise the most basic case of multi-pack bitmaps (having only one pack) without running our new repack code. Once we had built confidence in that approach, we expanded our test to alternate between geometric and full repacks.

After a couple of weeks without any issues, we were confident enough in our change to start testing it on other repositories external to GitHub. We selected first an individual host, and then a whole rack of hosts on which every repack would alternate between geometric and full. At this stage, results were encouraging: the average time to repack had dropped significantly, as had the overall amount of time spent repacking.

By this stage, our roll-out proceeded for several weeks in only one of three data centers. Because we never place a majority of repository replicas in any single datacenter, this configuration made it impossible for our changes to corrupt a majority set of replicas in an unrecoverable fashion, while still putting our changes in the request path for a large amount of traffic.

Finally, after adopting this configuration for a week, we proceeded to enroll percentages of replicas hosted in other datacenters to also use multi-pack bitmaps until all repositories were using multi-pack reachability bitmaps.

After rolling out our new repacking strategy with multi-pack bitmaps, we saved on average 5.67 CPU days every hour compared to the old strategy.

Likewise, the average time spent repacking any single repository also dropped considerably. Below, the plot is broken out per-site, and you can see when we began testing in a single site, as well as when we expanded our deployment to all sites.

There, the average dropped from around 1 minute to repack a repository to just 15 seconds.

Future directions

There are two major open areas we're considering for the future that will make it possible for further performance improvements:

One open area is the bitmap computation itself. Git's bitmap generation code can reuse existing bitmaps by permuting their bits into a new order, but this operation can still take time proportional to the size of the repository. One way to solve this would be to write bitmaps incrementally, only walking the new objects introduced since the last time a bitmap was written. This problem is tricky because it requires not only that the bitmap file be able to be written incrementally, but also that the object ordering we select is stable: that is, that introducing new bitmaps won't render the existing ones meaningless.

Another open area is the pack structure. Creating packs that form a geometric sequence is a promising step forward that allows us to trade off between full and partial repacks. But some repositories are so large that repacking the whole repository isn't feasible, much less desirable. Designing a strategy which freezes the packs containing the oldest objects in a repository's history will allow us to grow to support even larger repositories in the future.

Thank you

This project would not have been possible without help from the upstream Git community, as well as many engineering teams within GitHub. Each of these changes required extensive review, both to the Git project, as well as to internal GitHub services. Special thanks to Jeff King, Derrick Stolee, Jonathan Tan, Junio Hamano, and others for making this possible.

GitHub Desktop supports hiding whitespace, expanding diffs, and creating repository aliases

Post Syndicated from Billy Griffin original https://github.blog/2021-04-28-github-desktop-hiding-whitespace-expanding-diffs-repo-aliases/

GitHub Desktop 2.8 now includes several new features to make it easier to work with diffs and easier for people who have multiple copies of the same repository.

Expand diffs to get more context around changes

We hear a lot about how people love the way GitHub Desktop displays diffs beautifully, but you’re only able to see a few lines around the changes that you or someone else made. Now you can click to expand the diffs above or below your changes to get a more complete picture of the rest of the file around the changes made. You can also use a context menu on the diff to expand the whole file.

Hide whitespace in diffs

Similar to being able to see more context around your changes, sometimes there are a lot of whitespace changes in a file that don’t allow you to get a clear picture of the substantive changes that happened. Now, in both changes and history, you can optionally hide whitespace changes to allow you to focus just on the more meaningful changes to your code. This feature was built almost entirely by Steven Yeh (@say25), a fantastic and close community contributor to GitHub Desktop. Steven is a long-time open source contributor to GitHub Desktop, and we’re immensely grateful for him continuing to help improve the product.

Create aliases for repositories locally

Many developers keep more than one copy of a repository in GitHub Desktop, and the way repositories are displayed makes it tricky to differentiate between them. In GitHub Desktop 2.8, you can create aliases for your local repositories to easily tell them apart in the list.

We appreciate your input

All of the recent features we’ve shipped have come as a direct result of the great feedback we get from talking with users and hearing from you in our open source repository. With more than one million users actively using GitHub Desktop every month and more than 200 community contributors, we so appreciate all the feedback and contributions that help make GitHub Desktop even better every day. Thank you!

And if you’ve never tried it, or have not tried it in awhile, download GitHub Desktop today.

How we ship code faster and safer with feature flags

Post Syndicated from Alberto Gimeno original https://github.blog/2021-04-27-ship-code-faster-safer-feature-flags/

At GitHub, we’re continually working to improve existing features and shipping new ones all the time. From our launch of GitHub Discussions to the release of manual approvals for GitHub Actions—in order to ship new features and improvements faster while lowering the risk in our deployments, we have a simple but powerful tool: feature flags.

Reducing deployment risk

We deploy changes around the clock, and we need to keep the service running with no disruptions. For these reasons, deploying needs to be an almost risk-free process. Any problem that comes up during a deployment requires a rollback of the code. During this rollback period, users could be impacted and other deployments delayed; the more time that elapses the bigger the impact.

We use feature flags to reduce the risk of deploying to production. Any potentially risky change is put behind a feature flag in the code and then, when the deployment is done, we enable the feature flag to everyone or to a percentage of actors. This way we minimize the impact of the new changes, and if something goes wrong we can disable the feature flag completely in a matter of seconds without interrupting other deployments. This is fundamental to us for the following reasons:

  • It isolates and reduces the deployment risk.
  • We have the ability to disable changes in seconds without rolling back a deployment, which could take minutes.

What kind of things can go wrong during a deployment, where a feature flag can help? These are some examples:

  • Changes in database queries can potentially make existing functionality slower.
  • Changes in the logic on existing features can produce unexpected behavior, edge cases not considered, and more.
  • Changes in how we store or process information, such as saving new columns in the database can accidentally start inserting invalid values or generate too many writes to the database.

Working on new features

Feature flags are not only useful to reduce deployment risk when fixing or improving existing features, we also use them when developing new features!

Using feature flags allows us to work on features incrementally. Only staff members working on the project have the corresponding feature flag enabled, so they can see the new feature that is under development while other users cannot. We don’t use long-lived feature branches. Instead, feature flags allow us to work on small batches, which brings us many benefits:

  • Small batches are easier to review by other engineers in pull requests.
  • The smaller the change, the lesser the chance to get something wrong during the production deployment.
  • Not using long-lived feature branches avoids lots of potential merge conflicts and clashes with other features under development.

In order to adopt this way of building new features, you need to do a little bit of additional planning upfront to divide the work. For example, a new feature typically needs new data models or changes in existing ones. If you create a pull request only for the data model changes, then you may get blocked until those changes are merged. There are a couple of things we usually do to prevent this:

  • Create a main pull request. A spike, where you put the data model changes, and continue working on other areas that require changes: UI, API, background jobs, etc. Then, start extracting those changes to smaller pull requests that your peers can review.
  • Alternatively, you can create an initial pull request, and then create new branches off of that first branch, even if it’s not merged to the main branch yet. If you need to make changes to the first branch, you’ll need to rebase the other branches.

This way of working could introduce some friction due to having to ask for reviews more frequently. Sometimes it can take days until a team reviews your pull request. To prevent this from being a blocker, we follow one or more of these strategies:

Keep working even if the changes are not merged into the main branch. Either on that “spike” pull request or branching off the branch that is waiting for reviews.

Have a first responder person in each team that is focused on reviewing pull requests quickly to unblock the progress of their own team or other teams.
Sometimes if a critical pull request is ready to be deployed sooner rather than later, but has outstanding non-critical feedback waiting to be addressed, we do that in follow up pull requests.

Testing features

We have to make sure that the new functionality works as expected while also maintaining the existing functionality. In order to do that, we have many mechanisms to enable or disable feature flags at all levels as follows:

  • In our development environments, we can toggle feature flags from the command line.
  • In automated tests, we can enable or disable feature flags in the code.
  • In our CI, we have two different builds: one that runs with all feature flags disabled by default, and another one that runs with all feature flags enabled by default. This drastically reduces the chances of not covering most code paths in automated tests properly.
  • In production, we can enable or disable feature flags in the query string of a request.

Different shipping strategies

We already mentioned that we can flag particular actors, or we can enable a feature flag for a percentage of actors. Those are the most common options, but others include:

  • Individual actors: We use this mechanism to flag employees working on a feature or to enable the flag to customers that are experimenting with a feature, or if they have a problem we are trying to fix.
  • Staff shipping: When a feature is almost ready to be released, we first enable it for all GitHub staff, publish an internal post for awareness, and create an internal issue for gathering feedback or possible bugs.
  • Early access maintainers or other beta groups: When we are about to release an important feature that will impact the workflows of open source software (OSS) maintainers or other types of users, we test the feature with a small group first. After some time, we may interview them and gather their feedback to confirm our hypothesis, and validate the implementation.
  • Percentage of actors: As mentioned earlier, we can specify a percentage of actors for which to enable the feature flag. In this case when an actor is flagged, it remains flagged, unless the feature flag is rolled back. When the percentage of actors is changed, a message is sent to our deployments channel on Slack so other engineers are aware of the change.
  • Dark shipping: This allows us to enable the feature flag for a percentage of calls. This is different from the previous mechanism because an actor can get the feature enabled in one call (one request, for example) but not in the next one. This mechanism is not meant for features that are visible to the users, but for internal changes, such as a performance improvement in a query.

All these strategies and the full administration of feature flags is done through a web UI. Not all staff members have access to this UI, but most engineers do. The interface allows us to manage the shipping status of a feature flag, creation of new ones, and deletion. Additionally, we can see a history of changes made to the feature flag: who made the change, when and what changes were made, and other metadata, such as which team owns it.

What to flag

When we say that we can flag an actor, we mean that we can flag not only users, but other entities. In particular, we can flag users, organizations, teams, enterprises, repositories, or GitHub Apps. When writing the code that checks if the flag is enabled or not, we need to specify which actor has the flag enabled or not. These are some examples:

  • If the flag is purely visual, we flag the user, and we check the current logged user. Examples of this could be: dark mode, layout or navigation changes, and new components in the UI that don’t depend on data changes.
  • If the flag changes the way we store data, it is usually better to flag the repository for consistency for all users using the repository. For example, if we start storing timestamps for specific actions, we don’t want some users to produce the new timestamp while others do not.
  • For API changes, we also flag GitHub Apps.
  • Sometimes, especially for some risky changes, we create custom, context-specific actor types, where the existing actor types just wouldn’t have worked.

Sometimes we do more sophisticated checks, such as checking multiple actors or checking if a repository is public or not. For example, when we staff ship a feature it may make a new feature available for all repositories of an employee. If the feature must not be visible to other users, we want to enable the feature only for private repositories to prevent public repositories of an employee leaking the new feature.

The cost of a feature flag

As we have seen, using feature flags extensively changes the way we work, and requires a bit of additional planning and coordination. But there’s also additional costs associated with feature flags that we need to keep in mind. First, we have a runtime cost. When we check if a feature flag is enabled to an actor, we need to first load the metadata information of the flag, which is stored in MySQL but cached in memcached. We are now considering having the feature metadata and the list of actors always in memory to reduce this overhead. There’s an exception for this: we consider some feature flags “large.” An example of this was GitHub Actions, a feature that we rolled out to tens of thousands of users every week. For these large feature flags, when they are still not fully enabled, we need to do a per-actor query to MySQL to check if the actor is in the list of enabled actors.

For runtime, there’s also a cost in the form of technical debt. Once a feature flag is fully enabled, it leaves a lot of dead code and old tests in the repository that need to be deleted. Usually we do this manually after a feature has been rolled out for a few days or weeks depending on the case. But we are now automating the process: we have created a script that can be either invoked locally or by triggering a workflow manually.

The script uses regular expressions to find usages of the feature flag using git grep. Then for the matched Ruby files, it modifies the code using rubocop-ast. It is capable of deleting code blocks, such as if/else statements, and modifying boolean expressions, in addition to reindenting the code afterwards. When the file is a test, it deletes the code that enables the feature flag. For tests that check the behavior of the application when the feature flag is disabled, it just leaves a comment and makes the test fail, since in many cases we don’t want to just delete the test, but simply update it. As a result, even when the script requires some intervention from an engineer, it drastically reduces the manual work required in the process.

When the script is run using a workflow dispatch, it also creates a new branch and a pull request. Since we do extensive use of CODEOWNERS, the pull request will have the proper teams and people as reviewers. In the near future, we want to automate this process even more by running the script automatically when we can consider a feature flag stale.

Conclusion

Feature flags allow us to ship code faster with higher confidence. Using them we reduce deployment risk and make code reviews and development on new features easier, because we can work on small batches. There are also associated costs, but the benefits highly exceed them.