Tag Archives: Engineering

GitHub Availability Report: February 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-03-03-github-availability-report-february-2021/

Introduction

In February, we experienced no incidents resulting in service downtime to our core services.

This month’s GitHub Availability Report will provide initial details around an incident from March 1 that caused significant impact and a degraded state of availability for the GitHub Actions service.

March 1 09:59 UTC (lasting one hour and 42 minutes)

Our service monitors detected a high error rate on creating check suites for workflow runs. This incident resulted in the failure or delay of some queued jobs for a period of time. Additionally, some customers using the search/filter functionality to find workflows may have experienced incomplete search results.

In February, we experienced a few smaller and unrelated incidents affecting different parts of the GitHub Actions service. While these were localized to a subset of our customers and did not have broad impact, we take reliability very seriously and are conducting a thorough investigation into the contributing factors.

Due to the recency of the March 1 incident, we are still investigating the contributing factors and will provide a more detailed update in the March Availability Report, which will be published the first Wednesday of April. We will also share more about our efforts to minimize the impact of future incidents and increase the performance of the Actions service.

In summary

GitHub Actions has grown tremendously, and we know that the availability and performance of the service is critical to its continued success. We are committed to providing excellent service, reliability, and transparency to our users, and will continue to keep you updated on the progress we’re making to ensure this.

For more information, you can check out our status page for real-time updates on the availability of our systems and the GitHub Engineering blog for deep-dives around how we’re improving our engineering systems and infrastructure.

One small step closer to containerising service binaries

Post Syndicated from Grab Tech original https://engineering.grab.com/reducing-your-go-binary-size

Grab’s engineering teams currently own and manage more than 250+ microservices. Depending on the business problems that each team tackles, our development ecosystem ranges from Golang, Java, and everything in between.

Although there are centralised systems that help automate most of the build and deployment tasks, there are still some teams working on different technologies that manage their own build, test and deployment systems at different maturity levels. Managing a varied build and deploy ecosystems brings their own challenges.

Build challenges

  • Broken external dependencies.
  • Non-reproducible builds due to changes in AMI, configuration keys and other build parameters.
  • Missing security permissions between different repositories.

Deployment challenges

  • Varied deployment environments necessitating a bigger learning curve.
  • Managing the underlying infrastructure as code.
  • Higher downtime when bringing the systems up after a scale down event.

Grab’s appetite for customer obsession and quality drives the engineering teams to innovate and deliver value rapidly. The time that the team spends in fixing build issues or deployment-related tasks has a direct impact on the time they spend on delivering business value.

Introduction to containerisation

Using the Container architecture helps the team deploy and run multiple applications, isolated from each other, on the same virtual machine or server and with much less overhead.

At Grab, both the platform and the core engineering teams wanted to move to the containerisation architecture to achieve the following goals:

  • Support to build and push container images during the CI process.
  • Create a standard virtual machine image capable of running container workloads. The AMI is maintained by a central team and comes with Grab infrastructure components such as (DataDog, Filebeat, Vault, etc.).
  • A deployment experience which allows existing services to migrate to container workload safely by initially running both types of workloads concurrently.

The core engineering teams wanted to adopt container workloads to achieve the following benefits:

  • Provide a containerised version of the service that can be run locally and on different cloud providers without any dependency on Grab’s internal (runtime) tooling.
  • Allow reuse of common Grab tools in different projects by running the zero dependency version of the tools on demand whenever needed.
  • Allow a more flexible staging/dev/shadow deployment of new features.

Adoption of containerisation

Engineering teams at Grab use the containerisation model to build and deploy services at scale. Our containerisation effort helps the development teams move faster by:

  • Providing a consistent environment across development, testing and production
  • Deploying software efficiently
  • Reducing infrastructure cost
  • Abstracting OS dependency
  • Increasing scalability between cloud vendors

When we started using containers we realised that building smaller containers had some benefits over bigger containers. For example, smaller containers:

  • Include only the needed libraries and therefore are more secure.
  • Build and deploy faster as they can be pulled to the running container cluster quickly.
  • Utilise disk space and memory efficiently.

During the course of containerising our applications, we noted that some service binaries appeared to be bigger (~110 MB) than they should be. For a statically-linked Golang binary, that’s pretty big! So how do we figure out what’s bloating the size of our binary?

Go binary size visualisation tool

In the course of poking around for tools that would help us analyse the symbols in a Golang binary, we found go-binsize-viz based on this article. We particularly liked this tool, because it utilises the existing Golang toolchain (specifically, Go tool nm) to analyse imports, and provides a straightforward mechanism for traversing through the symbols present via treemap. We will briefly outline the steps that we did to analyse a Golang binary here.

  1. First, build your service using the following command (important for consistency between builds):

    $ go build -a -o service_name ./path/to/main.go
    
  2. Next, copy the binary over to the cloned directory of go-binsize-viz repository.
  3. Run the following script that covers the steps in the go-binsize-viz README.

    #!/usr/bin/env bash
    #
    # This script needs more input parsing, but it serves the needs for now.
    #
    mkdir dist
    # step 1
    go tool nm -size $1 | c++filt > dist/$1.symtab
    # step 2
    python3 tab2pydic.py dist/$1.symtab > dist/$1-map.py
    # step 3
    # must be data.js
    python3 simplify.py dist/$1-map.py > dist/$1-data.js
    rm data.js
    ln -s dist/$1-data.js data.js
    

    Running this script creates a dist folder where each intermediate step is deposited, and a data.js symlink in the top-level directory which points to the consumable .js file by treemap.html.

    # top-level directory
    $ ll
    -rw-r--r--   1 stan.halka  staff   1.1K Aug 20 09:57 README.md
    -rw-r--r--   1 stan.halka  staff   6.7K Aug 20 09:57 app3.js
    -rw-r--r--   1 stan.halka  staff   1.6K Aug 20 09:57 cockroach_sizes.html
    lrwxr-xr-x   1 stan.halka  staff        65B Aug 25 16:49 data.js -> dist/v2.0.709356.segments-paxgroups-macos-master-go1.13-data.js
    drwxr-xr-x   8 stan.halka  staff   256B Aug 25 16:49 dist
    ...
    # dist folder
    $ ll dist
    total 71728
    drwxr-xr-x   8 stan.halka  staff   256B Aug 25 16:49 .
    drwxr-xr-x  21 stan.halka  staff   672B Aug 25 16:49 ..
    -rw-r--r--   1 stan.halka  staff   4.2M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13-data.js
    -rw-r--r--   1 stan.halka  staff   3.4M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13-map.py
    -rw-r--r--   1 stan.halka  staff    11M Aug 25 16:37 v2.0.709356.segments-paxgroups-macos-master-go1.13.symtab
    

    As you can probably tell from the file names, these steps were explored on the segments-paxgroups service, which is a microservice used for segment information at Grab. You can ignore the versioning metadata, branch name, and Golang information embedded in the name.

  4. Finally, run a local python3 server to visualise the binary components.

    $ python3 -m http.server
    Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
    

    So now that we have a methodology to consistently generate a service binary, and a way to explore the symbols present, let’s dive in!

  5. Open your browser and visit http://localhost:8000/treemap_v3.html:

    Of the 103MB binary produced, 81MB are recognisable, with 66MB recognised as Golang (UNKNOWN is present, and also during parsing there were a fair number of warnings. Note that we haven’t spent enough time with the tool to understand why we aren’t able to recognise and index all the symbols present).

    Treemap

    The next step is to figure out where the symbols are coming from. There’s a bunch of Grab-internal stuff that for the sake of this blog isn’t necessary to go into, and it was reasonably easy to come to the right answer based on the intuitiveness of the go-binsize-viz tool.

    This visualisation shows us the source of how 11 MB of symbols are sneaking into the segments-paxgroups binary.

    Visualisation

    Every message format for any service that reads from, or writes to, streams at Grab is included in every service binary! Not cloud native!

How did this happen?

The short answer is that Golang doesn’t import only the symbols that it requires, but rather all the symbols defined within an imported directory and transitive symbols as well. So, when we think we’re importing just one directory, if our code structure doesn’t follow principles of encapsulation or isolation, we end up importing 11 MB of symbols that we don’t need! In our case, this occurred because a generic Message interface was included in the same directory with all the auto-generated code you see in the pretty picture above.

The Streams team did an awesome job of restructuring the code, which when built again, led to this outcome:

$$ ll | grep paxgroups
-rwxr-xr-x   1 stan.halka  staff   110M Aug 21 14:53 v2.0.709356.segments-paxgroups-macos-master-go1.12
-rwxr-xr-x   1 stan.halka  staff   103M Aug 25 16:34 v2.0.709356.segments-paxgroups-macos-master-go1.13
-rwxr-xr-x   1 stan.halka  staff        80M Aug 21 14:53 v2.0.709356.segments-paxgroups-macos-tinkered-go1.12
-rwxr-xr-x   1 stan.halka  staff        78M Aug 25 16:34 v2.0.709356.segments-paxgroups-macos-tinkered-go1.13

Not a bad reduction in service binary size!

Lessons learnt

The go-binsize-viz utility offers a treemap representation for imported symbols, and is very useful in determining what symbols are contributing to the overall size.

Code architecture matters: Keep binaries as small as possible!

To reduce your binary size, follow these best practices:

  • Structure your code so that the interfaces and common classes/utilities are imported from different locations than auto-generated classes.
  • Avoid huge, flat directory structures.
  • If it’s a platform offering and has too many interwoven dependencies, try to decouple the actual platform offering from the company specific instantiations. This fosters creating isolated, minimalistic code.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

New global ID format coming to GraphQL

Post Syndicated from Wissam Abirached original https://github.blog/2021-02-10-new-global-id-format-coming-to-graphql/

The GitHub GraphQL API has been publicly available for over 4 years now. Its usage has grown immensely over time, and we’ve learned a lot from running one of the largest public GraphQL APIs in the world. Today, we are introducing a new format for object identifiers and a roll out plan that brings it to our GraphQL API this year.

What is driving the change?

As GitHub grows and reaches new scale milestones, we came to the realization that the current format of Global IDs in our GraphQL API will not support our projected growth over the coming years. The new format gives us more flexibility and scalability in handling your requests even faster.

What exactly is changing?

We are changing the Global ID format in our GraphQL API. As a result, all object identifiers in GraphQL will change and some identifiers will become longer than they are now. Since you can get an object’s Global ID via the REST API, these changes will also affect an object’s node_id returned via the REST API. Object identifiers will continue to be opaque strings and should not be decoded.

How will this be rolled out?

We understand that our APIs are a critical part of your engineering workflows, and our goal is to minimize the impact as much as possible. In order to give you time to migrate your implementations, caches, and data records to the new Global IDs, we will go through a gradual rollout and thorough deprecation process that includes three phases.

  1. Introduce new format: This phase will introduce the new Global IDs into the wild on a type by type basis, for newly created objects only. Existing objects will continue to have the same ID. We will start by migrating the least requested object types, working our way towards the most popular types. Note that the new Global IDs may be longer and, in case you were storing the ID, you should ensure you can store the longer IDs. During this phase, as long as you can handle the potentially longer IDs, no action is required by you. The expected duration of this phase is 3 months.
  2. Migrate: In this phase you should look to update your caches and data records. We will introduce migration tools allowing you to toggle between the two formats making it easy for you to update your caches to the new IDs. The migration tools will be detailed in a separate blog post, closer to launch date. You will be able to use the old or new IDs to refer to an object throughout this phase. The expected duration of this phase is 3 months.
  3. Deprecate: In this phase, all REST API requests and GraphQL queries will return the new IDs. Requests made with the old IDs will continue to work, but the response will only include the new ID as well as a deprecation warning. The expected duration of this phase is 3 months.

Once the three migration phases are complete, we will sunset the old IDs. All requests made using the old IDs will result in an error. Overall, the whole process should take 9 months, with the goal of giving you plenty of time to adjust and migrate to the new format.

Tell us what you think

If you have any concerns about the rollout of this change impacting your app, please contact us and include information such as your app name so that we can better assist you.

Customer Support workforce routing

Post Syndicated from Grab Tech original https://engineering.grab.com/customer-support-workforce-routing

Introduction

With Grab’s wide range of services, we get large volumes of queries a day. Our Customer Support teams address concerns and issues from safety issues to general FAQs. The teams delight our customers through quick resolutions, resulting from world-class support framework and an efficient workforce routing system.

Our routing workforce system ensures that available resources are efficiently assigned to a request based on the right skillset and deciding factors such as department, country, request priority. Scalability to work across support channels (e.g. voice, chat, or digital) is also another factor considered for routing a request to a particular support specialist.

Sample flow of how it works today
Sample flow of how it works today

Having an efficient workforce routing system ensures that requests are directed to relevant support specialists who are most suited to handle a certain type of issue, resulting in quicker resolution, happier and satisfied customers, and reduced cost spent on support.

We initially implemented a third-party solution, however there were a few limitations, such prioritisation, that motivated us to build our very own routing solution that provides better routing configuration controls and cost reduction from licensing costs.

This article describes how we built our in-house workforce routing system at Grab and focuses on Livechat, one of the domains of customer support.

Problem

Let’s run through the issues with our previous routing solution in the next sections.

Priority management

The third-party solution didn’t allow us to prioritise a group of requests over others. This was particularly important for handling safety issues that were not impacted due to other low-priority requests like enquiries. So our goal for the in-house solution was to ensure that we were able to configure the priority of the request queues.

Bespoke product customisation

With the third-party solution being a generic service provider, customisations often required long lead times as not all product requests from Grab were well received by the mass market. Building this in-house meant Grab had full controls over the design and configuration over routing. Here are a few sample use cases that were addressed by customisation:

  • Bulk configuration changes – Previously, it was challenging to assign the same configuration to multiple agents. So, we introduced another layer of grouping for agents that share the same configuration. For example, which queues the agents receive chats from and what the proficiency and max concurrency should be.
  • Resource Constraints – To avoid overwhelming resources with unlimited chats and maintaining reasonable wait times for our customers, we introduced a dynamic queue limit on the number of chat requests enqueued. This limit was based on factors like the number of incoming chats and the agent performance over the last hour.
  • Remote Work Challenges – With the pandemic situation and more of our agents working remotely, network issues were common. So we released an enhancement on the routing system to reroute chats handled by unavailable agents (due to disconnection for an extended period) to another available agent.The seamless experience helped increase customer satisfaction.

Reporting and analytics

Similar to previous point, having a solution addressing generic use cases doesn’t allow you to add customisations at will over monitoring. With the custom implementation, we were able to add more granular metrics which are very useful to assess the agent productivity and performance and helps in planning the resources ahead of time, and this is why reporting and analytics were so valuable for workforce planning. Few of the customisations added additionally were

  • Agent Time Utilisation – While basic agent tracking was available in the out-of-the-box solution, it limited users to three states (online, away, and invisible). With the custom routing solution, we were able to create customised statuses to reflect the time the agent spent in a particular status due to  chat connection issues and failures and reflect this on dashboards for immediate attention.
  • Chat Transfers – The number of chat transfers could only be tabulated manually. We then automated this process with a custom implementation.

Solution

Now that we’ve covered the issues we’re solving, let’s cover the solutions.

Prioritising high-priority requests

During routing, the constraint is on the number of resources available. The incoming requests cannot simply be assigned to the first available agent. The issue with this approach is that we would eventually run out of agents to serve the high-priority requests.

One of the ways to prevent this is to have a separate group of agents to solely handle high-priority requests. This does not solve issues as the high-priority requests and low-priority requests share the same queue and are de-queued in a First-In, First-out (FIFO) order. As a result, the low-priority requests are directly processed instead of waiting for the queue to fill up before processing high-priority requests. Because of this queuing issue, prioritisation of requests is critical.

The need to prioritise

High-priority requests, such as safety issues, must not be in the queue for a long duration and should be handled as fast as possible even when the system is filled with low-priority requests.

There are two different kinds of queues, one to handle requests at priority level and other to handle individual issues, which are the business queues on which the queue limit constraints apply.

To illustrate further, here are two different scenarios of enqueuing/de-queuing:

Different issues with different priorities

In this scenario, the priority is set to dequeue safety issues, which are in the high-priority queue, before picking up the enquiry issues from the low-priority queue.

Different issues with different priorities
Different issues with different priorities

Identical issues with different priorities

In this scenario where identical issues have different priorities, the reallocated enquiry issue in the high-priority queue is dequeued first before picking up a low-priority enquiry issue.  Reallocations happen when a chat is transferred to another agent or when it was not accepted by the allocated agent. When reallocated, it goes back to the queue with a higher priority.

Identical issues with different priorities
Identical issues with different priorities

Approach

To implement different levels of priorities, we decided to use separate queues for each of the priorities and denoted the request queues by groups, which could logically exist in any of the priority queues.

For de-queueing, time slices of varied lengths were assigned to each of the queues to make sure the de-queueing worker spends more time on a higher priority queue.

The architecture uses multiple de-queueing workers running in parallel, with each worker looping over the queues and waiting for a message in a queue for a certain amount of time, and then allocating it to an agent.

for i := startIndex; i < len(consumer.priorityQueue); i++ {
 queue := consumer.priorityQueue\[i\]
 duration := queue.config.ProcessingDurationInMilliseconds
 for now := time.Now(); time.Since(now) < time.Duration(duration)\*time.Millisecond; {
   consumer.processMessage(queue.client, queue.config)
   // cool down
   time.Sleep(time.Millisecond \* 100)
 }
}

The above code snippet iterates over individual priority queues and waits for a message for a certain duration, it then processes the message upon receipt. There is also a cooldown period of 100ms before it moves on to receive a message from a different priority queue.

The caveat with the above approach is that the worker may end up spending more time than expected when it receives a message at the end of the waiting duration. We addressed this by having multiple workers running concurrently.

Request starvation

Now when priority queues are used, there is a possibility that some of the low-priority requests remain unprocessed for long periods of time. To ensure that this doesn’t happen, the workers are forced to run out of sync by tweaking the order in which priority queues are processed, such that when worker1 is processing a high-priority queue request, worker2 is waiting for a request in the medium-priority queue instead of the high-priority queue.

Customising to our needs

We wanted to make sure that agents with the adequate skills are assigned to the right queues to handle the requests. On top of that, we wanted to ensure that there is a limit on the number of requests that a queue can accept at a time, guaranteeing that the system isn’t flushed with too many requests, which can lead to longer waiting times for request allocation.

Approach

The queues are configured with a dynamic queue limit, which is the upper limit on the number of requests that a queue can accept. Additionally attributes such as country, department, and skills are defined on the queue.

The dynamic queue limit takes account of the utilisation factor of the queue and the available agents at the given time, which ensures an appropriate waiting time at the queue level.

A simple approach to assign which queues the agents can receive the requests from is to directly assign the queues to the agents. But this leads to another problem to solve, which is to control the number of concurrent chats an agent can handle and define how proficient an agent is at solving a request. Keeping this in mind, it made sense to have another grouping layer between the queue and agent assignment and to define attributes, such as concurrency, to make sure these groups can be reused.

There are three entities in agent assignment:

  • Queue
  • Agent Group
  • Agent

When the request is de-queued, the agent list mapped to the queue is found and then some additional business rules (e.g. checking for proficiency) are applied to calculate the eligibility score of each mapped agent to decide which agent is the best suited to cater to the request.

The factors impacting the eligibility score are proficiency, whether the agent is online/offline, the current concurrency, max concurrency and the last allocation time.

Ensuring the concurrency is not breached

To make sure that the agent doesn’t receive more chats than their defined concurrency, a locking mechanism is used at per agent level. During agent allocation, the worker acquires a lock on the agent record with an expiry, preventing other workers from allocating a chat to this agent. Only once the allocation process is complete (either failed or successful), the concurrency is updated and the lock is released, allowing other workers to assign more chats to the agent depending on the bandwidth.

A similar approach was used to ensure that the queue limit doesn’t exceed the desired limit.

Reallocation and transfers

Having the routing configuration set up, the reallocation of agents is done using the same steps for agent allocation.

To transfer a  chat to another queue, the request goes back to the queue with a higher priority so that the request is assigned faster.

Unaccepted chats

If the agent fails to accept the request in a given period of time, then the request is put back into the queue, but this time with a higher priority. This is the reason why there’s a corresponding re-allocation queue with a higher priority than the normal queue to make sure that those unaccepted requests don’t have to wait in the queue again.

Informing the frontend about allocation

When an allocation of an agent happens, the routing system needs to inform the frontend by sending messages over websocket to the frontend. This is done with our super reliable messaging system called Hermes, which operates at scale in supporting 12k concurrent connections and establishes real time communication between agents and customers.

Finding the online agents

The routing system should only send the allocation message to the frontend when the agent is online and accepting requests. Frontend uses the same websocket connection used to receive the allocation message to inform the routing system about the availability of agents. This means that if for some reason, the websocket connection is broken due to internet connection issues, the agent would stop receiving any new chat requests.

Enriched reporting and analytics

The routing system is able to push monitoring metrics, such as number of online agents, number of chat requests assigned to the agent, and so on. Because of fine grained control that comes with building this system in-house, it gives us the ability to push more custom metrics.

There are two levels of monitoring offered by this system, real time monitoring and non-real time monitoring, which could be used for analytics for calculating things like the productivity of the agent and the time they spent on each chat.

We achieved the discussed solutions with the help of StatsD for real time monitoring and for analytical purposes, we sent the data used for Tableau visualisations and reporting to Presto tables.

Given that the bottleneck for this system is the number of resources (i.e. number of agents), the real time monitoring helps identify which configuration needs to be adjusted when there is a spike in the number of requests. Moreover, the analytical persistent data allows us the ability to predict the traffic and plan the workforce management such that they are efficiently handling the requests.

Scalability

Letting the system behave appropriately when rolled out to multiple regions is a very critical piece that needed to be taken into account. To ensure that there were enough workers to handle the requests, horizontal scaling of instances can be done if the CPU utilisation increases.

Now to understand the system limitations and behaviour before releasing to multiple regions, we ran load tests with 10x more traffic than expected. This gave us the understanding on what monitors and alerts we should add to make sure the system is able to function efficiently and reduce our recovery time if something goes wrong.

Next steps

The few enhancements lined up after building this routing solution are to focus on reducing customer’s waiting time and to reduce the time spent by the agents on unresponsive customers, who have waited too long in the queue. Aside from chats, we would like to employ this solution into handling digital issues (social media and emails) and voice requests (call).


Special thanks to Andrea Carlevato and Karen Kue for making sure that the blogpost is interesting and represents the problem we solved accurately.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GitHub reduces Marketplace transaction fees, revamps Technology Partner Program

Post Syndicated from Ryan J. Salva original https://github.blog/2021-02-04-github-reduces-marketplace-transaction-fees-revamps-technology-partner-program/

At GitHub, our community is at the heart of everything we do. We want to make it easier to build the things you love, with the tools you prefer to use—which is why we’re committed to maintaining an open platform for developers. Launched in 2017 and now home to the world’s largest DevOps ecosystem, GitHub Marketplace is the single destination for developers to find, sell, and share tools and solutions that help simplify and improve the process of building software.

Whether buying or selling, our goal is to provide the best Marketplace experience for developers as possible. Today, we’re announcing some changes worth celebrating 🎉; changes to increase your revenue, simplify the application verification process, and make it easier for everyone to build with GitHub.

Supporting our Marketplace partners

In the spirit of helping developers both thrive and profit, we’re increasing developer’s take-home pay for apps sold in the marketplace from 75 to 95%. GitHub will only keep a 5% transaction fee. This change puts more revenue in the pockets of the developers, who are doing the work building tools that support the GitHub community.

Learn more

Simplifying app verification process on the Marketplace

We know our partners are excited to get on Marketplace, and we’ve made changes to make this as easy as possible. Previously, a deep review of app security and functionality was required before an app could be added to Marketplace. Moving forward, we’ll verify your organization’s identity and common-sense security precautions by:

  1. Validating your domain with a simple DNS TXT record
  2. Validating the email address on record
  3. Requiring two-factor authentication for your GitHub organization

You can track your app submission’s progress from your organization’s profile settings to fix issues faster. Now developers can get their solutions added to the Marketplace faster and the community can moderate app quality.

Screenshot of app publisher verification process in Marketplace

Soon, we’ll move all “verified apps” to the validated publisher model, updating the “green verified checkmarkverified” badge to indicate publishers, and not apps are scrutinized. Learn more

GitHub Technology Partner Program updates

We’ve also made some updates to our Technology Partner Program. If you’re interested in the GitHub Marketplace but unsure how to build integrations to the GitHub platform, co-market with us, or learn about partner events and opportunities, you can get started with our technology partner program for help. You can also check out the partner-centric resources section or reach out to us at [email protected].

Screenshot of Technology Partner Program Resource page

You’re now one step away from the technical and go-to-market resources you need to integrate with GitHub and help improve the lives of all software developers. Looking forward to seeing you on the Marketplace.

Happy coding. 👾

Deployment reliability at GitHub

Post Syndicated from Raffaele Di Fazio original https://github.blog/2021-02-03-deployment-reliability-at-github/

Welcome to another deep dive of the Building GitHub blog series, providing a look at how teams across the GitHub engineering organization identify and address opportunities to improve our internal development tooling and infrastructure.

In a previous post of this series, we described how we improved the deployment experience for github.com. When we describe deployments at GitHub, the deployment experience is an important part of what it takes to ship applications to production, especially at GitHub’s scale, but there is more to it: the actual deployment mechanics need to be fast and reliable.

Deploying GitHub

GitHub is deployed to two types of “targets”: multiple Kubernetes clusters and directly to bare metal hosts. Those two targets have different needs and characteristics, such as different number of replicas, different runtimes, etc.

The deployment process of GitHub is designed to be an invisible event for users—we deploy GitHub tens of times a day (yes, even on a Friday) without impact on our users.

When implementing deployments for a monolithic application, we must keep into account the impact that the deployment process has on the internal users of the tool as well. Hundreds of GitHub engineers work at the same time on new features and bug fixes on the same codebase and it’s critical that they can reliably deploy to production. If deployments take too long or if they are prone to fail (even if there is no impact on users), it will mean that developers at GitHub will spend more time getting those features out to users.

For these reasons, we asked ourselves the following questions:

  • How long does it take to get code successfully running in production?
  • How often do we roll back changes?
  • How often do deployments require any kind of manual intervention?

One thing was sure from the beginning, we needed data to answer those questions.

Measure all the things

We instrumented our tooling to send metrics on several important key aspects, including, but not limited to:

  • Duration of CI builds.
  • Duration of individual steps of the deployment pipeline.
  • Total duration of a deployment pipeline.
  • Final state of a deployment pipeline.
  • Number of deployments that are rolled back.
  • Occurrences of deployment retries in one of the steps of the pipeline.

As well as more general metrics related to the overall delivery:

  • How many pull requests we deploy/merge every week.
  • How long it takes to get a pull request from “ready to be deployed” to “merged”.

We used these metrics to implement several improvements to our deployment tooling: we generally made our deployments more reliable by analyzing those metrics, but we also introduced changes that allow us to tolerate some classes of intermittent deployment failures, introducing automatic retries in case of problems.

Additionally, instrumenting our deployment tooling allowed us to identify problems sooner when they happen so that we can react in a timely fashion.

Better visibility in the deployment process 

As we mentioned, GitHub is a monolithic rails app that is deployed to Kubernetes and bare metal servers, with the customer facing part of GitHub being 100% deployed to Kubernetes. When we deploy a new version of GitHub we need to start hundreds of pods in multiple Kubernetes clusters. 

A few months ago, our deploy tooling did not print much information on what was going on behind the scenes with our Kubernetes deployments. This meant that whenever a deployment failed, for example due to an issue that we didn’t previously detect in stages before canary, we would have to dig into what happened by directly asking Kubernetes.

At GitHub, we don’t require engineers deploying to understand the internals of Kubernetes. We abstract Kubernetes in a way that is easier to deal with and we have tooling in place to debug GitHub without directly accessing specific Kubernetes clusters.

We analyzed internal support requests to our infrastructure teams and found the possibility to reduce toil by making it easier to figure out what went wrong when deploying to Kubernetes.

For those reasons, we introduced changes to our tooling to provide better information on a deployment while it is being rolled out and proactively providing specific lower level information in case of failures, which includes a view of the Kubernetes events without the need to directly access Kubernetes itself.

This change allowed us to have better detailed information on the progress of a deployment and to increase the visibility on errors in case of failures, which reduces the time to detect a problem and ultimately reduces the overall time needed to deploy to production.

An SLO based approach to deployment reliability

When deployments fail, there is no impact for GitHub customers: deploys are automatically halted before there can be customer-facing issues and to do so we heavily rely on Kubernetes, for example by using readiness probes.

However, deploying GitHub tens of times a day means that the longer a deployment takes, the less things we can ship!

To make sure that we can successfully keep shipping new features to our customers, we defined a few service level objectives (SLOs) to keep track of how fast and reliable deploying GitHub is.

SLOs are usually defined for things like the success rate or latency of a web application, but they can be used for pretty much anything else. In our case, we started using SLOs to set reliability objectives and keep track of how much time it takes to deploy PRs to production which allows us to understand when we need to shift our focus from new features to improvements to the overall shipping flow of GitHub.

At GitHub we have a dedicated team that is responsible for the continuous deployment of applications, which means that we develop tools and best practices but ultimately also help Hubbers ship their applications. These SLOs are now an integral part of the team dynamics and influence the priorities of the team, to ensure that we can keep shipping hundreds of pull requests every week.

Conclusion

In this post we discussed how we make sure that we keep Hubbers shipping new features and improvements over time. Since we started taking a look at the problem, we significantly improved our deployment process, but more importantly we introduced SLOs that can guide investments to further improve our tools and processes so that GitHub users can keep getting fresh new features all year round.

GitHub Availability Report: January 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-02-02-github-availability-report-january-2021/

Introduction

In January, we experienced one incident resulting in significant impact and degraded state of availability for the GitHub Actions service.

January 28 04:21 UTC (lasting 3 hours 53 minutes)

Our service monitors detected abnormal levels of errors affecting the Actions service. This incident resulted in the failure or delay of some queued jobs for a period of time. Jobs that were queued during the incident were run successfully after the issue was resolved.

We identified the issue as caused by an infrastructure error in our SQL database layer. The database failure impacted one of the core microservices that facilitates authentication and communication between the Actions microservices, which affected queued jobs across the service. In normal circumstances, automated processes would detect that the database was unhealthy and failover with minimal or no customer impact. In this case, the failure pattern was not recognized by the automated processes, and telemetry did not show issues with the database, resulting in a longer time to determine the root cause and complete mitigation efforts.

To help avoid this class of failure in the future, we are updating the automation processes in our SQL database layer to improve error detection and failovers. Furthermore, we are continuing to invest in localizing failures to minimize the scope of impact resulting from infrastructure errors.

In summary

We’ll continue to keep you updated on the progress we’re making on ensuring reliability of our services. To learn more about how teams across GitHub identify and address opportunities to improve our engineering systems, check out the GitHub Engineering blog.

Making GitHub’s new homepage fast and performant

Post Syndicated from Tobias Ahlin original https://github.blog/2021-01-29-making-githubs-new-homepage-fast-and-performant/

This post is the third installment of our five-part series on building GitHub’s new homepage:

  1. How our globe is built
  2. How we collect and use the data behind the globe
  3. How we made the page fast and performant
  4. How we illustrate at GitHub
  5. How we designed the homepage and wrote the narrative

Creating a page full of product shots, animations, and videos that still loads fast and performs well can be tricky. Throughout the process of building GitHub’s new homepage, we’ve used the Core Web Vitals as one of our North Stars and measuring sticks. There are many different ways of optimizing for these metrics, and we’ve already written about how we optimized our WebGL globe. We’re going to take a deep-dive here into two of the strategies that produced the overall biggest performance impact for us: crafting high performance animations and serving the perfect image.

High performance animation and interactivity

As you scroll down the GitHub homepage, we animate in certain elements to bring your attention to them:

Traditionally, a typical way of building this relied on listening to the scroll event, calculating the visibility of all elements that you’re tracking, and triggering animations depending on the elements’ position in the viewport:

// Old-school scroll event listening (avoid)
window.addEventListener('scroll', () => checkForVisibility)
window.addEventListener('resize', () => checkForVisibility)

function checkForVisibility() {
  animatedElements.map(element => {
    const distTop = element.getBoundingClientRect().top
    const distBottom = element.getBoundingClientRect().bottom
    const distPercentTop = Math.round((distTop / window.innerHeight) * 100)
    const distPercentBottom = Math.round((distBottom / window.innerHeight) * 100)
    // Based on this position, animate element accordingly
  }
}

There’s at least one big problem with an approach like this: calls to getBoundingClientRect() will trigger reflows, and utilizing this technique might quickly create a performance bottleneck.

Luckily, IntersectionObservers are supported in all modern browsers, and they can be set up to notify you of an element’s position in the viewport, without ever listening to scroll events, or without calling getBoundingClientRect.  An IntersectionObserver can be set up in just a few lines of code to track if an element is shown in the viewport, and trigger animations depending on its state, using each entry’s isIntersecting method:

// Create an intersection observer with default options, that 
// triggers a class on/off depending on an element’s visibility 
// in the viewport
const animationObserver = new IntersectionObserver((entries, observer) => {
  for (const entry of entries) {
    entry.target.classList.toggle('build-in-animate', entry.isIntersecting)
  }
});

// Use that IntersectionObserver to observe the visibility
// of some elements
for (const element of querySelectorAll('.js-build-in')) {
  animationObserver.observe(element);
}

Avoiding animation pollution

As we moved over to IntersectionObservers for our animations, we also went through all of our animations and doubled down on one of the core tenets of optimizing animations: only animate the transform and opacity properties, since these properties are easier for browsers to animate (generally computationally less expensive). We thought we did a fairly good job of following this principle already, but we discovered that in some circumstances we did not, because unexpected properties were bleeding into our transitions and polluting them as elements changed state.

One might think a reasonable implementation of the “only animate transform and opacity” principle might be to define a transition in CSS like so:

// Don’t do this
.animated {
  opacity: 0;
  transform: translateY(10px);
  transition: * 0.6s ease;
}

.animated:hover {
  opacity: 1;
  transform: translateY(0);
}

In other words, we’re only explicitly changing opacity and transform, but we’re defining the transition to animate all changed properties. These transitions can lead to poor performance since other property changes can pollute the transition (you may have a global style that changes the text color on hover, for example), which can cause unnecessary style and layout calculations. To avoid this kind of animation pollution, we moved to always explicitly defining only opacity and transform as animatable:

// Be explicit about what can animate (and not)
.animated {
  opacity: 0;
  transform: translateY(10px);
  transition: opacity 0.6s ease, transform 0.6s ease;
}

.animated:hover {
  opacity: 0;
  transform: translateY(0);
}

As we rebuilt all of our animations to be triggered through IntersectionObservers and to explicitly specify only opacity and transform as animatable, we saw a drastic decrease in CPU usage and style recalculations, helping to improve our Cumulative Layout Shift score:

Lazy-loading videos with IntersectionObservers

If you’re powering any animations through video elements, you likely want to do two things: only play the video while it’s visible in the viewport, and lazy-load the video when it’s needed. Sadly, the lazy load attribute doesn’t work on videos, but if we use IntersectionObservers to play videos as they appear in the viewport, we can get both of these features in one go:

<!-- HTML: A video that plays inline, muted, w/o autoplay & preload -->
<video loop muted playsinline preload="none" class="js-viewport-aware-video" poster="video-first-frame.jpg">
  <source type="video/mp4" src="video.h264.mp4">
</video>

// JS: Play videos while they are visible in the viewport
const videoObserver = new IntersectionObserver((entries, observer) => {
  for (const entry of entries) entry.isIntersecting ? video.play() : video.pause();
});

for (const element of querySelectorAll('.js-viewport-aware-video')) {
  videoObserver.observe(element);
}

Together with setting preload to none, this simple observer setup saves us several megabytes on each page load.

Serving the perfect image

We visit web pages with a myriad of different devices, screens and browsers, and something simple as displaying an image is becoming increasingly complex if you want to cover all bases. Our particular illustration style also happens to fall between all of the classic JPG, PNG or SVG formats. Take this illustration, for example, that we use to transition from the main narrative to the footer:

To render this illustration, we would ideally need the transparency from PNGs but combine it with the compression from JPGs, as saving an illustration like this as a PNG would weigh in at several megabytes. Luckily, WebP is, as of iOS 14 and macOS Big Sur, supported in Safari on both desktops and phones, which brings browser support up to a solid +90%. WebP does in fact give us the best of both worlds: we can create compressed, lossy images with transparency. What about support for older browsers? Even a new Mac running the latest version of Safari on macOS Catalina can’t render WebP images, so we have to do something.

This challenge eventually led us to develop a somewhat obscure solution: two JPGs embedded in an SVG (one for the image data and one for the mask), embedded as base64 data—essentially creating a transparent JPG with one single HTTP request. Take a look at this image. Download it, open it up, and inspect it. Yes, it’s a JPG with transparency, encoded in base64, wrapped in an SVG.

Part of the SVG specification is the mask element. With it, you can mask out parts of an SVG. If we embed an SVG in a document, we can use the mask element in tandem with the image element to render an image with transparency:

<svg viewBox="0 0 300 300">
  <defs>
    <mask id="mask">
      <image width="300" height="300" href="mask.jpg"></image>
    </mask>
  </defs>
  <image mask="url(#mask)" width="300" height="300" href="image.jpg"></image>
</svg>

This is great, but it won’t work as a fallback for WebP. Since the paths for these images are dynamic (see href in the example above), the SVG needs to be embedded inside the document. If we instead save this SVG in a file and set it as the src of a regular img, the images won’t be loaded, and we’ll see nothing.

We can work around this limitation by embedding the image data inside the SVG as base64. There are services online where you can convert an image to base64, but if you’re on a Mac, base64 is available by default in your Terminal, and you can use it like so:

base64 -i <in-file> -o <outfile>

Where the in-file is your image of choice, the outfile is a text file where you’ll save the base64 data. With this technique, we can embed the images inside of the SVG and use the SVG as a src on a regular image.

These are the two images that we’re using to construct the footer illustration—one for the image data and one for the mask (black is completely transparent and white is fully opaque):

We convert the mask and the image to base64 using the Terminal command and then paste the data into the SVG:

<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 2900 1494">
  <defs>
    <mask id="mask">
      <image width="300" height="300" href="data:image/png;base64,/* your image in base64 */”></image>
    </mask>
  </defs>
  <image mask="url(#mask)" width="300" height="300" href="data:image/jpeg;base64,/* your image in base64 */”></image>
</svg>

You can save that SVG and use it like any regular image. We can then safely use WebP with lazy loading and a solid fallback that works in all browsers:

<picture>
  <source srcset="compressed-transparent-image.webp" type="image/webp">
  <img src="compressed-transparent-image.svg" loading="lazy">
</picture>

This somewhat obscure SVG hack saves us hundreds of kilobytes on each page load, and it enables us to utilize the latest technologies for the browsers and operating systems that support them.

Towards a faster web

We’re working throughout the company to create a faster and more reliable GitHub, and these are some of the techniques that we’re utilizing. We still have a long way to go, and if you’d like to be part of that journey, check out our careers page.

 

Serving driver-partners data at scale using mirror cache

Post Syndicated from Grab Tech original https://engineering.grab.com/mirror-cache-blog

Since the early beginnings, driver-partners have been the centerpiece of the wide-range of services or features provided by the Grab platform. Over time, many backend microservices were developed to support our driver-partners such as earnings, ratings, insurance, etc. All of these different microservices require certain information, such as name, phone number, email, active car types, and so on, to curate the services provided to the driver-partners.

We built the Drivers Data service to provide drivers-partners data to other microservices. The service attracts a high QPS and handles 10K requests during peak hours. Over the years, we have tried different strategies to serve driver-partners data in a resilient and cost-effective manner, while accounting for low response time. In this blog post, we talk about mirror cache, an in-memory local caching solution built to serve driver-partners data efficiently.

What we started with

Figure 1. Drivers Data service architecture
Figure 1. Drivers Data service architecture

Our Drivers Data service previously used MySQL DB as persistent storage and two caching layers – standalone local cache (RAM of the EC2 instances) as primary cache and Redis as secondary for eventually consistent reads. With this setup, the cache hit ratio was very low.

Figure 2. Request flow chart
Figure 2. Request flow chart

We opted for a cache aside strategy. So when a client request comes, the Drivers Data service responds in the following manner:

  • If data is present in the in-memory cache (local cache), then the service directly sends back the response.
  • If data is not present in the in-memory cache and found in Redis, then the service sends back the response and updates the local cache asynchronously with data from Redis.
  • If data is not present either in the in-memory cache or Redis, then the service responds back with the data fetched from the MySQL DB and updates both Redis and local cache asynchronously.
Figure 3. Percentage of response from different sources
Figure 3. Percentage of response from different sources

The measurement of the response source revealed that during peak hours ~25% of the requests were being served via standalone local cache, ~20% by MySQL DB, and ~55% via Redis.

The low cache hit rate is caused by the driver-partners data loading patterns: low frequency per driver over time but the high frequency in a short amount of time. When a driver-partner is a candidate for a job or is involved in an ongoing job, different services make multiple requests to the Drivers Data service to fetch that specific driver-partner information. The frequency of calls for a specific driver-partner reduces if he/she is not involved in the job allocation process or is not doing any job at the moment.

While low frequency per driver over time impacts the Redis cache hit rate, high frequency in short amounts of time mostly contributes to in-memory cache hit rate. In our investigations, we found that local caches of different nodes in the Drivers Data service cluster were making redundant calls to Redis and DB for fetching the same data that are already present in a node local cache.

Making in-memory cache available on every instance while the data is in active use, we could greatly increase the in-memory cache hit rate, and that’s what we did.

Mirror cache design goals

We set the following design goals:

  • Support a local least recently used (LRU) cache use-case.
  • Support active cache invalidation.
  • Support best effort replication between local cache instances (EC2 instances). If any instance successfully fetches the latest data from the database, then it should try to replicate or mirror this latest data across all the other nodes in the cluster. If replication fails and the item is expired or not found, then the nodes should fetch it from the database.
  • Support async data replication across nodes to ensure updates for the same key happens only with more recent data. For any older updates, the current data in the cache is ignored. The ordering of cache updates is not guaranteed due to the async replication.
  • Ability to handle auto-scaling.

The building blocks

Figure 4. Mirror cache
Figure 4. Mirror cache

The mirror cache library runs alongside the Drivers Data service inside each of the EC2 instances of the cluster. The two main components are in-memory cache and replicator.

In-memory cache

The in-memory cache is used to store multiple key/value pairs in RAM. There is a TTL associated with each key/value pair. We wanted to use a cache that can provide high hit ratio, memory bound, high throughput, and concurrency. After evaluating several options, we went with dgraph’s open-source concurrent caching library Ristretto as our in-memory local cache. We were particularly impressed by its use of the TinyLFU admission policy to ensure a high hit ratio.

Replicator

The replicator is responsible for mirroring/replicating each key/value entry among all the live instances of the Drivers Data service. The replicator has three main components: Membership Store, Notifier, and gRPC Server.

Membership Store

The Membership Store registers callbacks with our service discovery service to notify mirror cache in case any nodes are added or removed from the Drivers Data service cluster.

It maintains two maps – nodes in the same AZ (AWS availability zone) as itself (the current node of the Drivers Data service in which mirror cache is running) and the nodes in the other AZs.

Notifier

Each service (Drivers Data) node runs a single instance of mirror cache. So effectively, each node has one notifier.

  • Combine several (key/value) pairs updates to form a batch.
  • Propagate the batch updates among all the nodes in the same AZ as itself.
  • Send the batch updates to exactly one notifier (node) in different AZs who, in turn, are responsible for updating all the nodes in their own AZs with the latest batch of data. This communication technique helps to reduce cross AZ data transfer overheads.

In the case of auto-scaling, there is a warm-up period during which the notifier doesn’t notify the other nodes in the cluster. This is done to minimize duplicate data propagation. The warm-up period is configurable.

gRPC Server

An exclusive gRPC server runs for mirror cache. The different nodes of the Drivers Data service use this server to receive new cache updates from the other nodes in the cluster.

Here’s the structure of each cache update entity:

message Entity {
    string key = 1; // Key for cache entry.
    bytes value = 2; // Value associated with the key.
    Metadata metadata = 3; // Metadata related to the entity.
    replicationType replicate = 4; // Further actions to be undertaken by the mirror cache after updating its own in-memory cache.
    int64 TTL = 5; // TTL associated with the data.
    bool  delete = 6; // If delete is set as true, then mirror cache needs to delete the key from it's local cache.
}

enum replicationType {
    Nothing = 0; // Stop propagation of the request.
    SameRZ = 1; // Notify the nodes in the same Region and AZ.
}

message Metadata {
    int64 updatedAt = 1; // Same as updatedAt time of DB.
}

The server first checks if the local cache should update this new value or not. It tries to fetch the existing value for the key. If the value is not found, then the new key/value pair is added. If there is an existing value, then it compares the updatedAt time to ensure that stale data is not updated in the cache.

If the replicationType is Nothing, then the mirror cache stops further replication. In case the replicationType is SameRZ then the mirror cache tries to propagate this cache update among all the nodes in the same AZ as itself.

Run at scale

Figure 5. Drivers Data Service new architecture
Figure 5. Drivers Data Service new architecture

The behavior of the service hasn’t changed and the requests are being served in the same manner as before. The only difference here is the replacement of the standalone local cache in each of the nodes with mirror cache. It is the responsibility of mirror cache to replicate any cache updates to the other nodes in the cluster.

After mirror cache was fully rolled out to production, we rechecked our metrics related to the response source and saw a huge improvement. The graph showed that during peak hours ~75% of the response was from in-memory local cache. About 15% of the response was served by MySQL DB and a further 10% via Redis.

The local cache hit ratio was at 0.75, a jump of 0.5 from before and there was a 5% drop in the number of DB calls too.

Figure 6. New percentage of response from different sources
Figure 6. New percentage of response from different sources

Limitations and future improvements

Mirror cache is eventually consistent, so it is not a good choice for systems that need strong consistency.

Mirror cache stores all the data in volatile memory (RAM) and they are wiped out during deployments, resulting in a temporary load increase to Redis and DB.

Also, many new driver-partners are added everyday to the Grab system, and we might need to increase the cache size to maintain a high hit ratio. To address these issues we plan to use SSD in the future to store a part of the data and use RAM only to store hot data.

Conclusion

Mirror cache really helped us scale the Drivers Data service better and serve driver-partners data to the different microservices at low latencies. It also helped us achieve our original goal of an increase in the local cache hit ratio.

We also extended mirror cache in some other services and found similar promising results.


A huge shout out to Haoqiang Zhang and Roman Atachiants for their inputs into the final design. Special thanks to the Driver Backend team at Grab for their contribution.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Improving how we deploy GitHub

Post Syndicated from Julian Nadeau original https://github.blog/2021-01-25-improving-how-we-deploy-github/

Over the last year GitHub has doubled the number of developers contributing to the main GitHub.com application. While this seems like a solely positive thing on the surface, the 2x increase in folks contributing to the core software exposed some problems in terms of tooling. Tooling that worked for us a year ago no longer functioned in the same capacity. While GitHub itself has been a fantastic vehicle to drive change for GitHub, the deployment tooling and coordination has not enjoyed the same levels of success. One of those areas was our deployments.

GitHub.com is deployed primarily through chatops using a branch deploy model (we deploy branches before merging into the main branch). This meant that developers can add changes to a queue, check the status of the queue, and organize groups of pull requests to be deployed and merged. This all functioned using chatops in Slack in a room called #dotcom-ops. While this is a very simple system it started to cause some confusion while monitoring a deploy as a single chat room managed the queue, the deploy, and many other tasks for many people at the same time. All of a sudden this channel that was once a central part of a crucial information system was overwhelmed by the amount of information being pushed through it. Developers could no longer track their change through the system which resulted in a reduced capacity for that developer and an increased risk profile for GitHub.

This is just one step of about a dozen spread across hundreds of messages, it is hard to keep track and validate the state of a deploy.

The Build

At the beginning of this summer, we set out to be able to completely revamp the way we monitored deploy changes for GitHub.com. With the problem in mind –namely the multi-step, information-dense deploy process via chatops –we set out with a few main goals:

Simplify the Deploy for Developers

One of the main issues that we experienced with the previous system was that deploys were tracked across a number of different messages within the Slack channel. This made it very difficult to piece together the different messages that made-up a single deploy. Sometimes there could be as many as a few hundred messages in between subsequent messages from the deploy system.

Second Canary Stage

The second main issue was that we had a canary stage but the stage would only deploy to up to 2% of GitHub.com traffic. This meant that there were a whole slew of problems that would never get caught in the canary stage before we rolled out to 100% of production, and would have to instead start an incident and roll back. Availability and uptime is of the utmost concern to GitHub, so this risk became crucial to fix and address as GitHub continued to grow. With this in mind, we set out to introduce a second Canary stage at a higher percent so that we could catch more issues in earlier stages which would reduce the impact of future incidents.

At the end of this project, we were able to have two canary stages. The first is at 2% and aims to capture the majority of the issues. This low percentage keeps the risk profile at a tolerable level as such a small amount of traffic would actually be impacted by an issue. A second canary stage was introduced at 20% and allows us to direct to a much larger amount of traffic while still in canary stages. This has a higher risk profile, but is mitigated by the initial 2% canary stage, and allows us to transition with less risk to the 100% production stage.

Automate

The last issue is that developers had to sit with the deploy and help poke it along at every step of the way by running independent chatops commands to queue up, deploy canary, and deploy production – making judgement calls every step of the way. This meant that developers were fumbling with chatops commands multiple times in every deployment, and often getting them wrong.

We asked ourselves “what if there was one command and everything else just happened?”. So we did that, we automated the entire process.

The solution

We already had internal deploy software that was capable of tracking a single part of a deploy. This system had a proven track record and solid foundations, and so we aimed to add the ability to automate the entire deploy sequence from a single chatops command, rather than running multiple commands per deploy, and link these records together within the existing deployment infrastructure. Moreover, this new solution would provide an easy to use interface for a quick overview of any given deployment rather than piecing together multiple chatops commands.

We came up with two basic concepts:

  1. A deploy is made up of many stages (canary, production, etc.)
  2. There are gates between different stages that perform some sort of check to validate we can progress to the next stage

These concepts allowed us to model the entire system in a state-machine-like fashion:

This diagram shows an automatic progression between a 2% canary, 20% canary, production deploy, and a ready to merge stage – separated by automated 5-minute timers. Finally, a pointer was automated to progress across the data model after the timer gates completed and stages were deployed.

What resulted was a state machine backed deploy system with a first party UI component. This was combined with the traditional chatops with which our developers were already familiar. Below, you can see an overview of deploys which have recently been deployed to GitHub.com. Rather than tracking down various messages in a noisy Slack channel, you can go to a consolidated UI. You can see the state machine progression mentioned above in the overview on the right.

Drilling down into a specific deployment, you can see everything that has happened during a specific deployment. You can see that each stage of this deployment has an automatic 5-minute timer between them, and the ability to pause the deployment to give a developer more time to test. We also made sure that, in the case something is wrong, we have a quick way to rollback or revert the changes with the dropdown in the top right corner.

Finally, the entire system could be monitored and started from Slack – just like before. We found that this is how developers typically want to start their deploys, followed by monitoring in the UI component:

The results

These changes have revolutionized the way we deploy GitHub.com. Where confusion and frustration had once set in, we now have joy and content at the use of an automated system. We have received overwhelmingly positive feedback internally, and even attracted some attention from our very own Actions team. Our learnings, in this case, helped to inform and influence the decisions in the recent GitHub Actions CD offering announced at GitHub Universe. Our focus on our own developers means that we can apply our learnings to continue creating the best possible system for the 56M+ developers around the world.

Our work this past summer could not have been possible without the dedication and work from many people and teams. We’d like a special shoutout to the GitHub internal deploy team: their existing system, advice, and ongoing help was crucial in making sure our deploys were successful.

Part of the Building GitHub blog series.

The best of Changelog • 2020 Edition

Post Syndicated from Michelle Mannering original https://github.blog/2021-01-21-changelog-2020-edition/

If you haven’t seen it, the GitHub Changelog helps you keep up-to-date with all the latest features and updates to GitHub. We shipped a tonne of changes last year, and it’s impossible to blog about every feature. In fact, we merged over 90,000 pull requests into the GitHub codebase in the past 12 months!

Here’s a quick recap of the top changes made to GitHub in 2020. We hope these changes are helping you build cooler things better and faster. Let us know what your favourite feature of the past year has been.

GitHub wherever you are

While we haven’t exactly been travelling a lot recently, one of the things we love is the flexibility to work wherever we want, however we want. Whether you want to work on your couch, in the terminal, or check your notifications on the go, we’ve shipped some updates for you.

GitHub CLI

Do you like to work in the command line? In September, we brought GitHub to your terminal. Having GitHub available in the command line reduces the need to switch between applications or various windows and helps simplify a bunch of automation scenarios.

The GitHub CLI allows you to run your entire GitHub workflow directly from the terminal. You can clone a repo, create, view and review PRs, open issues, assign tasks, and so much more. The CLI is available on Windows, iOS, and Linux. Best of all, the GitHub CLI is open source. Download the CLI today, check out the repo, and view the Docs for a full list of the CLI commands.

GitHub for Mobile

It doesn’t stop there. Now you can also have GitHub in your pocket with GitHub for Mobile!

This new native app makes it easy to create, view, and comment on issues, check your notifications, merge a pull request, explore, organise your tasks, and more. One of the most used features of GitHub for Mobile is push notification support. Mobile alerts means you’ll never miss a mention or review again and can help keep your team unblocked.

GitHub for Mobile is available on iOS and Android. Download it today if you’re not already carrying the world’s development platform in your pocket.

Oh and did you know, GitHub for Mobile isn’t just in English? It’s also available in Brazilian Portuguese, Japanese, Simplified Chinese, and Spanish.

 

GitHub Enterprise Server

With the release of GitHub Enterprise Server 2.21 in 2020, there was a host of amazing new features. There are new features for PRs, a new notification experience, and changes to issues. These are all designed to make it easier to connect, communicate, and collaborate within your organisation.

And now we’ve made Enterprise Server even better with GitHub Enterprise Server 3.0 RC. That means GitHub Actions, Packages, Code Scanning, Mobile Support, and Secret Scanning are now available in your Enterprise Server. This is the biggest release we’ve done of GitHub Enterprise Server in years, and you can install it now with full support.

Working better with automation

GitHub Actions was launched at the end of 2019 and is already the most popular CI/CD service on GitHub. Our team has continued adding features and improving ways for you to automate common tasks in your repository. GitHub Actions is so much more than simply CI/CD. Our community has really stepped up to help you automate all the things with over 6,500 open source Actions available in the GitHub Marketplace.

Some of the enhancements to GitHub Actions in 2020 include:

Workflow visualisation

We made it easy for you to see what’s happening with your Actions automation. With Workflow visualisation, you can now see a visual graph of your workflow.

This workflow visualisation allows you to easily view and understand your workflows no matter how complex they are. You can also track the progress of your workflow in real time and easily monitor what’s happening so you can access deployment targets.

On top of workflow visualisation, you can also create workflow templates. This makes it easier to promote best practices and consistency across your organisation. It also cuts down time when using the same or similar workflows. You can even define rules for these templates that work across your repo.

Self-hosted runners

Right at the end of 2019, we announced GitHub Actions supports self-hosted runner groups. It offered developers maximum flexibility and control over their workflows. Last year, we made updates to self-hosted runners, making self-hosted runners shareable across some or all of your GitHub organisations.

In addition, you can separate your runners into groups, and add custom labels to the runners in your groups. Read more about these Enterprise self-hosted runners and groups over on our GitHub Docs.

Environments & Environment Secrets

Last year we added environment protection rules and environment secrets across our CD capabilities in GitHub Actions. This new update ensures there is separation between the concerns of deployment and concerns surrounding development to meet compliance and security requirements.

Manual Approvals

With Environments, we also added the ability to pause a job that’s trying to deploy to the protected environment and request manual approval before that job continues. This unleashes a whole new raft of continuous deployment workflows, and we are very excited to see how you make use of these new features.

Other Actions Changes

Yes there’s all the big updates, and we’re committed to making small improvements too. Alongside other changes, we now have better support for whatever default branch name you choose. We updated all our starter workflows to use a new $default-branch macro.

We also added the ability to re-run all jobs after a successful run, as well as change the retention days for artifacts and logs. Speaking of logs, we updated how the logs are displayed. They are now much easier to read, have better searching, auto-scrolling, clickable URLs, support for more colours, and full screen mode. You can now disable or delete workflow runs in the Actions tab as well as manually trigger Actions runs with the workflow_dispatch trigger.

While having access to all 6,500+ actions in the marketplace helps integrate with different tools, some enterprises want to limit which actions you can invoke to a limited trusted sub-set. You can now fine-tune access to your external actions by limiting control to GitHub-verified authors, and even limit access to specific Actions.

There were so many amazing changes and updates to GitHub Actions that we couldn’t possibly include them all here. Check out the Changelog for all our GitHub Actions updates.

Working better with Security

Keeping your code safe and secure is one of the most important things for us at GitHub. That’s why we made a number of improvements to GitHub Advanced Security for 2020.

You can read all about these improvements in the special Security Highlights from 2020. There are new features such as code scanning, secret scanning, Dependabot updates, Dependency review, and NPM advisory information.

If you missed the talk at GitHub Universe on the state of security in the software industry, don’t forget to check it out. Justin Hutchings, the Staff Product Manager for Security, walks through the latest trends in security and all things DevSecOps. It’s definitely worth carving out some time over the weekend to watch this:

Working better with your communities

GitHub is about building code together. That’s why we’re always making improvements to the way you work with your team and your community.

Issues improvements

Issues are important for keeping track of your project, so we have been busy making issues work better and faster on GitHub.

You can now also link issues and PRs via the sidebar, and issues now have list autocompletion. When you’re looking for an issue to reference, you can use multiple words to search for that issue inline.

Sometimes when creating an issue, you might like to add a GIF or short video to demo a bug or new feature. Now you can do it natively by adding an *.mp4 or *.mov into your issue.

GitHub Discussions

Issues are a great place to talk about feature updates and bug fixes, but what about when you want to have an open- ended conversation or have your community help answering common questions?

GitHub Discussions is a place for you and your community to come together and collaborate, chat, or discuss something in a separate space, away from your issues. Discussions allows you to have threaded conversations. You can even convert Issues to Discussions, mark questions as answered, categorise your topics, and pin your Discussions. These features help you provide a welcoming space to new people as well as quick access to the most common discussion points.

If you are an admin or maintainer of a public repo you can enable Discussions via repo settings today. Check out our Docs for more info.

Speaking of Docs, did you know we recently published all our documentation as an open source project? Check it out and get involved today.

GitHub Sponsors

We launched GitHub Sponsors in 2019, and people have been loving this program. It’s a great way to contribute to open source projects. In 2020, we made GitHub Sponsors available in even more countries. Last year, GitHub Sponsors became available in Mexico, Czech Republic, Malta, and Cyprus.

We also added some other fancy features to GitHub Sponsors. This includes the ability to export a list of your sponsors. You can also set up webhooks for events in your sponsored account and easily keep track of everything that’s happening via your activity feed.

At GitHub Universe, we also announced Sponsors for Companies. This means organisations can now invest in open source projects via their billing arrangement with GitHub. Now is a great time to consider supporting your company’s most critical open source dependencies.

Working better with code

We’re always finding ways to help developers. As Nat said in his GitHub Universe keynote, the thing we care about the most is helping developers build amazing things. That’s why we’re always trying to make it quicker and easier to collaborate on code.

Convert pull requests to drafts

Draft pull requests are a great way to let your team know you are working on a feature. It helps start the conversation about how it should be built without worrying about someone thinking it’s ready to merge into main. We recently made it easy to convert an existing PR into a draft anytime.

Multi-line code suggestions

Not only can you do multi-line comments, you can now suggest a specific change to multiple lines of code when you’re reviewing a pull request. Simply click and drag and then edit text within the suggestion block.

Default branch naming

Alongside the entire Git community, we’ve been trying to make it easier for teams wanting to use more inclusive naming for their default branch. This also gives teams much more flexibility around branch naming. We’ve added first-tier support for renaming branches in the GitHub UI.

This helps take care of retargeting pull requests and updating branch protection rules. Furthermore, it provides instructions to people who have forked or cloned your repo to make it easier for them to update to your new branch names.

Re-directing to the new default branch

We provided re-directs so links to deleted branch names now point to the new default branch. In addition, we updated GitHub Pages to allow it to publish from any branch. We also added a preference so you can set the default branch name for your organization. If you need to stay with ‘master’ for compatibility with your existing tooling and automation, or if you prefer to use a different default branch, such as ‘development,’ you can now set this in a single place.

For new organizations to GitHub, we also updated the default to ‘main’ to reflect the new consensus among the Git community. Existing repos are also not affected by any of these changes. Hopefully we’ve helped make it easier for the people who do want to move away from the old ‘master’ terminology in Git.

Design updates for repos and GitHub UI

In mid 2020, we launched a fresh new look to the GitHub UI. The way repos are shown on the homepage and the overall look and feel of GitHub is super sleek. There’s responsive layout, improved UX in the mobile web experience, and more. We also made lots of small improvements. For example, the way your commits are shown in the pull request timeline has changed. PRs in the past were ordered by author date. Now they’ll show up according to their chronological order in the head branch.

If you’ve been following a lot of our socials, you’ll know we’ve also got a brand new look and feel to GitHub.com. Check out these changes, and we hope it gives you fresh vibes for the future.

Go to the Dark Side

Speaking of fresh vibes, you’ve asked for it, and now it’s here! No longer will you be blinded by the light. Now you can go to the dark side with dark mode for the web.

Changelog 2020

These are just some of the highlights for 2020. We’re all looking forward to bringing you more great updates in 2021.

Keep an eye on the Changelog to stay informed and ensure you don’t miss out on any cool updates. You can also follow our changes with @GHChangelog on Twitter and see what’s coming soon by checking out the GitHub Roadmap. Tweet us your favourite changes for 2020, and tell us what you’re most excited to see in 2021.

Trident – Real-time event processing at scale

Post Syndicated from Grab Tech original https://engineering.grab.com/trident-real-time-event-processing-at-scale

Ever wondered what goes behind the scenes when you receive advisory messages on a confirmed booking? Or perhaps how you are awarded with rewards or points after completing a GrabPay payment transaction? At Grab, thousands of such campaigns targeting millions of users are operated daily by a backbone service called Trident. In this post, we share how Trident supports Grab’s daily business, the engineering challenges behind it, and how we solved them.

60-minute GrabMart delivery guarantee campaign operated via Trident
60-minute GrabMart delivery guarantee campaign operated via Trident

What is Trident?

Trident is essentially Grab’s in-house real-time if this, then that (IFTTT) engine, which automates various types of business workflows. The nature of these workflows could either be to create awareness or to incentivize users to use other Grab services.

If you are an active Grab user, you might have noticed new rewards or messages that appear in your Grab account. Most likely, these originate from a Trident campaign. Here are a few examples of types of campaigns that Trident could support:

  • After a user makes a GrabExpress booking, Trident sends the user a message that says something like “Try out GrabMart too”.
  • After a user makes multiple ride bookings in a week, Trident sends the user a food reward as a GrabFood incentive.
  • After a user is dropped off at his office in the morning, Trident awards the user a ride reward to use on the way back home on the same evening.
  • If  a GrabMart order delivery takes over an hour of waiting time, Trident awards the user a free-delivery reward as compensation.
  • If the driver cancels the booking, then Trident awards points to the user as a compensation.
  • With the current COVID pandemic, when a user makes a ride booking, Trident sends a message to both the passenger and driver reminding about COVID protocols.

Trident processes events based on campaigns, which are basically a logic configuration on what event should trigger what actions under what conditions. To illustrate this better, let’s take a sample campaign as shown in the image below. This mock campaign setup is taken from the Trident Internal Management portal.

Trident process flow
Trident process flow

This sample setup basically translates to: for each user, count his/her number of completed GrabMart orders. Once he/she reaches 2 orders, send him/her a message saying “Make one more order to earn a reward”. And if the user reaches 3 orders, award him/her the reward and send a congratulatory message. 😁

Other than the basic event, condition, and action, Trident also allows more fine-grained configurations such as supporting the overall budget of a campaign, adding limitations to avoid over awarding, experimenting A/B testing, delaying of actions, and so on.

An IFTTT engine is nothing new or fancy, but building a high-throughput real-time IFTTT system poses a challenge due to the scale that Grab operates at. We need to handle billions of events and run thousands of campaigns on an average day. The amount of actions triggered by Trident is also massive.

In the month of October 2020, more than 2,000 events were processed every single second during peak hours. Across the entire month, we awarded nearly half a billion rewards, and sent over 2.5 billion communications to our end-users.

Now that we covered the importance of Trident to the business, let’s drill down on how we designed the Trident system to handle events at a massive scale and overcame the performance hurdles with optimization.

Architecture design

We designed the Trident architecture with the following goals in mind:

  • Independence: It must run independently of other services, and must not bring performance impacts to other services.
  • Robustness: All events must be processed exactly once (i.e. no event missed, no event gets double processed).
  • Scalability: It must be able to scale up processing power when the event volume surges and withstand when popular campaigns run.

The following diagram depicts how the overall system architecture looks like.

Trident architecture
Trident architecture

Trident consumes events from multiple Kafka streams published by various backend services across Grab (e.g. GrabFood orders, Transport rides, GrabPay payment processing, GrabAds events). Given the nature of Kafka streams, Trident is completely decoupled from all other upstream services.

Each processed event is given a unique event key and stored in Redis for 24 hours. For any event that triggers an action, its key is persisted in MySQL as well. Before storing records in both Redis and MySQL, we make sure any duplicate event is filtered out. Together with the at-least-once delivery guaranteed by Kafka, we achieve exactly-once event processing.

Scalability is a key challenge for Trident. To achieve high performance under massive event volume, we needed to scale on both the server level and data store level. The following mind map shows an outline of our strategies.

Outline of Trident’s scale strategy
Outline of Trident’s scale strategy

Scale servers

Our source of events are Kafka streams. There are mostly two factors that could affect the load on our system:

  1. Number of events produced in the streams (more rides, food orders, etc. results in more events for us to process).
  2. Number of campaigns running.
  3. Nature of campaigns running. The campaigns that trigger actions for more users cause higher load on our system.

There are naturally two types of approaches to scale up server capacity:

  • Distribute workload among server instances.
  • Reduce load (i.e. reduce the amount of work required to process each event).

Distribute load

Distributing workload seems trivial with the load balancing and auto-horizontal scaling based on CPU usage that cloud providers offer. However, an additional server sits idle until it can consume from a Kafka partition.

Each Kafka partition can only be consumed by one consumer within the same consumer group (our auto-scaling server group in this case). Therefore, any scaling in or out requires matching the Kafka partition configuration with the server auto-scaling configuration.

Here’s an example of a bad case of load distribution:

Kafka partitions config mismatches server auto-scaling config
Kafka partitions config mismatches server auto-scaling config

And here’s an example of a good load distribution where the configurations for the Kafka partitions and the server auto-scaling match:

Kafka partitions config matches server auto-scaling config
Kafka partitions config matches server auto-scaling config

Within each server instance, we also tried to increase processing throughput while keeping the resource utilization rate in check. Each Kafka partition consumer has multiple goroutines processing events, and the number of active goroutines is dynamically adjusted according to the event volume from the partition and time of the day (peak/off-peak).

Reduce load

You may ask how we reduced the amount of processing work for each event. First, we needed to see where we spent most of the processing time. After performing some profiling, we identified that the rule evaluation logic was the major time consumer.

What is rule evaluation?

Recall that Trident needs to operate thousands of campaigns daily. Each campaign has a set of rules defined. When Trident receives an event, it needs to check through the rules for all the campaigns to see whether there is any match. This checking process is called rule evaluation.

More specifically, a rule consists of one or more conditions combined by AND/OR Boolean operators. A condition consists of an operator with a left-hand side (LHS) and a right-hand side (RHS). The left-hand side is the name of a variable, and the right-hand side a value. A sample rule in JSON:

Country is Singapore and taxi type is either JustGrab or GrabCar.
  {
    "operator": "and",
    "conditions": [
    {
      "operator": "eq",
      "lhs": "var.country",
      "rhs": "sg"
      },
      {
        "operator": "or",
        "conditions": [
        {
          "operator": "eq",
          "lhs": "var.taxi",
          "rhs": <taxi-type-id-for-justgrab>
          },
          {
            "operator": "eq",
            "lhs": "var.taxi",
            "rhs": <taxi-type-id-for-grabcard>
          }
        ]
      }
    ]
  }

When evaluating the rule, our system loads the values of the LHS variable, evaluates against the RHS value, and returns as result (true/false) whether the rule evaluation passed or not.

To reduce the resources spent on rule evaluation, there are two types of strategies:

  • Avoid unnecessary rule evaluation
  • Evaluate “cheap” rules first

We implemented these two strategies with event prefiltering and weighted rule evaluation.

Event prefiltering

Just like the DB index helps speed up data look-up, having a pre-built map also helped us narrow down the range of campaigns to evaluate. We loaded active campaigns from the DB every few minutes and organized them into an in-memory hash map, with event type as key, and list of corresponding campaigns as the value. The reason we picked event type as the key is that it is very fast to determine (most of the time just a type assertion), and it can distribute events in a reasonably even way.

When processing events, we just looked up the map, and only ran rule evaluation on the campaigns in the matching hash bucket. This saved us at least 90% of the processing time.

Event prefiltering
Event prefiltering
Weighted rule evaluation

Evaluating different rules comes with different costs. This is because different variables (i.e. LHS) in the rule can have different sources of values:

  1. The value is already available in memory (already consumed from the event stream).
  2. The value is the result of a database query.
  3. The value is the result of a call to an external service.

These three sources are ranked by cost:

In-memory < database < external service

We aimed to maximally avoid evaluating expensive rules (i.e. those that require calling external service, or querying a DB) while ensuring the correctness of evaluation results.

First optimization – Lazy loading

Lazy loading is a common performance optimization technique, which literally means “don’t do it until it’s necessary”.

Take the following rule as an example:

A & B

If we load the variable values for both A and B before passing to evaluation, then we are unnecessarily loading B if A is false. Since most of the time the rule evaluation fails early (for example, the transaction amount is less than the given minimum amount), there is no point in loading all the data beforehand. So we do lazy loading ie. load data only when evaluating that part of the rule.

Second optimization – Add weight

Let’s take the same example as above, but in a different order.

B & A
Source of data for A is memory and B is external service

Now even if we are doing lazy loading, in this case, we are loading the external data always even though it potentially may fail at the next condition whose data is in memory.

Since most of our campaigns are targeted, a popular condition is to check if a user is in a certain segment, which is usually the first condition that a campaign creator sets. This data resides in another service. So it becomes quite expensive to evaluate this condition first even though the next condition’s data can be already in memory (e.g. if the taxi type is JustGrab).

So, we did the next phase of optimization here, by sorting the conditions based on weight of the source of data (low weight if data is in memory, higher if it’s in our database and highest if it’s in an external system). If AND was the only logical operator we supported, then it would have been quite simple. But the presence of OR made it complex. We came up with an algorithm that sorts the evaluation based on weight keeping in mind the AND/OR. Here’s what the flowchart looks like:

Event flowchart
Event flowchart

An example:

Conditions: A & ( B | C ) & ( D | E )

Actual result: true & ( false | false ) & ( true | true ) --> false

Weight: B < D < E < C < A

Expected check order: B, D, C

Firstly, we start validating B which is false. Apparently, we cannot skip the sibling conditions here since B and C are connected by |. Next, we check D. D is true and its only sibling E is connected by | so we can mark E “skip”. Then, we check E but since E has been marked “skip”, we just skip it. Still, we cannot get the final result yet, so we need to continue validating C which is false. Now we know (B | C) is false so the whole condition is false too. We can stop now.

Sub-streams

After investigation, we learned that we consumed a particular stream that produced terabytes of data per hour. It caused our CPU usage to shoot up by 30%. We found out that we process only a handful of event types from that stream. So we introduced a sub-stream in between, which contains the event types we want to support. This stream is populated from the main stream by another server, thereby reducing the load on Trident.

Protect downstream

While we scaled up our servers wildly, we needed to keep in mind that there were many downstream services that received more traffic. For example, we call the GrabRewards service for awarding rewards or the LocaleService for checking the user’s locale. It is crucial for us to have control over our outbound traffic to avoid causing any stability issues in Grab.

Therefore, we implemented rate limiting. There is a total rate limit configured for calling each downstream service, and the limit varies in different time ranges (e.g. tighter limit for calling critical service during peak hour).

Scale data store

We have two types of storage in Trident: cache storage (Redis) and persistent storage (MySQL and others).

Scaling cache storage is straightforward, since Redis Cluster already offers everything we need:

  • High performance: Known to be fast and efficient.
  • Scaling capability: New shards can be added at any time to spread out the load.
  • Fault tolerance: Data replication makes sure that data does not get lost when any single Redis instance fails, and auto election mechanism makes sure the cluster can always auto restore itself in case of any single instance failure.

All we needed to make sure is that our cache keys can be hashed evenly into different shards.

As for scaling persistent data storage, we tackled it in two ways just like we did for servers:

  • Distribute load
  • Reduce load (both overall and per query)

Distribute load

There are two levels of load distribution for persistent storage: infra level and DB level. On the infra level, we split data with different access patterns into different types of storage. Then on the DB level, we further distributed read/write load onto different DB instances.

Infra level

Just like any typical online service, Trident has two types of data in terms of access pattern:

  • Online data: Frequent access. Requires quick access. Medium size.
  • Offline data: Infrequent access. Tolerates slow access. Large size.

For online data, we need to use a high-performance database, while for offline data, we can  just use cheap storage. The following table shows Trident’s online/offline data and the corresponding storage.

Trident’s online/offline data and storage
Trident’s online/offline data and storage

Writing of offline data is done asynchronously to minimize performance impact as shown below.

Online/offline data split
Online/offline data split

For retrieving data for the users, we have high timeout for such APIs.

DB level

We further distributed load on the MySQL DB level, mainly by introducing replicas, and redirecting all read queries that can tolerate slightly outdated data to the replicas. This relieved more than 30% of the load from the master instance.

Going forward, we plan to segregate the single MySQL database into multiple databases, based on table usage, to further distribute load if necessary.

Reduce load

To reduce the load on the DB, we reduced the overall number of queries and removed unnecessary queries. We also optimized the schema and query, so that query completes faster.

Query reduction

We needed to track usage of a campaign. The tracking is just incrementing the value against a unique key in the MySQL database. For a popular campaign, it’s possible that multiple increment (a write query) queries are made to the database for the same key. If this happens, it can cause an IOPS burst. So we came up with the following algorithm to reduce the number of queries.

  • Have a fixed number of threads per instance that can make such a query to the DB.
  • The increment queries are queued into above threads.
  • If a thread is idle (not busy in querying the database) then proceed to write to the database then itself.
  • If the thread is busy, then increment in memory.
  • When the thread becomes free, increment by the above sum in the database.

To prevent accidental over awarding of benefits (rewards, points, etc), we require campaign creators to set the limits. However, there are some campaigns that don’t need a limit, so the campaign creators just specify a large number. Such popular campaigns can cause very high QPS to our database. We had a brilliant trick to address this issue- we just don’t track if the number is high. Do you think people really want to limit usage when they set the per user limit to 100,000? 😉

Query optimization

One of our requirements was to track the usage of a campaign – overall as well as per user (and more like daily overall, daily per user, etc). We used the following query for this purpose:

INSERT INTO … ON DUPLICATE KEY UPDATE value = value + inc

The table had a unique key index (combining multiple columns) along with a usual auto-increment integer primary key. We encountered performance issues arising from MySQL gap locks when high write QPS hit this table (i.e. when popular campaigns ran). After testing out a few approaches, we ended up making the following changes to solve the problem:

  1. Removed the auto-increment integer primary key.
  2. Converted the secondary unique key to the primary key.

Conclusion

Trident is Grab’s in-house real-time IFTTT engine, which processes events and operates business mechanisms on a massive scale. In this article, we discussed the strategies we implemented to achieve large-scale high-performance event processing. The overall ideas of distributing and reducing load may be straightforward, but there were lots of thoughts and learnings shared in detail. If you have any comments or questions about Trident, feel free to leave a comment below.

All the examples of campaigns given in the article are for demonstration purpose only, they are not real live campaigns.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GitHub Availability Report: December 2020

Post Syndicated from Keith Ballinger original https://github.blog/2021-01-06-github-availability-report-december-2020/

Introduction

In December, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will provide a summary and follow-up details on how we addressed an incident mentioned in November’s report.

Follow-up to November 27 16:04 UTC (lasting one hour and one minute)

Upon further investigation around one of the incidents mentioned in November’s Availability Report, we discovered an edge case that triggered a large number of GitHub App token requests. This caused abnormal levels of replication lag within one of our MySQL clusters, specifically affecting the GitHub Actions service. This particular scenario resulted in amplified queries and increased the database lag, which impacted the database nodes that process GitHub App token requests.

When a GitHub Action is invoked, the Action is passed a GitHub App token to perform tasks on GitHub. In this case, the database lag resulted in the failure of some of those token requests because the database replicas did not have up to date information.

To help avoid this class of failure, we are updating the queries to prevent large quantities of token requests from overloading the database servers in the future.

In summary

Whether we’re introducing a system to manage flaky tests or improving our CI workflow, we’ve continued to invest in our engineering systems and overall reliability. To learn more about what we’re working on, visit GitHub’s engineering blog.

Building On-Call Culture at GitHub

Post Syndicated from Mary Moore-Simmons original https://github.blog/2021-01-06-building-on-call-culture-at-github/

As GitHub grows in size and our product offerings grow in number and complexity, we need to constantly evolve our on-call strategy so we can continue to be the trusted home for all developers. Expanding upon our Building GitHub blog series, this post gives you a window into one of the major steps along our continuous journey for operational excellence at GitHub.

Monolithic On-Call

Most of the GitHub products you interact with are in a large Ruby on Rails monolith. Monolithic codebases are common for many high-growth startups, and it’s a difficult situation to detangle yourself from. One of the pain points we had was problems with the on-call system for our monolith.

The biggest pain points with our monolithic on-call structure were:

  • The main GitHub monolith spans a huge number of products and features. Most engineers (understandably) weren’t familiar enough with enough of the codebase to feel confident in their ability to respond to incidents when on-call. Often, the engineer who got paged would escalate to another team, which made them feel more like a switchboard operator than an engineer.
  • The on-call rotation was large and engineers were on-call for 24 hours at a time. As a result, engineers were only on-call ~4 times per year for one day, and most never gained the context they needed to feel confident while on-call.
  • Since the monitoring and alerting for this on-call rotation was spread across most engineering teams at GitHub and people only had to experience on-call for 24 hours at a time, the monitoring and documentation was not well maintained. As a result, we had noisy alerts and poor runbooks.
  • Because most engineers didn’t feel confident with the monolithic on-call shift, the same 5-10 people who knew the platform best were involved for every production incident, which caused an imbalance in on-call responsibilities.

New On-Call Culture

To address these pain points, we made large changes to our on-call structure. We split up the monolith on-call rotation so every engineering team was on-call for the code they maintain. We had a number of logistical difficulties to overcome, but most of the hurdles we faced were cultural and educational.

Logistical Hurdles

File Ownership

The GitHub monolith contains over 16,000 files, and file ownership was often unclear or undocumented. To solve this, we rolled out a new system that associates files to services, and then assigns services to teams, which made it much easier to change ownership as teams changed. Here’s a snippet of the file to service mappings code we use in the monolith:

## Apps
### API
app/api/github_apps.rb     :apps
app/api/grants.rb          :apps
### Components             :apps
app/component/apps*        :apps
test/app/component/apps*   :apps

## Authzd
**/*authzd*                :authzd
app/models/permissions.rb  :authzd

You’ll see that each file in the monolith maps to a service, such as “apps” or “authzd”. We then have another file in the monolith that lists every service and maps them to teams. Here’s a snippet of that code for the apps service:

apps:
  name: GitHub Apps
  description: Allows users to build 'GitHub Apps' on top of GitHub.
  team: ecosystem-apps

This information is automatically pulled into a service we developed in-house, called the Service Catalog, which then displays the information to GitHub employees. You can search for services in the Service Catalog, which allows GitHub engineers, support, and product to track down which engineering team owns which services. Here’s an example of what it looks like for the apps service:

We asked all teams to move to this new service ownership model, implemented a linter that doesn’t allow files in the monolith to be updated or added unless the ownership information was completed, and worked with engineering and leadership to assign ownership for major services that were still unowned.

Monitoring and Alerting

Monitoring and alerting was set up for the monolith as a whole, so we asked teams to create monitoring specific to their areas of responsibility. Once this was mostly completed, a group of more senior engineers researched all remaining monolith-wide alerts and then split up alerts, assigned alerts out to teams, and decommissioned alerts that were no longer necessary.

Large Scope of Teams Involved

There were over 50 engineering teams that needed to make changes to their processes in order to complete this effort. To coordinate this, we opened a GitHub issue for every team with clear checklists for the work needed. From there we did regular check-ins with teams and provided help and information so we didn’t leave anyone behind.

Cultural and Educational Hurdles

  • Adjusting with the pandemic: This was a huge change for our engineers, and seven months into the project a global pandemic hit. This caused significant compounding anxiety that negatively impacted peoples’ ability to think critically.  As a result, the project required a more high-touch, empathy-first approach than was originally expected.
  • Training touchpoints: Many of our engineers had never been on-call before and didn’t have experience with operational best practices. We designed and delivered three rounds of training, set up office hours with on-call experts, created significant tooling and documentation for engineering teams, and opened Slack channels where people could ask questions and get help.
  • Instilling work-life balance while on-call: Several engineers were anxious about the impact on-call would have on their lives. Once you have been on-call for months or years, going to the grocery store or going for a bike ride while on-call is a normal activity that you can plan for. However, for those who haven’t been on-call before, it can be daunting to design personal strategies that allow you to respond to a page within minutes and try to do everyday tasks like shower or go grocery shopping. We worked with teams to understand their anxieties, document tips and tricks from people who had experience being on-call, and work with people 1:1 who had further concerns. We also reinforced that team members are there to support each other by taking on-call for a couple hours if someone wants to go for a run or handle childcare, by being a back-up if someone misses a page, or designing a follow-the-sun rotation if their team is in different time zones. This is yet another situation where GitHub’s global remote team is a major asset – we can lean on our team members in other parts of the world to take on-call when we’re not working. Many people will not become comfortable with how to balance on-call with having a life until they spend months or even years being on-call regularly, so some of this comfort is a matter of time and not training.
  • Cultivating a blameless culture: Many engineers were anxious about performing well when they are on-call. They had concerns that they would miss a page or make a mistake and let down their team. We worked with the organization to reinforce the message that mistakes are okay and outages happen no matter how well you perform your on-call duties, but we still have a lot of work to do in this area. We are undergoing a long-term effort to continue fostering a blameless and supportive on-call culture at GitHub, which includes nurturing safe spaces to learn about on-call and celebrating people publicly who bravely worked on something they weren’t familiar with while on-call. It will be a long journey to create a sense of safety about making mistakes on the engineering teams at GitHub, but we can improve over time.
  • Meeting the criticality needs for each team: Different GitHub products are at different levels of criticality, which means some engineering teams have to respond within 5 minutes if they are paged, and some don’t have to respond until the next business day if their service breaks. Some engineers are concerned that this causes an unfair imbalance between engineering teams. However, different engineers have different interests. Some would prefer to work on more business-critical, technically complex systems with stricter uptime requirements, while others value work/life balance more than the technical complexity needed for products with strict operational requirements. Over time, engineers will select teams that have a level of operational rigor they identify most with, and this will correct itself.
  • Dedicating time to focus on long-term success: As we rolled out the change, several teams expressed concerns that they are not able to spend enough time making their on-call experience better. We provided clear documentation reinforcing that the person on-call should be focused on improving the on-call experience when they are not responding to pages. This includes updating runbooks, tuning noisy alerts, scripting/automating on-call tasks to eliminate or streamline on-call tasks for sleep-deprived engineers, and fixing underlying technical debt that makes the on-call experience worse. We communicated that teams might spend ~20% of their time on technical debt and ~20% of their time on improving the on-call experience if needed for the stability of the product, customer experience, and engineering experience. These guidelines require long-term focus from leadership. We need to continuously spread the message from Senior VPs all the way to line managers that a sustainable engineering on-call experience is critical to the success of GitHub.

Continuing the Journey

GitHub’s incident resolution time improved after the bulk of this initiative was completed, but our journey is never done. Organizations need to constantly improve their operational best practices or they’ll fall behind.

There are several long-term cultural changes discussed above that we must continue to promote for years at GitHub. We are conducting a retrospective in January to learn how we can improve rolling out large changes like these in the future, and how we can continue to improve the on-call experience for our engineers and GitHub’s stability for our customers. In addition, we are sending out regular surveys to engineers about their on-call experience and we continue to monitor GitHub’s uptime. We will continue to meet with engineering teams to discuss their on-call pain points and how to improve so that we can encourage a growth mindset in our engineering teams and help each other learn operational best practices. Everyone at GitHub is in this journey together, and we need to support each other in our drive for excellence so we can continue to be the trusted home for all developers.

Git clone: a data-driven study on cloning behaviors

Post Syndicated from Solmaz Abbaspoursani original https://github.blog/2020-12-22-git-clone-a-data-driven-study-on-cloning-behaviors/

@derrickstolee recently discussed several different git clone options, but how do those options actually affect your Git performance? Which option is fastest for your client experience? Which option is fastest for your build machines? How can these options impact server performance? If you are a GitHub Enterprise Server administrator it’s important that you understand how the server responds to these options under the load of multiple simultaneous requests.

Here at GitHub, we use a data-driven approach to answer these questions. We ran an experiment to compare these different clone options and measured the client and server behavior. It is not enough to just compare git clone times, because that is only the start of your interaction with a Git repository. In particular, we wanted to determine how these clone options change the behavior of future Git operations such as git fetch.

In this experiment, we aimed to answer the below questions:

  1. How fast are the various git clone commands?
  2. Once we have cloned a repository, what kind of impact do future git fetch commands have on the server and client?
  3. What impact do full, shallow and partial clones have on a Git server? This is mostly important for our GitHub Enterprise Server Admins.
  4. Will the repository shape and size make any difference in the overall performance?

It is worth special emphasis that these results come from simulations that we performed in our controlled environments and do not simulate complex workflows that might be used by many Git users. Depending on your workflows and repository characteristics these results may change. Perhaps this experiment provides a framework that you could follow to measure how your workflows are affected by these options. If you would like help analyzing your worksflows, feel free to engage with GitHub’s Professional Services team.

For a summary of our findings, feel free to jump to our conclusions and recommendations.

Experiment design

To maximize the repeatability of our experiment, we use open source repositories for our sample data. This way, you can compare your repository shape to the tested repositories to see which is most applicable to your scenario.

We chose to use the jquery/jqueryapple/swift and torvalds/linux repositories. These three repositories vary in size and number of commits, blobs, and trees.

These repositories were mirrored to a GitHub Enterprise Server running version 2.22 on a 8-core cloud machine. We use an internal load testing tool based on Gatling to generate git requests against the test instance. We ran each test with a specific number of users across 5 different load generators for 30 minutes. All of our load generators use git version 2.28.0 which by default is using protocol version 1. We would like to make a note that protocol version 2 only improves ref advertisement and therefore we don’t expect it to make a difference in our tests.

Once a test is complete, we use a combination of Gatling results, ghe-governor and server health metrics to analyze the test.

Test repository characteristics

The git-sizer tool measures the size of Git repositories along many dimensions. In particular, we care about the total size on disk along with the count of each object type. The table below contains this information for our three test repositories.

Repo blobs trees commits Size (MB)
jquery/jquery 14,779 22,579 7,873 40
apple/swift 390,116 649,322 132,316 750
torvalds/linux 2,174,798 4,619,864 968,500 4,000

The jquery/jquery repository is a fairly small repository with only 40MB of disk space. apple/swift is a medium-sized repository with around 130 thousand commits and 750MB on disk. The torvalds/linux repository is typically the gold standard for Git performance tests on open source repositories. It uses 4 gigabytes of disk space and has close to a million commits.

Test scenarios

We care about the following clone options:

  1. Full clones.
  2. Shallow clones (--depth=1).
  3. Treeless clones (--filter=tree:0).
  4. Blobless clones (--filter=blob:none).

In addition to these options at clone time, we can also choose to fetch in a shallow way using --depth=1. Since treeless and blobless clones have their own way to reduce the objects downloaded during git fetch, we only test shallow fetches on full and shallow clones.

We organized our test scenarios into the following ten categories, labeled T1 through T10. T1 to T4, simulate four different git clone types. T5 to T10 simulate various git fetch operations into these cloned repositories.

Test# git command Description
T1 git clone Full clone
T2 git clone --depth 1 Shallow clone
T3 git clone --filter=tree:0 Treeless partial clone
T4 git clone --filter=blob:none Blobless partial clone
T5 T1 + git fetch Full fetch in a fully cloned repository
T6 T1 + git fetch --depth 1 Shallow fetch in a fully cloned repository
T7 T2 + git fetch Full fetch in a shallow cloned repository
T8 T2 + git fetch --depth 1 Shallow fetch in a shallow cloned repository
T9 T3 + git fetch Full fetch in a treeless partially cloned repository
T10 T4 + git fetch Full fetch in a blobless partially cloned repository

In partial clones, the new blobs at the new ref tip are not downloaded until we navigate to that position and populate our working directory with those blob contents. To be a fair comparison with the full and shallow clone cases, we also have our simulation run git reset --hard origin/$branch in all T5 to T10 tests. In T5 to T8 this extra step will not have a huge impact, but in T9 and T10 it will ensure the blob downloads are included in the cost.

In all the scenarios above, a single user was also set to repeatedly change 3 random files in the repository and push them to the same branch that the other users were cloning and fetching. This simulates repository growth so the git fetch commands actually have new data to download.

Test Results

Let’s dig into the numbers to see what our experiment says.

git clone performance

The full numbers are provided in the tables below.

Unsurprisingly, shallow clone is the fastest clone for the client, followed by a treeless then blobless partial clones, and finally full clones. This performance is directly proportional to the amount of data required to satisfy the clone request. Recall that full clones need all reachable objects, blobless clones need all reachable commits and trees, treeless clones need all reachable commits. A shallow clone is the only clone type that does not grow at all along with the history of your repository.

The performance impact of these clone types grows in proportion to the repository size, especially the number of commits. For example, a shallow clone of torvalds/linux is four times faster than a full clone, while a treeless clone is only twice as fast and a blobless clone is only 1.5 times as fast. It is worth noting that the development culture of the Linux project promotes very small blobs that compress extremely well. We expect that the performance difference to be greater for most other projects with a higher blob count or size.

As for server performance, we see that the Git CPU time per clone is higher for the blobless partial clone (T4). Looking a bit closer with ghe-governor, we observe that the higher Git CPU is mainly due to the higher amount of pack-objects operations in the partial clone scenarios (T3 and T4). In the torvalds/linux repository, the Git CPU time spent on pack-objects is four times more in a treeless partial clone (T3) compared to a full clone (T1). In contrast, in the smaller jquery/jquery repository, a full clone consumes more CPU per clone compared to the partial clones (T3 and T4). Shallow clone of all the three different repositories, consumes the lowest amount of total and Git CPU per clone.

If the full clone is sending more data in a full clone, then why is it spending more CPU on a partial clone? When Git sends all reachable objects to the client, it mostly transfers the data it has on disk without decompressing or converting the data. However, partial and shallow clones need to extract that subset of data and repackage it to send to the client. We are investigating ways to reduce this CPU cost in partial clones.

The real fun starts after cloning the repository and users start developing and pushing code back up to the server. In the next section we analyze scenarios T5 through T10, which focus on the git fetch and git reset --hard origin/$branch commands.

jquery/jquery clone performance

Test# description clone avgRT (milliseconds) git CPU spent per clone
T1 full clone 2,000ms (Slowest) 450ms (Highest)
T2 shallow clone 300ms (6x faster than T1) 15ms (30x less than T1)
T3 treeless partial clone 900ms (2.5x faster than T1) 270ms (1.7x less than T1)
T4 blobless partial clone 900ms (2.5x faster than T1) 300ms (1.5x less than T1)

apple/swift clone performance

Test# description clone avgRT (seconds) git CPU spent per clone
T1 full clone 50s (Slowest) 8s (2x less than T4)
T2 shallow clone 8s (6x faster than T1) 3s (6x less than T4)
T3 treeless partial clone 16s (3x faster than T1) 13s (Similar to T4)
T4 blobless partial clone 22s (2x faster than T1) 15s (Highest)

torvalds/linux clone performance

Test# description clone avgRT (minutes) git CPU spent per clone
T1 full clone 5m (Slowest) 60s (2x less than T4)
T2 shallow clone 1.2m (4x faster than T1) 40s (3.5x less than T4)
T3 treeless partial clone 2.4m (2x faster than T1) 120s (Similar to T4)
T4 blobless partial clone 3m (1.5x faster than T1) 130s (Highest)

git fetch performance

The full fetch performance numbers are provided in the tables below, but let’s first summarize our findings.

The biggest finding is that shallow fetches are the worst possible options, in particular from full clones. The technical reason is that the existence of a “shallow boundary” disables an important performance optimization on the server. This causes the server to walk commits and trees to find what’s reachable from the client’s perspective. This is more expensive in the full clone case because there are more commits and trees on the client that the server is checking to not duplicate. Also, as more shallow commits are accumulated, the client needs to send more data to the server to describe those shallow boundaries.

The two partial clone options have drastically different behavior in the git fetch and git reset --hard origin/$branch sequence. Fetching from blobless partial clones increases the reset command by a small, measurable way, but not enough to make a huge difference from the user perspective. In contrast, fetching from a treeless partial clone causes significant more time because the server needs to send all trees and blobs reachable from a commit’s root tree in order to satisfy the git reset --hard origin/$branch command.

Due to these extra costs as a repository grows, we strongly recommend against shallow fetches and fetching from treeless partial clones. The only recommended scenario for a treeless partial clone is for quickly cloning on a build machine that needs access to the commit history, but will delete the repository at the end of the build.

Blobless partial clones do increase the Git CPU costs on the server somewhat, but the network data transfer is much less than a full clone or a full fetch from a shallow clone. The extra CPU cost is likely to become less important if your repository has larger blobs than our test repositories. In addition, you have access to the full commit history, which might be valuable to real users interacting with these repositories.

It is also worth noting that we noticed a surprising result during our testing. During T9 and T10 tests for the Linux repository, our load generators encountered memory issues as it seems that these scenarios with the heavy load that we were running, triggered more auto Garbage Collections (GC). GC in the Linux repository is expensive and involves a full repack of all Git data. Since we were testing on a Linux client, the GC processes were launched in the background to avoid blocking our foreground commands. However, as we kept fetching we ended up with several concurrent background processes; this is not a realistic scenario but a factor of our synthetic load testing. We ran git config gc.auto false to prevent this from affecting our test results.

It is worth noting that blobless partial clones might trigger automatic garbage collection more often than a full clone. This is a natural byproduct of splitting the data into a larger number of small requests. We have work in progress to make Git’s repository maintenance be more flexible, especially for large repositories where a full repack is too time-consuming. Look forward to more updates about that feature here on the GitHub blog.

jquery/jquery fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 200ms 5ms 4ms (Lowest)
T6 shallow fetch in a fully cloned repository 300ms 5ms 18ms (4x more than to T5)
T7 full fetch in a shallow cloned repository 200ms 5ms 4ms (Similar to T5)
T8 shallow fetch in a shallow cloned repository 250ms 5ms 14ms (3x more than to T5)
T9 full fetch in a treeless partially cloned repository 200ms 85ms 9ms (2x more than to T5)
T10 full fetch in a blobless partially cloned repository 200ms 40ms 6ms (1.5x more than to T5)

apple/swift fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 350ms 80ms 20ms (Lowest)
T6 shallow fetch in a fully cloned repository 1,500ms 80ms 300ms (13x more than to T5)
T7 full fetch in a shallow cloned repository 300ms 80ms 20ms (Similar to T5)
T8 shallow fetch in a shallow cloned repository 350ms 80ms 45ms (2x more than to T5)
T9 full fetch in a treeless partially cloned repository 300ms 300ms 70ms (3x more than to T5)
T10 full fetch in a blobless partially cloned repository 300ms 150ms 35ms (1.5x more than to T5)

torvalds/linux fetch performance

Test# scenario git fetch avgRT git reset –hard git CPU spent per fetch
T5 full fetch in a fully cloned repository 350ms 250ms 40ms (Lowest)
T6 shallow fetch in a fully cloned repository 6,000ms 250ms 1,000ms (25x more than T5)
T7 full fetch in a shallow cloned repository 300ms 250ms 50ms (1.3x more than T5)
T8 shallow fetch in a shallow cloned repository 400ms 250ms 80ms (2x more than to T5)
T9 full fetch in a treeless partially cloned repository 350ms 1250ms 400ms (10x more than to T5)
T10 full fetch in a blobless partially cloned repository 300ms 500ms 140ms (3.5x more than to T5)

What does this mean for you?

Our experiment demonstrated some performance changes between these different clone and fetch options. Your mileage may vary! Our experimental load was synthetic, and your repository shape can differ greatly from these repositories.

Here are some common themes we identified that could help you choose the right scenario for your own usage:

If you are a developer focused on a single repository, the best approach is to do a full clone and then always perform a full fetch into that clone. You might deviate to a blobless partial clone if that repository is very large due to many large blobs, as that clone will help you get started more quickly. The trade-off is that some commands such as git checkout or git blame will require downloading new blob data when necessary.

In general, calculating a shallow fetch is computationally more expensive compared to a full fetch. Always use a full fetch instead of a shallow fetch both in fully and shallow cloned repositories.

In workflows such as CI builds when there is a need to do a single clone and delete the repository immediately, shallow clones are a good option. Shallow clones are the fastest way to get a copy of the working directory at the tip commit. If you need the commit history for your build, then a treeless partial clone might work better for you than a full clone. Bear in mind that in larger repositories such as the torvalds/linux repository, it will save time on the client but it’s a bit heavier on your git server when compared to a full clone.

Blobless partial clones are particularly effective if you are using Git’s sparse-checkout feature to reduce the size of your working directory. The combination greatly reduces the number of blobs you need to do your work. Sparse-checkout does not reduce the required data transfer for shallow clones.

Notably, we did not test repositories that are significantly larger than the torvalds/linux repository. Such repositories are not really available in the open, but are becoming increasingly common for private repositories. If you feel your repository is not represented by our test repositories, then we recommend trying to replicate our experiments yourself.

As can be observed, our test process is not simulating a real life situation where users have different workflows and work on different branches. Also the set of Git commands that have been analyzed in this study is a small set and is not a representative of a user’s daily Git usage. We are continuing to study these options to get a holistic view of how they change the user experience.

Acknowledgements

A special thanks to Derrick Stolee and our Professional Services team for their efforts and sponsorship of this study!

Pharos – Searching Nearby Drivers on Road Network at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/pharos-searching-nearby-drivers-on-road-network-at-scale

Have you ever wondered what happens when you click on the book button when arranging a ride home? Actually, many things happen behind this simple action and it would take days and nights to talk about all of them. Perhaps, we should rephrase this question to be more precise.  So, let’s try again – have you ever thought about how Grab stores and uses driver locations to allocate a driver to you? If so, you will surely find this blog post interesting as we cover how it all works in the backend.

What problems are we going to solve?

One of the fundamental problems of the ride-hailing and delivery industry is to locate the nearest moving drivers in real-time. There are two challenges from serving this request in real time.

Fast-moving vehicles

Vehicles are constantly moving and sometimes the drivers go at the speed of over 20 meters per second. As shown in Figure 1a and Figure 1b, the two nearest drivers to the pick-up point (blue dot) change as time passes. To provide a high-quality allocation service, it is important to constantly track the objects and update object locations at high frequency (e.g. per second).

Figure 1: Fast-moving drivers
Figure 1: Fast-moving drivers

Routing distance calculation

To satisfy business requirements, K nearest objects need to be calculated based on the routing distance instead of straight-line distance. Due to the complexity of the road network, the driver with the shortest straight-line distance may not be the optimal driver as it could reach the pick-up point with a longer routing distance due to detour.

Figure 2: Straight line vs routing
Figure 2: Straight line vs routing

As shown in Figure 2, the driver at the top is deemed as the nearest one to pick-up point by straight line distance. However, the driver at the bottom should be the true nearest driver by routing distance. Moreover, routing distance helps to infer the estimated time of arrival (ETA), which is an important factor for allocation, as shorter ETA reduces passenger waiting time thus reducing order cancellation rate and improving order completion rate.

Searching for the K nearest drivers with respect to a given POI is a well studied topic for all ride-hailing companies, which can be treated as a K Nearest Neighbour (KNN) problem. Our predecessor, Sextant, searches nearby drivers with the haversine distance from driver locations to the pick-up point. By partitioning the region into grids and storing them in a distributed manner, Sextant can handle large volumes of requests with low latency. However, nearest drivers found by the haversine distance may incur long driving distance and ETA as illustrated in Figure 2. For more information about Sextant, kindly refer to the paper, Sextant: Grab’s Scalable In-Memory Spatial Data Store for Real-Time K-Nearest Neighbour Search.

To better address the challenges mentioned above, we present the next-generation solution, Pharos.

Figure 3: Lighthouse of Alexandria
Figure 3: Lighthouse of Alexandria

What is Pharos?

Pharos means lighthouse in Greek. At Grab, it is a scalable in-memory solution that supports large-volume, real-time K nearest search by driving distance or ETA with high object update frequency.

In Pharos, we use OpenStreetMap (OSM) graphs to represent road networks. To support hyper-localized business requirements, the graph is partitioned by cities and verticals (e.g. the road network for a four-wheel vehicle is definitely different compared to a motorbike or a pedestrian). We denote this partition key as map ID.

Pharos loads the graph partitions at service start and stores drivers’ spatial data in memory in a distributed manner to alleviate the scalability issue when the graph or the number of drivers grows. These data are distributed into multiple instances (i.e. machines) with replicas for high stability. Pharos exploits Adaptive Radix Trees (ART) to store objects’ locations along with their metadata.

To answer the KNN query by routing distance or ETA, Pharos uses Incremental Network Expansion (INE) starting from the road segment of the query point. During the expansion, drivers stored along the road segments are incrementally retrieved as candidates and put into the results. As the expansion actually generates an isochrone map, it can be terminated by reaching a predefined radius of distance or ETA, or even simply a maximum number of candidates.

Now that you have an  overview of Pharos, we would like to go into the design details of it, starting with its architecture.

Pharos architecture

As a microservice, Pharos receives requests from the upstream, performs corresponding actions and then returns the result back. As shown in Figure 4, the Pharos architecture can be broken down into three layers: Proxy, Node, and Model.

  • Proxy layer. This layer helps to pass down the request to the right node, especially when the Node is on another machine.
  • Node layer. This layer stores the index of map IDs to models and distributes the request to the right model for execution.
  • Model layer. This layer is, where the business logic is implemented, executes the operations and returns the result.

As a distributed in-memory driver storage, Pharos is designed to handle load balancing, fault tolerance, and fast recovery.

Taking Figure 4 as an example, Pharos consists of three instances. Each individual instance is able to handle any request from the upstream. Whenever there is a request coming from the upstream, it is distributed into one of the three instances, which achieves the purpose of load balancing.

Figure 4: Pharos architecture
Figure 4: Pharos architecture

In Pharos, each model has two replicas and they are stored on different instances and different availability zones. If one instance is down, the other two instances are still up for service. The fault tolerance module in Pharos automatically detects the reduction of replicas and creates new instances to load graphs and build the models of missing replicas. This proves the reliability of Pharos even under extreme situations.

With the architecture of Pharos in mind, let’s take a look at how it stores driver information.

Driver storage

Pharos acts as a driver storage, and rather than being an external storage, it adopts in-memory storage which is faster and more adequate to handle frequent driver position updates and retrieve driver locations for nearby driver queries. Without loss of generality, drivers are assumed to be located on the vertices, i.e. Edge Based Nodes (EBN) of an edge-based graph.

Model is in charge of the driver storage in Pharos. Driver objects are passed down from upper layers to the model layer for storage. Each driver object contains several fields such as driver ID and metadata, containing the driver’s business related information e.g. driver status and particular allocation preferences.

There is also a Latitude and Longitude (LatLon) pair contained in the object, which indicates the driver’s current location. Very often, this LatLon pair sent from the driver is off the road (not on any existing road). The computation of routing distance between the query point and drivers is based on the road network. Thus, we need to infer which road segment (EBN) the driver is most probably on.

To convert a LatLon pair to an exact location on a road is called Snapping. Model begins with finding EBNs which are close to the driver’s location. After that, as illustrated in Figure 5, the driver’s location is projected to those EBNs, by drawing perpendicular lines from the location to the EBNs. The projected point is denoted as a phantom node. As the name suggests, these nodes do not exist in the graph. They are merely memory representations of the snapped driver.

Each phantom node contains information about its projected location such as the ID of EBN it is projected to, projected LatLon and projection ratio, etc. Snapping returns a list of phantom nodes ordered by the haversine distance from the driver’s LatLon to the phantom node in ascending order. The nearest phantom node is bound with the original driver object to provide information about the driver’s snapped location.

Figure 5: Snapping and phantom nodes
Figure 5: Snapping and phantom nodes

To efficiently index drivers from the graph, Pharos uses ART for driver storage. Two ARTs are maintained by each model: Driver ART and EBN ART.

Driver ART is used to store the index of driver IDs to corresponding driver objects, while EBN ART is used to store the index of EBN IDs to the root of an ART, which stores the drivers on that EBN.

Bi-directional indexing between EBNs and drivers are built because an efficient retrieval from driver to EBN is needed as driver locations are constantly updated. In practice, as index keys, driver IDs, and EBN IDs are both numerical. ART has a better throughput for dense keys (e.g. numerical keys) in contrast to sparse keys such as alphabetical keys, and when compared to other in-memory look-up tables (e.g. hash table). It also incurs less memory than other tree-based methods.

Figure 6 gives an example of driver ART assuming that the driver ID only has three digits.

Figure 6: Driver ART
Figure 6: Driver ART

After snapping, this new driver object is wrapped into an update task for execution. During execution, the model firstly checks if this driver already exists using its driver ID. If it does not exist, the model directly adds it to driver ART and EBN ART. If the driver already exists, the new driver object replaces the old driver object on driver ART. For EBN ART, the old driver object on the previous EBN needs to be deleted first before adding the new driver object to the current EBN.

Every insertion or deletion modifies both ARTs, which might cause changes to roots. The model only stores the roots of ARTs, and in order to prevent race conditions, a lock is used to prevent other read or write operations to access the ARTs while changing the ART roots.

Whenever a driver nearby request comes in, it needs to get a snapshot of driver storage, i.e. the roots of two ARTs. A simple example (Figure 7a and 7b) is used to explain how synchronization is achieved during concurrent driver update and nearby requests.

Figure 7: How ARTs change roots for synchronization
Figure 7: How ARTs change roots for synchronization

Currently there are two drivers A and B stored and these two drivers reside on the same EBN. When there is a nearby request, the current roots of the two ARTs are returned. When processing this nearby request, there could be driver updates coming and modifying the ARTs, e.g. a new root is resulted due to update of driver C. This driver update has no impact on ongoing driver nearby requests as they are using different roots. Subsequent nearby requests will use the new ART roots to find the nearby drivers. Once the current roots are not used by any nearby request, these roots and their child nodes are ready to be garbage collected.

Pharos does not delete drivers actively. A deletion of expired drivers is carried out every midnight by populating two new ARTs with the same driver update requests for a duration of driver’s Time To Live (TTL), and then doing a switch of the roots at the end. Drivers with expired TTLs are not referenced and they are ready to be garbage collected. In this way, expired drivers are removed from the driver storage.

Driver update and nearby

Pharos mainly has two external endpoints: Driver Update and Driver Nearby. The following describes how the business logic is implemented in these two operations.

Driver update

Figure 8 demonstrates the life cycle of a driver update request from upstream. Driver update requests from upstream are distributed to each proxy by a load balancer. The chosen proxy firstly constructs a driver object from the request body.

RouteTable, a structure in proxy, stores the index between map IDs and replica addresses. Proxy then uses map ID in the request as the key to check its RouteTable and gets the IP addresses of all the instances containing the model of that map ID.

Then, proxy forwards the update to other replicas that reside in other instances. Those instances, upon receiving the message, know that the update is forwarded from another proxy. Hence they directly pass down the driver object to the node.

After receiving the driver object, Node sends it to the right model by checking the index between map ID and model. The remaining part of the update flow is the same as described in Driver Storage. Sometimes the driver updates to replicas are not successful, e.g. request lost or model does not exist, Pharos will not react to such kinds of scenarios.

It can be observed that data storage in Pharos does not guarantee strong consistency. In practice, Pharos favors high throughput over strong consistency of KNN query results as the update frequency is high and slight inconsistency does not affect allocation performance significantly.

Figure 8: Driver update flow
Figure 8: Driver update flow

Driver nearby

Similar to driver update, after a driver nearby request comes from the upstream, it is distributed to one of the machines by the load balancer. In a nearby request, a set of filter parameters is used to match with driver metadata in order to support KNN queries with various business requirements. Note that driver metadata also carries an update timestamp. During the nearby search, drivers with an expired timestamp are filtered.

As illustrated in Figure 9, upon receiving the nearby request, a nearby object is built and passed to the proxy layer. The proxy first checks RouteTable by map ID to see if this request can be served on the current instance. If so, the nearby object is passed to the Node layer. Otherwise, this nearby request needs to be forwarded to the instances that contain this map ID.

In this situation, a round-robin fashion is applied to select the right instance for load balancing. After receiving the request, the proxy of the chosen instance directly passes the nearby object to the node. Once the node layer receives the nearby object, it looks for the right model using the map ID as key. Eventually, the nearby object goes to the model layer where K-nearest-driver computation takes place. Model snaps the location of the request to some phantom nodes as described previously – these nodes are used as start nodes for expansion later.

Figure 9: Driver nearby flow
Figure 9: Driver nearby flow

Starting from the phantom nodes found in the Driver Nearby flow, the K nearest driver search begins. Two priority queues are used during the search: EBNPQ is used to keep track of the nearby EBNs, while driverPQ keeps track of drivers found during expansion by their driving distance to the query point.

At first, a snapshot of the current driver storage is taken (using roots of current ARTs) and it shows the driver locations on the road network at the time when the nearby request comes in. From each start node, the parent EBN is found and drivers on these EBNs are appended to driverPQ. After that, KNN search expands to adjacent EBNs and appends these EBNs to EBNPQ. After iterating all start nodes, there will be some initial drivers in driverPQ and adjacent EBNs waiting to be expanded in EBNPQ.

Each time the nearest EBN is removed from EBNPQ, drivers located on this EBN are appended to driverPQ. After that, the closest driver is removed from driverPQ. If the driver satisfies all filtering requirements, it is appended to the array of qualified drivers. This step repeats until driverPQ becomes empty. During this process, if the size of qualified drivers reaches the maximum driver limit, the KNN search stops right away and qualified drivers are returned.

After driverPQ becomes empty, adjacent EBNs of the current one are to be expanded and those within the predefined range, e.g. three kilometers, are appended to EBNPQ. Then the nearest EBN is removed from EBNPQ and drivers on that EBN are appended to driverPQ again. The whole process continues until EBNPQ becomes empty. The driver array is returned as the result of the nearby query.

Figure 10 shows the pseudo code of this KNN algorithm.

Figure 10: KNN search algorithm
Figure 10: KNN search algorithm

What’s next?

Currently, Pharos is running on the production environment, where it handles requests with P99 latency time of 10ms for driver update and 50ms for driver nearby, respectively. Even though the performance of Pharos is quite satisfying, we still see some potential areas of improvements:

  • Pharos uses ART for driver storage. Even though ART proves its ability to handle large volumes of driver update and driver nearby requests, the write operations (driver update) are not carried out in parallel. Hence, we plan to explore other data structures that can achieve high concurrency of read and write, eg. concurrent hash table.
  • Pharos uses OSM Multi-level Dijkstra (MLD) graphs to find K nearest drivers. As the predefined range of nearby driver search is often a few kilometers, Pharos does not make use of MLD partitions or support long distance query. Thus, we are interested in exploiting MLD graph partitions to enable Pharos to support long distance query.
  • In Pharos, maps are partitioned by cities and we assume that drivers of a city operate within that city. When finding the nearby drivers, Pharos only allocates drivers of that city to the passenger. Hence, in the future, we want to enable Pharos to support cross city allocation.

We hope this blog helps you to have a closer look at how we store driver locations and how we use these locations to find nearby drivers around you.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Acknowledgements

We would like to thank Chunda Ding, Zerun Dong, and Jiang Liu for their contributions to the distributed layer used in Pharos. Their efforts make Pharos reliable and fault tolerant.

Figure 3, Lighthouse of Alexandria is taken from https://www.britannica.com/topic/lighthouse-of-Alexandria#/media/1/455210/187239 authored by Sergey Kamshylin.

Figure 5, Snapping and Phantom Nodes, is created by Minbo Qiu. We would like to thank him for the insightful elaboration of the snapping mechanism.

Cover Photo by Kevin Huang on Unsplash

Get up to speed with partial clone and shallow clone

Post Syndicated from Derrick Stolee original https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/

As your Git repositories grow, it becomes harder and harder for new developers to clone and start working on them. Git is designed as a distributed version control system. This means that you can work on your machine without needing a connection to a central server that controls how you interact with the repository. This is only fully realizable if you have all reachable data in your local repository.

What if there was a better way? Could you get started working in the repository without downloading every version of every file in the entire Git history? Git’s partial clone and shallow clone features are options that can help here, but they come with their own tradeoffs. Each option breaks at least one expectation from the normal distributed nature of Git, and you might not be willing to make those tradeoffs.

If you are working with an extremely large monorepo, then these tradeoffs are more likely to be worthwhile or even necessary to interact with Git at that scale!

Before digging in on this topic, be sure you are familiar with how Git stores your data, including commits, trees, and blob objects. I presented some of these ideas and other helpful tips at GitHub Universe in my talk, Optimize your monorepo experience.

Quick Summary

There are three ways to reduce clone sizes for repositories hosted by GitHub.

  • git clone --filter=blob:none <url> creates a blobless clone. These clones download all reachable commits and trees while fetching blobs on-demand. These clones are best for developers and build environments that span multiple builds.
  • git clone --filter=tree:0 <url> creates a treeless clone. These clones download all reachable commits while fetching trees and blobs on-demand. These clones are best for build environments where the repository will be deleted after a single build, but you still need access to commit history.
  • git clone --depth=1 <ulr> creates a shallow clone. These clones truncate the commit history to reduce the clone size. This creates some unexpected behavior issues, limiting which Git commands are possible. These clones also put undue stress on later fetches, so they are strongly discouraged for developer use. They are helpful for some build environments where the repository will be deleted after a single build.

Full clones

As we discuss the different clone types, we will use a common representation of Git objects:

  • Boxes are blobs. These represent file contents.
  • Triangles are trees. These represent directories.
  • Circles are commits. These are snapshots in time.

We use arrows to represent a relationship between objects. Basically, if an OID B appears inside a commit or tree A, then the object A has an arrow to the object B. If we can follow a list of arrows from an object A to another object C, then we say C is reachable from A. The process of following these arrows is sometimes referred to as walking objects.

We can now describe the data downloaded by a git clone command! The client asks the server for the latest commits, then the server provides those objects and every other reachable object. This includes every tree and blob in the entire commit history!

In this diagram, time moves from left to right. The arrows between a commit and its parents therefore go from right to left. Each commit has a single root tree. The root tree at the HEAD commit is fully expanded underneath, while the rest of the trees have arrows pointing towards these objects.

This diagram is purposefully simple, but if your repository is very large you will have many commits, trees, and blobs in your history. Likely, the historical data forms a majority of your data. Do you actually need all of it?

These days, many developers always have a network connection available as they work, so asking the server for a little more data when necessary might be an acceptable trade-off.

This is the critical design change presented by partial clone.

Partial clone

Git’s partial clone feature is enabled by specifying the --filter option in your git clone command. The full list of filter options exist in the git rev-list documentation, since you can use git rev-list --filter=<filter> --all to see which objects in your repository match the filter. There are several filters available, but the server can choose to deny your filter and revert to a full clone.

On github.com and GitHub Enterprise Server 2.22+, there are two options available:

  1. Blobless clones: git clone --filter=blob:none <url>
  2. Treeless clones: git clone --filter=tree:0 <url>

Let’s investigate each of these options.

Blobless clones

When using the --filter=blob:none option, the initial git clone will download all reachable commits and trees, and only download the blobs for commits when you do a git checkout. This includes the first checkout inside the git clone operation. The resulting object model is shown here:

The important thing to notice is that we have a copy of every blob at HEAD but the blobs in the history are not present. If your repository has a deep history full of large blobs, then this option can significantly reduce your git clone times. The commit and tree data is still present, so any subsequent git checkout only needs to download the missing blobs. The Git client knows how to batch these requests to ask the server only for the missing blobs.

Further, when running git fetch in a blobless clone, the server only sends the new commits and trees. The new blobs are downloaded only after a git checkout. Note that git pull runs git fetch and then git merge, so it will download the necessary blobs during the git merge command.

When using a blobless clone, you will trigger a blob download whenever you need the contents of a file, but you will not need one if you only need the OID of a file. This means that git log can detect which commits changed a given path without needing to download extra data.

This means that blobless clones can perform commands like git merge-basegit log, or even git log -- <path> with the same performance as a full clone.

Commands like git diff or git blame <path> require the contents of the paths to compute diffs, so these will trigger blob downloads the first time they are run. However, the good news is that after that you will have those blobs in your repository and do not need to download them a second time. Most developers only need to run git blame on a small number of files, so this tradeoff of a slightly slower git blame command is worth the faster clone and fetch times.

Blobless clones are the most widely-used partial clone option. I’ve been using them myself for months without issue.

Treeless clones

In some repositories, the tree data might be a significant portion of the history. Using --filter=tree:0, a treeless clone downloads all reachable commits, then downloads trees and blobs on demand. The resulting object model is shown here:

Note that we have all of the data at HEAD, but otherwise only have commit data. This means that the initial clone can be much faster in a treeless clone than in a blobless or full clone. Further, we can run git fetch to download only the latest commits. However, working in a treeless clone is more difficult because downloading a missing tree when needed is more expensive.

For example, a git checkout command changes the HEAD commit, usually to a commit where we do not have the root tree. The Git client then asks the server for that root tree by OID, but also for all reachable trees from that root tree. Currently, this request does not tell the server that the client already has some root trees, so the server might send many trees the client already has locally. After the trees are downloaded, the client can detect which blobs are missing and request those in a batch.

It is possible to work in a treeless clone without triggering too many requests for extra data, but it is much more restrictive than a blobless clone.

For example, history operations such as git merge-base or git log (without extra options) only use commit data. These will not trigger extra downloads.

However, if you run a file history request such as git log -- <path>, then a treeless clone will start downloading root trees for almost every commit in the history!

We strongly recommend that developers do not use treeless clones for their daily work. Treeless clones are really only helpful for automated builds when you want to quickly clone, compile a project, then throw away the repository. In environments like GitHub Actions using public runners, you want to minimize your clone time so you can spend your machine time actually building your software! Treeless clones might be an excellent option for those environments.

⚠ Warning: While writing this article, we were putting treeless clones to the test beyond the typical limits. We noticed that repositories that contain submodules behave very poorly with treeless clones. Specifically, if you run git fetch in a treeless clone, then the logic in Git that looks for changed submodules will trigger a tree request for every new commit! This behavior can be avoided by running git config fetch.recurseSubmodules false in your treeless clones. We are working on a more robust fix in the Git client.

Shallow clones

Partial clones are relatively new to Git, but there is an older feature that does something very similar to a treeless clone: shallow clones. Shallow clones use the --depth=<N> parameter in git clone to truncate the commit history. Typically, --depth=1 signifies that we only care about the most recent commits. Shallow clones are best combined with the --single-branch --branch=<branch> options as well, to ensure we only download the data for the commit we plan to use immediately.

The object model for a shallow clone is shown in this diagram:

Here, the commit at HEAD exists, but its connection to its parents and the rest of the history is severed. The commits whose parents are removed are called shallow commits and together form the shallow boundary. The commit objects themselves have not changed, but there is some metadata in the client repository directing the Git client to ignore those parent connections. All trees and blobs are downloaded for any commit that exists on the client.

Since the commit history is truncated, commands such as git merge-base or git log show different results than they would in a full clone! In general, you cannot count on them to work as expected. Recall that these commands work as expectedly in partial clones. Even in blobless clones, commands like git blame -- <path> will work correctly, if only a little slower than in full clones. Shallow clones don’t even make that a possibility!

The other major difference is how git fetch behaves in a shallow clone. When fetching new commits, the server must provide every tree and blob that is “new” to these commits, relative to the shallow commits. This computation can be more expensive than a typical fetch, partly because a well-maintained server can make use of reachability bitmaps. Depending on how others are contributing to your remote repository, a git fetch operation in a shallow clone might end up downloading an almost-full commit history!

Here are some descriptions of things that can go wrong with shallow clones that negate the supposed values. For these reasons we do not recommend shallow clones except for builds that delete the repository immediately afterwards. Fetching from shallow clones can cause more harm than good!

Remember the “shallow boundary” mentioned earlier? The client sends that boundary to the server during a git fetch command, telling the server that it doesn’t have all of the reachable commits behind that boundary. The client then asks for the latest commits and everything reachable from those until hitting a shallow commit in the boundary. If another user starts a topic branch below that boundary and then the shallow client fetches that topic (or worse, the topic is merged into the default branch), then the server needs to walk the full history and serve the client what amounts to almost a full clone! Further, the server needs to calculate that data without the advantage of performance features like reachability bitmaps.

Comparing Clone Options

Let’s recall each of our clone options. Instead of looking them at a pure object level, let’s explore each category of object. The figures below group the data that is downloaded by each repository type. In addition to the data downloaded at clone, let’s consider the situation where some time passes and then the client runs git fetch and then git checkout to move to a new commit. For each of these options, how much data is downloaded.

Full clones download all reachable objects. Typically, blobs are responsible for most of this data.

In a partial clone, some data is not served immediately and is delayed until the client needs it. Blobless clones skip blobs except those needed at checkout time. Treeless clones skip all trees in the history in favor of downloading a full copy of the trees needed for each checkout.

Blobless clone Treeless clone
git clone --depth=1,
git fetch
git clone --depth=1,
git fetch --depth=1

What do the numbers say?

Fellow GitHub engineer @solmazabbaspour designed and ran an experiment to compare these different clone options on a variety of open source repositories. She will post a blog post tomorrow giving full details and data for the experiment, but I’ll share the executive summary here. Here are some common themes we identified that could help you choose the right scenario for your own usage:

There are many different types of clones beyond the default full clone. If you truly need to have a distributed workflow and want all of the data in your local repository, then you should continue using full clones. If you are a developer focused on a single repository and your repository is reasonably-sized, the best approach is to do a full clone.

You might switch to a blobless partial clone if your repository is very large due to many large blobs, as that clone will help you get started more quickly. The trade-off is that some commands such as git checkout or git blame will require downloading new blob data when necessary.

In general, calculating a shallow fetch is computationally more expensive compared to a full fetch. Always use a full fetch instead of a shallow fetch both in fully and shallow cloned repositories.

In workflows such as CI builds when there is a need to do a single clone and delete the repository immediately, shallow clones are a good option. Shallow clones are the fastest way to get a copy of the working directory at the tip commit with the additional cost that fetching from these repositories is much more expensive, so we do not recommend shallow clones for developers. If you need the commit history for your build, then a treeless partial clone might work better for you than a full clone.

In general, your mileage may vary. Now that you are armed with these different options and the object model behind them, you can go and play with these kinds of clones. You should also be aware of some pitfalls of these non-full clone options:

  • Shallow clones skip the commit history. This makes commands such as git log or git merge-base unavailable. Never fetch from a shallow clone!
  • Treeless clones contain commit history, but it is very expensive to download missing trees. Thus, git log (without a path) and git merge-base are available, but commands like git log -- <path> and git blame are extremely slow and not recommended in these clones.
  • Blobless clones contain all reachable commits and trees, so Git downloads blobs when it needs access to file contents. This means that commands like git log -- <path> are available but commands like git blame are a bit slower on their first run. However, this can be a great way to get started on a very large repository with a lot of old, large blobs.
  • Full clones work as expected. The only downside is the time required to download all of that data, plus the extra disk space for all those files.

Be sure to upgrade to the latest Git version so you have all the latest performance improvements!

Visualizing GitHub’s global community

Post Syndicated from Tal Safran original https://github.blog/2020-12-21-visualizing-githubs-global-community/

This is the second post in a series about how we built our new homepage.

  1. How our globe is built
  2. How we collect and use the data behind the globe
  3. How we made the page fast and performant
  4. How We Illustrate at GitHub
  5. How we designed the homepage and wrote the narrative

In the first post, my teammate Tobias shared how we made the 3D globe come to life, with lots of nitty gritty details about Three.js, performance optimization, and delightful touches.

But there’s another side to the story—the data! We hope you enjoy the read. ✨

Data goals

When we kicked off the project, we knew that we didn’t want to make just another animated globe. We wanted the data to be interesting and engaging. We wanted it to be real, and most importantly, we wanted it to be live.

Luckily, the data was there.

The challenge then became designing a data service that addressed the following challenges:

  1. How do we query our massive volume of data?
  2. How do we show you the most interesting bits?
  3. How do we geocode user locations in a way that respects privacy?
  4. How do we expose the computed data back to the monolith?
  5. How do we not break GitHub? 😊

Let’s begin, shall we?

Querying GitHub

So, how hard could it be to show you some recent pull requests? It turns out it’s actually very simple:

class GlobeController < ApplicationController
  def data
    pull_requests = PullRequest
      .where(open: true)
      .joins(:repositories)
      .where("repository.is_open_source = true")
      .last(10_000)

    render json: pull_requests
  end
end

Just kidding 😛

Because of the volume of data generated on GitHub every day, the size of our databases, as well as the importance of keeping GitHub fast and reliable, we knew we couldn’t query our production databases directly.

Luckily, we have a data warehouse and a fantastic team that maintains it. Data from production is fetched, sanitized, and packaged nicely into the data warehouse on a regular schedule. The data can then be queried using Presto, a flavor of SQL meant for querying large sets of data.

We also wanted the data to be as fresh as possible. So instead of querying snapshots of our MySQL tables that are only copied over once a day, we were able to query data coming from our Apache Kafka event stream that makes it into the data warehouse much more regularly.

As an example, we have an event that is reported every time a pull request is merged. The event is defined in a format called protobuf, which stands for “protocol buffer.”

Here’s what the protobuf for a merged pull request event might look like:

message PullRequestMerge {
  github.v1.entities.User actor = 1;
  github.v1.entities.Repository repository = 2;
  github.v1.entities.User repository_owner = 3;
  github.v1.entities.PullRequest pull_request = 4;
  github.v1.entities.Issue issue = 5;
}

Each row corresponds to an “entity,” each of which is defined in its own protobuf file. Here’s a snippet from the definition of a pull request entity:

message PullRequest {
  uint64 id = 1;
  string global_relay_id = 2;
  uint64 author_id = 3;

  enum PullRequestState {
    UNKNOWN = 0;
    OPEN = 1;
    CLOSED = 2;
    MERGED = 3;
  }
  PullRequestState pull_request_state = 4;

  google.protobuf.Timestamp created_at = 5;
  google.protobuf.Timestamp updated_at = 6;
}

Including an entity in an event will pass along all of the attributes defined for it. All of that data gets copied into our data warehouse for every pull request that is merged.

This means that a Presto query for pull requests merged in the past day could look like:

SELECT
  pull_request.created_at,
  pull_request.updated_at,
  pull_request.id,
  issue.number,
  repository.id
FROM kafka.github.pull_request_merge
WHERE
  day >= CAST((CURRENT_DATE - INTERVAL '1' DAY) AS VARCHAR)

There are a few other queries we make to pull in all the data we need. But as you can see, this is pretty much standard SQL that pulls in merged pull requests from the last day in the event stream.

Surfacing interesting data

We wanted to make sure that whatever data we showed was interesting, engaging, and appropriate to be spotlighted on the GitHub homepage. If the data was good, visitors would be enticed to explore the vast ecosystem of open source being built on GitHub at that given moment. Maybe they’d even make a contribution!

So how do we find good data?

Luckily our data team came to the rescue yet again. A few years ago, the Data Science Team put together a model to rank the “health” of repositories based on 30-plus features weighted by importance. A healthy repository doesn’t necessarily mean having a lot of stars. It also takes into account how much current activity is happening and how easy it is to contribute to the project, to name a few.

The end result is a numerical health score that we can query against in the data warehouse.

SELECT repository_id
FROM data_science.github.repository_health_scores
WHERE 
  score > 0.75

Combining this query with the above, we can now pull in merged pull requests from repositories with health scores above a certain threshold:

WITH
healthy_repositories AS (
  SELECT repository_id
  FROM data_science.github.repository_health_scores
  WHERE 
    score > 0.75
)

SELECT
  a.pull_request.created_at,
  a.pull_request.updated_at,
  a.pull_request.id,
  a.issue.number,
  a.repository.id
FROM kafka.github.pull_request_merge a
JOIN healthy_repositories b
ON a.repository.id = b.repository_id
WHERE
  day >= CAST((CURRENT_DATE - INTERVAL '1' DAY) AS VARCHAR)

We do some other things to ensure the data is good, like filtering out accounts with spammy behavior. But repository health scores are definitely a key ingredient.

Geocoding user-provided locations

Your GitHub profile has an optional free text field for providing your location. Some people fill it out with their actual location (mine says “San Francisco”), while others use fake or funny locations (42 users have “Middle Earth” listed as theirs). Many others choose to not list a location. In fact, two-thirds of users don’t enter anything and that’s perfectly fine with us.

For users that do enter something, we try to map the text to a real location. This is a little harder to do than using IP addresses as proxies for locations, but it was important to us to only include data that users felt comfortable making public in the first place.

In order to map the free text locations to latitude and longitude pairs, we use Mapbox’s forward geocoding API and their Ruby SDK. Here’s an example of a forward geocoding of “New York City”:

MAPBOX_OPTIONS = {
  limit: 1,
  types: %w(region place country),
  language: "en"
}

Mapbox::Geocoder.geocode_forward("New York City", MAPBOX_OPTIONS)

=> [{
  "type" => "FeatureCollection",
  "query" => ["new", "york", "city"],
  "features" => [{
    "id" => "place.15278078705964500",
    "type" => "Feature",
    "place_type" => ["place"],
    "relevance" => 1,
    "properties" => {
      "wikidata" => "Q60"
    },
    "text_en" => "New York City",
    "language_en" => "en",
    "place_name_en" => "New York City, New York, United States",
    "text" => "New York City",
    "language" => "en",
    "place_name" => "New York City, New York, United States",
    "bbox" => [-74.2590879797556, 40.477399, -73.7008392055224, 40.917576401307],
    "center" => [-73.9808, 40.7648],
    "geometry" => {
      "type" => "Point", "coordinates" => [-73.9808, 40.7648]
    },
    "context" => [{
      "id" => "region.17349986251855570",
      "wikidata" => "Q1384",
      "short_code" => "US-NY",
      "text_en" => "New York",
      "language_en" => "en",
      "text" => "New York",
      "language" => "en"
    }, {
      "id" => "country.19678805456372290",
      "wikidata" => "Q30",
      "short_code" => "us",
      "text_en" => "United States",
      "language_en" => "en",
      "text" => "United States",
      "language" => "en"
    }]
  }],
  "attribution" => "NOTICE: (c) 2020 Mapbox and its suppliers. All rights reserved. Use of this data is subject to the Mapbox Terms of Service (https://www.mapbox.com/about/maps/). This response and the information it contains may not be retained. POI(s) provided by Foursquare."
}, {}]

There is a lot of data there, but let’s focus on text, relevance, and center for now. Here are those fields for the “New York City”:

result = Mapbox::Geocoder.geocode_forward("New York City", MAPBOX_OPTIONS)
result[0]["features"][0].slice("text", "relevance", "center")

=> {"text"=>"New York City", "relevance"=>1, "center"=>[-73.9808, 40.7648]}

If you use “NYC” query string, you get the exact same result:

result = Mapbox::Geocoder.geocode_forward("NYC", MAPBOX_OPTIONS)
result[0]["features"][0].slice("text", "relevance", "center")

=> {"text"=>"New York City", "relevance"=>1, "center"=>[-73.9808, 40.7648]}

Notice that the text is still “New York City” in this second example? That is because Mapbox is normalizing the results. We use the normalized text on the globe so viewers get a consistent experience. This also takes care of capitalization and misspellings.

The center field is an array containing the longitude and latitude of the location.

And finally, the relevance score is an indicator of Mapox’s confidence in the results. A relevance score of 1 is the highest, but sometimes users enter locations that Mapbox is less sure about:

result = Mapbox::Geocoder.geocode_forward("Middle Earth", MAPBOX_OPTIONS)
result[0]["features"][0].slice(text", "relevance", "center")

=> {"text"=>"Earth City", "relevance"=>0.5, "center"=>[-90.4682, 38.7689]}

We discard anything with a score of less than 1, just to get confidence that the location we show feels correct.

Mapbox also provides a batch geocoding endpoint. This allows us to query multiple locations in one request:

MAPBOX_ENDPOINT = "mapbox.places-permanent"

query_string = "{San Francisco};{Berlin};{Dakar};{Tokyo};{Lima}"

Mapbox::Geocoder.geocode_forward(query_string, MAPBOX_OPTIONS, MAPBOX_ENDPOINT)

After we’ve geocoded and normalized all of the results, we create a JSON representation of the pull request and its locations so our globe JavaScript client knows how to parse it.

Here’s a pull request we recently featured that was opened in San Francisco and merged in Tokyo:

{
   "uml":"Tokyo",
   "gm":{
      "lat":35.68,
      "lon":139.77
   },
   "uol":"San Francisco",
   "gop":{
      "lat":37.7648,
      "lon":-122.463
   },
   "l":"JavaScript",
   "nwo":"mdn/browser-compat-data",
   "pr":7937,
   "ma":"2020-12-17 04:00:48.000",
   "oa":"2020-12-16 10:02:31.000"
}

We use short keys to shave off some bytes from the JSON we end up serving so the globe loads faster.

Airflow, HDFS, and Munger

We run our data warehouse queries and geocoding throughout the day to ensure that the data on the homepage is always fresh.

For scheduling this work, we use another system from Apache called Airflow. Airflow lets you run scheduled jobs that contain a sequence of tasks. Airflow calls these workflows Direct Acyclical Graphs (or DAGs for short), which is a borrowed term from graph theory in computer science. Basically this means that you schedule one task at a time, execute the task, and when the task is done, then the next task is scheduled and eventually executed. Tasks can pass along information to each other.

At a high level, our DAG executes the following tasks:

  1. Query the data warehouse.
  2. Geocode locations from the results.
  3. Write the results to a file.
  4. Expose the results to the GitHub Rails app.

We covered the first two steps earlier. For writing the file, we use HDFS, which is a distributed file system that’s part of the Apache Hadoop project. The file is then uploaded to Munger, an internal service we use to expose results from the data science pipeline back to the GitHub Rails app that powers github.com.

Here’s what this might look like in the Airflow UI:

Each column in that screenshot represents a full DAG run of all of the tasks. The last column with the light green circle at the top indicates that the DAG is in the middle of a run. It’s completed the build_home_page_globe_table task (represented by a dark green box) and now has the next task write_to_hdfs scheduled (dark blue box).

Our Airflow instance runs more than just this one DAG throughout the day, so we may stay in this state for some time before the scheduler is ready to pick up the write_to_hdfs task. Eventually the remaining tasks should run. If everything ends up running smoothly, we should see all green:

Wrapping up

Hope that gives you a glimpse into how we built this!

Again, thank you to all the teams that made the GitHub homepage and globe possible. This project would not have been possible without years of investment in our data infrastructure and data science capabilities, so a special shout out to Kim, Jeff, Preston, Ike, Scott, Jamison, Rowan, and Omoju.

More importantly, we could not have done it without you, the GitHub community, and your daily contributions and projects that truly bring the globe to life. Stay tuned—we have even more in store for this project coming soon.

In the meantime, I hope to see you on the homepage soon. 😉

How we built the GitHub globe

Post Syndicated from Tobias Ahlin original https://github.blog/2020-12-21-how-we-built-the-github-globe/

GitHub is where the world builds software. More than 56 million developers around the world build and work together on GitHub. With our new homepage, we wanted to show how open source development transcends the borders we’re living in and to tell our product story through the lens of a developer’s journey.

Now that it’s live, we would love to share how we built the homepage-directly from the voices of our designers and developers. In this five-part series, we’ll discuss:

  1. How our globe is built
  2. How we collect and use the data behind the globe
  3. How we made the page fast and performant
  4. How our illustrators work with designers and engineers
  5. How we designed the homepage and wrote the narrative

At Satellite in 2019, our CEO Nat showed off a visualization of open source activity on GitHub over a 30-day span. The sheer volume and global reach was astonishing, and we knew we wanted to build on that story.

 

 The main goals we set out to achieve in the design and development of the globe were:

  • An interconnected community. We explored many different options, but ultimately landed on pull requests. It turned out to be a beautiful visualization of pull requests being opened in one part of the world and closed in another.
  • A showcase of real work happening now. We started by simply showing the pull requests’ arcs and spires, but quickly realized that we needed “proof of life.” The arcs could quite as easily just be design animations instead of real work. We iterated on ways to provide more detail and found most resonance with clear hover states that showed the pull request, repo, timestamp, language, and locations. Nat had the idea of making each line clickable, which really upleveled the experience and made it much more immersive. Read more here.
  • Attention to detail and performance. It was extremely important to us that the globe not only looked inspiring and beautiful, but that it performed well on all devices. We went through many, many iterations of refinement, and there’s still more work to be done.

Rendering the globe with WebGL

At the most fundamental level, the globe runs in a WebGL context powered by three.js. We feed it data of recent pull requests that have been created and merged around the world through a JSON file. The scene is made up of five layers: a halo, a globe, the Earth’s regions, blue spikes for open pull requests, and pink arcs for merged pull requests. We don’t use any textures: we point four lights at a sphere, use about 12,000 five-sided circles to render the Earth’s regions, and draw a halo with a simple custom shader on the backside of a sphere.

To draw the Earth’s regions, we start by defining the desired density of circles (this will vary depending on the performance of your machine—more on that later), and loop through longitudes and latitudes in a nested for-loop. We start at the south pole and go upwards, calculate the circumference for each latitude, and distribute circles evenly along that line, wrapping around the sphere:

for (let lat = -90; lat <= 90; lat += 180/rows) {
  const radius = Math.cos(Math.abs(lat) * DEG2RAD) * GLOBE_RADIUS;
  const circumference = radius * Math.PI * 2;
  const dotsForLat = circumference * dotDensity;
  for (let x = 0; x < dotsForLat; x++) {
    const long = -180 + x*360/dotsForLat;
    if (!this.visibilityForCoordinate(long, lat)) continue;

    // Setup and save circle matrix data
  }
}

To determine if a circle should be visible or not (is it water or land?) we load a small PNG containing a map of the world, parse its image data through canvas’s context.getImageData(), and map each circle to a pixel on the map through the visibilityForCoordinate(long, lat) method. If that pixel’s alpha is at least 90 (out of 255), we draw the circle; if not, we skip to the next one.

After collecting all the data we need to visualize the Earth’s regions through these small circles, we create an instance of CircleBufferGeometry and use an InstancedMesh to render all the geometry.

Making sure that you can see your own location

As you enter the new GitHub homepage, we want to make sure that you can see your own location as the globe appears, which means that we need to figure where on Earth that you are. We wanted to achieve this effect without delaying the first render behind an IP look-up, so we set the globe’s starting angle to center over Greenwich, look at your device’s timezone offset, and convert that offset to a rotation around the globe’s own axis (in radians):

const date = new Date();
const timeZoneOffset = date.getTimezoneOffset() || 0;
const timeZoneMaxOffset = 60*12;
rotationOffset.y = ROTATION_OFFSET.y + Math.PI * (timeZoneOffset / timeZoneMaxOffset);

It’s not an exact measurement of your location, but it’s quick, and does the job.

Visualizing pull requests

The main act of the globe is, of course, visualizing all of the pull requests that are being opened and merged around the world. The data engineering that makes this possible is a different topic in and of itself, and we’ll be sharing how we make that happen in an upcoming post. Here we want to give you an overview of how we’re visualizing all your pull requests.

 

Let’s focus on pull requests being merged (the pink arcs), as they are a bit more interesting. Every merged pull request entry comes with two locations: where it was opened, and where it was merged. We map these locations to our globe, and draw a bezier curve between these two locations:

const curve = new CubicBezierCurve3(startLocation, ctrl1, ctrl2, endLocation);

We have three different orbits for these curves, and the longer the two points are apart, the further out we’ll pull out any specific arc into space. We then use instances of TubeBufferGeometry to generate geometry along these paths, so that we can use setDrawRange() to animate the lines as they appear and disappear.

As each line animates in and reaches its merge location, we generate and animate in one solid circle that stays put while the line is present, and one ring that scales up and immediately fades out. The ease out easings for these animations are created by multiplying a speed (here 0.06) with the difference between the target (1) and the current value (animated.dot.scale.x), and adding that to the existing scale value. In other words, for every frame we step 6% closer to the target, and as we’re coming closer to that target, the animation will naturally slow down.

// The solid circle
const scale = animated.dot.scale.x + (1 - animated.dot.scale.x) * 0.06;
animated.dot.scale.set(scale, scale, 1);

// The landing effect that fades out
const scaleUpFade = animated.dotFade.scale.x + (1 - animated.dotFade.scale.x) * 0.06;
animated.dotFade.scale.set(scaleUpFade, scaleUpFade, 1);
animated.dotFade.material.opacity = 1 - scaleUpFade;

Creative constraints from performance optimizations

The homepage and the globe needs to perform well on a variety of devices and platforms, which early on created some creative restrictions for us, and made us focus extensively on creating a well-optimized page. Although some modern computers and tablets could render the globe at 60 FPS with antialias turned on, that’s not the case for all devices, and we decided early on to leave antialias turned off and optimize for performance. This left us with a sharp and pixelated line running along the top left edge of the globe, as the globe’s highlighted edge met the darker color of the background:

This encouraged us to explore a halo effect that could hide that pixelated edge. We created one by using a custom shader to draw a gradient on the backside of a sphere that’s slightly larger than the globe, placed it behind the globe, and tilted it slightly on its side to emphasize the effect in the top left corner:

const halo = new Mesh(haloGeometry, haloMaterial);
halo.scale.multiplyScalar(1.15);
halo.rotateX(Math.PI*0.03);
halo.rotateY(Math.PI*0.03);
this.haloContainer.add(halo);

This smoothed out the sharp edge, while being a much more performant operation than turning on antialias. Unfortunately, leaving antialias off also produced a fairly prominent moiré effect as all the circles making up the world came closer and closer to each other as they neared the edges of the globe. We reduced this effect and simulated the look of a thicker atmosphere by using a fragment shader for the circles where each circle’s alpha is a function of its distance from the camera, fading out every individual circle as it moves further away:

if (gl_FragCoord.z > fadeThreshold) {
  gl_FragColor.a = 1.0 + (fadeThreshold - gl_FragCoord.z ) * alphaFallOff;
}

Improving perceived speed

We don’t know how quickly (or slowly) the globe is going to load on a particular device, but we wanted to make sure that the header composition on the homepage is always balanced, and that you got the impression that the globe loads quickly even if there’s a slight delay before we can render the first frame.

We created a bare version of the globe using only gradients in Figma and exported it as an SVG. Embedding this SVG in the HTML document adds little overhead, but makes sure that something is immediately visible as the page loads. As soon as we’re ready to render the first frame of the globe, we transition between the SVG and the canvas element by crossfading between and scaling up both elements using the Web Animations API. Using the Web Animations API enables us to not touch the DOM at all during the transition, ensuring that it’s as stutter-free as possible.

const keyframesIn = [
      { opacity: 0, transform: 'scale(0.8)' },
      { opacity: 1, transform: 'scale(1)' }
    ];
const keyframesOut = [
      { opacity: 1, transform: 'scale(0.8)' },
      { opacity: 0, transform: 'scale(1)' }
    ];
const options = { fill: 'both', duration: 600, easing: 'ease' };

this.renderer.domElement.animate(keyframesIn, options);
const placeHolderAnim = placeholder.animate(keyframesOut, options);
placeHolderAnim.addEventListener('finish', () => {
  placeholder.remove();
});

Graceful degradation with quality tiers

We aim at maintaining 60 FPS while rendering an as beautiful globe as we can, but finding that balance is tricky—there are thousands of devices out there, all performing differently depending on the browser they’re running and their mood. We constantly monitor the achieved FPS, and if we fail to maintain 55.5 FPS over the last 50 frames we start to degrade the quality of the scene.

 

There are four quality tiers, and for every degradation we reduce the amount of expensive calculations. This includes reducing the pixel density, how often we raycast (figure out what your cursor is hovering in the scene), and the amount of geometry that’s drawn on screen—which brings us back to the circles that make up the Earth’s regions. As we traverse down the quality tiers, we reduce the desired circle density and rebuild the Earth’s regions, here going from the original ~12 000 circles to ~8 000:

// Reduce pixel density to 1.5 (down from 2.0)
this.renderer.setPixelRatio(Math.min(AppProps.pixelRatio, 1.5));
// Reduce the amount of PRs visualized at any given time
this.indexIncrementSpeed = VISIBLE_INCREMENT_SPEED / 3 * 2;
// Raycast less often (wait for 4 additional frames)
this.raycastTrigger = RAYCAST_TRIGGER + 4;
// Draw less geometry for the Earth’s regions
this.worldDotDensity = WORLD_DOT_DENSITY * 0.65;
// Remove the world
this.resetWorldMap();
// Generate world anew from new settings
this.buildWorldGeometry();

A small part of a wide-ranging effort

These are some of the techniques that we use to render the globe, but the creation of the globe and the new homepage is part of a longer story, spanning multiple teams, disciplines, and departments, including design, brand, engineering, product, and communications. We’ll continue the deep-dive in this 5-part series, so come back soon or follow us on Twitter @GitHub for all the latest updates on this project and more.

Next up: how we collect and use the data behind the globe.

In the meantime, don’t miss out on the new GitHub globe wallpapers from the GitHub Illustration Team to enjoy the globe from your desktop or mobile device:


Love the new GitHub homepage or any of the work you see here? Join our team

Commits are snapshots, not diffs

Post Syndicated from Derrick Stolee original https://github.blog/2020-12-17-commits-are-snapshots-not-diffs/

Git has a reputation for being confusing. Users stumble over terminology and phrasing that misguides their expectations. This is most apparent in commands that “rewrite history” such as git cherry-pick or git rebase. In my experience, the root cause of this confusion is an interpretation of commits as diffs that can be shuffled around. However, commits are snapshots, not diffs!

I believe that Git becomes understandable if we peel back the curtain and look at how Git stores your repository data. After we investigate this model, we’ll explore how this new perspective helps us understand commands like git cherry-pick and git rebase.

If you want to go really deep, you should read the Git Internals chapter of the Pro Git book.

I’ll be using the git/git repository checked out at v2.29.2 as an example. Follow along with my command-line examples for extra practice.

Object IDs are hashes

The most important part to know about Git objects is that Git references each by its object ID (OID for short), providing a unique name for the object. We will use the git rev-parse <ref> command to discover these OIDs. Each object is essentially a plain-text file and we can examine its contents using the git cat-file -p <oid> command.

You might also be used to seeing OIDs given as a shorter hex string. This string is given as something long enough that only one object in the repository has an OID that matches that abbreviation. If we request the type of an object using an abbreviated OID that is too short, then we will see the list of OIDs that match

$ git cat-file -t e0c03
error: short SHA1 e0c03 is ambiguous
hint: The candidates are:
hint: e0c03f27484 commit 2016-10-26 - contrib/buildsystems: ignore irrelevant files in Generators/
hint: e0c03653e72 tree
hint: e0c03c3eecc blob
fatal: Not a valid object name e0c03

What are these types: blobtree, and commit? Let’s start at the bottom and work our way up.

Blobs are file contents

At the bottom of the object model, blobs contain file contents. To discover the OID for a file at your current revision, run git rev-parse HEAD:<path>. Then, use git cat-file -p <oid> to find its contents.

$ git rev-parse HEAD:README.md
eb8115e6b04814f0c37146bbe3dbc35f3e8992e0

$ git cat-file -p eb8115e6b04814f0c37146bbe3dbc35f3e8992e0 | head -n 8
[![Build status](https://github.com/git/git/workflows/CI/PR/badge.png)](https://github.com/git/git/actions?query=branch%3Amaster+event%3Apush)

Git - fast, scalable, distributed revision control system
=========================================================

Git is a fast, scalable, distributed revision control system with an
unusually rich command set that provides both high-level operations
and full access to internals.

If I edit the README.md file on my disk, then git status notices that the file has a recent modified time and hashes the contents. If the contents don’t match the current OID at HEAD:README.md, then git status reports the file as “modified on disk.” In this way, we can see if the file contents in the current working directory match the expected contents at HEAD.

Trees are directory listings

Note that blobs contain file contents, but not the file names! The names come from Git’s representation of directories: trees. A tree is an ordered list of path entries, paired with object types, file modes, and the OID for the object at that path. Subdirectories are also represented as trees, so trees can point to other trees!

We will use diagrams to visualize how these objects are related. We use boxes for blobs and triangles for trees.

$ git rev-parse HEAD^{tree}
75130889f941eceb57c6ceb95c6f28dfc83b609c

$ git cat-file -p 75130889f941eceb57c6ceb95c6f28dfc83b609c  | head -n 15
100644 blob c2f5fe385af1bbc161f6c010bdcf0048ab6671ed    .cirrus.yml
100644 blob c592dda681fecfaa6bf64fb3f539eafaf4123ed8    .clang-format
100644 blob f9d819623d832113014dd5d5366e8ee44ac9666a    .editorconfig
100644 blob b08a1416d86012134f823fe51443f498f4911909    .gitattributes
040000 tree fbe854556a4ae3d5897e7b92a3eb8636bb08f031    .github
100644 blob 6232d339247fae5fdaeffed77ae0bbe4176ab2de    .gitignore
100644 blob cbeebdab7a5e2c6afec338c3534930f569c90f63    .gitmodules
100644 blob bde7aba756ea74c3af562874ab5c81a829e43c83    .mailmap
100644 blob 05f3e3f8d79117c1d32bf5e433d0fd49de93125c    .travis.yml
100644 blob 5ba86d68459e61f87dae1332c7f2402860b4280c    .tsan-suppressions
100644 blob fc4645d5c08bd005238fc72cfa709495d8722e6a    CODE_OF_CONDUCT.md
100644 blob 536e55524db72bd2acf175208aef4f3dfc148d42    COPYING
040000 tree a58410edddbdd133cca6b3322bebe4fb37be93fa    Documentation
100755 blob ca6ccb49866c595c80718d167e40cfad1ee7f376    GIT-VERSION-GEN
100644 blob 9ba33e6a141a3906eb707dd11d1af4b0f8191a55    INSTALL

Trees provide names for each sub-item. Trees also include information such as Unix file permissions, object type (blob or tree), and OIDs for each entry. We cut the output to the top 15 entries, but we can use grep to discover that this tree has a README.md entry that points to our earlier blob OID.

$ git cat-file -p 75130889f941eceb57c6ceb95c6f28dfc83b609c | grep README.md
100644 blob eb8115e6b04814f0c37146bbe3dbc35f3e8992e0    README.md

Trees can point to blobs and other trees using these path entries. Keep in mind that those relationships are paired with path names, but we will not always show those names in our diagrams.

The tree itself doesn’t know where it exists within the repository, that is the role of the objects pointing to the tree. The tree referenced by <ref>^{tree} is a special tree: the root tree. This designation is based on a special link from your commits.

Commits are snapshots

commit is a snapshot in time. Each commit contains a pointer to its root tree, representing the state of the working directory at that time. The commit has a list of parent commits corresponding to the previous snapshots. A commit with no parents is a root commit and a commit with multiple parents is a merge commit. Commits also contain metadata describing the snapshot such as author and committer (including name, email address, and date) and a commit message. The commit message is an opportunity for the commit author to describe the purpose of that commit with respect to the parents.

For example, the commit at v2.29.2 in the Git repository describes that release, and is authored and committed by the Git maintainer.

$ git rev-parse HEAD
898f80736c75878acc02dc55672317fcc0e0a5a6

/c/_git/git ((v2.29.2))
$ git cat-file -p 898f80736c75878acc02dc55672317fcc0e0a5a6
tree 75130889f941eceb57c6ceb95c6f28dfc83b609c
parent a94bce62b99be35f2ee2b4c98f97c222e7dd9d82
author Junio C Hamano <[email protected]> 1604006649 -0700
committer Junio C Hamano <[email protected]> 1604006649 -0700

Git 2.29.2

Signed-off-by: Junio C Hamano <[email protected]>

Looking a little farther in the history with git log, we can see a more descriptive commit message talking about the change between that commit and its parent.

$ git cat-file -p 16b0bb99eac5ebd02a5dcabdff2cfc390e9d92ef
tree d0e42501b1cf65395e91e22e74f75fc5caa0286e
parent 56706dba33f5d4457395c651cf1cd033c6c03c7a
author Jeff King &lt;[email protected]&gt; 1603436979 -0400
committer Junio C Hamano &lt;[email protected]&gt; 1603466719 -0700

am: fix broken email with --committer-date-is-author-date

Commit e8cbe2118a (am: stop exporting GIT_COMMITTER_DATE, 2020-08-17)
rewrote the code for setting the committer date to use fmt_ident(),
rather than setting an environment variable and letting commit_tree()
handle it. But it introduced two bugs:

- we use the author email string instead of the committer email

- when parsing the committer ident, we used the wrong variable to
compute the length of the email, resulting in it always being a
zero-length string

This commit fixes both, which causes our test of this option via the
rebase "apply" backend to now succeed.

Signed-off-by: Jeff King &lt;[email protected]&gt; Signed-off-by: Junio C Hamano &lt;[email protected]&gt;

In our diagrams, we will use circles to represent commits. Notice the alliteration? Let’s review:

  • Boxes are blobs. These represent file contents.
  • Triangles are trees. These represent directories.
  • Circles are commits. These are snapshots in time.

Branches are pointers

In Git, we move around the history and make changes without referring to OIDs most of the time. This is because branches provide pointers to the commits we care about. A branch with name main is actually a reference in Git called refs/heads/main. These files literally contain hex strings referencing the OID of a commit. As you work, these references change their contents to point to other commits.

This means branches are significantly different from our previous Git objects. Commits, trees, and blobs are immutable, meaning you can’t change their contents. If you change the contents, then you get a different hash and thus a new OID referring to the new object! Branches are named by users to provide meaning, such as trunk or my-special-project. We use branches to track and share work.

The special reference HEAD points to the current branch. When we add a commit to HEAD, it automatically updates that branch to the new commit.

We can create a new branch and update our HEAD using git switch -c:

$ git switch -c my-branch
Switched to a new branch 'my-branch'
$ cat .git/refs/heads/my-branch
1ec19b7757a1acb11332f06e8e812b505490afc6
$ cat .git/HEAD
ref: refs/heads/my-branch

Notice how creating my-branch created a file (.git/refs/heads/my-branch) containing the current commit OID and the .git/HEAD file was updated to point at this branch. Now, if we update HEAD by creating new commits, the branch my-branch will update to point to that new commit!

The big picture

Let’s put all of these new terms into one giant picture. Branches point to commits, commits point to other commits and their root trees, trees point to blobs and other trees, and blobs don’t point to anything. Here is a diagram containing all of our objects all at once:

In this diagram, time moves from left to right. The arrows between a commit and its parents go from right to left. Each commit has a single root tree. HEAD points to the main branch here, and main points to the most-recent commit. The root tree at this commit is fully expanded underneath, while the rest of the trees have arrows pointing towards these objects. The reason for that is that the same objects are reachable from multiple root trees! Since these trees reference those objects by their OID (their content) these snapshots do not need multiple copies of the same data. In this way, Git’s object model forms a Merkle tree.

When we view the object model in this way, we can see why commits are snapshots: they link directly to a full view of the expected working directory for that commit!

Computing diffs

Even though commits are snapshots, we frequently look at a commit in a history view or on GitHub as a diff. In fact, the commit message frequently refers to this diff. The diff is dynamically generated from the snapshot data by comparing the root trees of the commit and its parent. Git can compare any two snapshots in time, not just adjacent commits.

To compare two commits, start by looking at their root trees, which are almost always different. Then, perform a depth-first-search on the subtrees by following pairs when paths for the current tree have different OIDs. In the example below, the root trees have different values for the docs, so we recurse into those two trees. Those trees have different values for M.md, so those two blobs are compared line-by-line and that diff is shown. Still within docsN.md is the same, so that is skipped and we pop back to the root tree. The root tree then sees that the things directories have equal OIDs as well as the README.md entries.

In the diagram above, we notice that the things tree is never visited, and so none of its reachable objects are visited. This way, the cost of computing a diff is relative to the number of paths with different content.

Now we have the understanding that commits are snapshots and we can dynamically compute a diff between any two commits. Then why isn’t this common knowledge? Why do new users stumble over this idea that a commit is a diff?

One of my favorite analogies is to think of commits as having a wave/partical duality where sometimes they are treated like snapshots and other times they are treated like diffs. The crux of the matter really goes into a different kind of data that’s not actually a Git object: patches.

Wait, what’s a patch?

patch is a text document that describes how to alter an existing codebase. Patches are how extremely-distributed groups can share code without using Git commits directly. You can see these being shuffled around on the Git mailing list.

A patch contains a description of the change and why it is valuable, followed by a diff. The idea is that someone could use that reasoning as a justification to apply that diff to their copy of the code.

Git can convert a commit into a patch using git format-patch. A patch can then be applied to a Git repository using git apply. This was the dominant way to share code in the early days of open source, but most projects have moved to sharing Git commits directly through pull requests.

The biggest issue with sharing patches is that the patch loses the parent information and the new commit has a parent equal to your existing HEAD. Moreover, you get a different commit even if you use the same parent as before due to the commit time, but also the committer changes! This is the fundamental reason why Git has both “author” and “committer” details in the commit object.

The biggest problem with using patches is that it is hard to apply a patch when your working directory does not match the sender’s previous commit. Losing the commit history makes it difficult to resolve conflicts.

This idea of “moving patches around” has transferred into several Git commands as “moving commits around.” Instead, what actually happens is that commit diffs are replayed, creating new commits.

If commits aren’t diffs, then what does git cherry-pick do?

The git cherry-pick <oid> command creates a new commit with an identical diff to <oid> whose parent is the current commit. Git is essentially following these steps:

  1. Compute the diff between the commit <oid> and its parent.
  2. Apply that diff to the current HEAD.
  3. Create a new commit whose root tree matches the new working directory and whose parent is the commit at HEAD.
  4. Move the ref at HEAD to that new commit.

After Git creates the new commit, the output of git log -1 -p HEAD should match the output of git log -1 -p <oid>.

It is important to recognize that we didn’t “move” the commit to be on top of our current HEAD, we created a new commit whose diff matches the old commit.

If commits aren’t diffs, then what does git rebase do?

The git rebase command presents itself as a way to move commits to have a new history. In its most basic form it is really just a series of git cherry-pick commands, replaying diffs on top of a different commit.

The most important thing is that git rebase <target> will discover the list of commits that are reachable from HEAD but not reachable from <target>. You can show these yourself using git log --oneline <target>..HEAD.

Then, the rebase command simply navigates to the <target> location and starts performing git cherry-pick commands on this commit range, starting from the oldest commits. At the end, we have a new set of commits with different OIDs but similar diffs to the original commit range.

For example, consider a sequence of three commits in the current HEAD since branching off of a target branch. When running git rebase target, the common base P is computed to determine the commit list AB, and C. These are then cherry-picked on top of target in order to construct new commits A'B', and C'.

The commits A'B', and C' are brand new commits that share a lot of information with AB, and C, but are distinct new objects. In fact, the old commits still exist in your repository until garbage collection runs.

We can even inspect how these two commit ranges are different using the git range-diff command! I’ll use some example commits in the Git repository to rebase onto the v2.29.2 tag, then modify the tip commit slightly.

$ git checkout -f 8e86cf65816
$ git rebase v2.29.2
$ echo extra line >>README.md
$ git commit -a --amend -m "replaced commit message"
$ git range-diff v2.29.2 8e86cf65816 HEAD
1:  17e7dbbcbc = 1:  2aa8919906 sideband: avoid reporting incomplete sideband messages
2:  8e86cf6581 ! 2:  e08fff1d8b sideband: report unhandled incomplete sideband messages as bugs
    @@ Metadata
     Author: Johannes Schindelin <[email protected]>
     
      ## Commit message ##
    -    sideband: report unhandled incomplete sideband messages as bugs
    +    replaced commit message
     
    -    It was pretty tricky to verify that incomplete sideband messages are
    -    handled correctly by the `recv_sideband()`/`demultiplex_sideband()`
    -    code: they have to be flushed out at the end of the loop in
    -    `recv_sideband()`, but the actual flushing is done by the
    -    `demultiplex_sideband()` function (which therefore has to know somehow
    -    that the loop will be done after it returns).
    -
    -    To catch future bugs where incomplete sideband messages might not be
    -    shown by mistake, let's catch that condition and report a bug.
    -
    -    Signed-off-by: Johannes Schindelin <[email protected]>
    -    Signed-off-by: Junio C Hamano <[email protected]>
    + ## README.md ##
    [email protected]@ README.md: and the name as (depending on your mood):
    + [Documentation/giteveryday.txt]: Documentation/giteveryday.txt
    + [Documentation/gitcvs-migration.txt]: Documentation/gitcvs-migration.txt
    + [Documentation/SubmittingPatches]: Documentation/SubmittingPatches
    ++extra line
     
      ## pkt-line.c ##
     @@ pkt-line.c: int recv_sideband(const char *me, int in_stream, int out)

Notice that the resulting range-diff claims that commits 17e7dbbcbc and 2aa8919906 are “equal”, which means they would generate the same patch. The second pair of commits are different, showing that the commit message changed and there is an edit to the README.md that was not in the original commit.

If you are following along, you can also see how the commit history still exists for these two commit sets. The new commits have the v2.29.2 tag as the third commit in the history while the old commits have the (earlier) v2.28.0 tag as the third commit.

$ git log --oneline -3 HEAD
e08fff1d8b2 (HEAD) replaced commit message
2aa89199065 sideband: avoid reporting incomplete sideband messages
898f80736c7 (tag: v2.29.2) Git 2.29.2

$ git log --oneline -3 8e86cf65816
8e86cf65816 sideband: report unhandled incomplete sideband messages as bugs
17e7dbbcbce sideband: avoid reporting incomplete sideband messages
47ae905ffb9 (tag: v2.28.0) Git 2.28

Since commits aren’t diffs, how does Git track renames?

If you were looking carefully at the object model, you might have noticed that Git never tracks changes between commits in the stored object data. You might have wondered “how does Git know a rename happened?”

Git doesn’t track renames. There is no data structure inside Git that stores a record that a rename happened between a commit and its parent. Instead, Git tries to detect renames during the dynamic diff calculation. There are two stages to this rename detection: exact renames and edit-renames.

After first computing a diff, Git inspects the internal model of that diff to discover which paths were added or deleted. Naturally, a file that was moved from one location to another would appear as a deletion from the first location and an add in the second. Git attempts to match these adds and deletes to create a set of inferred renames.

The first stage of this matching algorithm looks at the OIDs of the paths that were added and deleted and see if any are exact matches. Such exact matches are paired together.

The second stage is the expensive part: how can we detect files that were renamed and edited? Git iterates through each added file and compares that file against each deleted file to compute a similarity score as a percentage of lines in common. By default, anything larger than 50% of lines in common counts as a potential edit-rename. The algorithm continues comparing these pairs until finding the maximum match.

Did you notice a problem? This algorithm runs A * D diffs, where A is the number of adds and D is the number of deletes. This is quadratic! To avoid extra-long rename computations, Git will skip this portion of detecting edit-renames if A + D is larger than an internal limit. You can modify this limit using the diff.renameLimit config option. You can also avoid the algorithm altogether by disabling the diff.renames config option.

I’ve used my awareness of the Git rename detection in my own projects. For example, I forked VFS for Git to create the Scalar project and wanted to re-use a lot of the code but also change the file structure significantly. I wanted to be able to follow the history of these files into the versions in the VFS for Git codebase, so I constructed my refactor in two steps:

  1. Rename all of the files without changing the blobs.
  2. Replace strings to modify the blobs without changing filenames.

These two steps ensured that I can quickly use git log --follow -- <path> to see the history of a file across this rename.

$ git log --oneline --follow -- Scalar/CommandLine/ScalarVerb.cs
4183579d console: remove progress spinners from all commands
5910f26c ScalarVerb: extract Git version check
...
9f402b5a Re-insert some important instances of GVFS
90e8c1bd [REPLACE] Replace old name in all files
fb3a2a36 [RENAME] Rename all files
cedeeaa3 Remove dead GVFSLock and GitStatusCache code
a67ca851 Remove more dead hooks code
...

I abbreviated the output, but these last two commits don’t actually have a path corresponding to Scalar/CommandLine/ScalarVerb.cs, but instead it is tracking the previous path GVSF/GVFS/CommandLine/GVFSVerb.cs because Git recognized the exact-content rename from the commit fb3a2a36 [RENAME] Rename all files.

Won’t be fooled again!

You now know that commits are snapshots, not diffs! This understanding will help you navigate your experience working with Git.

Now you are armed with deep knowledge of the Git object model. You can use this knowledge to expand your skills in using Git commands or deciding on workflows for your team. In a future blog post, we will use this knowledge to learn about different Git clone options and how to reduce the data you need to get things done!