Tag Archives: Engineering

Improving GitHub code search

2021-12-08 Pavel Avgustinov

Post Syndicated from Pavel Avgustinov original https://github.blog/2021-12-08-improving-github-code-search/

Today, we are rolling out a technology preview for substantial improvements to searching code on GitHub. We want to give you an early look at our efforts and get your feedback as we iterate on helping you explore and discover code—all while saving you time and keeping you focused. Sign up for the waitlist now, and give us your feedback!

Getting started

Once the technology preview is enabled for your account, you can try it out at https://cs.github.com. Initially, we’re creating a separate interface for the new code search as we build it out, but once we’re happy with the feedback and are ready for wider adoption, we will integrate it into the main github.com experience.

At the moment, the search index covers more than five million of the most popular public repositories; in addition, you can search private repositories you have access to. Here are some things to look out for:

Easily find what you’re looking for among the top results, with smart ranking and an index that is optimized for code.
Search for an exact string, with support for substring matches and special characters, or use regular expressions (enclosed in / separators).
Scope your searches with org: or repo: qualifiers, with auto-completion suggestions in the search box.
Refine your results using filters like language:, path:, extension:, and Boolean operators (OR, NOT). Search for definitions of a symbol with symbol:.
Get your bearings quickly with additional features, like a directory tree view, symbol information for the active scope, jump-to-definition, select-to-search, and more!

The syntax is documented here, and you can press ? on any page to view available keyboard shortcuts. You can also check out the FAQs.

What’s next?

We’re excited to share our work with you as a technology preview while we iterate, and to work with you to find unique, novel use cases and workflows. What radical new idea have you always wanted to try? What feature would make you most productive? Is support for your favorite language missing? Let us know, and let’s make it happen together.

We have no shortage of ideas for what to focus on next. We’ll be growing the index until it covers every repository you can access on GitHub. We’ll experiment with scoring and ranking heuristics to see what works best, and we’ll explore what APIs and integrations would be most impactful. We’ll keep adding support for more languages to the language-specific features. But most of all, we want to listen to your feedback and build the tools you didn’t even know you needed.

The bigger picture: developer productivity at GitHub

As a developer, staying in a flow state is hard. Whenever you look up how to use a library, or have a test fail because your developer environment has diverged from CI, or need to know how an error message can arise, you are interrupted. The longer it takes to resolve the interruption, the more context you lose.

Earlier this year, we launched GitHub Copilot as a technical preview, leveraging the power of AI to let you code confidently even in unfamiliar territory. We also released Codespaces and shared how adopting them internally boosted GitHub’s own productivity. We see our improvements to code search and navigation in the context of these broader initiatives around developer productivity, as part of a unified solution.

For code search, our vision is to help every developer search, discover, navigate, and understand code quickly and intuitively. GitHub code search puts the world’s code at your fingertips: everything is just a search away. It helps you maintain a flow state by showing you the most relevant results first and helping you with auto-completion at every step. And once you get to a result page, the rich browsing experience is optimized for reading and understanding code, allowing you to make sense of unfamiliar logic quickly, even for code outside your IDE.

We plan to share more updates on our progress soon, including deep-dives on the engineering work behind code search and the developers, open source projects, and communities we rely on (special shout-out to @BurntSushi and @lemire, whose work has been fundamental to ours). In the meantime, the number of spots for the technology preview is limited, so sign up today!

GitHub Availability Report: November 2021

2021-12-01 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2021-12-01-github-availability-report-november-2021/

In November, we experienced one incident resulting in significant impact and degraded state of availability for core GitHub services, including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks.

November 27 20:40 UTC (lasting 2 hours and 50 minutes)

We encountered a novel failure mode when processing a schema migration on a large MySQL table. Schema migrations are a common task at GitHub and often take weeks to complete. The final step in a migration is to perform a rename to move the updated table into the correct place. During the final step of this migration a significant portion of our MySQL read replicas entered a semaphore deadlock. Our MySQL clusters consist of a primary node for write traffic, multiple read replicas for production traffic, and several replicas that serve internal read traffic for backup and analytics purposes. The read replicas that hit the deadlock entered a crash-recovery state causing an increased load on healthy read replicas. Due to the cascading nature of this scenario, there were not enough active read replicas to handle production requests which impacted the availability of core GitHub services.

During the incident mitigation, in an effort to increase capacity, we promoted all available internal replicas that were in a healthy state into the production path; however, the shift was not sufficient for full recovery. We also observed that read replicas serving production traffic would temporarily recover from their crash-recovery state only to crash again due to load. Based on this crash-recovery loop, we chose to prioritize data integrity over site availability by proactively removing production traffic from broken replicas until they were able to successfully process the table rename. Once the replicas recovered, we were able to move them back into production and restore enough capacity to return to normal operations.

Throughout the incident, write operations remained healthy and we have verified there was no data corruption.

To address this class of failure and reduce time to recover in the future, we continue to prioritize our functional partitioning efforts. Partitioning the cluster adds resiliency given migrations can then be run in canary mode on a single shard—reducing the potential impact of this failure mode. Additionally, we are actively updating internal procedures to increase the amount each cluster is over-provisioned.

As next steps, we’re continuing to investigate the specific failure scenario, and have paused schema migrations until we know more on safeguarding against this issue. As we continue to test our migration tooling, we are classifying opportunities to improve it during such scenarios.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

Using ChatOps to help Actions on-call engineers

2021-12-01 Yaswanth Anantharaju

Post Syndicated from Yaswanth Anantharaju original https://github.blog/2021-12-01-using-chatops-to-help-actions-on-call-engineers/

At GitHub, we use ChatOps to help us collaborate seamlessly. They’re implemented using our favorite chatbot, Hubot. Running a ChatOps command is similar to running commands on your terminal, except that teammates can see what you ran and see the results if the commands are invoked from Slack. This enables real time collaboration. This is especially useful in handling incidents where the primary goal is to mitigate disruption to users and find the root cause quickly.

Keeping an application healthy isn’t the responsibility of some mysterious service delivery team anymore––the responsibility lies in the hands of the engineers who build it. Besides working on the awesome features you see at GitHub, our engineers also contribute their time to keeping our services healthy and helping customers. Let’s talk about a few days in the life of a new Actions on-call engineer: Mona. We’ll take a look at how Mona responds to a variety of incidents, and you’ll see how ChatOps empowers her to remediate incidents quickly and effectively.

Empowering support and on-call engineers

Mona receives a support ticket from a customer who has been having trouble with a few of their Actions runs.

She sees URLs for the runs in the support ticket, but since they belong to a private repository she can’t access them without customer consent, as a customer privacy measure. She needs to quickly identify ways to find the root cause of the issue and help the customer.

Luckily, we have an automated ChatOps tool called “Hubot” for that! Mona asks Hubot about a URL, and Hubot does the analysis for her.

chatops screenshot: Mona runs `.actions check plan` against a URL

She is able to identify the issue with no effort, and now she has a huge head start in her investigation.

Behind the scenes, Hubot spins up multiple parallel queries to our log aggregation tool, Kusto, performs complex analysis, and reports back on its findings. In this way, we’ve automated a common support procedure into a handy and secure ChatOps command. As a result, Mona was able to diagnose the issue in a fraction of the time it would have taken her to run the necessary queries herself, and she didn’t even need to provision additional resources or DB access for herself to do so—Hubot took care of all of that for her.

Let’s look at another example in which ChatOps commands helped Mona quickly diagnose the impact of some errors. On another fine day, GitHub Actions runs are failing with an error and Mona wants to quickly find the impact. This is a job for ChatOps!

chatops screenshot: Mona runs `.actions kusto annotation "Process completed with exit code 1."`

GitHub Actions telemetry and logs flow into Splunk and Azure Data Explorer (using Kusto Query Language), so Mona can use ChatOps to execute common Kusto queries that help her gain visibility into error logs. Thanks to this ChatOps command, Mona doesn’t have to write the Kusto query herself. All she has to do is execute the ChatOps command, and she’ll get a link to the right query so she can start her investigation.

On-call rotations typically involve analyzing the same things repeatedly—common problems arise with common investigation and remediation steps. This is where ChatOps can help. At GitHub, we take a set of well-defined tasks that we typically perform when incidents occur and codify them into ChatOps. On-call engineers can then reach for these ChatOps tools to greatly increase the speed with which they are able to diagnose and fix issues.

Finding the needle in the haystack

Mona is pretty new to the team, and she loves that there’s a lot of internal documentation on various topics. But soon she realizes there is too much documentation to sort through effectively, especially when remediating high-severity incidents. She is told many times by other engineers on the team, “I know it’s there, but I’m not sure where. It’s definitely somewhere though!” Who can relate?

At GitHub, we invest a lot in documentation. We maintain architecture decision records, or ADRs, across different teams. Besides our repository content, sometimes there are treasures hiding in GitHub Issues. It can be hard to find relevant documentation efficiently, especially information relevant to a specific team, like the GitHub Actions team. This is another place where ChatOps helps. We have a ChatOps command that searches only repositories that are relevant to the Actions team and shows the relevant docs.

chatops screenshot: Mona runs `.actions search GITHUB_TOKEN`

This ChatOps command helps her during on-call shifts as well. In another incident, automated failover of the database isn’t working, and Mona has to manually failover the database. She is sure the instructions she needs are somewhere, but where exactly? Mona types a quick .actions search failover database command into Slack.

chatops screenshot: Mona runs `.actions search failover database`

Voilà! The playbook has been served! A playbook is a set of manual steps for troubleshooting a production issue.

Automating playbooks

Mona has now found the playbooks, but she is overwhelmed by the information.

Just like every service within GitHub, GitHub Actions has its own well-defined SLOs (service level objectives).GitHub’s SLOs are behavior-based rather than resource-based.

However, we also use resource-based alerts, and these sometimes help in quickly root-causing a symptom-based alert. For example, one alert we might see is our “run start delay” SLO signal coupled with high CPU alerts on one of our databases. That is a very solid path toward mitigating and root-causing the incident.

Mona is paged due to a resource-based alert, “database is unhealthy.” There is a lengthy playbook on logical steps to execute for finding the root cause. There could potentially be a wide range of impact from this, and this is not the time to go through manual steps in the playbooks.

She tries to figure out if there is a ChatOps command that would help her by invoking .actions

chatops screeshot: Mona invokes `.actions`

One of the results that Hubot returns, .actions check, seems interesting. Mona wants to check the health of the database and learn more about the issues that are happening.

She invokes help for that command with .actions check help.

chatops screenshot: Mona invokes `.actions check help`

Near the bottom of the list is the sql command, which is exactly what she wants! She then executes the command for the specific database:

chatops screenshot: Mona executes `.actions check sql service_configuration service-ghub-eus2-3`

The root cause is served: worker percentage is high. A single session is blocking others and is waiting on PAGELATCH_SH.

This particular analysis is a simple anomaly-based analysis. Mona looks at the CPU, memory, and worker percentage data between the requested time period and compares that with historic data by using IQR to find outliers. Hubot then switches to analysis of that specific resource and digs in further to find the potential root cause. In the end, Hubot suggests possible mitigation steps and links all resources and queries to do any further analysis.

When there is an active production issue affecting customers, the focus of on-call is to quickly mitigate and root cause. There are always subject matter experts who know more about certain aspects of the code than anyone else, and their mitigation steps and deep insights into potential root causes are generally documented in playbooks. However, during an incident, following the playbook can be challenging. Playbooks can turn out to be huge, with various logical steps for investigation. More and more of GitHub’s incident response activities are moving from static playbooks to ChatOps.

It’s scenarios like these where ChatOps shines. Subject matter experts codify their knowledge by building their time-tested debugging practices into re-usable ChatOps that any on-call engineer can quickly and easily take advantage of. This gives a huge head start to on-call, and they can then take the analysis further and contribute back to the ChatOps, making it even better!

In short, these kinds of reports give a crucial jump-start for on-call and help prevent customer impact with early detection and mitigation.

Looking to the future

You can multiply the impact of your domain experts by building their common workflows into ChatOps. As more and more engineers take advantage of ChatOps during their on-call rotation, ChatOps will get iterated on again and again. In this way, ChatOps can become a sort of living incident playbook that speeds up incident investigation and remediation.

Besides helping the company grow, ChatOps also saves domain experts from doing the same tasks repeatedly. And it’s not just about root cause analysis—we also use ChatOps to deploy, perform mitigations, recycle VMs, upgrade databases, and much more!

If you share the same vision, join us in leveraging ChatOps as living incident playbooks and you’ll see the quality of life for on-call engineers go up, while the time it takes to remediate incidents goes down.

5 DevOps tips to speed up your developer workflow

2021-11-30 Damian Brady

Post Syndicated from Damian Brady original https://github.blog/2021-11-30-5-devops-tips-to-speed-up-your-developer-workflow/

TL;DR: From learning YAML to scripting with Bash, here are a few simple tips for developers who want to speed up their workflows.

From CI/CD to containerization management and server provisioning, DevOps gets a lot of buzz in tech today. You could even say that it’s a buzz … word.

As a developer, you might be part of a DevOps team, but you’re focused on building great software, not necessarily provisioning servers and managing containers.

Even still, a lot of what developers, DevOps engineers, and IT teams handle in today’s software development life cycle is focused on tools, testing, automations, and server orchestration. And, that’s even more true if you’re a team of one or engaging in a big open source project.

Here are five DevOps tips for any developer looking to work smarter and faster.

Tip #1: A little YAML can make frontend work easier

Initially released in 2001, YAML has become one of the defacto languages for a lot of declarative automation—and it’s commonly used in DevOps and development work for an array of frontend configurations, automations, and more.

YAML, which stands for Yet Another Markup Language, is a superset of JSON and is notable for being a human readable language. That means it focuses less on characters, like brackets, braces, and quotes ({}, [], “).

Here’s why this matters: Learning YAML (or even stepping up your YAML skills) makes it easier to store configurations for your own applications, like your settings in an easy-to-write and easy-to-read language.

For this reason, you’re likely to come across YAML files anywhere from enterprise development workflows to open source projects—and yes, you’ll see plenty of YAML files on GitHub (it powers a product we’re pretty fond of: GitHub Actions, but more on this later).

Whether you can apply YAML directly to your day-to-day dev workflows or leverage different tools that use YAML, there are some pretty big benefits to getting started with this language—or stepping up your YAML skills.

Looking to learn more about YAML? Try the Learn YAML in Y Minutes guide.

Tip #2: A few DevOps tools to keep you moving fast

Let’s clear up one thing first: “DevOps tools” is an umbrella term that covers everything from cloud platforms, server orchestration tools, code management, version control, and dozens of other things.

So when we talk about “DevOps tools,” we’re really talking about technologies that make it easier to write, test, host, and release software, as well as reduce any worries around unexpected failures.

Here are three “DevOps tools” that can speed up your workflows and let you focus on building great software.

Git

You’re on the GitHub Blog, so we’re pretty sure you’re familiar with Git as a version control system and distributed source code management tool. It’s a mainstay of developers and a popular DevOps tool.

Here’s why: Git makes version control easy and gives teams a straightforward way to collaborate, experiment with different branches, and merge new features into the main software branch.

Learn how Git works >

Cloud-hosted integrated development environments (IDE)

I know, I know, saying cloud-hosted integrated development environments, or cloud IDEs, out loud is a bit of a mouthful (thank you, marketing). But these platforms are something you should start exploring immediately, if you haven’t already.

Here’s why: Cloud IDEs are fully hosted developer environments that let you write, run, and debug code—and they make spinning up new, preconfigured environments fast. Do you need proof? We launched our own cloud IDE called Codespaces earlier this year and started using it internally to build GitHub. It used to take us up to 45 minutes to spin up new developer environments—now it takes 10 seconds :mindblown:.

Cloud IDEs give you a super simple way to quickly spin up new, pre-configured development environments (and disposable development environments). Also, since they’re hosted in the cloud, you don’t need to worry about how powerful the computer you’re coding on is (friendly shout out here goes to the intrepid folks who have started coding on tablets).

Picture this: Your laptop fries itself (which has happened to me once or twice). You might have versions of npm, tools for connecting to your cloud provider, and any number of other configurations that you just lost. If you use a cloud IDE, you can spin up an environment in the cloud with all of your configurations, and that’s a magical thing to see.

Learn how cloud IDEs work >

Containers

If you don’t want to use a cloud IDE, dev containers are something you can use locally or in the cloud. Containers have exploded in popularity over the past decade for their utility in microservices architectures, CI/CD, and cloud-native application development, among other things. By nature, containers are lightweight and efficient making it easy to build, test, stage, and deploy software.

Learning the basics of containerization can be really handy—especially when it comes to testing your code in a lightweight environment that imitates your production environment. If you need to upgrade a library or try using an application on the next version of Node, you can do that really easily with containers before you hit production.

This can be especially useful for ”shifting left,” which is an important DevOps strategy. Catching issues or problems before you ever hit production can save a lot of headaches. If you can find those issues while you’re writing the code, that’s even better. Any problems will eventually mean more work, so the earlier you can catch them the better. After all, catching a problem before you get to the compiling stage can save you a headache or two.

Learn how containers work >

Tip #3: Automated testing and continuous integration (CI) to stay one step ahead

In any conversation around DevOps, you’ll probably hear about automated testing and continuous integration (CI). Yet while automated testing is typically part of a good CI development practice, it’s not strictly a requirement (but it should be … or at least part of your continuous delivery phase).

Most teams have some basic unit testing as part of their CI process, but stop short of testing for security vulnerabilities, automated UI testing, integration testing, etc.

Even still, these are two things that can help you step up your workflows by: (A) making sure your code works with the main branch; and (B) catching things like security vulnerabilities and other problems, so you can lessen your DevOps team’s workload.

Here’s how:

Using GitHub Actions to run automated tests

From ordering pizza to triggering an alarm, there’s a lot you can do with GitHub Actions. It all comes down to workflow automations.When it comes to setting up automated tests with GitHub Actions, you can either build your own action or leverage pre-built actions in the GitHub Marketplace.

[Learn how to build your own GitHub Actions workflow automations.]> Pro tip: Using Actions workflows that run on pull requests is a great way to check for security vulnerabilities, problems in your code, or anything else before you merge to the main branch. Doing this means you’re one step ahead and helps keep your main branch clean.

[Want to learn more about GitHub Actions? Check out our guide.]You can also configure your workflows to deploy to ephemeral testing environments. This means you can run your tests and deploy your changes to an environment where you can test your application. You can even configure your workflow to automatically tear these testing environments down after you’re finished.

All this means you’re testing things as much as possible before it’s time to go to production.

Using GitHub Actions to create CI pipelines

CI, or continuous integration, is the process of automatically integrating code from multiple people for a given project. A good CI practice means you can work faster, make sure your code compiles correctly, merge code changes more efficiently, and be sure your code plays nice with everyone else’s work.

The most powerful CI workflows are the ones that test all of the things you care about every single time you push your code to the server.

If you’re working on GitHub, GitHub Actions can do this for you, too. There are plenty of pre-built CI workflows in the GitHub Marketplace (and you can always build your own), but there are a few things to keep in mind when you start incorporating CI into your development flow. These include:

Run the necessary tests: Think about what build, integration, and testing automations you ideally need. You’ll want to consider things that may have gone wrong with releases in the past, and see if you can add a test for that in your CI.
Balance the time it takes to test your code with how fast you’re pushing new code: Let’s say you have teams pushing new code every five minutes (hypothetically), but the tests you’re running take 10 minutes to execute … that’s not great. It’s always best to balance what you’re checking and when with how long it takes, which might mean trimming your ideal list of tests down to a more realistic number, at least for your CI builds.

Get a tutorial on creating a CI pipeline with GitHub Actions >

Tip #4: Server orchestration tips for flexibility and speed

If you’re building a cloud-native application (or really even just using a few different servers, VMs, containers, or hosting services), you’re probably dealing with a few environments. Being able to make sure your application and infrastructure play well together means you can rely a little less on an operations team trying to get your software to run on existing infrastructure at the last minute.

That’s where server orchestration comes in. Server orchestration—or infrastructure orchestration—is often the job of IT and DevOps teams and includes configuring, managing, provisioning, and coordinating systems, applications, and core infrastructure needed to run software.

Pro tip: There’s a suite of tools that allow you to define and update the infrastructure you need to use.

A big advantage of infrastructure automation is improved scalability—and defined environments means it’s easier to tear down and rebuild an environment when something goes wrong (instead of starting from scratch, but we’ve all been there).

There’s another big advantage: If you want to test something, you don’t have to worry about asking the operations team to go and set up a server for you. You can instead do that as part of a workflow. You don’t have to worry about manually provisioning hardware or system requirements.

How to get started: Don’t try to replace everything in your environment with automated infrastructure automation. Instead, look for a part that might be easy to automate and start there—then the next piece and the next piece after that.

And definitely never start in production. Instead, begin with your testing environment. Once that works, move to your staging environment (and if that works, you can trust it’s good for production).

Tip #5: Repeatable tasks? Try scripting them with Bash or PowerShell

Picture this: You have a bunch of repeatable tasks that you’re executing on a local basis, and you’re spending way too much time working through them every week. There’s a better—and more efficient—way to handle this. How? Scripting with either Bash or PowerShell.

Bash has deep roots in the Unix world, and it’s a mainstay of IT and DevOps teams, and more than a few developers too. PowerShell is comparatively newer. Designed by Microsoft and launched in 2006, PowerShell replaced the command shell and earlier scripting languages for task automation and configuration management in Windows environments.

Today, both Bash and PowerShell are cross-platform (though most people with a Windows background will use PowerShell, and most people familiar with Linux or macOS will use Bash out of habit).

Pro tip: Bash and PowerShell have different ways of working. Where PowerShell works with objects, Bash passes information around as strings. Even still, whatever you choose is largely up to personal preference.

One of the more useful things I’ve done with Bash and PowerShell, for example, is building a script that pulls down the latest version of the code, creates a new branch, switches to that branch, pushes a draft pull request up to GitHub, and then opens VSCode (sub in your editor of choice here) in that branch.

It’s a series of small steps to make your life much easier. It’s something you might do once or twice a week, and if you can script that—it gives you more time to focus on what matters: writing great code.

The bottom line

There’s a big difference between an IT pro, a DevOps engineer, and a developer. But in today’s world of software development, a lot of core DevOps practices are becoming everyone’s job. Plus, any developer that can learn a few DevOps tricks can have an easier time working independently (and more efficiently at that), and continue to focus on what matters most: building great software. That’s something we can all get behind.

Additional resources

GraphQL global ID migration update

2021-11-16 Andrew Hoglund

Post Syndicated from Andrew Hoglund original https://github.blog/2021-11-16-graphql-global-id-migration-update/

We are pleased to announce that we have now completed the first phase of rolling out the new GraphQL global ID format. This means that all newly created GraphQL objects have IDs that conform to a new format, which we refer to as next IDs. It also means we’ve hit a major milestone as we work towards improving our scalability and speed. In this post, we’d like to give you some details as to how you can begin migrating to the next format for older IDs.

Why is this necessary?

The current format of Global IDs in our GraphQL API will not support our projected growth over the coming years due to limitations with the data encoded in the IDs. The next format gives us the ability to handle your requests even faster by being able to build queries that will be optimized for our database clusters. We will continue to support the legacy IDs for the short term, after which we will sunset them. We are asking that you use the provided tools (more on that below) to migrate your implementations, caches, and data records to reference a next ID for older objects. Doing so will ensure that the response times of your requests will remain consistent and small. It will also ensure that nothing in your application will break once we finally sunset usage of the legacy IDs.

Do I need to do anything?

You only need to react to this announcement if you store references to GraphQL IDs, which always correspond to the id field for any object in the schema. If you don’t store these, then you can continue to interact with the API with no effect on your service. If you currently decode IDs, your service may break as the underlying data format of the IDs has changed. We suggest you migrate your service to treat these IDs as opaque strings. We guarantee the IDs will be unique, therefore you can rely on them directly as references.

How do I migrate my service?

If you have determined that you do need to migrate your service to the next IDs, we have introduced new functionality to help you do so. You can now pass a header in your API requests to the GraphQL API to receive updated IDs. This header works by forcing the response payload to always return the next ID for any object in which you’ve requested the id field. The name of the header is:

X-Github-Next-Global-ID

This header can be set to two values, 1 or 0. Setting the value to 1 will force the value for all id fields in your query to return the next ID format. Setting the value to 0 will revert to default behavior, which is to show legacy or next IDs depending on their creation date.

Here is an example request using curl:

$ curl \
  -H "Authorization: token $GITHUB_TOKEN" \
  -H "X-Github-Next-Global-ID: 1" \
  https://api.github.com/graphql \
  -d '{ "query": "{ node(id: \"MDQ6VXNlcjM0MDczMDM=\") { id } }" }'

And the response will contain the next ID:

{"data":{"node":{"id":"U_kgDOADP9xw"}}}

The legacy ID of MDQ6VXNlcjM0MDczMDM= was used in the node query, and the response contains the ID in the next format. Using this mechanism, you will be able to call the API with the legacy IDs you have referenced in your application. The next ID received in the response can then be used to update those references. We suggest that you update all references to legacy IDs and use them for any subsequent requests to the API. Remember that you can submit multiple node queries in one API call (using aliases) to perform bulk operations.

Another option for migrating IDs would be to use the ids returned in the nodes field for a collection of items. For example, if you wanted to convert all the repositories in your organization, you could do something like the following:

{
  organization(login: "github") {
    repositories(last: 10) {
      edges {
        cursor
        node {
          name
          id
        }
      }
    }
  }
}

As long as you have a reference to the name of a repository (or some other unique field on an object), you could use this method to update your references in bulk.

Please also note that setting the X-Github-Next-Global-ID to 1 will affect the return value of every id field in your query. This means that even when you submit a non-node query, you will get back the new format ID if you requested the id field.

Tell us what you think

If you have any concerns about the rollout of this change impacting your app, please contact us and include information, such as your app name so that we can better assist you.

Highlights from Git 2.34

2021-11-15 Taylor Blau

Post Syndicated from Taylor Blau original https://github.blog/2021-11-15-highlights-from-git-2-34/

The open source Git project just released Git 2.34 with features and bug fixes from over 109 contributors, 29 of them new. We last caught up with you on the latest in Git back when 2.33 was released. To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

Sparse index

In the past, we’ve talked about new Git features to make it possible to work with large repositories, like partial clones and sparse-checkout. For a complete description, check out the linked blog posts. But as a refresher, these two features work together to allow you to:

Fetch or clone only part of a repository’s objects, and
Only populate part of your working copy, typically scoped to a set of
sub-directories.

This pair of features is designed to create the illusion that you are working in a much smaller repository than you actually are. For instance, if your work takes place in an all-encompassing monorepo, your local copy only needs to contain the parts of the repository that you frequently work in.

But often, this illusion falls short. Why? The answer is the index. The index is the data structure Git uses to track what will be written the next time you run git commit, as well as to track the state of every file in your repository at the current point in history.

As you can imagine, even if you are working in a small corner of a large repository, the index still has to keep track of the repository’s entire contents, not just the parts that you are working in. Unfortunately, that overhead adds up: every time Git needs work with the index, it needs to parse and write out a lot of data that doesn’t affect the parts of your repository outside of your sparse checkout.

That’s changing in this release with the addition of a sparse-enabled index. Unlike the index of previous versions, this release enables the index to only track the parts of your repository that you care about. Specifically, it only contains entries for parts of your repository that are either in your sparse checkout, or at the boundary between your sparse checkout and the rest of the repository.

Triangles represent trees and boxes represent blobs. Left: a representation of a non-sparse index’s contents. Right: a sparse-ified index.

The high-level details here are that the index format now understands that specially marked directories indicate the boundary between the contents of your sparse checkout and the parts of your repository that you don’t have checked out. But the process of implementing this new format, teaching sub-commands how to use it, and making sure that the sparse index can be expanded to a full index is much more detailed.

For all of the details behind this exciting new feature, check out a comprehensive blog post published by Derrick Stolee last week: Making your monorepo feel small with Git’s sparse index.

[source, source, source, source, source, source, source, source]

Multi-pack reachability bitmaps

In a previous blog post, we talked about a new feature to enable reachability bitmaps to keep track of objects stored in multiple packs within your object store.

This release of Git contains the remaining components described in that blog post. If you haven’t read it, here’s a summary. When serving a fetch, a Git server needs to send the client everything reachable from the set of objects they want, less anything reachable from the set that they already have. (You can think of a clone as a “special case” fetch where the client wants everything and has nothing).

In order to compute this set efficiently, Git can use reachability bitmaps. One of these .bitmap files stores a set of bitmaps, each corresponding to some commit. The contents of an individual bitmap is a string of bits, one per object, indicating which objects are reachable from each commit.

In the past, the contents of a reachability bitmap were tied to the order of objects within a single packfile. This meant that a bitmap could only cover objects in one packfile. In other words, bitmaps were only useful if you could efficiently pack the entire contents of your repository down into a single packfile.

For many repositories, writing all objects into the same pack is completely feasible. But the effort it takes to write a pack (including searching for deltas between objects, compressing individual objects, and I/O cost) scales with the size of the pack you’re writing.

Git 2.34 introduces a new bitmap format that is instead tied to the contents of the multi-pack index file. This means that a bitmap can now flexibly represent objects in multiple packs, and server operators no longer need to repack their biggest repositories into a single pack in order to take full advantage of reachability bitmaps.

For more details, including some of the steps required to make this new feature work, see the aforementioned blog post.

[source, source, source]

A new default merge strategy

In an earlier blog post, we explained Git’s newest merge strategy: ort. Here are some of the basics:

When Git needs to merge two branches, it uses one of several “strategy” backends in order to resolve the changes or emit conflicts when two changes cannot be reconciled.

For years, Git has used a strategy called “recursive”. If you have ever done a merge in Git without passing -s <strategy>, then you have almost certainly used the recursive engine. Recursive behaves mostly like a standard three-way merge, with one exception. In the case of “criss-cross” merges (where there isn’t a single merge base), recursive merges multiple bases together in pairs (recursively) in order to produce a single tree which is then treated as the new merge base. This makes it possible to resolve cases where a traditional three-way merge might produce a conflict.

In recent versions of Git, there has been an ongoing effort to replace the recursive strategy with a new one called ort (short for “ostensibly recursive‘s twin”). Why do this? There are a few reasons, but perhaps the most compelling is that a rewrite allowed Git to implement a merge strategy that doesn’t operate on the index (that same one we talked about a couple of sections ago)!

ort does just that: it’s a full-blown rewrite of the merge strategy that aims to emulate the same concepts behind recursive while avoiding many of its long-standing performance and correctness problems. In a merge containing many renames, ort outperforms recursive by 500x. For a series of similar merges (like in a rebase operation), the speedup is over 9000x, in part due to ort’s ability to cache and reuse results from previous merges.

These numbers show off some of the worst-case scenarios for recursive, but in testing, ort consistently outperforms recursive with much less variance. In Git 2.34, ort is now the default merge strategy, so you should notice faster merges with fewer bugs just by upgrading.

For more details about the ort merge strategy, see our earlier blog post, or any one of a six-part series of posts written by ort‘s creator, Elijah Newren: part one, part two, part three, part four, part five, and part six.

[source]

Tidbits

Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.

You might be aware that Git allows you to sign your work by attaching your PGP signature to certain objects. For example, the Git project itself publishes tags signed by the maintainer in order to verify that each release comes from someone trustworthy.
But the experience of using GPG and maintaining keys can be somewhat
cumbersome. One alternative is to use a new feature of OpenSSH (released
back in OpenSSH 8.0) that allows using the SSH key you likely already have as a signing key.

Git 2.34 includes support to take advantage of this feature and allows you to sign your work using SSH keys. To try it out, you can either set user.signingKey to the SSH key you want to use (for example, by asking your ssh-agent for a list with ssh-add -L), or set gpg.format to ssh and gpg.ssh.defaultKeyCommand to ssh-add -L in order to automatically use the first SSH key available.

After configuring Git to sign objects using your SSH keys, you can use git commit -S, git merge -S, and git tag -s as usual, and they will automatically use your SSH key.

For more information about the new configuration options, including information about how to verify SSH signatures with an “allowed signers” file, check out the documentation.

[source, source, source]
If you’ve ever accidentally typed git psuh when you meant push, you
might have seen this message:
```
$ git psuh
git: 'psuh' is not a git command. See 'git --help'.

The most similar command is
  push
```
You have always been able to control this behavior by setting the
help.autoCorrect configuration. You can hide this advice by setting that
configuration to never, or let Git automatically rerun the most similar
command for you immediately or with a delay (by setting immediate, or a
real number of seconds to wait before rerunning your command).

In Git 2.34, you can now configure Git to ask you interactively whether you
want to rerun your last operation with the suggested command by setting
help.autoCorrect to prompt.

[source]

In Git 2.34, a handful of patch series were focused on improving the performance of interacting with other repositories. Here’s a pair of tidbits that improves the performance of git fetch and git push:

When fetching from a remote, your client needs to do some bookkeeping before
and after it receives a set of objects from the remote.

Before anything happens, your client needs to figure out what it has in common with the remote it’s fetching from, and what commits it wants as a result. Previously, this process was somewhat wasteful: Git used to load commit objects directly when they could instead have been read from the commit-graph, resulting in much improved performance. In Git 2.34, commits loaded in this code path use the commit-graph when possible. The effect of this scales with the number of references in your repository: in an example repository with over 2 million references, it cuts the time it takes to fetch a single commit by more than half.

[source]
Another patch series made a handful of improvements to updating local references when fetching, along with some changes to improve fetch negotiation, as well as skipping the connectivity check (which I’ll talk about in more detail in the next tidbit) when the receiving end had already verified the connectedness of the new objects. These changes together contributed similarly impressive performance improvements to the git fetch command.

[source]

You might have heard of “submodules,” the Git feature that allows combining multiple repositories by storing links to other repositories. Submodules have been somewhat neglected over the years, but this release brought renewed attention to the feature. Here are just some of the changes that enhance submodules:

It might be a surprise to learn that, though the majority of Git is written in C, the original git submodule command is actually a shell script!
The Git project has been converting many of its subcommands written in other languages into C. Reimplementing subcommands as C programs means that
they can be read and written more easily, take advantage of Git’s comprehensive libraries, and avoid the overhead of spawning many processes, especially on platforms where the new process overhead is rather costly.

In Git 2.34, many parts of the git submodule command were rewritten in C.
This project was completed by Atharva Raykar, who is a Google Summer of Code
student. You can check out their final report here, along with Git’s other GSoC participant ZheNing Hu’s report here.

[source, source, source]
While we’re on the topic of submodules, one thing you might not know is that
when using commands that deal with objects from both the submodule and the
repository containing it, the submodule is temporarily added as an alternate
object store of the other repository!

Alternates are Git’s object borrowing mechanism, which allow you to in effect link multiple object stores together. When using a repository with alternates, any object lookups that fail to find an object are retried in that repository’s alternate.

In order to make both the objects in a submodule and the objects in the repository that contains that submodule available to git grep (among a select set of
other commands), the submodule would temporarily be added as an alternate for the duration of that command.

If you’re thinking to yourself, “this is a hack”, then you’re not alone. Git has made internal changes to parameterize many functions in terms of a repository (which is usually the global the_repository). This allowed Git to avoid combining multiple repositories via alternates and instead make function calls by passing two (or more) separate repository instances. This enables Git to avoid hackily relying on the alternates mechanism, which produced less confusing and error-prone code as a result.

[source, source, source]
One last submodule-related topic (though there are more we couldn’t fit here!). If you are cloning a repository that you know to contain submodules, it is often useful to pass the --recurse-submodules, which will cause that repository’s submodules to be cloned and initialized, too.

But other commands that can optionally recurse into submodules (like git diff, for example) don’t themselves recurse into submodules by default, even when you cloned with --recurse-submodules. In Git 2.34, this is no longer the case, with one caveat: when cloning with --recurse-submodules, other commands only recurse into submodules if the submodule.stickyRecursiveClone configuration is set, to prevent commands from unintentionally running in submodules.

[source]

Now that I’ve listed out a few of the submodule-related changes, let’s get back
to the rest of the tidbits:

If you’ve ever scripted around Git, you have almost certainly run into Git’s cat-file plumbing command. This tool can be used to print out a single object (by providing the object name as an argument), a stream of objects (by providing line-delimited object names over stdin), or all objects in your repository (with --batch-all-objects).
This low-level command accidentally took into account replace refs, which produced confusing results when combined with --batch-all-objects, resulting in it not actually showing all objects in your repository if some were hidden by refs/replace.

Dropping support for replacement refs made it possible for cat-file to reuse some information when it is given --batch-all-objects. Namely, to populate the list of objects, it iterates each object in each pack and therefore knows the byte offset within each pack where each object can be found. Previous versions of Git did not reuse this information when looking up objects to parse them, but Git 2.34 retains this information.

This makes it possible to process an object’s metadata much more quickly by avoiding having to locate it twice. In a copy of torvalds/linux, the time it takes to print the name and type of each object (for the curious, that’s git cat-file --batch-check='%(objectname) %(objecttype)' --batch-all-objects --unordered) dropped from 8.1 seconds to just 4.3 seconds.

[source]
There has been a recent concerted effort to remove some memory leaks from Git’s code. Unlike library code, Git typically has a very short runtime. This makes the need to free allocated memory much less urgent, since if a process is about to exit, all memory allocated to it will be “freed” by the operating system.

A recent patch has made it so that Git’s integration tests can be run in a mode that ensures no memory is leaked (by setting GIT_TEST_PASSING_SANITIZE_LEAK=true in the environment). Since Git’s test suite still contains memory leaks in some tests, a new mode was added to run only tests that have been specifically marked as being leak-free. That way, when Git is compiled with leak detection (by running make SANITIZE=leak), you can easily spot regressions in tests that were supposedly leak-free.

Building off this new infrastructure, there have been many patch series that remove leaks from the code in various places.

[source, source, source, source, source, source, source, source, source, source, source]
When you need to get some debugging information out of a Git process, like what version you’re running, or how much time it spent in a particular region, the trace2 mechanism is a good choice. Often, looking at these logs is like looking at a piece of a puzzle. For example, when you run git fetch, you actually run git fetch-pack, which then invokes git upload-pack on the remote, which itself invokes git pack-objects.

Trace2 output includes information about when child processes are started and stopped (and consequently, how long they took to run), but what if you’re trying to figure out something more basic than that, like what process you were started by? In other words, if you’re stuck looking at output from a slow git pack-objects, how do you figure out whether it was a fetch (in which case it would have been started by upload-pack) or part of a repository repack (which here would be started by git repack)?

Git 2.34 includes additional debugging information in trace2 output to indicate the full ancestry of a process, so you can easily read out the name of the program a process was started by, like so:
```
$ cat trace2.log
21:14:38.170730 common-main.c:48                  version 2.34.0.rc1.14.g88d915a634
21:14:38.170810 common-main.c:49                  start /home/ttaylorr/src/git/git pack-objects git pack-objects --revs --thin --stdout --progress --delta-base-offset
21:14:38.174325 compat/linux/procinfo.c:170       cmd_ancestry sh <- git-upload-pack <- sh <- git <- zsh <- sshd <- systemd
```
(Above, you can see that pack-objects was run by git upload-pack, which was run by sh–that’s where we inserted the trace point via uploadpack.packObjectsHook, which was run by git, in my shell, over sshd, which was started by systemd.)

[source, source]
In a previous post, we talked about the background maintenance daemon, which can be used to perform routine repository maintenance in the background (like pre-fetching, or repacking the objects in your repository).

When this feature was first released back in Git 2.31, it had support for cron on Linux, launchctl on macOS, and schtasks on Windows. Git 2.34 brings support for systemd-based timers on Linux. This has a few benefits over cron: cron may not be available everywhere, and using systemd isolates each service into its own cgroup and writes its logs separately.

If you want to use systemd instead of the default scheduler, you can run:
```
$ git maintenance start --scheduler=systemd
```
[source]
In a previous blog post, we talked about how git rebase works, and how to move a complicated branching structure elsewhere in your repository’s history.

The brief history is that this used to be done with the --preserve-merges option, which attempted to replay merges elsewhere in history. Confusingly, this mode uses rebase’s interactive machinery internally, so attempting to manually edit the rebase sequence (with git rebase -i) often produced counterintuitive results.

The --rebase-merges option fixed many of these issues and has been the recommended replacement of --preserve-merges for some time now. In Git 2.34, the --preserve-merges option is now gone for good.

[source]
You might have used git grep to quickly search through your code. But you might not have known that git log has a --grep=<expression> option, which allows you to filter through commits produced by git log to only show ones whose commit messages match the provided expression.

In previous versions, the --grep option only filtered down which results were presented in the output of git log. But in Git 2.34, git log now knows how to colorize the parts of its output that matched the provided expression, like so:

[source]
Last but not least, if you’re using Git in a terminal on Windows, you might have noticed that your terminal is left in a weird state after running git commit, or git rebase, like in this issue.

This was because Git shares its terminal with any child processes it spawns, including your $EDITOR. If your editor sets special terminal settings but does not clear them upon exiting, it can leave your terminal in a broken state.

Git 2.34 introduces functionality to save and restore the terminal settings before and after launching your editor. That means that even misbehaving editors cannot corrupt your terminal since it will always be restored to the state it was in before launching the editor.

[source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.34, or any previous version in the Git repository.

Make your monorepo feel small with Git’s sparse index

2021-11-10 Derrick Stolee

Post Syndicated from Derrick Stolee original https://github.blog/2021-11-10-make-your-monorepo-feel-small-with-gits-sparse-index/

One way that Git scales to the largest monorepos is the sparse-checkout feature, which allows you to focus on a subset of the files. This is supposed to make it feel like you are actually in a small repository, even though you are contributing to a large repository.

There’s only one problem: the Git index is still large in a monorepo, and users can feel it. Until now.

The Git index is a critical data structure in Git. It serves as the “staging area” between the files you have on your filesystem and your commit history. When you run git add, the files from your working directory are hashed and stored as objects in the index, leading them to be “staged changes”. When you run git commit, the staged changes as stored in the index are used to create that new commit. When you run git checkout, Git takes the data from a commit and writes it to the working directory and the index.

The working directory, index, and commit history

In addition to storing your staged changes, the index also stores filesystem information about your working directory. This helps Git report changed files more quickly. One problem is that the index stores this information for every file at HEAD, even if those files are outside of the sparse-checkout definition. This means that the index can be much larger in a monorepo than it would be if your important subset of files was in its own repository.

Throughout the past year, the Git Fundamentals team at GitHub contributed a new feature to Git called the sparse index, which allows the index to focus on the files within the sparse-checkout cone. If you are in a repository that can use sparse-checkout, then you can enable the sparse index using these commands:

git sparse-checkout init --cone --sparse-index
git sparse-checkout set <dir1> <dir2> ... <dirN>

The size of the sparse index will scale with the number of files within your chosen directories, instead of the full size of your repository. When enabled with a number of other performance features, this can have a dramatic performance impact.

As we built the sparse index, we tested its performance against a large monorepo that has over two million files at HEAD, and with a sparse-checkout definition that populated the working directory with about 100,000 of those files. We then compared the performance between having a normal “full” index, a sparse index, and a repository that only contained the files matching the sparse-checkout definition.

Command performance by index and repository type

The chart above demonstrates the significant performance improvements enabled by the sparse index. The bottom bars for each Git command show the runtime without the sparse index. The middle bars show the runtime of the same commands with the sparse index enabled. The top bars show the runtime of these commands, except in a repository that only contained the files within the sparse-checkout cone, representing the theoretical optimum. Since the sparse index still contains pointers to the rest of the monorepo, there is still some overhead. This overhead is hardly noticeable, since the difference is at most 60 milliseconds, even in the worst case above.

Today, I will go deep into the design and implementation of the sparse index. In particular, I’ll focus on how the Git community made such a significant change to a critical data structure in a safe way. I will include links to the actual changes in the Git codebase as I go.

This post is going to go deep into the guts of Git. If you are unfamiliar with Git’s object model, then learn about commits, trees, and blobs before continuing.

You can also get an overview of the sparse index alongside several other advanced Git features in this presentation I gave with colleague Lessley Dennington at the GitHub Nova event.

First, let’s dig into the Git index to understand its structure and purpose. I’ll use the derrickstolee/trace2-flamechart repository as a concrete example (and to generate some of the figures). If you want to follow along with the Git commands shown, then clone that repository.

The Git index

The index file stores a list of every file at HEAD, along with the object ID for its blob and some metadata. This list of files is stored as a flat list and Git parses the index into an array.

You can expose the list of files in the index using the git ls-files command:

$ git ls-files
LICENSE
README.md
bin/index.js
examples/fetch/git-fetch-after.png
examples/fetch/git-fetch-after.svg
examples/fetch/git-fetch-after.txt
examples/fetch/git-fetch-before.png
examples/fetch/git-fetch-before.svg
examples/fetch/git-fetch-before.txt
examples/fetch/git-fetch-combined.png
examples/fetch/git-fetch-combined.svg
examples/maintenance/trace.png
examples/maintenance/trace.svg
examples/maintenance/trace.txt
package.json

In this repository, many of the files live in an examples/ directory, but you don’t actually need them for the functionality of the code, which lives in the bin/ directory. You can focus the repository only on the necessary files using the git sparse-checkout command:

$ ls
LICENSE     README.md     bin       examples   package.json
$ git sparse-checkout init --cone --sparse-index
$ git sparse-checkout set bin
$ ls
LICENSE     README.md     bin       package.json

Even though I used the sparse-checkout command to reduce the size of the working directory, my git ls-files command will return the same set of files as before. In fact, I can dig in a little more and expose some more information using git ls-files --debug.

$ git ls-files --debug
LICENSE
  ctime: 1634910503:287405820
  mtime: 1634910503:287405820
  dev: 16777220 ino: 119325319
  uid: 501  gid: 20
  size: 1098    flags: 0
README.md
  ctime: 1634910503:288090279
  mtime: 1634910503:288090279
  dev: 16777220 ino: 119325320
  uid: 501  gid: 20
  size: 934 flags: 0
bin/index.js
  ctime: 1634910767:828434033
  mtime: 1634910767:828434033
  dev: 16777220 ino: 119325520
  uid: 501  gid: 20
  size: 7292    flags: 0
examples/fetch/git-fetch-after.png
  ctime: 0:0
  mtime: 0:0
  dev: 0    ino: 0
  uid: 0    gid: 0
  size: 0   flags: 40004000
(...)

The above output is truncated, but it shows that each index entry contains additional filesystem information for each path. The last entry listed shows what happens for a file that is outside of the sparse-checkout definition: all of the filesystem information is removed and the flags entry has some bits enabled. These bits include a SKIP_WORKTREE bit that signifies that Git will not write that file to the working directory.

If these files are not written to disk, then why are they listed in the index at all? The reason is that Git still needs to understand the content that would be there if the index was expanded. Further, that information is used to generate a commit with the git commit command.

In the Git codebase, there is a test helper that can show additional information from the index: test-tool read-cache. (You won’t have this command if you just have normal Git installed.) Running it here, you can see that the index also stores the object IDs for every file:

$ test-tool read-cache --table
100644 blob 646521d0d6c070e6f15e0f5828be1127d3b75503    LICENSE
100644 blob b230b3a6e2d81d50dc00177e970a10726b5baf08    README.md
100755 blob 918533d51c7a5f91622311893dcfd40bfd4f43d7    bin/index.js
100644 blob e0f88531b916b92821476760672e8161b9954898    examples/fetch/git-fetch-after.png
100644 blob f4a523cd1acb0a9d2620970ad7a43405d6e305dc    examples/fetch/git-fetch-after.svg
100644 blob fc4e30dca5fcb0c3d2031dc82a43d5d644e26b41    examples/fetch/git-fetch-after.txt
100644 blob 15dc889965617df3b5a30cf01e52c491e41c59c1    examples/fetch/git-fetch-before.png
100644 blob 602bd5bcbd815914a035d0d4f0d2a3896f600de2    examples/fetch/git-fetch-before.svg
100644 blob bc40a8e4658d17c35de996f3655e737b85ce7ad9    examples/fetch/git-fetch-before.txt
100644 blob 356cdd36e0d78a62af8b010d25d658054bb6fdc7    examples/fetch/git-fetch-combined.png
100644 blob cc0c23f2c8a822c51a17c46268f38c2268b400ae    examples/fetch/git-fetch-combined.svg
100644 blob dfc0893d172d841d971e206461466db935b7c192    examples/maintenance/trace.png
100644 blob a10a876472e46c6ae58e6fc6e2adc64d4dae809b    examples/maintenance/trace.svg
100644 blob 8f5e8bfbc44674feb3aa96e0b7bf1bf717495658    examples/maintenance/trace.txt
100644 blob a4599a9e0a01c28a2c0a622457664fc8c55bfdf9    package.json

Notice in particular how every single row lists the object type as blob, meaning a file. Also, the bin/index.js file has executable permissions, so the file permission column shows 100755 instead of 100644 like the others. This is all important metadata to store for each index entry.

To visualize the index, the diagram below displays our blobs as boxes in a line in the order given above, but it represents the trees that connect the root tree to those blobs as triangles. Thus, the root tree has two subtrees for bin/ and examples/, and the examples/ subtree has two subtrees for fetch/ and maintenance/.

full index

This figure represents all of the links between the trees and blobs. However, the core index data structure stores only the list of blobs as a flat list. The nesting tree structure does not exist in the core of the file.

However, there is an extension to the format that includes the information of the nesting directories: the cache-tree extension. Each node of the cache-tree stores a list of sub-nodes and a range of index entries that are covered by the current node. Each node stores the object ID for the tree it represents.

The index and the cache tree

The root node always covers all of the blobs in the index. The contained nodes have ranges contained within that range.

Git commands such as git add update the cache-tree extension in order to make the next git commit command very fast. To create the new commit, Git can use the tree from the root of the cache-tree extension.

Many Git commands use the index in many different ways. Some commands compare the working directory to the index and update one or the other when there is a difference. Others compare the index and a commit. Some compare multiple indexes together. Some use the cache-tree extension to navigate the nested tree structure, but mostly they scan the flat list of files in the form of an array.

The index affects Git performance at scale

The index can be very large in monorepos. I will show Git performance data from an example monorepo that has over two million files at HEAD. Even using the latest compression techniques available, the index file is over 180 MB in this monorepo.

This has a significant effect on normal Git commands. Presented below is an annotated flamechart of a git status command with one of these large indexes. The x-axis represents time since the start of the command, and each rectangle represents a region of Git’s execution that is marked by its trace2 logging library.

`git status` with full index

Three regions are annotated here:

The index is read from disk and parsed into memory.
The working directory is compared to the index. This triggers a lazy initialization of some hash tables that are required for this effort.
The modified index is written to disk.

Parsing is multi-threaded, but writes are not. This explains some of the differences in how long those actions take.

Clearly, the amount of data in the index is a significant portion of this command. This also affects other commands such as git add and git commit, which are expected to be fast.

This performance concern became abundantly clear when our monorepo customer wanted to group more dependencies within the monorepo. Some teams had isolated Git repositories that were hundreds of times smaller than the monorepo, but these repositories created packages that were consumed by the monorepo, causing complications in tracking dependencies. The hope was that they could merge into the monorepo and rely on sparse-checkout to make it feel like they were still working in a small repository. The user experience was actually much worse, and the root cause was the time it took to read and write the index.

The main culprit is that there are millions of index entries corresponding to files these users do not care about for their daily work. When they push to the server and create a pull request, the build machines can handle the massive scale of building the entire tree and verifying that the small change works within the larger whole of the monorepo. Users should not need to pay that cost.

As members of the Git Fundamentals team, we are very focused on Git performance, and the index has been on our minds for years. For example, the index is a bottleneck for the VFS for Git environment, but that environment has particular needs that prevent improvements in this area.

The biggest thing that has changed recently is the creation of “cone mode” sparse-checkout patterns. These use directory-based pattern matching instead of file-based pattern matching. While cone mode was originally designed to speed up pattern matching in the sparse-checkout feature, it has now unlocked a new way to shrink the index.

The sparse index

The sparse index differs from a normal “full” index in one aspect: it can store directory paths with the object ID for its tree object. This is in addition to the file paths which are paired with blob objects. Since the cone mode sparse-checkout patterns match on a directory level, we can determine that an entire directory is out of the sparse-checkout cone and replace all of its contained file paths with a single directory path.

Back in my example derrickstolee/trace2-flamegraph repository, you can enable the sparse index command and then use the test-tool read-cache tool to show the contents of the index.

$ git sparse-checkout init --cone --sparse-index
$ test-tool read-cache --table
100644 blob 646521d0d6c070e6f15e0f5828be1127d3b75503    LICENSE
100644 blob b230b3a6e2d81d50dc00177e970a10726b5baf08    README.md
100755 blob 918533d51c7a5f91622311893dcfd40bfd4f43d7    bin/index.js
040000 tree b395192a7adbf21793f9489f3623c117802b2043    examples/
100644 blob a4599a9e0a01c28a2c0a622457664fc8c55bfdf9    package.json

Similar to my previous visualizations, you can now see how the index contains a directory entry in addition to the four blobs.

sparse index

The sparse directory entries correspond to directories that are just outside of the sparse-checkout definition. These directories also have a cache-tree node whose range is only one entry: that sparse directory entry.

I can even display the full details of the --debug output for git ls-files. This currently requires a --sparse flag that I have implemented in my personal fork of Git, but a similar feature will eventually be available in the core Git client.

$ git ls-files --debug --sparse
LICENSE
  ctime: 1634910503:287405820
  mtime: 1634910503:287405820
  dev: 16777220 ino: 119325319
  uid: 501  gid: 20
  size: 1098    flags: 200000
README.md
  ctime: 1634910503:288090279
  mtime: 1634910503:288090279
  dev: 16777220 ino: 119325320
  uid: 501  gid: 20
  size: 934 flags: 200000
bin/index.js
  ctime: 1634910767:828434033
  mtime: 1634910767:828434033
  dev: 16777220 ino: 119325520
  uid: 501  gid: 20
  size: 7292    flags: 200000
examples/
  ctime: 0:0
  mtime: 0:0
  dev: 0    ino: 0
  uid: 0    gid: 0
  size: 0   flags: 40004000
package.json
  ctime: 1634910503:288676330
  mtime: 1634910503:288676330
  dev: 16777220 ino: 119325321
  uid: 501  gid: 20
  size: 680 flags: 200000

This output is not truncated as it was before, and you can see that the sparse directory entry for examples/ is the only one with blank filesystem data. It also has the same flags value as the sparse file entries did before.

By removing the number of index entries as well as reducing the average path length, you can shrink the index size significantly. In our example monorepo, most users will reduce their index size from 180 MB to less than 10 MB!

Back to our monorepo, let’s try that git status example again and create a new flamechart. Here, I compare the flamechart for git status with a full index followed by one with a sparse index.

Annotated `git status` flame chart

With the sparse index, the git status command drops from 1.3 seconds to under 200 milliseconds! In the flame chart above I highlighted some regions that have similar appearance in each run. These represent the parts of the git status command that are actually walking the working directory and doing work independent of the index size. Everything else is slower in the full index case entirely because of the size of the index!

Building the sparse index safely

Pruning the index at the directory level is a relatively simple idea with a rather complicated result: our flat list of paths now contains two types of Git objects! There are dozens of places in the Git codebase that interact directly with the index in subtly different ways, and all of them are expecting every index entry to point to a blob object.

In order to make such a change to a critical data structure, we needed to first create a compatibility layer. To safely interact with a sparse index, we needed a way to expand a sparse index to an equivalent full index. This way, code paths that have not been integrated and tested with a sparse index can still be used, even if the on-disk format is sparse.

In the Git codebase, we started by creating the ensure_full_index() method, which converts a sparse index into a full one. This method inspects the list for directory entries and replaces them with its contained file entries. Since the directory is outside of the sparse-checkout cone, we could ignore all of the filesystem metadata information and populate the list by traversing the tree objects under that directory. Before anything else happens, the ensure_full_index() method is called immediately after parsing the index so no interactions with the index happen until the sparse directories are removed.

Expanding to a full index

When Git expands a sparse index to a full one, it scans the entries in lexicographic order. If the entry is a file, then Git copies it to the new list. If the entry is a directory, then the tree at that location is passed to the read_tree_at() method to iterate over all contained blobs. For each contained blob, Git generates an index entry for the corresponding file. Finally, the index entry list is copied back into the index and the index is no longer sparse.

Once that protection was in place, we extended Git to write the sparse index format. When writing the index, a full index is converted to a sparse one in-memory using the convert_to_sparse() method.

Collapsing to a sparse index

To convert from a full index to a sparse one, Git uses the cache-tree extension to find the object ID for our new sparse directory entries. The existing file entries are copied, and Git inserts the directory entries as needed.

Once these steps were built, we could verify that the index size was shrinking to the scale we expected. Both of these steps were included in a single series that introduced the format and implemented the conversions.

While it is nice that the index size has shrunk, we couldn’t stop there. The index is a very compact data structure, so it is more efficient to read it from disk than to recreate it by parsing trees. The ensure_full_index() method takes noticeably longer to expand a sparse index to a full one than it would take to read a full one from disk. In order to gain the performance benefits of a sparse index, we needed to teach Git what to do when encountering sparse directory entries.

Before embarking on these integrations, we first set up more guardrails. A new setting, command_requires_full_index, was created which is enabled by default. This setting is used to trigger ensure_full_index() upon parsing a sparse index unless the Git command being run has explicitly disabled the setting. This allowed us to integrate with Git commands one-by-one without disrupting the behavior of other Git commands. In addition to these settings, we inserted calls to ensure_full_index() before most index interactions. to make sure that we were operating on a full index in any code path that might iterate over the index entries. This allowed us to find which code paths needed integration: we could debug a Git command with a breakpoint on ensure_full_index() and see the call stack that led to that expansion.

The first command to integrate with the sparse index was git status. In hindsight, this was a challenging command to use as a starting point, because it performs multiple index operations that are common to other commands. This became clear when integrating with git checkout and git commit because most of the work was already done in the git status integration.

Let’s explore some of the smaller interactions that needed special care with directory entries.

Example implementation detail: `git diff`

The git diff command can show what is different between different representations of a working directory. There are two interesting cases that involve the index: comparing the working directory to the index, and comparing the index to a commit.

With no other arguments, git diff shows the differences in the working directory compared to what is staged in the index. This algorithm is mostly simple to integrate with the sparse index: while walking the working directory, Git drills into a directory only if it exists. If the sparse directory entries in the sparse index do not appear as directories in the working directory, then it never tries to drill into the sparse directory entries. If Git finds that a sparse directory entry does exist in the filesystem as a real directory, then the ensure_full_index() method expands the index and Git continues as normal. This is not desired, so we did everything possible to make sure that these directories do not exist, including updating the sparse-checkout feature to delete ignored files outside the cone.

The git diff --cached command compares the files staged in the index with the commit at HEAD. Here, it is easier to have differences outside the sparse-checkout cone, such as when using git reset --soft to change the HEAD commit without changing the working directory or index. In this case, the git diff --cached command wants to compare the root tree for the HEAD commit to the files in the index. This can proceed normally for the files that exist in the sparse index, but when we reach a sparse directory entry, we see the tree object staged in the index as well as a tree object from the tree walk. At this point, we shift from a tree-vs-index comparison to a tree-vs-tree comparison of those subtrees. When that diff is complete, we can continue with the larger comparison.

One major benefit to the tree-vs-tree comparison is that it is easier to compare the same type of objects. The recursive comparison can also prune the walk when it finds two subtrees with the same object ID, as all of their content is the same at that point.

This change allows us to report on these differences not only in the git diff command, but also such diffs as are written during git status, git checkout, and git commit.

Implementation detail: ORT merge strategy

When beginning the sparse index work, there was a huge question that we did not know how to tackle: three-way merges. The default merge strategy, recursive, uses the index as a data structure during its computation. It was going to be difficult to reconcile that algorithm to work within the confines of a sparse index. In fact, many merges that a monorepo user runs would need to resolve merges outside of the sparse-checkout cone.

Luckily, another contributor, Elijah Newren, announced that he was creating a new merge strategy, named the ORT strategy, that did not use the index. We prioritized reviewing and testing that strategy so that we could take advantage of it. It turns out that it is also a better algorithm in general, so it will become the new default strategy with Git 2.34.

The critical feature of the ORT strategy was its replacement of the index with a recursive tree-like structure. That structure is built from the root tree and only creates subtrees for paths that have changed since the merge base. At the end, the ORT strategy creates an index to match its representation of the resulting merge commit.

Because of the ORT merge strategy, integrating the sparse index into git merge, git rebase, git cherry-pick, and git revert was very simple. We just needed to make sure the index that was created at the end of the merge was sparse from its original creation.

Different merge strategies and their performance

As we reported earlier, the ORT strategy improves over the recursive strategy in the typical case, but also the recursive strategy has significant outliers as shown in the box plot above. Enabling the sparse index on top of the ORT strategy provides even more improvements.

Without the ORT merge strategy, the sparse index work could have easily doubled in scope. For a detailed look at the ORT strategy and its many optimizations, take a look at Elijah’s six-part blog series:

Testing the sparse index

The sparse index was touching critical code and doing so in interesting ways. We needed a way to carefully test that these changes were as correct as possible. The Git test suite is substantial and has excellent coverage of most index operations. However, almost all of those tests do not use sparse-checkout, so we couldn’t immediately gain value in checking the sparse index by enabling it globally.

We created a test script that focused on testing the sparse index in a new way. The test starts by creating a repository with some interesting data shapes in it. Then, each test case starts by copying that repository into three new repositories. Those three repositories have different configurations:

The repository as-is, without sparse-checkout.
The repository with cone mode sparse-checkout enabled.
The repository with cone mode sparse-checkout and sparse index enabled.

Then, each test case runs a number of Git commands against all three repositories, expecting the same output and results in the working directory. This allowed us to be confident that the changes we were making to enable the sparse index would have identical behavior with the other two cases.

Along the way, we found some interesting differences between sparse-checkout and full repositories. Several of these bugs have been fixed since. Sometimes, it was unclear whether the sparse-checkout feature should do the same thing as a normal repository, specifically when interacting with files outside of the sparse-checkout cone. This led to changing how some commands interact with sparse entries. In Git 2.34, some commands will need a --sparse flag in order to modify paths outside of the sparse-checkout cone.

In addition to these test scripts, we routinely ran the Scalar functional tests against our development branches, since many of those tests focus on special circumstances around the sparse-checkout feature. If a Scalar test would fail when the Git tests did not, then we would create a similar test in Git to prevent such a bug in the future.

Once we had integrated with a core set of Git commands, we also created an experimental release that contained early versions of these integrations. We provided this version to a subset of monorepo users to evaluate the performance. We found some interesting data from some of the users, but overall the results confirmed that the sparse index was going to significantly improve the user experience in the monorepo. The most important thing we discovered is that the sparse-checkout feature should remove ignored files outside of the cone, as those files cause the sparse index to expand to a full one, negating the performance benefits. There is an additional benefit in that the working directory shrinks even more by deleting these files.

The current state of the sparse index

Not all Git commands understand the sparse index. Those that have not been integrated trigger a compatibility check that converts a sparse index into a full one during the first index read. The integrated commands are ones that have been carefully tested with the sparse index. They likely received some code updates in order to properly handle a sparse index. These integrations were dispersed across the last few Git versions, and some only exist in the microsoft/git fork until we can complete contributing them to the core Git project.

scope

In June, Git 2.32.0 was released with an understanding of the sparse index format. In August, Git 2.33.0 included integrations with git status, git commit, and git checkout. Git 2.34.0 is slated for a November release with integrations for git add, git merge, git rebase, git cherry-pick, and git reset. In order to serve our monorepo users, we fast-tracked some integrations into a pre-release in our fork including integrations with git diff, git blame, git clean, git sparse-checkout, and git stash.

Based on our understanding of how users interact with a monorepo, the commands that are listed here are sufficient to cover almost all users’ needs. We expect that users who adopt the sparse index with these integrations will have a significantly improved experience from before. This all depends on the size of their monorepo and on the size of their sparse-checkout cone.

We will release this feature widely to our monorepo users with the sparse index on by default in the 2.34 release of our Git fork. The core Git project will keep the sparse index off by default until it has all of these features and the implementation has been stable for a few versions.

Looking to the future

At this point, we have covered all of the integrations we need to have a successful monorepo experience. There is more work to be done. In particular, we need to finish contributing the final integrations in our list. Upstream progress takes time and we are grateful for all of the feedback we have received from the community so far. There are more commands that could use integrations, as well.

For now, we are focused on ensuring that monorepo users transition to the sparse index without incident. If you are interested in the sparse index, then we believe it is in an excellent state to start using it. If you have any issues with its performance or stability, then please do not hesitate to create an issue or start a discussion. We will be there to help!

Through the sparse index, we broke through a significant barrier to monorepo scale. We continue to seek the next innovation that will help Git scale to the largest repos out there.

GitHub Availability Report: October 2021

2021-11-04 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2021-11-04-github-availability-report-october-2021/

In October, we experienced one incident resulting in significant impact and degraded state of availability for the GitHub Codespaces service.

October 8 17:16 UTC (lasting 1 hour and 36 minutes)

A core Codespaces API response was inadvertently restructured as part of our Codespaces public API launch, impacting existing API clients dependent on a stable schema.

For the duration of the incident, new Codespaces could not be initiated from the Visual Studio Code Desktop client. Connections to the web editor and pre-existing desktop sessions were not impacted, but degraded, with the extension displaying an error message while omitting Codespaces metadata from the Remote Explorer view.

The incident was mitigated once we rolled back the regression, at which point all clients could connect again, including with new Codespaces created during the incident. As our monitoring systems did not initially detect the impact of the regression, a subsequent and unrelated deployment was initiated, delaying our ability to revert the change. To ensure similar breaking changes are not introduced in the future, we are investing in tooling to support more rigorous end-to-end testing with the extension’s use of our API. Additionally, we are expanding our monitoring to better align with the user experience across the relevant internal service boundaries.

In summary

GitHub Availability Report: September 2021

2021-10-06 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2021-10-06-github-availability-report-september-2021/

In September, we experienced no incidents resulting in service downtime to our core services.

Please follow our status page for real time updates. To learn more about what we’re working on, check out the GitHub engineering blog.

Partitioning GitHub’s relational databases to handle scale

2021-09-27 Thomas Maurer

Post Syndicated from Thomas Maurer original https://github.blog/2021-09-27-partitioning-githubs-relational-databases-scale/

More than 10 years ago, GitHub.com started out like many other web applications of that time—built on Ruby on Rails, with a single MySQL database to store most of its data.

Over the years, this architecture went through many iterations to support GitHub’s growth and ever-evolving resiliency requirements. For example, we started storing data for some features (like statuses) in separate MySQL databases, we added read replicas to spread the load across multiple machines, and we started using ProxySQL to reduce the number of connections opened against our primary MySQL instances.

Yet at its core, GitHub.com remained built around one main database cluster (called mysql1) that housed a large portion of the data used by core GitHub features, like user profiles, repositories, issues, and pull requests.

With GitHub’s growth, this inevitably led to challenges. We struggled to keep our database systems adequately sized, always moving to newer and bigger machines to scale up. Any sort of incident negatively affecting mysql1 would affect all features that stored their data on this cluster.

In 2019, in order to meet the growth and availability challenges we faced, we set a plan in motion to improve our tooling and our ability to partition relational databases. As you can imagine, this was a complex challenge necessitating the introduction and creation of various tools as outlined below.

The result, we see in 2021, is a 50% load reduction on database hosts housing the data that once was on mysql1. This contributed significantly to reducing the number of database-related incidents and improved GitHub.com’s reliability for all our users.

Virtual partitions

The first concept we introduced was virtual partitions of database schemas. Before database tables can be moved physically, we have to make sure they are separated virtually in the application layer, and this has to happen without impacting teams working on new or existing features.

To do that, we group database tables that belong together into schema domains and enforce boundaries between the domains with SQL linters. This allows us to safely partition data later without ending up with queries and transactions that span partitions.

Schema domains

Schema domains are a tool we came up with to implement virtual partitions. A schema domain describes a tightly coupled set of database tables that are frequently used together in queries (such as, when using table joins or subqueries) and transactions. For example, the gists schema domain contains all of the tables supporting the GitHub Gist feature–like the gists, gist_comments and starred_gists tables. Since they belong together, they should stay together. A schema domain is the first step to codify that.

Schema domains put clear boundaries in place and expose sometimes-hidden dependencies between features. In the Rails application, the information is stored in a simple YAML configuration located at db/schema-domains.yml. Here’s an example illustrating the contents of that file:

gists:
  - gist_comments
  - gists
  - starred_gists
repositories:
  - issues
  - pull_requests
  - repositories
users:
  - avatars
  - gpg_keys
  - public_keys
  - users

A linter makes sure that the list of tables in this file matches our database schema. In turn, the same linter enforces the assignment of a schema domain to every table.

SQL linters

Building on top of schema domains, two new SQL linters enforce virtual boundaries between domains. They identify any violating queries and transactions that span schema domains by adding a query annotation and treating them as exemptions. If a domain has no violations, it is virtually partitioned and ready to be physically moved to another database cluster.

Query linter

The query linter verifies that only tables belonging to the same schema domain can be referenced in the same database query. If it detects tables from different domains, it throws an exception with a helpful message for the developer to avoid the issue.

Since the linter is only enabled in development and test environments, developers encounter violation errors early in the development process. In addition, during CI runs, the linter ensures that no new violations are introduced by accident.

The linter has a way to suppress the exception by annotating the SQL query with a special comment: /* cross-schema-domain-query-exempted */

We even built and upstreamed a new method to ActiveRecord to make adding such a comment easier:

Repository.joins(:owner).annotate("cross-schema-domain-query-exempted")
# => SELECT * FROM `repositories` INNER JOIN `users` ON `users`.`id` = `repositories.owner_id` /* cross-schema-domain-query-exempted */

By annotating all queries that cause failures, a backlog of queries needing modification can be built. Here’s a couple approaches we often use to eliminate exemptions:

Sometimes, an exemption can easily be addressed by triggering separate queries instead of joining tables. One example is using ActiveRecord‘s preload method instead of includes.
Another challenge is has_many :through relations that lead to JOINs across tables from different schema domains. For that, we worked on a generic solution that got upstreamed to Rails as well: has_many now has a disable_joins option that tells Active Record not to do any JOIN queries across the underlying tables. Instead, it runs several queries passing primary key values.
Joining data in the application instead of in the database is another common solution. For example, replacing INNER JOIN statements with two separate queries and instead performing the “union” operation in Ruby (for example, A.pluck(:b_id) & B.where(id: ...)).

In some cases, this leads to surprising performance improvements. Depending on the data structure and cardinality, MySQL’s query planner can sometimes create suboptimal query execution plans, whereas an application-side join has a more stable performance cost.

As with almost all reliability and performance-related changes, we ship them behind Scientist experiments that execute both the old and new implementations for a subset of requests, allowing us to assess the performance impact of each change.

Transaction linter

In addition to queries, transactions are a concern as well. Existing application code was written with a certain database schema in mind. MySQL transactions guarantee consistency across tables within a database. If a transaction includes queries to tables that will move to separate databases, it will no longer be able to guarantee consistency.

To understand which transactions need to be reviewed, we introduced a transaction linter. Similar to the query linter, it verifies that all tables which are used together in a given transaction belong to the same schema domain.

This linter runs in production with heavy sampling to keep the performance impact at a minimum. The linting results are collected and analyzed to understand where most cross-domain transactions happen, allowing us to decide to either update certain code paths or adapt our data model.

In cases where transactional consistency guarantees are crucial, we extract data into new tables that belong to the same schema domain. This ensures they stay on the same database cluster and therefore continue to have transactional consistency. This often happens with polymorphic tables that house data from different schema domains (for example, a reactions table storing records for different features like issues, pull requests, discussions, etc.)

Moving data without downtime

A schema domain that is virtually isolated is ready to be physically moved to another database cluster. To move tables on the fly, we use two different approaches: Vitess, and a custom write-cutover process.

Vitess

Vitess is a scaling layer on top of MySQL that helps with sharding needs. We use its vertical sharding feature to move sets of tables together in production without downtime.

To do that, we deploy Vitess’ VTGate in our Kubernetes clusters. These VTGate processes become the endpoint for the application to connect to, instead of direct connections to MySQL. They implement the same MySQL protocol and are indistinguishable from the application side.

The VTGate processes know the current state of the Vitess setup and talk to the MySQL instances via another Vitess component: VTTablet. Behind the scenes, Vitess’ table moving feature is powered by VReplication, which replicates data between database clusters.

Write-cutover process

Because Vitess’ adoption was still in its early stages at the beginning of 2020, we developed an alternative approach to move large sets of tables at once. This mitigated the risk of relying on a single solution to ensure the continued availability of GitHub.com.

We use MySQL’s regular replication feature to feed data to another cluster. Initially, the new cluster is added to the replication tree of the old cluster. Then a script quickly executes a series of changes to effect the cutover.

The MySQL database cluster setup before executing the write-cutover process

Before running the script, we prepare the application and database replication so that a destination cluster called cluster_b is a sub-cluster of the existing cluster_a. ProxySQL is used for multiplexing client connections to MySQL primaries. The ProxySQL instance on cluster_b is configured to route all traffic to the cluster_a primary. The use of ProxySQL allows us to change database traffic routing quickly and with minimal impact on database clients—in our case, the Rails application.

With this setup, we can move database connections to cluster_b without splitting anything. All read traffic still goes to hosts replicating from the cluster_a primary. All write traffic remains with the cluster_a primary too.

In this situation, we run a cutover script executing the following:

Enable read-only mode for the cluster_a primary. At this point, all writes to cluster_a and cluster_b are prevented. All web requests that try to write to these database primaries fail and result in 500s.
Read the last executed MySQL GTID from the cluster_a primary.
Poll the cluster_b primary to verify the last executed GTID has arrived.
Stop replication on the cluster_b primary from cluster_a.
Update ProxySQL routing configuration on cluster_b to direct traffic to the cluster_b primary.
Disable read-only mode for the cluster_a and cluster_b primaries.
Celebrate!

After thorough preparation and exercising, we learned that these six steps execute in only a few tens of milliseconds for our busiest database tables. Since we execute such cutovers during our lowest traffic time of day, we only cause a handful of user-facing errors because of failed writes. The results of this approach were better than we expected.

Learnings

The write-cutover process was used to split up mysql1, our original main database cluster. We moved 130 of our busiest tables at once––those that power GitHub’s core features: repositories, issues, and pull requests. This process was created as a risk-mitigation strategy to have multiple, independent tools at our disposal. In addition, because of factors like deployment topology and read-your-writes support, we didn’t choose Vitess as the tool to move database tables in every case. We anticipate the opportunity to use it for the majority of data migrations in the future though.

Results

The main database cluster mysql1, mentioned in the introduction, housed a large portion of the data used by many of GitHub’s most important features, like users, repositories, issues, and pull requests. Since 2019 we achieved the ability to scale this relational database with the following results:

In 2019, mysql1 answered 950,000 queries/s on average, 900,000 queries/s on replicas, and 50,000 queries/s on the primary.
Today, in 2021, the same database tables are spread across several clusters. In two years, they saw continued growth, accelerating year-over-year. All hosts of these clusters combined answer 1,200,000 queries/s on average (1,125,000 queries/s on replicas, 75,000 queries/s on the primaries). At the same time, the average load on each host halved.

The load reduction contributed significantly to reducing the number of database-related incidents and improved GitHub.com’s reliability for all our users.

More partitioning

In addition to vertical partitioning to move database tables, we also use horizontal partitioning (aka sharding). This allows us to split database tables across multiple clusters, enabling more sustainable growth. We’ll detail the tooling, linters, and Rails improvements related to this in a future blog post.

Conclusion

Over the last 10 years, GitHub has been learning to scale according to its needs. We often choose to leverage “boring” technology that has been proven to work at our scale, as reliability remains the primary concern. But the combination of industry-proven tools with simple changes to our production code and its dependencies has provided us with a path for the continued growth of our databases into the future.

Search indexing optimisation

2021-09-27 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/search-indexing-optimisation

Modern applications commonly utilise various database engines, with each serving a specific need. At Grab Deliveries, MySQL database (DB) is utilised to store canonical forms of data, and ElasticSearch (ES) to provide advanced search capabilities. MySQL serves as the primary data storage for raw data, and ES as the derived storage.

Efforts have been made to synchronise data between MySQL and ES. In this post, a series of techniques will be introduced on how to optimise incremental search data indexing.

Background

The synchronisation of data from the primary data storage to the derived data storage is handled by Food-Puxian, a Data Synchronisation Platform (DSP). In a search service context, it is the synchronisation of data between MySQL and ES.
The data synchronisation process is triggered on every real-time data update to MySQL, which will streamline the updated data to Kafka. DSP consumes the list of Kafka streams and incrementally updates the respective search indexes in ES. This process is also known as Incremental Sync.

Kafka to DSP

DSP uses Kafka streams to implement Incremental Sync. A stream represents an unbounded, continuously updating data set. A stream is ordered, replayable and is a fault-tolerant sequence of immutable.

*Data synchronisation process using Kafka*

The above diagram depicts the process of data synchronisation using Kafka. The Data Producer creates a Kafka stream for every operation done on MySQL and sends it to Kafka in real-time. DSP creates a stream consumer for each Kafka stream and the consumer reads data updates from respective Kafka streams and synchronises them to ES.

MySQL to ES

Indexes in ES correspond to tables in MySQL. MySQL data is stored in tables, while ES data is stored in indexes. Multiple MySQL tables are joined to form an ES index. The below snippet shows the Entity-Relationship mapping in MySQL and ES. Entity A has a one-to-many relationship with entity B. Entity A has multiple associated tables in MySQL, table A1 and A2, and they are joined into a single ES index A.

Sometimes a search index contains both entity A and entity B. In a keyword search query on this index, e.g. “Burger”, objects from both entity A and entity B whose name contains “Burger” are returned in the search response.

Original Incremental Sync

Original Kafka streams

The Data Producers create a Kafka stream for every MySQL table in the ER diagram above. Every time there is an insert/update/delete operation on the MySQL tables, a copy of the data after the operation executes is sent to its Kafka stream. DSP creates different stream consumers for every Kafka stream since their data structures are different.

Stream Consumer infrastructure

Stream Consumer consists of 3 components.

Event Dispatcher: Listens and fetches events from the Kafka stream, pushes them to the Event Buffer and starts a goroutine to run Event Handler for every event whose ID does not exist in the Event Buffer
Event Buffer: Caches events in memory by the primary key (aID, bID, etc). An event is cached in the Buffer until it is picked by a goroutine or replaced when a new event with the same primary key is pushed into the Buffer.
Event Handler: Reads an event from the Event Buffer and the goroutine started by the Event Dispatcher handles it.

Event Buffer procedure

Event Buffer consists of many sub buffers, each with a unique ID which is the primary key of the event cached in it. The maximum size of a sub buffer is 1. This allows Event Buffer to deduplicate events having the same ID in the buffer.
The below diagram shows the procedure of pushing an event to Event Buffer. When a new event is pushed to the buffer, the old event sharing the same ID will be replaced. The replaced event is therefore not handled.

Event Handler procedure

The below flowchart shows the procedures executed by the Event Handler. It consists of the common handler flow (in white), and additional procedures for object B events (in green). After creating a new ES document by data loaded from the database, it will get the original document from ES to compare if any field is changed and decide whether it is necessary to send the new document to ES.
When object B event is being handled, on top of the common handler flow, it also cascades the update to the related object A in the ES index. We name this kind of operation Cascade Update.

Issues in the original infrastructure

Data in an ES index can come from multiple MySQL tables as shown below.

The original infrastructure came with a few issues.

Heavy DB load: Consumers read from Kafka streams, treat stream events as notifications then use IDs to load data from the DB to create a new ES document. Data in the stream events are not well utilised. Loading data from the DB every time to create a new ES document results in heavy traffic to the DB. The DB becomes a bottleneck.
Data loss: Producers send data copies to Kafka in application code. Data changes made via MySQL command-line tool (CLT) or other DB management tools are lost.
Tight coupling with MySQL table structure: If producers add a new column to an existing table in MySQL and this column needs to be synchronised to ES, DSP is not able to capture the data changes of this column until the producers make the code change and add the column to the related Kafka Stream.
Redundant ES updates: ES data is a subset of MySQL data. Producers publish data to Kafka streams even if changes are made on fields that are not relevant to ES. These stream events that are irrelevant to ES would still be picked up.
Duplicate cascade updates: Consider a case where the search index contains both object A and object B. A large number of updates to object B are created within a short span of time. All the updates will be cascaded to the index containing both objects A and B. This will bring heavy traffic to the DB.

Optimised Incremental Sync

MySQL Binlog

MySQL binary log (Binlog) is a set of log files that contain information about data modifications made to a MySQL server instance. It contains all statements that update data. There are two types of binary logging:

Statement-based logging: Events contain SQL statements that produce data changes (inserts, updates, deletes).
Row-based logging: Events describe changes to individual rows.

The Grab Caspian team (Data Tech) has built a Change Data Capture (CDC) system based on MySQL row-based Binlog. It captures all the data modifications made to MySQL tables.

Current Kafka streams

The Binlog stream event definition is a common data structure with three main fields: Operation, PayloadBefore and PayloadAfter. The Operation enums are Create, Delete, and Update. Payloads are the data in JSON string format. All Binlog streams follow the same stream event definition. Leveraging PayloadBefore and PayloadAfter in the Binlog event, optimisations of incremental sync on DSP becomes possible.

Stream Consumer optimisations

Event Handler optimisations

Optimisation 1

Remember that there was a redundant ES updates issue mentioned above where the ES data is a subset of the MySQL data. The first optimisation is to filter out irrelevant stream events by checking if the fields that are different between PayloadBefore and PayloadAfter are in the ES data subset.
Since the payloads in the Binlog event are JSON strings, a data structure only with fields that are present in ES data is defined to parse PayloadBefore and PayloadAfter. By comparing the parsed payloads, it is easy to know whether the change is relevant to ES.
The below diagram shows the optimised Event Handler flows. As shown in the blue flow, when an event is handled, PayloadBefore and PayloadAfter are compared first. An event will be processed only if there is a difference between PayloadBefore and PayloadAfter. Since the irrelevant events are filtered, it is unnecessary to get the original document from ES.

Achievements

No data loss: changes made via MySQL CLT or other DB manage tools can be captured.
No dependency on MySQL table definition: All the data is in JSON string format.
No redundant ES updates and DB reads.
ES reads traffic reduced by 90%: Not a need to get the original document from ES to compare with the newly created document anymore.
55% irrelevant stream events filtered out
DB load reduced by 55%

Optimisation 2

The PayloadAfter in the event provides updated data. This makes us think about whether a completely new ES document is needed each time, with its data read from several MySQL tables. The second optimisation is to change to a partial update using data differences from the Binlog event.
The below diagram shows the Event Handler procedure flow with a partial update. As shown in the red flow, instead of creating a new ES document for each event, a check on whether the document exists will be performed first. If the document exists, which happens for the majority of the time, the data is changed in this event, provided the comparison between PayloadBefore and PayloadAfter is updated to the existing ES document.

Achievements

Change most ES relevant events to partial update: Use data in stream events to update ES.
ES load reduced: Only fields that have been changed will be sent to ES.
DB load reduced: DB load reduced by 80% based on Optimisation 1.

Event Buffer optimisation

Instead of replacing the old event, we merge the new event with the old event when the new event is pushed to the Event Buffer.
The size of each sub buffer in Event Buffer is 1. In this optimisation, the stream event is not treated as a notification anymore. We use the Payloads in the event to perform Partial Updates. The old procedure of replacing old events is no longer suitable for the Binlog stream.
When the Event Dispatcher pushes a new event to a non-empty sub buffer in the Event Buffer, it will merge event A in the sub buffer and the new event B into a new Binlog event C, whose PayloadBefore is from Event A and PayloadAfter is from Event B.

Cascade Update optimisation

Optimisation

Use a new stream to handle cascade update events.
When the producer sends data to the Kafka stream, data sharing the same ID will be stored at the same partition. Every DSP service instance has only one stream consumer. When Kafka streams are consumed by consumers, one partition will be consumed by only one consumer. So the Cascade Update events sharing the same ID will be consumed by one stream consumer on the same EC2 instance. With this special mechanism, the in-memory Event Buffer is able to deduplicate most of the Cascade Update events sharing the same ID.
The flowchart below shows the optimised Event Handler procedure. Highlighted in green is the original flow while purple highlights the current flow with Cascade Update events.
When handling an object B event, instead of cascading update the related object A directly, the Event Handler will send a Cascade Update event to the new stream. The consumer of the new stream will handle the Cascade Update event and synchronise the data of object A to the ES.

*Event Handler with Cascade Update events*

Achievements

Cascade Update events deduplicated by 80%
DB load introduced by cascade update reduced

Summary

In this article four different DSP optimisations are explained. After switching to MySQL Binlog streams provided by the Coban team and optimising Stream Consumer, DSP has saved about 91% DB reads and 90% ES reads, and the average queries per second (QPS) of stream traffic processed by Stream Consumer increased from 200 to 800. The max QPS at peak hours could go up to 1000+. With a higher QPS, the duration of processing data and the latency of synchronising data from MySQL to ES was reduced. The data synchronisation ability of DSP has greatly improved after optimisation.

Special thanks to Jun Ying Lim and Amira Khazali for proofreading this article.

Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Increasing developer happiness with GitHub code scanning

2021-09-07 Sam Partington

Post Syndicated from Sam Partington original https://github.blog/2021-09-07-increasing-developer-happiness-github-code-scanning/

You probably already know about using GitHub code scanning to secure your code. But how about using it to make your day-to-day coding easier? We’ve been making internal use of CodeQL, our code analysis engine for code scanning, to keep code quality high by protecting ourselves from those annoying coding mistakes that are easy to make but hard to spot! Read on for some examples of what we’ve done so far and how you can make the most of CodeQL for yourself.

Plugging a memory leak

Go’s defer statement defers the execution of a function until the surrounding function returns. This is useful for cleaning up: For example, closing resources like file handles or completing database transactions.

When changing existing code, you can end up moving a defer statement inside a loop. If you do so, you’ll still have to wait until the end of the function for cleanup; it won’t happen at the end of the iteration. We’ve seen this mistake lead to memory leaks in production.

Wouldn’t it be great if this mistake could be pointed out to you? We wanted to live in that happy world, and all it took was four lines of CodeQL.

A nice postscript to this story is that seeing this query led another team at GitHub to add CodeQL to their repository. They’d been bitten by a defer-in-loop memory leak before and didn’t want it to happen again. Once code scanning was set up for them, CodeQL discovered another problem in their codebase, which was similar to the one we’ll discuss next.

The error you can’t ignore

We use GORM, a Go Object Relational Mapper, in some of our codebases. Error handling in GORM is different than in idiomatic Go code, because it has a chainable API. Here’s an example:

if err := db.Where("name = ?", "jinzhu").First(&user).Error; err !=
nil {
 // error handling...
}

As you can imagine, it’s easy to write code like db.Where("name = ?", "jinzhu").First(&user) and not check that Error field.

At least it used to be easy to do that. We’ve now created a CodeQL query which detects GORM calls that don’t check the associated Error field and flags these calls in pull requests. You’ll also find a similar query for error checking functions which return pointers in the security-and-quality query suite for CodeQL.

Loopy performance problems

In addition to protecting against missing error checking, we also want to keep our database-querying code performant. “N+1 queries” are a common performance issue. This is where some expensive operation is performed once for every member of a set, so the code will get slower as the number of items increases. Database calls in a loop are often the culprit here; typically, you’ll get better performance from a batch query outside of the loop instead.

We created a custom CodeQL query, which looks for calls to any of the GORM methods that actually result in a query being performed. We filter that list of calls down to those that happen within a loop and fail CI if any are encountered. What’s nice about CodeQL is that we’re not limited to database calls directly within the body of a loop―calls within functions called directly or indirectly from the loop are caught too.

Using these queries

These queries are experimental, so we’ve not included them in our standard suites. However, you can use them by referencing a special query suite we’ve created.

First, create a file .github/codeql/go-developer-happiness.qls in the repository you would like to analyze:

- import: codeql-suites/go-developer-happiness.qls
  from: codeql-go

Next, set up a CodeQL workflow (or edit an existing one) and amend the “Initialize CodeQL” section of the template as follows:

- name: Initialize CodeQL
  uses: github/codeql-action/init@v1
  with:
    languages: go
    queries: ./.github/codeql/go-developer-happiness.qls

For more information and configuration examples, please refer to the documentation for running custom CodeQL queries in GitHub code scanning.

Making your own

Are there common “gotchas” in your codebase? Why not ease developer friction with some custom CodeQL queries of your own? You can learn more about writing CodeQL with our documentation and discussions―and also find out more about contributing queries back to the community―in the CodeQL repository at https://github.com/github/codeql. We look forward to seeing what you come up with!

GitHub Availability Report: August 2021

2021-09-01 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2021-09-01-github-availability-report-august-2021/

In August, we experienced two distinct incidents resulting in significant impact and degraded state of availability for Git operations, API requests, webhooks, issues, pull requests, GitHub Pages, GitHub Packages, and GitHub Actions services.

August 10 15:16 UTC (lasting one hour and 17 minutes)

This incident was caused when one of our MySQL database primaries entered a degraded state, affecting a number of internal services. This caused an impact to GitHub.com services requiring write access to this particular database cluster, which resulted in some users being unable to perform operations.

Investigation had identified an edge case in one of our most active applications, which caused the generation of a poorly performing query capable of impacting overall database capacity. This combined with application retry and queueing logic meant that the MySQL primary was placed into a state where the cluster was unable to automatically recover.

We have been able to address this query, as well as some of the application retry logic, to reduce the chance of recurrence in the future.

One of the novel elements to this incident was the breadth of impact across multiple services. This led to a discussion about the overall service status as we were reporting it within the incident, and, so we’d like to take this opportunity to discuss the approach we took at the time, as well as the way we look to increase our learning potential after the incident.

When we first introduced the monthly availability report, we aimed to provide post-incident reviews for major incidents that impact service availability, in addition to background on how we’re continuing to evolve the process. As part of our standard post-incident analysis process, we are using this incident as a valuable source of data to evaluate the responsiveness of our internal metrics and alerting. These systems guide our responders during incidents on both when to status and what degree of impact to status for. As a result, we’re continuing to tune and optimize these activities to ensure we are able to status—both quickly and accurately—so that we continue to earn the trust our users place in us everyday.

August 10 19:57 UTC (lasting 3 hours and 6 minutes)

Following ongoing maintenance of the Actions service, our service monitors detected a high error rate on workflow runs for new and in progress jobs, which affected the Actions service. This incident resulted in the failure of all queued jobs for a period of time. This was a new incident, unrelated to the earlier issue on August 10. We immediately reverted recent Actions deployments and started to investigate the issue.

The incident was caused by work to set up a new Actions Premium Runner microservice in the Actions service. The impacting portion of this work involved alterations to the service discovery process within the Actions microservices architecture. A bad service record pushed to this system resulted in many of the microservices being unable to make Service-to-Service calls.

Ultimately, the mitigation for this incident was to remove the bad record from the service discovery infrastructure. After investigating whether this mitigation would address the incident, we were able to confidently confirm that the bad record was the root cause of the issue, and removing it would restore the Actions service with no unintended side effects.

We have prioritized several changes as a result of this incident, including fixing this part of the Actions microservice discovery process to properly handle potential bad records. We’ve also added a broader scope of visibility into what’s changed recently across all of the Actions microservices, so we can quickly focus investigations in the correct place.

In Summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. Please follow our status page for real time updates and watch our blog for next month’s availability report. To learn more about what we’re working on, check out the GitHub Engineering blog.

Automating Multi-Armed Bandit testing during feature rollout

2021-09-01 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/multi-armed-bandit-system-recommendation

A/B testing is an experiment where a random e-commerce platform user is given two versions of a variable: a control group and a treatment group, to discover the optimal version that maximizes conversion. When running A/B testing, you can take the Multi-Armed Bandit optimisation approach to minimise the loss of conversion due to low performance.

In the traditional software development process, Multi-Armed Bandit (MAB) testing and rolling out a new feature are usually separate processes. The novel Multi-Armed Bandit System for Recommendation solution, hereafter the Multi-Armed Bandit Optimiser, proposes automating the Multi-Armed Bandit testing simultaneously while rolling out the new feature.

Advantages

Automates the MAB testing process during new feature rollouts.
Selects the optimal parameters based on predefined metrics of each use case, which results in an end-to-end solution without the need for user intervention.
Uses the Batched Multi-Armed Bandit and Monte Carlo Simulation, which enables it to process large-scale business scenarios.
Uses a feedback loop to automatically collect recommendation metrics from user event logs and to feed them to the Multi-Armed Bandit Optimiser.
Uses an adaptive rollout method to automatically roll out the best model to the maximum distribution capacity according to the feedback metrics.

Architecture

The following diagram illustrates the system architecture.

System architecture

The novel Multi-Armed Bandit System for Recommendation solution contains three building blocks.

Stream processing framework

A lightweight system that performs basic operations on Kafka Streams, such as aggregation, filtering, and mapping. The proposed solution relies on this framework to pre-process raw events published by mobile apps and backend processes into the proper format that can be fed into the feedback loop.

Feedback loop

A system that calculates the goal metrics and optimises the model traffic distribution. It runs a metrics server which pulls the data from Stalker, which is a time series database that stores the processed events in the last one hour. The metrics server invokes a Spark Job periodically to run the SQL queries that computes the pre-defined goal metrics: the Clickthrough Rate, Conversion Rate and so on, provided by users. The output of the job is dumped into an S3 bucket, and is picked up by optimiser runtime. It runs the Multi-Armed Bandit Optimiser to optimise the model traffic distribution based on the latest goal metrics.

Dynamic value receiver, or the GrabX variable

Multi-Armed Bandit Optimiser modules

The Multi-Armed Bandit Optimiser consists of the following modules:

Reward Update
Batched Multi-Armed Bandit Agent
Monte-Carlo Simulation
Adaptive Rollout

Multi-Armed Bandit Optimiser modules

The goal of the Multi-Armed Bandit Optimisation is to find the optimal Arm that results in the best predefined metrics, and then allocate the maximum traffic to that Arm.

The solution can be illustrated in the following problem. For K Arm, in which the action space A={1,2,…,K}, the Multi-Arm-Bandit Optimiser goal is to solve the one-shot optimisation problem of Formula .

Reward Update module

The Reward Update module collects a batch of the metrics. It calculates the Success and Failure counts, then updates the Beta distribution of each Arm with the Batched Multi-Armed Bandit algorithm.

Multi-Armed Bandit Agent module

In the Multi-Armed Bandit Agent module, each Arm’s metrics are modelled as a Beta distribution which is sampled with Thompson Sampling. The Beta distribution formula is:
Formula .

The Batched Multi-Armed Bandit algorithm updates the Beta distribution with the batch metrics. The optimisation algorithm can be described in the following method.

Batched Multi-Armed Bandit algorithm

Monte-Carlo Simulation module

The Monte-Carlo Simulation module runs the simulation for N repeated times to find the best Arm over a configurable simulation window. Then, it applies the simulated results as each Arm’s distribution percentage for the next round.

To handle different scenarios, we designed two strategies.

Max strategy: We count each Arm’s Success count’s result in Monte-Carlo Simulation, and then compute the next round distribution according to the success rate.
Mean strategy: We average each Arm’s Beta distribution probabilities’s result in Monte-Carlo Simulation, and then compute the next round distribution according to the averaged probabilities of each Arm.

Adaptive Rollout module

The Adaptive Rollout module rolls out the sampled distribution of each Multi-Armed Bandit Arm, in the form of Multi-Armed Bandit Arm Model ID and distribution, to the experimentation platform’s configuration variable. The resulting variable is then read from the online service. The process repeats as it collects feedback from the Adaptive Rollout metrics’ results in the feedback loop.

Multi-Armed Bandit for Recommendation Solution

In the GrabFood Recommended for You widget, there are several food recommendation models that categorise lists of merchants. The choice of the model is controlled through experiments at rollout, and the results of the experiments are analysed offline. After the analysis, data scientists and product managers rectify the model choice based on the experiment results.

The Multi-Armed Bandit System for Recommendation solution improves the process by speeding up the feedback loop with the Multi-Armed Bandit system. Instead of depending on offline data which comes out at T+N, the solution responds to minute-level metrics, and adjusts the model faster.

This results in an optimal solution faster. The proposed Multi-Armed Bandit for Recommendation solution workflow is illustrated in the following diagram.

Multi-Armed Bandit for Recommendation solution workflow

Optimisation metrics

The GrabFood recommendation uses the Effective Conversion Rate metrics as the optimisation objective. The Effective Conversion Rate is defined as the total number of checkouts through the Recommended for You widget, divided by the total widget viewed and multiplied by the coverage rate.

The events of views, clicks, and checkouts are collected over a 30-minute aggregation window and the coverage. A request with a checkout is considered as a success event, while a non-converted request is considered as a failure event.

Multi-Armed Bandit strategy

With the Multi-Armed Bandit Optimiser, the Beta distribution is selected to model the Effective Conversion Rate. The use of the mean strategy in the Monte-Carlo Simulation results in a more stable distribution.

Rollout policy

The Multi-Armed Bandit Optimiser uses the eater ID as the unique entity, applies a policy and assigns different percentages of eaters to each model, based on computed distribution at the beginning of each loop.

Fallback logic

The Multi-Armed Bandit Optimiser first runs model validation to ensure all candidates are suitable for rolling out. If the scheduled MAB job fails, it falls back to a default distribution that is set to 50-50% for each model.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GitHub’s Engineering Team has moved to Codespaces

2021-08-11 Cory Wilkerson

Post Syndicated from Cory Wilkerson original https://github.blog/2021-08-11-githubs-engineering-team-moved-codespaces/

Today, GitHub is making Codespaces available to Team and Enterprise Cloud plans on github.com. Codespaces provides software teams a faster, more collaborative development environment in the cloud. Read more on our Codespaces page.

The GitHub.com codebase is almost 14 years old. When the first commit for GitHub.com was pushed, Rails was only two years old. AWS was one. Azure and GCP did not yet exist. This might not be long in COBOL time, but in internet time it’s quite a lot.

Over those 14 years, the core repository powering GitHub.com (github/github) has seen over a million commits. The vast majority of those commits come from developers building and testing on macOS.

A classic commit message for a classic commit

But our development platform is evolving. Over the past months, we’ve left our macOS model behind and moved to Codespaces for the majority of GitHub.com development. This has been a fundamental shift for our day-to-day development flow. As a result, the Codespaces product is stronger and we’re well-positioned for the future of GitHub.com development.

The status quo

Over the years, we’ve invested significant time and effort in making local development work well out of the box. Our scripts-to-rule-them-all approach has presented a familiar interface to engineers for some time now—new hires could clone github/github, run setup and bootstrap scripts, and have a local instance of GitHub.com running in a half-day’s time. In most cases things just worked, and when they didn’t, our bootstrap script would open a GitHub issue connecting the new hire with internal support. Our #friction Slack channel—staffed by helpful, kind engineers—could debug nearly any system configuration under the sun.

Run GitHub.com locally (eventually) with this one command!

Yet for all our efforts, local development remained brittle. Any number of seemingly innocuous changes could render a local environment useless and, worse still, require hours of valuable development time to recover. Mysterious breakage was so common and catastrophic that we’d codified an option for our bootstrap script: --nuke-from-orbit. When invoked, the script deletes as much as it responsibly can in an attempt to restore the local environment to a known good state.

And of course, this is a classic story that anyone in the software engineering profession will instantly recognize. Local development environments are fragile. And even when functioning perfectly, a single-context, bespoke local development environment felt increasingly out of step with the instant-on, access-from-anywhere world in which we now operate.

Collaborating on multiple branches across multiple projects was painful. We’d often find ourselves staring down a 45-minute bootstrap when a branch introduced new dependencies, shipped schema changes, or branched from a different SHA. Given how quickly our codebase changes (we’re deploying hundreds of changes per day), this was a regular source of engineering friction.

And we weren’t the only ones to take notice—in building Codespaces, we engaged with several best-in-class engineering organizations who had built Codespaces-like platforms to solve these same types of problems. At any significant scale, removing this type of productivity loss becomes a very clear productivity opportunity, very quickly.

This single log message will cause any GitHub engineer to break out in a cold sweat

Development infrastructure

In the infrastructure world, industry best practices have continued to position servers as a commodity. The idea is that no single server is unique, indispensable, or irreplaceable. Any piece could be taken out and replaced by a comparable piece without fanfare. If a server goes down, that’s ok! Tear it down and replace it with another one.

Our local development environments, however, are each unique, with their own special quirks. As a consequence, they require near constant vigilance to maintain. The next git pull or bootstrap can degrade your environment quickly, requiring an expensive context shift to a recovery effort when you’d rather be building software. There’s no convention of a warm laptop standing by.

But there’s a lot to be said for treating development environments as our own—they’re the context in which we spend the majority of our day! We tweak and tune our workbench in service of productivity but also as an expression of ourselves.

With Codespaces, we saw an opportunity to treat our dev environments much like we do infrastructure—a commodity we can churn—but still maintain the ability to curate our workbench. Visual Studio Code extensions, settings sync, and dotfiles repos bring our environment to our compute. In this context, a broken workbench is a minor inconvenience—now we can provision a new codespace at a known good state and get back to work.

Adopting Codespaces

Migrating to Codespaces addressed the shortcomings in our existing developer environments, motivated us to push the product further, and provided leverage to improve our overall development experience.

And while our migration story has a happy ending, the first stages of our transition were… challenging. The GitHub.com repository is almost 13 GB on disk; simply cloning the repository takes 20 minutes. Combined with dependency setup, bootstrapping a GitHub.com codespace would take upwards of 45 minutes. And once we had a repository successfully mounted into a codespace, the application wouldn’t run.

Those 14 years of macOS-centric assumptions baked into our bootstrapping process were going to have to be undone.

Working through these challenges brought out the best of GitHub. Contributors came from across the company to help us revisit past decisions, question long-held assumptions, and work at the source-level to decouple GitHub development from macOS. Finally, we could (albeit very slowly) provision working GitHub.com codespaces on Linux hosts, connect from Visual Studio Code, and ship some work. Now we had to figure out how to make the thing hum.

45 minutes to 5 minutes

Our goal with Codespaces is to embrace a model where development environments are provisioned on-demand for the task at hand (roughly a 1:1 mapping between branches and codespaces.) To support task-based workflows, we need to get as close to instant-on as possible. 45 minutes wasn’t going to meet our task-based bar, but we could see low-hanging fruit, ripe with potential optimizations.

Up first: changing how Codespaces cloned github/github. Instead of performing a full clone when provisioned, Codespaces would now execute a shallow clone and then, after a codespace was created with the most recent commits, unshallow repository history in the background. Doing so reduced clone time from 20 minutes to 90 seconds.

Our next opportunity: caching the network of software and services that support GitHub.com, inclusive of traditional Gemfile-based dependencies as well as services written in C, Go, and a custom build of Ruby. The solution was a GitHub Action that would run nightly, clone the repository, bootstrap dependencies, and build and push a Docker image of the result. The published image was then used as the base image in github/github’s devcontainer—config-as-code for Codespaces environments. Our codespaces would now be created at 95%+ bootstrapped.

These two changes, along with a handful of app and service level optimizations, took GitHub.com codespace creation time from 45 minutes to five minutes. But five minutes is still quite a distance from “instant-on.” Well-known studies have shown people can sustain roughly ten seconds of wait time before falling out of flow. So while we’d made tremendous strides, we still had a way to go.

5 minutes to 10 seconds

While five minutes represented a significant improvement, these changes involved tradeoffs and hinted at a more general product need.

Our shallow clone approach—useful for quickly launching into Codespaces—still required that we pay the cost of a full clone at some point. Unshallowing post-create generated load with distracting side effects. Any large, complex project would face a similar class of problems during which cloning and bootstrapping created contention for available resources.

What if we could clone and bootstrap the repository ahead of time so that by the time an engineer asked for a codespace we’d already done most of the work?

Enter prebuilds: pools of codespaces, fully cloned and bootstrapped, waiting to be connected with a developer who wants to get to work. The engineering investment we’ve made in prebuilds has returned its value many times over: we can now create reliable, preconfigured codespaces, primed and ready for GitHub.com development in 10 seconds.

New hires can go from zero to a functioning development environment in less time than it takes to install Slack. Engineers can spin off new codespaces for parallel workstreams with no overhead. When an environment falls apart—maybe it’s too far behind, or the test data broke something—our engineers can quickly create a new environment and move on with their day.

Increased leverage

The switch to Codespaces solved some very real problems for us: it eliminated the fragility and single-track model of local development environments, but it also gave us a powerful new point of leverage for improving GitHub’s developer experience.

We now have a wedge for performing additional setup and optimization work that we’d never consider in local environments, where the cost of these optimizations (in both time and patience) is too high. For instance, with prebuilds we now prime our language server cache and gem documentation, run pending database migrations, and enable both GitHub.com and GitHub Enterprise development modes—a task that would typically require yet another loop through bootstrap and setup.

With Codespaces, we can upgrade every engineer’s machine specs with a single configuration change. In the early stages of our Codespaces migration, we used 8 core, 16 GB RAM VMs. Those machines were sufficient, but GitHub.com runs a network of different services and will gladly consume every core and nibble of RAM we’re willing to provide. So we moved to 32 core, 64 GB RAM VMs. By changing a single line of configuration, we upgraded every engineer’s machine.

Instant upgrade—ship config and bypass the global supply chain bottleneck

Codespaces has also started to steal business from our internal “review lab” platform—a production-like environment where we preview changes with internal collaborators. Before Codespaces, GitHub engineers would need to commit and deploy to a review lab instance (which often required peer review) in order to share their work with colleagues. Friction. Now we ctrl+click, grab a preview URL, and send it on to a colleague. No commit, no push, no review, no deploy — just a live look at port 80 on my codespace.

Command line

Visual Studio Code is great. It’s the primary tool GitHub.com engineers use to interface with codespaces. But asking our Vim and Emacs users to commit to a graphical editor is less great. If Codespaces was our future, we had to bring everyone along.

Happily, we could support our shell-based colleagues through a simple update to our prebuilt image which initializes sshd with our GitHub public keys, opens port 22, and forwards the port out of the codespace.

From there, GitHub engineers can run Vim, Emacs, or even ed if they so desire.

This has worked exceedingly well! And, much like how Docker image caching led to prebuilds, the obvious next step is taking what we’ve done for the GitHub.com codespace and making it a first-class experience for every codespace.

Reception

Change is hard, doubly so when it comes to development environments. Thankfully, GitHub engineers are curious and kind—and quickly becoming Codespaces superfans.

I used codespaces yesterday while my dev environment was a little broken and I finished the entire features on codespaces before my dev env was done building lol
~@lindseyb

My friends, I’m here to tell you I was a Codespaces skeptic before this started and now I am not. This is the way.
~@iolsen

I really was more productive with respect to the Rails part of my work this week than I think I ever have been before. Everything was just so fast and reliable.
~@jclem

Whomever has worked on getting codespaces up and running, you enabled me to have an awesome first week!
~@bestra

I do solemnly swear that never again will my CPU have to compile ruby from source.
~@latentflip

Codespaces are now the default development environment for GitHub.com. That #friction Slack channel that we mentioned earlier to help debug local development environment problems? We’re planning to archive it.

We’re onboarding more services and more engineers throughout GitHub every day, and we’re discovering new stories about the value Codespaces can generate along the way. But at the core of each story, you’ll discover a consistent theme that resonates with every engineer: I found a better tool, I’m more productive now, and I’m not going back.

GitHub Availability Report: July 2021

2021-08-04 Keith Ballinger

Post Syndicated from Keith Ballinger original https://github.blog/2021-08-04-github-availability-report-july-2021/

In July, we experienced no incidents resulting in service downtime to our core services.

Please follow our status page for real time updates. To learn more about what we’re working on, check out the GitHub engineering blog.

Protecting Personal Data in Grab’s Imagery

2021-07-26 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/protecting-personal-data-in-grabs-imagery

Image Collection Using KartaView

Starting a few years ago, we realised the strong demand to better understand the streets where our drivers and clients go, with the purpose to better fulfil their needs and also to be able to quickly adapt ourselves to the rapidly changing environment in the Southeast Asia cities.

One way to fulfil that demand was to create an image collection platform named KartaView which is Grab Geo’s platform for geotagged imagery. It empowers collection, indexing, storage, retrieval of imagery, and map data extraction.

KartaView is a public, partially open-sourced product, used both internally and externally by the OpenStreetMap community and other users. As of 2021, KartaView has public imagery in over 100 countries with various coverage degrees, and 60+ cities of Southeast Asia. Check it out at www.kartaview.com.

Figure 1 - KartaView platform — *Figure 1 – KartaView platform*

Why Image Blurring is Important

Many incidental people and licence plates are in the collected images, whose privacy is a serious concern. We deeply respect all of them and consequently, we are using image obfuscation as the most effective anonymisation method for ensuring privacy protection.

Because manually annotating the regions in the picture where faces and licence plates are located is impractical, this problem should be solved using machine learning and engineering techniques. Hence we detect and blur all faces and licence plates which could be considered as personal data.

Figure 2 - Sample blurred picture — *Figure 2 – Sample blurred picture*

In our case, we have a wide range of picture types: regular planar, very wide and 360 pictures in equirectangular format collected with 360 cameras. Also, because we are collecting imagery globally, the vehicle types, licence plates, and human environments are quite diverse in appearance, and are not handled well by off-the-shelf blurring software. So we built our own custom blurring solution which yielded higher accuracy and better cost-efficiency overall with respect to blurring of personal data.

Figure 3 - Example of equirectangular image where personal data has to be blurred — *Figure 3 – Example of equirectangular image where personal data has to be blurred*

Behind the scenes, in KartaView, there are a set of cool services which are able to derive useful information from the pictures like image quality, traffic signs, roads, etc. A big part of them are using deep learning algorithms which potentially can be negatively affected by running them over blurred pictures. In fact, based on the assessment we have done so far, the impact is extremely low, similar to the one reported in a well known study of face obfuscation in ImageNet [9].

Outline of Grab’s Blurring Process

Roughly, the processing steps are the following:

Transform each picture into a set of planar images. In this way, we further process all pictures, whatever the format they had, in the same way.
Use an object detector able to detect all faces and licence plates in a planar image having a standard field of view.
Transform the coordinates of the detected regions into original coordinates and blur those regions.

Figure 4 - Picture’s processing steps [8] — *Figure 4 – Picture’s processing steps [8]*

In the following section, we are going to describe in detail the interesting aspects of the second step, sharing the challenges and how we were solving them. Let’s start with the first and most important part, the dataset.

Dataset

Our current dataset consists of images from a wide range of cameras, including normal perspective cameras from mobile phones, wide field of view cameras and also 360 degree cameras.

It is the result of a series of data collections contributed by Grab’s data tagging teams, which may contain 2 classes of dataset that are of interest for us: FACE and LICENSE_PLATE.

The data was collected using Grab internal tools, stored in queryable databases, making it a system that gives the possibility to revisit the data and correct it if necessary, but also making it possible for data engineers to select and filter the data of interest.

Dataset Evolution

Each iteration of the dataset was made to address certain issues discovered while having models used in a production environment and observing situations where the model lacked in performance.

	Dataset v1	Dataset v2	Dataset v3
Nr. images	15226	17636	30538
Nr. of labels	64119	86676	242534

If the first version was basic, containing a rough tagging strategy we quickly noticed that it was not detecting some special situations that appeared due to the pandemic situation: people wearing masks.

This led to another round of data annotation to include those scenarios.
The third iteration addressed a broader range of issues:

Small regions of interest (objects far away from the camera)

Objects in very dark backgrounds

Rotated objects or even upside down

Variation of the licence plate design due to images from different countries and regions

People wearing masks

Faces in the mirror – see below the mirror of the motorcycle

But the main reason was because of a scenario where the recording, at the start or end (but not only), had close-ups of the operator who was checking the camera. This led to images with large regions of interest containing the camera operator’s face – too large to be detected by the model.

An investigation in the dataset structure, by splitting the data into bins based on the bbox sizes (in pixels), made something clear: the dataset was unbalanced.

We made bins for tag sizes with a stride of 100 pixels and went up to the max present in the dataset which accounted for 1 sample of size 2000 pixels. The majority of the labels were small in size and the higher we would go with the size, the less tags we would have. This made it clear that we would need more targeted annotations for our dataset to try to balance it.

All these scenarios required the tagging team to revisit the data multiple times and also change the tagging strategy by including more tags that were considered at a certain limit. It also required them to pay more attention to small details that may have been missed in a previous iteration.

Data Splitting

To better understand the strategy chosen for splitting the data we need to also understand the source of the data. The images come from different devices that are used in different geo locations (different countries) and are from a continuous trip recording. The annotation team used an internal tool to visualise the trips image by image and mark the faces and licence plates present in them. We would then have access to all those images and their respective metadata.

The chosen ratios for splitting are:

Train 70%
Validation 10%
Test 20%

Number of train images	12733
Number of validation images	1682
Number of test images	3221
Number of labeled classes in train set	60630
Number of labeled classes in validation set	7658
Number of of labeled classes in test set	18388

The split is not so trivial as we have some requirements and need to complete some conditions:

An image can have multiple tags from one or both classes but must belong to just one subset.
The tags should be split as close as possible to the desired ratios.
As different images can belong to the same trip in a close geographical relation we need to force them in the same subset, thus avoiding similar tags in train and test subsets, resulting in incorrect evaluations.

Data Augmentation

The application of data augmentation plays a crucial role while training the machine learning model. There are mainly three ways in which data augmentation techniques can be applied. They are:

Offline data augmentation – enriching a dataset by physically multiplying some of its images and applying modifications to them.
Online data augmentation – on the fly modifications of the image during train time with configurable probability for each modification.
Combination of both offline and online data augmentation.

In our case, we are using the third option which is the combination of both.

The first method that contributes to offline augmentation is a method called image view splitting. This is necessary for us due to different image types: perspective camera images, wide field of view images, 360 degree images in equirectangular format. All these formats and field of views with their respective distortions would complicate the data and make it hard for the model to generalise it and also handle different image types that could be added in the future.

For this we defined the concept of image views which are an extracted portion (view) of an image with some predefined properties. For example, the perspective projection of 75 by 75 degrees field of view patches from the original image.

Here we can see a perspective camera image and the image views generated from it:

Figure 5 - Original image — *Figure 5 – Original image*

Figure 6 - Two image views generated — *Figure 6 – Two image views generated*

The important thing here is that each generated view is an image on its own with the associated tags. They also have an overlapping area so we have a possibility to contain the same tag in two views but from different perspectives. This brings us to an indirect outcome of the first offline augmentation.

The second method for offline augmentation is the oversampling of some of the images (views). As mentioned above, we faced the problem of an unbalanced dataset, specifically we were missing tags that occupied high regions of the image, and even though our tagging teams tried to annotate as many as they could find, these were still scarce.

As our object detection model is an anchor-based detector, we did not even have enough of them to generate the anchor boxes correctly. This could be clearly seen in the accuracy of the previous trained models, as they were performing poorly on bins of big sizes.

By randomly oversampling images that contained big tags, up to a minimum required number, we managed to have better anchors and increase the recall for those scenarios. As described below, the chosen object detector for blurring was YOLOv4 which offers a large variety of online augmentations. The online augmentations used are saturation, exposure, hue, flip and mosaic.

Model

As of summer of 2021, the “to go” solution for object detection in images are convolutional neural networks (CNN), being a mature solution able to fulfil the needs efficiently.

Architecture

Most CNN based object detectors have three main parts: Backbone, Neck and (Dense or Sparse Prediction) Heads. From the input image, the backbone extracts features which can be combined in the neck part to be used by the prediction heads to predict object bounding-boxes and their labels.

Figure 7 - Anatomy of one and two-stage object detectors [1] — *Figure 7 – Anatomy of one and two-stage object detectors [1]*

The backbone is usually a CNN classification network pretrained on some dataset, like ImageNet-1K. The neck combines features from different layers in order to produce rich representations for both large and small objects. Since the objects to be detected have varying sizes, the topmost features are too coarse to represent smaller objects, so the first CNN based object detectors were fairly weak in detecting small sized objects. The multi-scale, pyramid hierarchy is inherent to CNNs so [2] introduced the Feature Pyramid Network which at marginal costs combines features from multiple scales and makes predictions on them. This or improved variants of this technique is used by most detectors nowadays. The head part does the predictions for bounding boxes and their labels.

YOLO is part of the anchor-based one-stage object detectors family being developed originally in Darknet, an open source neural network framework written in C and CUDA. Back in 2015 it was the first end-to-end differentiable network of this kind that offered a joint learning of object bounding boxes and their labels.

One reason for the big success of newer YOLO versions is that the authors carefully merged new ideas into one architecture, the overall speed of the model being always the north star.

YOLOv4 introduces several changes to its v3 predecessor:

Backbone – CSPDarknet53: YOLOv3 Darknet53 backbone was modified to use Cross Stage Partial Network (CSPNet [5]) strategy, which aims to achieve richer gradient combinations by letting the gradient flow propagate through different network paths.
Multiple configurable augmentation and loss function types, so called “Bag of freebies”, which by changing the training strategy can yield higher accuracy without impacting the inference time.
Configurable necks and different activation functions, they call “Bag of specials”.

Insights

For this task, we found that YOLOv4 gave a good compromise between speed and accuracy as it has doubled the speed of a more accurate two-stage detector while maintaining a very good overall precision/recall. For blurring, the main metric for model selection was the overall recall, while precision and intersection over union (IoU) of the predicted box comes second as we want to catch all personal data even if some are wrong. Having a multitude of possibilities to configure the detector architecture and train it on our own dataset we conducted several experiments with different configurations for backbones, necks, augmentations and loss functions to come up with our current solution.

We faced challenges in training a good model as the dataset posed a large object/box-level scale imbalance, small objects being over-represented in the dataset. As described in [3] and [4], this affects the scales of the estimated regions and the overall detection performance. In [3] several solutions are proposed for this out of which the SPP [6] blocks and PANet [7] neck used in YOLOv4 together with heavy offline data augmentation increased the performance of the actual model in comparison to the former ones.

As we have evaluated the model; it still has some issues:

Occlusion of the object, either by the camera view, head accessories or other elements:

These cases would need extra annotation in the dataset, just like the faces or licence plates that are really close to the camera and occupy a large region of interest in the image.

As we have a limited number of annotations of close objects to the camera view, the model has incorrectly learnt this, sometimes producing false positives in these situations:

Again, one solution for this would be to include more of these scenarios in the dataset.

What’s Next?

Grab spends a lot of effort ensuring privacy protection for its users so we are always looking for ways to further improve our related models and processes.

As far as efficiency is concerned, there are multiple directions to consider for both the dataset and the model. There are two main factors that drive the costs and the quality: further development of the dataset for additional edge cases (e.g. more training data of people wearing masks) and the operational costs of the model.

As the vast majority of current models require a fully labelled dataset, this puts a large work effort on the Data Entry team before creating a new model. Our dataset increased 4x for it’s third version, still there is room for improvement as described in the Dataset section.

As Grab extends its operation in more cities, new data is collected that has to be processed, this puts an increased focus on running detection models more efficiently.

Directions to pursue to increase our efficiency could be the following:

As plenty of unlabelled data is available from imagery collection, a natural direction to explore is self-supervised visual representation learning techniques to derive a general vision backbone with superior transferring performance for our subsequent tasks as detection, classification.
Experiment with optimisation techniques like pruning and quantisation to get a faster model without sacrificing too much on accuracy.
Explore new architectures: YOLOv5, EfficientDet or Swin-Transformer for Object Detection.
Introduce semi-supervised learning techniques to improve our model performance on the long tail of the data.

References

Alexey Bochkovskiy et al.. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv:2004.10934v1
Tsung-Yi Lin et al. Feature Pyramid Networks for Object Detection. arXiv:1612.03144v2
Kemal Oksuz et al.. Imbalance Problems in Object Detection: A Review. arXiv:1909.00169v3
Bharat Singh, Larry S. Davis. An Analysis of Scale Invariance in Object Detection – SNIP. arXiv:1711.08189v2
Chien-Yao Wang et al. CSPNet: A New Backbone that can Enhance Learning Capability of CNN. arXiv:1911.11929v1
Kaiming He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. arXiv:1406.4729v4
Shu Liu et al. Path Aggregation Network for Instance Segmentation. arXiv:1803.01534v4
http://blog.nitishmutha.com/equirectangular/360degree/2017/06/12/How-to-project-Equirectangular-image-to-rectilinear-view.html
Kaiyu Yang et al. Study of Face Obfuscation in ImageNet: arxiv.org/abs/2103.06191
Zhenda Xie et al. Self-Supervised Learning with Swin Transformers. arXiv:2105.04553v2

Join Us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Processing ETL tasks with Ratchet

2021-07-19 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/processing-etl-tasks-with-ratchet

Overview

At Grab, the Lending team is focused towards building products that help finance various segments of users, such as Passengers, Drivers, or Merchants, based on their needs. The team builds products that enable users to avail funds in a seamless and hassle-free way. In order to achieve this, multiple lending microservices continuously interact with each other. Each microservice handles different responsibilities, such as providing offers, storing user information, disbursing availed amounts to a user’s account, and many more.

In this tech blog, we will discuss what Data and Extract, Transform and Load (ETL) pipelines are and how they are used for processing multiple tasks in the Lending Team at Grab. We will also discuss Ratchet, which is a Go library, that helps us in building data pipelines and handling ETL tasks. Let’s start by covering the basis of Data and ETL pipelines.

What is a Data Pipeline?

A Data pipeline is used to describe a system or a process that moves data from one platform to another. In between platforms, data passes through multiple steps based on defined requirements, where it may be subjected to some kind of modification. All the steps in a Data pipeline are automated, and the output from one step acts as an input for the next step.

What is an ETL Pipeline?

An ETL pipeline is a type of Data pipeline that consists of 3 major steps, namely extraction of data from a source, transformation of that data into the desired format, and finally loading the transformed data to the destination. The destination is also known as the sink.

*Extract-Transform-Load (Source: TatvaSoft)*

The combination of steps in an ETL pipeline provides functions to assure that the business requirements of the application are achieved.

Let’s briefly look at each of the steps involved in the ETL pipeline.

Data Extraction

Data extraction is used to fetch data from one or multiple sources with ease. The source of data can vary based on the requirement. Some of the commonly used data sources are:

Database
Web-based storage (S3, Google cloud, etc)
Files
User Feeds, CRM, etc.

The data format can also vary from one use case to another. Some of the most commonly used data formats are:

SQL
CSV
JSON
XML

Once data is extracted in the desired format, it is ready to be fed to the transformation step.

Data Transformation

Data transformation involves applying a set of rules and techniques to convert the extracted data into a more meaningful and structured format for use. The extracted data may not always be ready to use. In order to transform the data, one of the following techniques may be used:

Filtering out unnecessary data.
Preprocessing and cleaning of data.
Performing validations on data.
Deriving a new set of data from the existing one.
Aggregating data from multiple sources into a single uniformly structured format.

Data Loading

The final step of an ETL pipeline involves moving the transformed data to a sink where it can be accessed for its use. Based on requirements, a sink can be one of the following:

Database
File
Web-based storage (S3, Google cloud, etc)

An ETL pipeline may or may not have a loadstep based on its requirements. When the transformed data needs to be stored for further use, the loadstep is used to move the transformed data to the storage of choice. However, in some cases, the transformed data may not be needed for any further use and thus, the loadstep can be skipped.

Now that you understand the basics, let’s go over how we, in the Grab Lending team, use an ETL pipeline.

Why Use Ratchet?

At Grab, we use Golang for most of our backend services. Due to Golang’s simplicity, execution speed, and concurrency support, it is a great choice for building data pipeline systems to perform custom ETL tasks.

Given that Ratchet is also written in Go, it allows us to easily build custom data pipelines.

Go channels are connecting each stage of processing, so the syntax for sending data is intuitive for anyone familiar with Go. All data being sent and received is in JSON, providing a nice balance of flexibility and consistency.

Utilising Ratchet for ETL Tasks

We use Ratchet for multiple ETL tasks like batch processing, restructuring and rescheduling of loans, creating user profiles, and so on. One of the backend services, named Azkaban, is responsible for handling various ETL tasks.

Ratchet uses Data Processors for building a pipeline consisting of multiple stages. Data Processors each run in their own goroutine so all of the data is processed concurrently. Data Processors are organised into stages, and those stages are run within a pipeline. For building an ETL pipeline, each of the three steps (Extract, Transform and Load) use a Data Processor for implementation. Ratchet provides a set of built-in, useful Data Processors, while also providing an interface to implement your own. Usually, the transform stage uses a Custom Data Processor.

*Data Processors in Ratchet (Source: Github)*

Let’s take a look at one of these tasks to understand how we utilise Ratchet for processing an ETL task.

Whitelisting Merchants Through ETL Pipelines

Whitelisting essentially means making the product available to the user by mapping an offer to the user ID. If a merchant in Thailand receives an option to opt for Cash Loan, it is done by whitelisting that merchant. In order to whitelist our merchants, our Operations team uses an internal portal to upload a CSV file with the user IDs of the merchants and other required information. This CSV file is generated by our internal Data and Risk team and handed over to the Operations team. Once the CSV file is uploaded, the user IDs present in the file are whitelisted within minutes. However, a lot of work goes in the background to make this possible.

Data Extraction

Once the Operations team uploads the CSV containing a list of merchant users to be whitelisted, the file is stored in S3 and an entry is created on the Azkaban service with the document ID of the uploaded file.

The data extraction step makes use of a Custom CSV Data Processor that uses the document ID to first create a PreSignedUrl and then uses it to fetch the data from S3. The data extracted is in bytes and we use commas as the delimiter to format the CSV data.

Data Transformation

In order to transform the data, we define a Custom Data Processor that we call a Transformer for each ETL pipeline. Transformers are responsible for applying all necessary transformations to the data before it is ready for loading. The transformations applied in the merchant whitelisting transformers are:

Convert data from bytes to struct.
Check for presence of all mandatory fields in the received data.
Perform validation on the data received.
Make API calls to external microservices for whitelisting the merchant.

As mentioned earlier, the CSV file is uploaded manually by the Operations team. Since this is a manual process, it is prone to human errors. Validation of data in the data transformation step helps avoid these errors and not propagate them further up the pipeline. Since CSV data consists of multiple rows, each row passes through all the steps mentioned above.

Data Loading

Whenever the merchants are whitelisted, we don’t need to store the transformed data. As a result, we don’t have a loadstep for this ETL task, so we just use an Empty Data Processor. However, this is just one of many use cases that we have. In cases where the transformed data needs to be stored for further use, the loadstep will have a Custom Data Processor, which will be responsible for storing the data.

Connecting All Stages

After defining our Data Processors for each of the steps in the ETL pipeline, the final piece is to connect all the stages together. As stated earlier, the ETL tasks have different ETL pipelines and each ETL pipeline consists of 3 stages defined by their Data Processors.

In order to connect these 3 stages, we define a Job Processor for each ETL pipeline. A Job Processor represents the entire ETL pipeline and encompasses Data Processors for each of the 3 stages. Each Job Processor implements the following methods:

SetSource: Assigns the Data Processor for the Extraction stage.
SetTransformer: Assigns the Data Processor for the Transformation stage.
SetDestination: Assigns the Data Processor for the Load stage.
Execute: Runs the ETL pipeline.

*Job processors containing Data Processor for each stage in ETL*

When the Azkaban service is initialised, we run the SetSource(), SetTransformer() and SetDestination() methods for each of the Job Processors defined. When an ETL task is triggered, the Execute() method of the corresponding Job Processor is run. This triggers the ETL pipeline and gradually runs the 3 stages of ETL pipeline. For each stage, the Data Processor assigned during initialisation is executed.

Conclusion

ETL pipelines help us in streamlining various tasks in our team. As showcased through the example in the above section, an ETL pipeline breaks a task into multiple stages and divides the responsibilities across these stages.

In cases where a task fails in the middle of the process, ETL pipelines help us determine the cause of the failure quickly and accurately. With ETL pipelines, we have reduced the manual effort required for validating data at each step and avoiding propagation of errors towards the end of the pipeline.

Through the use of ETL pipelines and schedulers, we at Lending have been able to automate the entire pipeline for many tasks to run at scheduled intervals without any manual effort involved at all. This has helped us tremendously in reducing human errors, increasing the throughput of the system and making the backend flow more reliable. As we continue to automate more and more of our tasks that have tightly defined stages, we foresee a growth in our ETL pipelines usage.

References

https://www.alooma.com/blog/what-is-a-data-pipeline

http://rkulla.blogspot.com/2016/01/data-pipeline-and-etl-tasks-in-go-using

https://medium.com/swlh/etl-pipeline-and-data-pipeline-comparison-bf89fa240ce9

Join Us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

App Moduralisation at Scale

2021-07-13 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/app-moduralisation-at-scale

Grab a coffee ☕️, sit back and enjoy reading. 😃

Wanna know how we improved our app’s build time performance and developer experience at Grab? Continue reading…

Where it all began

Imagine you are working on an app that grows continuously as more and more features are added to it, it becomes challenging to manage the code at some point. Code conflicts increase due to coupling, development slows down, releases take longer to ship, collaboration becomes difficult, and so on.

Grab superapp is one such app that offers many services like booking taxis, ordering food, payments using an e-wallet, transferring money to friends/families, paying at merchants, and many more, across Southeast Asia.

Grab app followed a monolithic architecture initially where the entire code was held in a single module containing all the UI and business logic for almost all of its features. But as the app grew, new developers were hired, and more features were built, it became difficult to work on the codebase. We had to think of better ways to maintain the codebase, and that’s when the team decided to modularise the app to solve the issues faced.

What is Modularisation?

Breaking the monolithic app module into smaller, independent, and interchangeable modules to segregate functionality so that every module is responsible for executing a specific functionality and will contain everything necessary to execute that functionality.

Modularising the Grab app was not an easy task as it brought many challenges along with it because of its complicated structure due to the high amount of code coupling.

Approach and Design

We divided the task into the following sub-tasks to ensure that only one out of many functionalities in the app was impacted at a time.

Setting up the infrastructure by creating Base/Core modules for Networking, Analytics, Experimentation, Storage, Config, and so on.
Building Shared Library modules for Styling, Common-UI, Utils, etc.
Incrementally building Feature modules for user-facing features like Payments Home, Wallet Top Up, Peer-to-Merchant (P2M) Payments, GrabCard and many others.
Creating Kit modules for the feature to feature module communication. This step helped us in building the feature modules in parallel.
Finally, the App module is used as a hub to connect all the other modules together using dependency injection (Dagger).

In the above diagram, payments-home, wallet top-up, and grabcard are different features provided by the Grab app. top-up-kit and grabcard-kit are bridges that expose functionalities from topup and grabcard modules to the payments-home module, respectively.

In the process of modularising the Grab app, we ensured that a feature module did not directly depend on other feature modules so that they could be built in parallel using the available CPU cores of the machine, hence reducing the overall build time of the app.

With the Kit module approach, we separated our code into independent layers by depending only on abstractions instead of concrete implementation.

Modularisation Benefits

Faster build times and hence faster CI: Gradle build system compiles only the changed modules and uses the binaries of all the non-affected modules from its cache. So the compilation becomes faster. Moreover, independent modules are run in parallel on different threads.
Fine dependency graph: Dependencies of a module are well defined.
Reusability across other apps: Modules can be used across different apps by converting them into an AAR SDK.
Scale and maintainability: Teams can work independently on the modules owned by them without blocking each other.
Well-defined code ownership: Clear responsibility on who owns which code.

Limitations

Requires more effort and time to modularise an app.
Separate configuration files to be maintained for each module.
Gradle sync time starts to grow.
IDE becomes very slow and its memory usage goes up a lot.
Parallel execution of the module depends on the machine’s capabilities.

Where we are now

There are more than 1,000 modules in the Grab app and are still counting.

At Grab, we have many sub-teams which take care of different features available in the app. Grab Financial Group (GFG) is one such sub-team that handles everything related to payments in the app. For example: P2P & P2M money transfers, e-Wallet activation, KYC, and so on.

We started modularising payments further in July 2020 as it was already bombarded with too many features and it was difficult for the team to work on the single payments module. The result of payments modularisation is shown in the following chart.

As of today, we have about 200+ modules in GFG and more than 95% of the modules take less than 15s to build.

Conclusion

Modularisation has helped us a lot in reducing the overall build time of the app and also, in improving the developer experience by breaking dependencies and allowing us to define code ownership. Having said that, modularisation is not an easy or a small task, especially for large projects with legacy code. However, with careful planning and the right design, modularisation can help in forming a well-structured and maintainable project.

Hope you enjoyed reading. Don’t forget to 👏.

References:

Join Us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

App Modularisation at Scale

2021-07-13 Grab Tech

Post Syndicated from Grab Tech original https://engineering.grab.com/app-modularisation-at-scale

Grab a coffee ☕️, sit back and enjoy reading. 😃

Wanna know how we improved our app’s build time performance and developer experience at Grab? Continue reading…

Where it all began

What is Modularisation?

Modularising the Grab app was not an easy task as it brought many challenges along with it because of its complicated structure due to the high amount of code coupling.

Approach and Design

We divided the task into the following sub-tasks to ensure that only one out of many functionalities in the app was impacted at a time.

Setting up the infrastructure by creating Base/Core modules for Networking, Analytics, Experimentation, Storage, Config, and so on.
Building Shared Library modules for Styling, Common-UI, Utils, etc.
Incrementally building Feature modules for user-facing features like Payments Home, Wallet Top Up, Peer-to-Merchant (P2M) Payments, GrabCard and many others.
Creating Kit modules for the feature to feature module communication. This step helped us in building the feature modules in parallel.
Finally, the App module is used as a hub to connect all the other modules together using dependency injection (Dagger).

With the Kit module approach, we separated our code into independent layers by depending only on abstractions instead of concrete implementation.

Modularisation Benefits

Faster build times and hence faster CI: Gradle build system compiles only the changed modules and uses the binaries of all the non-affected modules from its cache. So the compilation becomes faster. Moreover, independent modules are run in parallel on different threads.
Fine dependency graph: Dependencies of a module are well defined.
Reusability across other apps: Modules can be used across different apps by converting them into an AAR SDK.
Scale and maintainability: Teams can work independently on the modules owned by them without blocking each other.
Well-defined code ownership: Clear responsibility on who owns which code.

Limitations

Requires more effort and time to modularise an app.
Separate configuration files to be maintained for each module.
Gradle sync time starts to grow.
IDE becomes very slow and its memory usage goes up a lot.
Parallel execution of the module depends on the machine’s capabilities.

Where we are now

There are more than 1,000 modules in the Grab app and are still counting.

As of today, we have about 200+ modules in GFG and more than 95% of the modules take less than 15s to build.

Conclusion

Hope you enjoyed reading. Don’t forget to 👏.

References:

Join Us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Getting started

What’s next?

The bigger picture: developer productivity at GitHub

November 27 20:40 UTC (lasting 2 hours and 50 minutes)

In summary

Empowering support and on-call engineers

Finding the needle in the haystack

Automating playbooks

Looking to the future

Tip #1: A little YAML can make frontend work easier

Tip #2: A few DevOps tools to keep you moving fast

Git

Cloud-hosted integrated development environments (IDE)

Containers

Tip #3: Automated testing and continuous integration (CI) to stay one step ahead

Using GitHub Actions to run automated tests

Using GitHub Actions to create CI pipelines

Tip #4: Server orchestration tips for flexibility and speed

Tip #5: Repeatable tasks? Try scripting them with Bash or PowerShell

The bottom line

Additional resources

Why is this necessary?

Do I need to do anything?

How do I migrate my service?

Tell us what you think

Sparse index

Multi-pack reachability bitmaps

A new default merge strategy

Tidbits

The rest of the iceberg

The Git index

The index affects Git performance at scale

The sparse index

Building the sparse index safely

Example implementation detail: git diff

Implementation detail: ORT merge strategy

Testing the sparse index

The current state of the sparse index

Looking to the future

October 8 17:16 UTC (lasting 1 hour and 36 minutes)

In summary

Virtual partitions

Schema domains

SQL linters

Query linter

Transaction linter

Moving data without downtime

Vitess

Write-cutover process

Learnings

Results

More partitioning

Conclusion

Background

Kafka to DSP

MySQL to ES

Original Incremental Sync

Original Kafka streams

Stream Consumer infrastructure

Event Buffer procedure

Event Handler procedure

Issues in the original infrastructure

Optimised Incremental Sync

MySQL Binlog

Current Kafka streams

Stream Consumer optimisations

Event Handler optimisations

Optimisation 1

Achievements

Optimisation 2

Achievements

Event Buffer optimisation

Cascade Update optimisation

Optimisation

Achievements

Summary

Join Us

Plugging a memory leak

The error you can’t ignore

Loopy performance problems

Example implementation detail: `git diff`