Making GitHub CI workflow 3x faster

Post Syndicated from Keerthana Kumar original https://github.blog/2020-10-29-making-github-ci-workflow-3x-faster/

Welcome to the first deep dive of the Building GitHub blog series, providing a look at how teams across the GitHub engineering organization identify and address opportunities to improve our internal development tooling and infrastructure.

At GitHub, we use the Four Key Metrics of high performing software development to help frame our engineering fundamentals effort. As we measured Lead Time for Changes—the time it takes for code to be successfully running in production—we identified that developers waited an average of 45 minutes for a successful run of our continuous integration suite to complete before merging any change. This 45-minute lead time was repeated once more before deploying a merge branch. In a perfect scenario, a developer waited almost two hours after checking in code before the change went live on GitHub.com. This 45-minute CI now takes only 15 minutes to run! Here is a deep dive on how we made GitHub’s CI workflow 3x faster.

Analyzing the problem

At this moment the monumental Ruby monolith that powers millions of developers on GitHub.com, has over 7,000 test suites and over 5,000 test files. Every commit to a pull request triggers 25 CI jobs and requires 15 of those CI jobs to complete before merging a pull request. This meant that a developer at GitHub spent approximately 45 minutes and 600 cores of computing resources for every commit. That’s a lot of developer-hours and machine-hours that could be spent creating value for our customers.

Analyzing the types of CI jobs, we identified four categories: unit testing, linting/performance, integration testing, builds/deployments. All jobs except two of the integration testing jobs took less than 13 minutes to run. The two integration testing jobs were the bottleneck in our Lead Time for Changes. As it is true for most DevOps cycles, several test suites were also flaky. Although this blog post isn’t going to share how we solved for the flakiness of our tests, spoiler alert, a future post in this series will explain that process. Apart from being flaky, the two integration testing jobs increased developer friction and reduced productivity at GitHub.

Engineering decision

GitHub Enterprise Server, the on-premise offering of GitHub used by our enterprise customers, ships a new patch release every two weeks and a major release every quarter. The two long running test suites were added to the CI workflow to ensure a pull request did not break the GitHub experience for our Enterprise Server customers. It was also clear that these 45-minute test suites did not provide additional value blocking GitHub.com deployments that happen continuously throughout the day. Driven by customer obsession and developer satisfaction, we developed the deferred compliance tool.

Deferred compliance

The deferred compliance tool integrated along with our CI workflow system aims to strike a critical balance between improving Lead Time for Change in deploying GitHub.com and creating accountability for the quality of Enterprise Server. The long running CI jobs are no longer required to pass before a pull request is merged but the deferred compliance tool is monitoring for any test failure.

If a CI job fails, a GitHub issue with a deferred compliance label is created and the pull request author and code segment’s code owners are tagged. A warning message is sent on Slack to the developer and a 72-hour timer is kicked off. The developer now has 72 hours to fix the build, push a change or revert the pull request. A successful run of the CI job automatically closes the compliance issue and the 72-hour timer is turned off. If the CI job remains broken for more than 72 hours, all deployments to GitHub.com are halted, barring any exceptional situations, until the integration tests for Enterprise Server are fixed. This creates accountability and ownership for all our developers to build features that work flawlessly on GitHub.com and Enterprise Server. The 72-hour timer is customizable but our analysis showed that with a global team of developers, 72 hours reduced the possibility that a change merged by a developer in San Francisco on a Friday afternoon did not unintentionally block deployments for a developer in Sydney on Monday morning. Deferred compliance can be used for any long running CI run that does not need to block deployments while creating a call for action for CI run failures.

Key Takeaways

  • Internal engineering tooling is a powerful resource to support developers and at the same time provide guardrails for product consistency.
  • Focusing on a key metric allows us to identify bottlenecks and develop simple and creative solutions.
  • Comprehending historical context for past decisions and being customer obsessed provides us an opportunity to build a more thoughtful engineering design.

Overall, this project is a testimony that a simple solution can significantly improve developer productivity and that can have long-term positive implications to an engineering organization. And of course, since numbers matter, we made our CI 3x faster.