All posts by Keith Ballinger

An update on recent service disruptions

Post Syndicated from Keith Ballinger original https://github.blog/2022-03-23-an-update-on-recent-service-disruptions/

Over the past few weeks, we have experienced multiple incidents due to the health of our database, which resulted in degraded service of our platform. We know this impacts many of our customers’ productivity and we take that very seriously. We wanted to share with you what we know about these incidents while our team continues to address these issues.

The underlying theme of our issues over the past few weeks has been due to resource contention in our mysql1 cluster, which impacted the performance of a large number of our services and features during periods of peak load. Over the past several years, we’ve shared how we’ve been partitioning our main database in addition to adding clusters to support our growth, but we are still actively working on this problem today. We will share more in our next Availability Report, but I’d like to be transparent and share what we know now.

Timeline

March 16 14:09 UTC (lasting 5 hours and 36 minutes)

At this time, GitHub saw an increased load during peak hours on our mysql1 database, causing our database proxying technology to reach its maximum number of connections. This particular database is shared by multiple services and receives heavy read/write traffic. All write operations were unable to function during this outage, including git operations, webhooks, pull requests, API requests, issues, GitHub Packages, GitHub Codespaces, GitHub Actions, and GitHub Pages services.

The incident appeared to be related to peak load combined with poor query performance for specific sets of circumstances. Our MySQL clusters use a classic primary-replica set up for high-availability where a single node primary is able to accept writes, while the rest of the cluster consists of replica nodes that serve read traffic. We were able to recover by failing over to a healthy replica and started investigations into traffic patterns at peak load related to query performance during these times.

March 17 13:46 UTC (lasting 2 hours and 28 minutes)

The following day, we saw the same peak traffic pattern and load on mysql1. We were not able to pinpoint and address the query performance issues before this peak, and we decided to proactively failover before the issue escalated. Unfortunately, this caused a new load pattern that introduced connectivity issues on the new failed-over primary, and applications were once again unable to connect to mysql1 while we worked to reset these connections. We were able to identify the load pattern during this incident and subsequently implemented an index to fix the main performance problem.

March 22 15:53 UTC (lasting 2 hours and 53 minutes)

While we had reduced load seen in the previous incidents, we were not fully confident in the mitigations. We wanted to do more to analyze performance on this database to prevent future load patterns or performance issues. In this third incident, we enabled memory profiling on our database proxy in order to look more closely at the performance characteristics during peak load. At the same time, client connections to mysql1 started to fail, and we needed to again perform a primary failover in order to recover.

March 23 14:49 UTC (lasting 2 hours and 51 minutes)

We again saw a recurrence of load characteristics that caused client connections to fail and again performed a primary failover in order to recover. In order to reduce load, we throttled webhook traffic and will continue to use that as a mitigation to prevent future recurrence during peak load times as we continue to investigate further mitigations.

Next steps

In order to prevent these types of incidents from occurring in the future, we have started an audit of load patterns for this particular database during peak hours and a series of performance fixes based on these audits. As part of this, we are moving traffic to other databases in order to reduce load and speed up failover time, as well as reviewing our change management procedures, particularly as it relates to monitoring and changes during high load in production. As the platform continues to grow, we have been working to scale up our infrastructure including sharding our databases and scaling hardware.

In summary

We sincerely apologize for the negative impacts these disruptions have caused. We understand the impact these types of outages have on customers who rely on us to get their work done every day and are committed to efforts ensuring we can gracefully handle disruption and minimize downtime. We look forward to sharing additional information as part of our March Availability Report in the next few weeks.

GitHub Availability Report: June 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-07-07-github-availability-report-june-2021/

In June, we experienced no incidents resulting in service downtime to our core services.

Please follow our status page for real time updates and watch our blog for next month’s availability report. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: April 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-05-05-github-availability-report-april-2021/

In April, we experienced two incidents resulting in significant impact and degraded state of availability for API requests and the GitHub Packages service, specifically the GitHub Packages Container registry service.

April 1 21:30 UTC (lasting one hour and 34 minutes)

This incident was caused by failures in our DNS resolution, resulting in a degraded state of availability for the GitHub Packages Container registry service. During this incident, some of our internal services that support the Container registry experienced intermittent failures when trying to connect to dependent services. The inability to resolve requests to these services resulted in users being unable to push new container images to the Container registry as well as pull existing images. The Container registry is currently in a public beta, and only beta users were impacted during this incident. The broader GitHub Packages service remained unaffected.

As a next step, we are looking at increasing the cache times of our DNS resolutions to decrease the impact of intermittent DNS resolution failures in the future.

April 22 17:16 UTC (lasting 53 minutes)

Our service monitors detected an elevated error rate when using API requests, which resulted in a degraded state of availability for repository creation. Upon further investigation of this incident, we identified the issue was caused by a bug from a recent data migration. In a data migration to isolate our secret scanning tables into their own cluster, a bug was discovered that broke the ability of the application to successfully write to the secret scanning database. The incident revealed a hitherto unknown dependency that repository creation had upon secret scanning, which makes a call for every repository created. Due to this dependency, repository creation was blocked until we were able to roll back the data migration.

As next steps, we are actively working with our vendor to update the data migration tool and have amended our migration process to include revised steps for remediation, in case similar incidents occur. Furthermore, our application code has been updated to remove the dependency on secret scanning for the creation of repositories.

In summary

From scaling the GitHub API to improving large monorepo performance, we will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: February 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-03-03-github-availability-report-february-2021/

Introduction

In February, we experienced no incidents resulting in service downtime to our core services.

This month’s GitHub Availability Report will provide initial details around an incident from March 1 that caused significant impact and a degraded state of availability for the GitHub Actions service.

March 1 09:59 UTC (lasting one hour and 42 minutes)

Our service monitors detected a high error rate on creating check suites for workflow runs. This incident resulted in the failure or delay of some queued jobs for a period of time. Additionally, some customers using the search/filter functionality to find workflows may have experienced incomplete search results.

In February, we experienced a few smaller and unrelated incidents affecting different parts of the GitHub Actions service. While these were localized to a subset of our customers and did not have broad impact, we take reliability very seriously and are conducting a thorough investigation into the contributing factors.

Due to the recency of the March 1 incident, we are still investigating the contributing factors and will provide a more detailed update in the March Availability Report, which will be published the first Wednesday of April. We will also share more about our efforts to minimize the impact of future incidents and increase the performance of the Actions service.

In summary

GitHub Actions has grown tremendously, and we know that the availability and performance of the service is critical to its continued success. We are committed to providing excellent service, reliability, and transparency to our users, and will continue to keep you updated on the progress we’re making to ensure this.

For more information, you can check out our status page for real-time updates on the availability of our systems and the GitHub Engineering blog for deep-dives around how we’re improving our engineering systems and infrastructure.

GitHub Availability Report: January 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-02-02-github-availability-report-january-2021/

Introduction

In January, we experienced one incident resulting in significant impact and degraded state of availability for the GitHub Actions service.

January 28 04:21 UTC (lasting 3 hours 53 minutes)

Our service monitors detected abnormal levels of errors affecting the Actions service. This incident resulted in the failure or delay of some queued jobs for a period of time. Jobs that were queued during the incident were run successfully after the issue was resolved.

We identified the issue as caused by an infrastructure error in our SQL database layer. The database failure impacted one of the core microservices that facilitates authentication and communication between the Actions microservices, which affected queued jobs across the service. In normal circumstances, automated processes would detect that the database was unhealthy and failover with minimal or no customer impact. In this case, the failure pattern was not recognized by the automated processes, and telemetry did not show issues with the database, resulting in a longer time to determine the root cause and complete mitigation efforts.

To help avoid this class of failure in the future, we are updating the automation processes in our SQL database layer to improve error detection and failovers. Furthermore, we are continuing to invest in localizing failures to minimize the scope of impact resulting from infrastructure errors.

In summary

We’ll continue to keep you updated on the progress we’re making on ensuring reliability of our services. To learn more about how teams across GitHub identify and address opportunities to improve our engineering systems, check out the GitHub Engineering blog.

GitHub Availability Report: December 2020

Post Syndicated from Keith Ballinger original https://github.blog/2021-01-06-github-availability-report-december-2020/

Introduction

In December, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will provide a summary and follow-up details on how we addressed an incident mentioned in November’s report.

Follow-up to November 27 16:04 UTC (lasting one hour and one minute)

Upon further investigation around one of the incidents mentioned in November’s Availability Report, we discovered an edge case that triggered a large number of GitHub App token requests. This caused abnormal levels of replication lag within one of our MySQL clusters, specifically affecting the GitHub Actions service. This particular scenario resulted in amplified queries and increased the database lag, which impacted the database nodes that process GitHub App token requests.

When a GitHub Action is invoked, the Action is passed a GitHub App token to perform tasks on GitHub. In this case, the database lag resulted in the failure of some of those token requests because the database replicas did not have up to date information.

To help avoid this class of failure, we are updating the queries to prevent large quantities of token requests from overloading the database servers in the future.

In summary

Whether we’re introducing a system to manage flaky tests or improving our CI workflow, we’ve continued to invest in our engineering systems and overall reliability. To learn more about what we’re working on, visit GitHub’s engineering blog.

GitHub Availability Report: November 2020

Post Syndicated from Keith Ballinger original https://github.blog/2020-12-02-availability-report-november-2020/

Introduction

In November, we experienced two incidents resulting in significant impact and degraded state of availability for issues, pull requests, and GitHub Actions services.

November 2 12:00 UTC (lasting 32 minutes)

The SSL certificate for *.githubassets.com expired, impacting web requests for GitHub.com UI and services. There was an auto-generated issue indicating the certificate was within 30 days of expiration, but it was not addressed in time. Impact was reported, and the on-call engineer remediated it promptly.

We are using this occurrence to evaluate our current processes, as well as our tooling and automation, within this area to reduce the likelihood of such instances in the future.

November 27 16:04 UTC (lasting one hour and one minute)

Our service monitors detected abnormal levels of replication lag within one of our MySQL clusters affecting the GitHub Actions service.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

In summary

We place great importance in the reliability of our services along with the trust that our users place in us every day. We’ll continue to keep you updated on the progress we’re making to ensure this. To learn more about what we’re working on, visit the GitHub engineering blog.

GitHub Availability Report: August 2020

Post Syndicated from Keith Ballinger original https://github.blog/2020-09-02-github-availability-report-august-2020/

Introduction

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Status page refresh

At GitHub, we are always striving to be more transparent and clear with our users. We know our customers trust us to communicate the current availability of our services and we are now making that more clear for everyone to see. Previously, our status page shared a 90-day history of GitHub’s availability by service, but this history can distract users from what’s happening actively, which during an incident is the most important piece of information. Starting today, the status page will display our current availability and inform users of any degradation in service with real-time updates from the team. The history view will continue to be available and can be found under the “incident history” link. As a reminder, if you want to receive updates on any status changes, you can subscribe to get notifications whenever GitHub creates, updates, or resolves an incident.

Check out the new GitHub Status Page.

Animation showing new GitHub Status page

Follow up to July 13 08:18 UTC

Since the incident mentioned in July’s GitHub Availability Report, we’ve worked on a number of improvements to both our deployment tooling and to the way we configure our Kubernetes deployments, with the goal of improving the reliability posture of our systems.

First, we audited all the Kubernetes deployments used in production to remove all usages of the ImagePullPolicy of Always configuration.

Our philosophy when dealing with Kubernetes configuration is to make it easy for internal users to ship their code to production while continuing to follow best practices. For this reason, we implemented a change that automatically replaces the ImagePullPolicy of Always setting in all our Kubernetes-deployed applications, while still allowing experienced users with particular needs to opt out of this automation.

Second, we implemented a mechanism equivalent to the one of Kubernetes mutating admission controllers that we use to inject the latest version of sidecar containers, identified by the SHA256 digest of the image.

These changes allowed us to remove the strong coupling between Kubernetes Pods and the availability of our Docker registry in case of container restarts. We have more improvements in the pipeline that will help further increase the resilience of our Kubernetes deployments and we plan to share more information about those in the future.

In Summary

We’re excited to be able to share these updates with you and look forward to future updates as we continue our efforts in making GitHub more resilient every day.