All posts by Jakub Oleksy

GitHub Availability Report: June 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-07-06-github-availability-report-june-2022/

In June, we experienced four incidents resulting in significant impact and degraded state of availability to multiple GitHub.com services. This report also sheds light into an incident that impacted multiple GitHub.com services in May.

June 1 09:40 UTC (lasting 48 minutes)

During this incident, customers experienced delays in the start up of their GitHub Actions workflows. The cause of these delays was excessive load on a proxy server that routes traffic to the database.

At 09:37 UTC, Actions service noticed a marked increase in the time it takes customer jobs to start. Our on-call engineer was paged and Actions was statused red. Once we started to investigate, we noticed that the pods running the proxy server for the database were crash-looping due to out-of-memory errors. A change was created to increase the available memory to these pods, which fully rolled out by 10:08 UTC. We started to see recovery in Actions even before 10:08 UTC, and statused to yellow at 10:17 UTC. By 10:28 UTC, we were confident that the memory increase had mitigated the issue, and statused Actions green.

Ultimately, this issue was traced back to a set of data analysis queries being pointed at an incorrect database. The large load they placed on the database caused the crash loops and the broader impact. These queries have been moved to a dedicated analytics setup that does not serve production traffic.

We are adding alerts to identify increases in load to the proxy server to catch issues like this early. We are also investigating how we can put in guardrails to ensure production database access is limited to services that own the data.

June 21 17:02 UTC (lasting 1 hour and 10 minutes)

During this incident, shortly after the GA of Copilot, users with either a Marketplace or Sponsorship plan were unable to use Copilot. Users with those subscriptions received an error from our API responsible for creating authentication tokens. This impacted a little less than 20% of our active users at the time.

At approximately 16:45 UTC, we were alerted and noticed elevated error rates in the API and began investigating causes. We were able to identify the issue and statused red. Our engineers worked quickly to roll out a fix to the API endpoint and we saw API error rates begin lowering at approximately 17:45 UTC. By 18:00 UTC, we were no longer seeing this issue but decided to wait for 10 more minutes to status back to green to ensure there were no regressions.

We have increased our testing around this particular combination of subscription types, added these scenarios to our user testing and will add additional data shape testing before future rollouts.

June 28 17:16 UTC (lasting 26 minutes)

Our alerting systems detected degraded availability for Codespaces during this time. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on the causes and remediations in the July Availability Report, which will be published the first Wednesday of August.

June 29 14:48 UTC (lasting 1 hour and 27 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, GitHub Packages, and GitHub Pages were impacted. As we continue to investigate the contributing factors, we will provide a more detailed update in the July Availability Report. We will also share more about our efforts to minimize the impact of similar incidents in the future.

Follow up to May 27 04:26 UTC (lasting 21 minutes) and May 27 07:36 UTC (lasting 1 hour and 21 minutes)

As mentioned in the May Availability Report, we are now providing a more detailed update on this incident following further investigation.

Both instances that occurred at 04:26 and 07:36 UTC were caused by the same contributing factors. In the first instance, an individual service team noticed higher than normal load and an increase in error rate on API requests and statused red. The load was particularly high on our login endpoint. While this did elevate error rates, it was not enough to cause a widespread outage and we should have likely statused yellow in this instance.

After follow-up that indicated the load pattern had subsided, our on-call team determined it was safe to report the situation was mitigated and began to investigate further.

However, three hours later, we again experienced a degradation of service from a sustained high load in traffic. This was again concentrated on our login endpoint. We statused all services red, since we were seeing sustained error rates for a variety of clients and situations, and then updated individual service statuses based on their SLOs. Services that were affected by the load pattern statused to yellow, while services that were not impacted statused back to green.

The duration of impact to GitHub.com from the second instance of the load pattern lasted about 15 minutes. We continued to see elevated traffic during this time and waited until a network-level mitigation was rolled out before statusing all affected services back to green.

In addition to network mitigation, we were able to use the data from this incident to add additional mitigations on the application side for a sustained load of this type, as well as inform architectural changes we can make in the future to make our services more resilient.

Following this incident, we are improving our on-call procedures to ensure we always report the correct status level based on SLO review. While we always want to over-communicate issues with customers for awareness, we want to only status red when necessary.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.

GitHub Availability Report: May 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-06-01-github-availability-report-may-2022/

In May, we experienced three distinct incidents that resulted in significant impact and degraded state of availability to multiple services across GitHub.com. This report also sheds light into the billing incident that impacted GitHub Actions and Codespaces users in April.

May 20 09:44 UTC (lasting 49 minutes)

During this incident, our alerting systems detected increased CPU utilization on one of the GitHub Container registry databases. When we received the alert we immediately began investigating. Due to this preemptive monitoring added from the last incident in April at 8:59 UTC, the on-call was readily monitoring and prepared to run mitigation for this incident.

As CPU utilization on the database continued to rise, the Container registry began responding to requests with increased latency, followed by an internal server error for a percentage of requests. At this point we knew there was customer impact and changed the public status of the service. This increased CPU activity was due to a high volume of the “Put Manifest” command. Other package registries were not impacted.

The reason for the CPU utilization was that the throttling criteria configured at the API side for this command was too permissive, and a database query was found to be non-performant under that degree of scale. This caused an outage for anyone using the GitHub Container registry. Users were experiencing latency issues when pushing or pulling packages, as well as getting slow access to the packages UI.

In order to limit impact we throttled the requests from all organizations/users and to restore normal operation, we had to reset our database state by restarting our front-end servers and then the database.

To avoid this in the future, we have added separate rate limiting for operation types from specific organizations/users and will continue working on performance improvements for SQL queries.

May 27 04:26 UTC (lasting 21 minutes)

Our alerting systems detected degraded availability for API requests during this time. Due to the recency of these incidents, we are still investigating the contributing factors and will provide a more detailed update on the causes and remediations in the June Availability Report, which will be published the first Wednesday of July.

May 27 07:36 UTC (lasting 1 hour and 21 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks were impacted. As we continue to investigate the contributing factors, we will provide a more detailed update in the June Availability Report. We will also share more about our efforts to minimize the impact of similar incidents in the future.

Follow up to April 14 20:35 UTC (lasting 4 hours and 53 minutes)

As we mentioned in the April Availability Report, we are now providing a more detailed update on this incident following further investigation.

On April 14, GitHub Actions and Codespaces customers started reporting incorrect charges for metered services shown in their GitHub billing settings. As a result, customers were hitting their GitHub spending limits and unable to run new Actions or create new Codespaces. We immediately started an incident bridge. Our first step was to unblock all customers by giving unlimited Actions and Codespaces usage for no additional charge during the time of this incident.

From looking at the timing and list of recently pushed changes, we determined that the issue was caused by a code change in the metered billing pipeline. When attempting to improve performance of our metered usage processor, Actions and Codespaces minutes were mistakenly multiplied by 1,000,000,000 to convert gigabytes into bytes when this was not necessary for these products. This was due to a change to shared metered billing code that was not thought to impact these products.

To fix the issue, we reverted the code change and started repairing the corrupted data recorded during the incident. We did not re-enable metered billing for GitHub products until we had repaired the incorrect billing data, which happened 24 hours after this incident.

To prevent this incident in the future, we added a Rubocop (Ruby static code analyzer) rule to block pull requests containing non-safe billing code updates. In addition, we added anomaly monitoring for the billed quantity, so next time we are alerted before impacted customers. We also tightened the release process to require a feature flag and end-to-end test when shipping such changes.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.

GitHub Availability Report: April 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-05-04-github-availability-report-april-2022/

In April, we experienced three distinct incidents resulting in significant impact and degraded state of availability for Codespaces and GitHub Packages.

April 01  7:07 UTC (lasting 5 hours and 32 minutes)

Our alerting detected an increase in failures to create new Codespaces and start existing stopped Codespaces in the US West region. We immediately updated the GitHub status page and began to investigate.

Upon further investigation, we determined that some secrets that are used by the Codespaces service had expired. Codespaces maintains warm pools of resources to protect our users from intermittent failures in our dependent services. However, in the US West region, those pools were empty of resources due to the expired secret. In this case, we didn’t have an early enough warning on pools reaching low thresholds and didn’t have time to react until we ran out of capacity. As we worked to mitigate the incident, the pools in other regions also emptied due to the expired secret, and those regions began to see failures as well.

A limited number of GitHub engineers had access to rotate the secret, and communication issues delayed the start of the secret refresh process. The expired secret was eventually refreshed and rolled out to all regions, and the service was returned to full operation.

To prevent this failure pattern in the future, we now verify resources that expire and have monitors in place that alert well in advance if pool resources are not being maintained. We’ve also added monitors to notify us earlier when we approach resource exhaustion limits. In addition, we’ve initiated migrating the service to use a mechanism that doesn’t rely on secrets or the need to rotate credentials.

April 14 20:35 UTC (lasting 4 hours and 53 minutes)

We are still investigating the contributing factors and will provide a more detailed update in the May Availability Report, which will be published the first Wednesday of June. We will also share more about our efforts to minimize the impact of future incidents.

April 25 8:59 UTC (lasting 5 hours and 8 minutes)

During this incident, our alerting systems detected increased CPU utilization on one of the GitHub Packages Registry databases, which started approximately one hour before any customer impact occurred. The threshold for this alert was relatively low, and it was not a paging alert, so we did not immediately investigate. CPU continued to rise on the database causing the Package Registry to start responding to requests with internal server errors, eventually causing customer impact. This increased activity was due to a high volume of the “Create Manifest” command used in an unexpected manner.

The throttling criteria configured at the database level wasn’t enough to limit the above command, and this caused an outage for anyone using the GitHub Packages Registry. Users were unable to push or pull packages, as well as being unable to access the packages UI or the repository landing page.

After investigating, we determined there was a performance bug related to the high volume of “Create Manifest” commands. In order to limit impact and restore normal operation, we blocked the activity causing this problem. We are actively following up on this issue by improving the rate limiting in packages and fixing the performance problem that was uncovered. We’ve also modified database alerting thresholds and severity so we get alerted to unexpected issues more quickly (rather than after customer impact).

During this incident, we also discovered that the repository home page has a hard dependency on the packages infrastructure. When the package registry is down, the home pages for repositories that list packages also fail to load. We decoupled the package listing from the repository home page, but that required manual intervention during the outage. We are working on a fix that loosely binds the packages listing, so if it fails, it does not take down the repository home pages for repositories that list packages.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. Please follow our status page for real-time updates. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: March 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-04-06-github-availability-report-march-2022/

In March, we experienced a number of incidents that resulted in significant impact and degraded state of availability to some core GitHub services. This blog post includes a detailed follow-up on a series of incidents that occurred due to degraded database stability, and a distinct incident impacting the Actions service.

Database Stability

Last month, we experienced a number of recurring incidents that impacted the availability of our services. We want to acknowledge the impact this had on our customers, and take this opportunity during our monthly report to provide additional details as a result of further investigations and share what we have learned.

Background

The underlying theme of these issues was due to resource contention in our mysql1 cluster, which impacted the performance of a large number of our services and features during periods of peak load.

Each of these incidents resulted in a degraded state of availability for write operations on our primary services (including Git, issues, and pull requests). While some read operations were not impacted, any user who performed a write operation that involved our mysql1 cluster was affected, as the database could not handle the load.

After the other services recovered, GitHub Actions queues were saturated. We enabled the queues gradually to catch up in real time, and as a result our status page noted the multi-hour outages. When Actions are delayed, it can also impact CI completion and a host of other functions.

What we learned

These incidents were characterized by a burst in load during peak hours of GitHub traffic. During these bursts, our mysql1 cluster was not able to handle the load generated by traffic on the system and we were forced to fail-over and take other mitigations, as mentioned in the previous post.

Some of these incidents were related to our efforts to improve visibility on the database, but all of them were related to the low amount of headroom we had on our primary database and thus its susceptibility to a few poorly performing queries.

Optimizing for stability

Because of this, even after we mitigated the initial causes of downtime due to poor query performance, we were still running with low headroom and decided to take a proactive approach to managing load by intentionally slowing down services during peak hours. Furthermore, we took a calculated approach to increase capacity on the database by further optimizing queries.

Rather than risk another site outage, we established lower performance alerting thresholds on the database and proactively throttled webhooks and Actions services (the two largest drivers of automated load on the system) as we approached unsafe margins of error on March 14 14:43 UTC. We understood the potential impact to our customers, but decided it would be safer to proactively limit load on the system rather than risk another outage on multiple services.

In the meantime, we implemented a series of optimizations between March 14 and March 28 that drove queries per second on this database down by over 50% and reduced our transaction volume by 70% at peak load times. Through these performance optimizations, we became more confident in our headroom, but given ongoing investigations, we did not want to chance any unwarranted impacts.

Minimizing impact to our users

After the incidents mentioned above, we took steps to make sure we would be in a position, if necessary, to shut down any services driving high peak load. This meant taking maintenance windows for three services starting on March 24. We proactively paused migrations and team synchronization during peak load due to their potential impact.

We also took maintenance windows for GitHub Actions even though we did not actually throttle any actions and no customers were impacted during these windows. We did this in order to proactively notify customers of possible disruption. While it didn’t end up being the case, we knew we would need to throttle GitHub Actions if we saw any significant database degradation during these time windows. While this may have caused uncertainty for some customers, we wanted to prepare them for any potential impact.

Next steps

Immediate changes

In addition to the improvements mentioned above, we have significantly reduced our database performance alerting thresholds so that we are not “running hot” and will be well positioned to take action before customers are impacted.

We have also accelerated work that was already in progress to continue to shard this particular cluster and apply the learnings from this incident to other clusters that already exist outside of mysql1.

Additional technical and organizational initiatives

Due to the nature of this incident, we have also dedicated a team of engineers to study our internal processes and procedures, observability, and change release processes. While we’re still actively revisiting this incident, we feel confident we have mitigated the initial issues and we have the correct alerting and processes in place to ensure this problem is not likely to occur again.

We understand that the Actions service is critical to many of our customers. With new and ongoing investments across architecture and processes, we’ll continue to bring focus specifically to Actions reliability, including more graceful degradations when other GitHub services are experiencing issues, as well as faster recovery times.

March 29 10:26 UTC (lasting 57 minutes)

During an operation to move GitHub Actions and checks data to its own dedicated, sharded database cluster, a misconfiguration on the new database cluster caused the application to encounter errors. Once we reverted our changes, we were able to recover. This incident resulted in the failure or delay of some queued jobs for a period of time. Once mitigation was initiated, jobs that were queued during the incident were run successfully after the issue was resolved.

The Actions and checks data resides in a multi-tenant database cluster. As part of our efforts to improve reliability and scale, we have been working on functionally partitioning the Actions data to its own sharded database cluster. The switch over to the new cluster involves gradually switching over reads and then switching over writes. Immediately after switching the write traffic, we noticed Actions SLOs were breached and initiated a revert back to the old database. After we reverted back to the old database, we saw an immediate improvement in availability.

Upon further investigation, we discovered that update and delete queries were processed correctly on the new cluster, but insert queries were failing because of missing permissions on the new cluster. All changes processed on the new cluster were replicated back to the old cluster before the switch back, ensuring data integrity.

We have paused any attempts for migrations until we fully investigate and apply our learnings. Furthermore, due to the risk associated with these operations, we will no longer be attempting them during peak traffic hours, which occur between 12:00 and 21:00 UTC. From a technical perspective, we’re looking to scrutinize and improve our operational workflows for these database operations. Additionally, we are going to be performing an audit of our configurations and topology across our environment, to ensure we have properly covered them in our testing strategy. As part of these efforts, we uncovered a gap where we need to extend our pre-migration checklist with a step to verify permissions more thoroughly.

In summary

Every month we share an update on GitHub’s availability, including a description of any incidents that may have occurred and an update on how we are evolving our engineering systems and practices in response. Our hope is that by increasing our transparency and sharing what we’ve learned, everyone can gain from our experiences. At GitHub, we take the trust you place in us very seriously, and we hope this is a way for you to help hold us accountable for continuously improving our operational excellence, as well as our product functionality.

To learn more about our efforts to make GitHub more resilient every day, check out the GitHub engineering blog.