All posts by Scott Sanders

GitHub Availability Report: February 2022

Post Syndicated from Scott Sanders original https://github.blog/2022-03-02-github-availability-report-february-2022/

In February, we experienced one incident resulting in significant impact and degraded state of availability for GitHub.com, issues, pull requests, GitHub Actions, and GitHub Codespaces services.

February 2 19:05 UTC (lasting 13 minutes)

As mentioned in our January report, our service monitors detected a high rate of errors affecting a number of GitHub services.

Upon further investigation of this incident, we found that a routine deployment failed to generate the complete set of integrity hashes needed for Subresource Integrity. The resulting output was missing values needed to securely serve Javascript assets on GitHub.com.

As a safety protocol, our default behavior is to error rather than rendering script tags without integrities, if a hash cannot be found in the integrities file. In this case, that means that github.com started serving 500 error pages to all web users. As soon as the errors were detected, we rolled back to the previous deployment and resolved the incident. Throughout the incident, only browser-based access to GitHub.com was impacted, with API and Git access remaining healthy.

Since this incident, we have added additional checks to our build process to ensure that the integrities are accurate and complete. We’ve also added checks for our main Javascript resources to the health check for our deployment containers, and adjusted the build pipeline to ensure the integrity generation process is more robust and will not fail in a similar way in the future.

In summary

Every month, we share an update on GitHub’s availability, including a description of any incidents that may have occurred and an update on how we are evolving our engineering systems and practices in response. Whether in these reports or via our engineering blog, we look forward to keeping you updated on the progress and investments we’re making to ensure the reliability of our services.

You can also follow our status page for the latest on our availability.

GitHub Availability Report: January 2022

Post Syndicated from Scott Sanders original https://github.blog/2022-02-02-github-availability-report-january-2022/

In January, we experienced no incidents resulting in service downtime to our core services. However, we do want to acknowledge an incident in February that we are continuing to investigate.

February 2 19:12 UTC (lasting 26 minutes)

Our service monitors detected a high rate of errors for issues, pull requests, GitHub Codespaces, and GitHub Actions services. We have mitigated the incident and are confident it has been fully resolved.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

Please follow our status page for real time updates. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: November 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-12-01-github-availability-report-november-2021/

In November, we experienced one incident resulting in significant impact and degraded state of availability for core GitHub services, including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks.

November 27 20:40 UTC (lasting 2 hours and 50 minutes)

We encountered a novel failure mode when processing a schema migration on a large MySQL table. Schema migrations are a common task at GitHub and often take weeks to complete. The final step in a migration is to perform a rename to move the updated table into the correct place. During the final step of this migration a significant portion of our MySQL read replicas entered a semaphore deadlock. Our MySQL clusters consist of a primary node for write traffic, multiple read replicas for production traffic, and several replicas that serve internal read traffic for backup and analytics purposes. The read replicas that hit the deadlock entered a crash-recovery state causing an increased load on healthy read replicas. Due to the cascading nature of this scenario, there were not enough active read replicas to handle production requests which impacted the availability of core GitHub services.

During the incident mitigation, in an effort to increase capacity, we promoted all available internal replicas that were in a healthy state into the production path; however, the shift was not sufficient for full recovery. We also observed that read replicas serving production traffic would temporarily recover from their crash-recovery state only to crash again due to load. Based on this crash-recovery loop, we chose to prioritize data integrity over site availability by proactively removing production traffic from broken replicas until they were able to successfully process the table rename. Once the replicas recovered, we were able to move them back into production and restore enough capacity to return to normal operations.

Throughout the incident, write operations remained healthy and we have verified there was no data corruption.

To address this class of failure and reduce time to recover in the future, we continue to prioritize our functional partitioning efforts. Partitioning the cluster adds resiliency given migrations can then be run in canary mode on a single shard—reducing the potential impact of this failure mode. Additionally, we are actively updating internal procedures to increase the amount each cluster is over-provisioned.

As next steps, we’re continuing to investigate the specific failure scenario, and have paused schema migrations until we know more on safeguarding against this issue. As we continue to test our migration tooling, we are classifying opportunities to improve it during such scenarios.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: October 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-11-04-github-availability-report-october-2021/

In October, we experienced one incident resulting in significant impact and degraded state of availability for the GitHub Codespaces service.

October 8 17:16 UTC (lasting 1 hour and 36 minutes)

A core Codespaces API response was inadvertently restructured as part of our Codespaces public API launch, impacting existing API clients dependent on a stable schema.

For the duration of the incident, new Codespaces could not be initiated from the Visual Studio Code Desktop client. Connections to the web editor and pre-existing desktop sessions were not impacted, but degraded, with the extension displaying an error message while omitting Codespaces metadata from the Remote Explorer view.

The incident was mitigated once we rolled back the regression, at which point all clients could connect again, including with new Codespaces created during the incident. As our monitoring systems did not initially detect the impact of the regression, a subsequent and unrelated deployment was initiated, delaying our ability to revert the change. To ensure similar breaking changes are not introduced in the future, we are investing in tooling to support more rigorous end-to-end testing with the extension’s use of our API. Additionally, we are expanding our monitoring to better align with the user experience across the relevant internal service boundaries.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: August 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-09-01-github-availability-report-august-2021/

In August, we experienced two distinct incidents resulting in significant impact and degraded state of availability for Git operations, API requests, webhooks, issues, pull requests, GitHub Pages, GitHub Packages, and GitHub Actions services.

August 10 15:16 UTC (lasting one hour and 17 minutes)

This incident was caused when one of our MySQL database primaries entered a degraded state, affecting a number of internal services. This caused an impact to GitHub.com services requiring write access to this particular database cluster, which resulted in some users being unable to perform operations.

Investigation had identified an edge case in one of our most active applications, which caused the generation of a poorly performing query capable of impacting overall database capacity. This combined with application retry and queueing logic meant that the MySQL primary was placed into a state where the cluster was unable to automatically recover.

We have been able to address this query, as well as some of the application retry logic, to reduce the chance of recurrence in the future.

One of the novel elements to this incident was the breadth of impact across multiple services. This led to a discussion about the overall service status as we were reporting it within the incident, and, so we’d like to take this opportunity to discuss the approach we took at the time, as well as the way we look to increase our learning potential after the incident.

When we first introduced the monthly availability report, we aimed to provide post-incident reviews for major incidents that impact service availability, in addition to background on how we’re continuing to evolve the process. As part of our standard post-incident analysis process, we are using this incident as a valuable source of data to evaluate the responsiveness of our internal metrics and alerting. These systems guide our responders during incidents on both when to status and what degree of impact to status for. As a result, we’re continuing to tune and optimize these activities to ensure we are able to status—both quickly and accurately—so that we continue to earn the trust our users place in us everyday.

August 10 19:57 UTC (lasting 3 hours and 6 minutes)

Following ongoing maintenance of the Actions service, our service monitors detected a high error rate on workflow runs for new and in progress jobs, which affected the Actions service. This incident resulted in the failure of all queued jobs for a period of time. This was a new incident, unrelated to the earlier issue on August 10. We immediately reverted recent Actions deployments and started to investigate the issue.

The incident was caused by work to set up a new Actions Premium Runner microservice in the Actions service. The impacting portion of this work involved alterations to the service discovery process within the Actions microservices architecture. A bad service record pushed to this system resulted in many of the microservices being unable to make Service-to-Service calls.

Ultimately, the mitigation for this incident was to remove the bad record from the service discovery infrastructure. After investigating whether this mitigation would address the incident, we were able to confidently confirm that the bad record was the root cause of the issue, and removing it would restore the Actions service with no unintended side effects.

We have prioritized several changes as a result of this incident, including fixing this part of the Actions microservice discovery process to properly handle potential bad records. We’ve also added a broader scope of visibility into what’s changed recently across all of the Actions microservices, so we can quickly focus investigations in the correct place.

In Summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. Please follow our status page for real time updates and watch our blog for next month’s availability report. To learn more about what we’re working on, check out the GitHub Engineering blog.

GitHub Availability Report: May 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-06-02-github-availability-report-may-2021/

Introduction

In May, we experienced two incidents resulting in significant impact and degraded state of availability for API requests, GitHub Pages, GitHub Actions and the GitHub Packages service, specifically the GitHub Packages Container registry service.

May 8 06:46 UTC (46 minutes)

This incident was caused by failures in an underlying MySQL database, which caused some operations to time out for the GitHub Container registry service. During this incident, some customers viewing packages in the UI or interacting with the registry through “docker push” and “docker pull” may have experienced failures as the engineering team investigated the incident. After performing a failover to one of our database replicas, the affected systems were properly restored.

Our internal engineering team is now prioritizing work that will help ensure reduced impact to customers should such underlying outages happen again. This work includes creating internal documentation, dashboards, and enhanced alerts to quickly triage the cause of operation failures. We will also continue to actively maintain and increase replicas in different regions and availability zones that serve as a line of defense against unexpected region outages.

May 16 07:17 UTC (lasting 9 hours 48 minutes)

This incident was caused when a foreign key for scoped tokens exceeded max INT32, which resulted in high failure rates for GitHub Actions and GitHub Pages. It also prevented some access to operations against the GitHub API and low-level git commands, such as “push” and “pull”, using scoped tokens. We mitigated this with a long-running schema migration to change the foreign key to INT64.

Once the foreign key migration was successful, the internal engineering teams then worked to slowly remove token records stored in our cache layer that were considered invalid. After these cached records were removed, newly created API tokens were able to generate new records and API calls resumed working as expected.

Alerting and linting are already in place to help prevent integer overflows in the database. Unfortunately, these mechanisms were not sufficient in this case due to it being a foreign key that predated our linting. In response, we are manually auditing all our INT32 columns and investigating further improvements to our automation to help prevent these types of issues moving forward.

Given the nature of this overflow, only a single GitHub Action used on a single repository received unauthorized access grants for a short period of time. We revoked these grants and confirmed that no unauthorized access was gained through the use of this Action in this repository.

Our internal engineering teams are actively working on reducing the impact and likelihood of this class of issue happening in the future. This work includes tooling to prevent database inconsistencies and improved alerting to allow faster remediation.

In summary

From our open source release of the GitHub Artifact Exporter to our adoption of OpenTelemetry SDKs, you can learn more about what we are working on to improve our internal development tooling and infrastructure in the GitHub Engineering Blog.