All posts by Jakub Oleksy

GitHub Availability Report: January 2024

Post Syndicated from Jakub Oleksy original https://github.blog/2024-02-14-github-availability-report-january-2024/

In January, we experienced three incidents that resulted in degraded performance across GitHub services.

January 09 12:20 UTC (lasting 140 minutes)

On January 9 between 12:20 and 14:40 UTC, services in one of our three sites experienced elevated latency for connections. This led to a sustained period of timed-out requests across a number of services, including but not limited to our Git backend. An average of 5% and max of 10% of requests failed with a 5xx response or timed out during this period.

This was caused by an upgrade of hosts, which led to temporarily reduced capacity as the upgrade rolled through the fleet. While these hosts had plenty of capacity to handle the increased load, we found that the configured connection limit was lower than it should have been. We have increased that limit to prevent this from recurring. We have also identified improvements to our monitoring of connection limits and behavior and changes to reduce the risk of host upgrades leading to reduced capacity.

January 21 02:01 UTC (lasting 7 hours 3 minutes)

On January 21 at 2:01 UTC, we experienced an incident that affected customers using GitHub Codespaces. Customers encountered issues creating and resuming Codespaces in multiple regions due to operational issues with compute and storage resources.

Around 25% of customers were impacted, primarily in East US and West Europe. We re-routed traffic for Codespace creations to less impacted regions, but existing Codespaces in these regions may have been unable to resume during the incident.

By 7:30 UTC, we had recovered connectivity to all regions except West Europe, which had an extended recovery time due to increased load in that particular region. The incident was resolved on January 21 at 9:34 UTC once Codespace creations and resumes were working normally in all regions.

We are working to improve our alerting and resiliency to reduce the duration and impact of region-specific outages.

January 31 12:30 UTC (lasting 147 minutes)

On January 31, we deployed an infrastructure change to our load balancers in preparation towards our longer term goal of IPv6 enablement at GitHub.com. This change was deployed to a subset of our global edge sites. The change had the unintended consequence of causing IPv4 addresses to start being passed as an IPv4-mapped IPv6-compatible address (for example, 10.1.2.3 became ::ffff:10.1.2.3) to our IP Allow List functionality. While our IP Allow List functionality was developed with IPv6 in mind, it wasn’t developed to handle these mapped addresses, and hence, started blocking requests as it deemed these to be not in the defined list of allowed addresses. Request error rates peaked at 0.23% of all requests.

In addition to changes deployed to remediate the issues, we have taken steps to improve testing and monitoring to better catch these issues in the future.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: January 2024 appeared first on The GitHub Blog.

GitHub Availability Report: December 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2024-01-17-github-availability-report-december-2023/

In December, we experienced three incidents that resulted in degraded performance across GitHub services. All three are related to a broad secret rotation initiative in late December. While we have investigated and identified improvements from each of these individual incidents, we are also reviewing broader opportunities to reduce availability risk in our broader secrets management.

December 27 02:30 UTC (lasting 90 minutes)

While rotating HMAC secrets between GitHub’s frontend service and an internal service, we triggered a bug in how we fetch keys from Azure Key Vault. API calls between the two services started failing when we disabled a key in Key Vault while rolling back a rotation in response to an alert.

This resulted in all codespace creations failing between 02:30 and 04:00 UTC on December 27 and approximately 15% of resumes to fail as well as other background functions. We temporarily re-enabled the key in Key Vault to mitigate the impact before deploying a change to continue the secret rotation. The original alert turned out to be a separate issue that was not customer-impacting and was fixed immediately after the incident.

Learning from this, the team has improved the existing playbooks for HMAC key rotation and documentation of our Azure Key Vault implementation.

December 28 05:52 UTC (lasting 65 minutes)

Between 5:52 UTC and 6:47 UTC on December 28, certain GitHub email notifications were not sent due to failed authentication between backend services that generate notifications and a subset of our SMTP servers. This primarily impacted CI activity and Gist email notifications.

This was caused by the rotation of authentication credentials between frontend and internal services that resulted in the SMTP servers not being correctly updated with the new credentials. This triggered an alert for one of the two impacted notifications services within minutes of the secret rotation. On-call engineers discovered the incorrect authentication update on the SMTP servers and applied changes to update it, which mitigated the impact.

Repair items have already been completed to update the relevant secrets rotation playbooks and documentation. While the monitor that did fire was sufficient in this case to engage on-call engineers and remediate the incident, we’ve completed an additional repair item to provide earlier alerting across all services moving forward.

December 29 00:34 UTC (lasting 68 minutes)

Users were unable to sign in or sign up for new accounts between 00:34 and 1:42 UTC on December 29. Existing sessions were not impacted.

This was caused by a credential rotation that was not mirrored in our frontend caches, causing the mismatch in behavior between signed in and signed out users. We resolved the incident by deploying the updated credentials to our cache service.

Repair items are underway to improve our monitoring of signed out user experiences and to better manage updates to shared credentials in our systems moving forward.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: December 2023 appeared first on The GitHub Blog.

GitHub Availability Report: November 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-12-13-github-availability-report-november-2023/

In November, we experienced one incident that resulted in degraded performance across GitHub services.

November 3 18:42 UTC (lasting 38 minutes)

Between 18:42 and 19:20 UTC on November 3, the GitHub authorization service experienced excessive application memory use, leading to failed authorization requests and users getting 404 or error responses on most page and API requests.

A performance and resilience optimization to the authorization microservice contained a memory leak that was exposed under high traffic. Testing did not expose the service to sufficient traffic to discover the leak, allowing it to graduate to production at 18:37 UTC. The memory leak under high load caused pods to crash repeatedly starting at 18:42 UTC, failing authorization checks in their default closed state. These failures started triggering alerts at 18:44 UTC. Rolling back the authorization service change was delayed as parts of the deployment infrastructure relied on the authorization service and required manual intervention to complete. Rollback completed at 19:08 UTC and all impacted GitHub features recovered after pods came back online.

To reduce the risk of future deployments, we implemented changes to our rollout strategy by including additional monitoring and checks, which automatically block a deployment from proceeding if key metrics are not satisfactory. To reduce our time to recover in the future, we have removed dependencies between the authorization service and the tools needed to roll back changes.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: November 2023 appeared first on The GitHub Blog.

GitHub Availability Report: October 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-11-13-github-availability-report-october-2023/

In October, we experienced two incidents that resulted in degraded performance across GitHub services.

October 17 10:59 UTC (lasting 2 hours and 49 minutes)

From 10:59 UTC to 13:48 UTC on October 17, GitHub Codespaces service was degraded due to an outage in authentication. This issue impacted 67% of users over this time period, with users seeing failures to create and start their Codespaces. The regional authentication layer experienced throttling with a global third-party dependency due to increased load from onboarding a new Codespaces region. The Codespaces team mitigated manually by reducing load on the external dependency. Following the incident, the Codespaces team is actively evaluating and implementing scaling improvements to make the service more resilient to increasing demands. These include implementing regional-level caching to minimize calls to the dependency and incorporating measures to ensure the continued health of the authentication service in the event of errors.

October 25 09:13 UTC (lasting 3 hours and 27 minutes cumulatively)

On October 25 through 26, GitHub Copilot experienced multiple short and partial outages which affected code completions.

GitHub Copilot completions are currently hosted in multiple regions globally. Users are typically routed to the nearest geographic region, but may be routed to other regions when the nearest region is unhealthy. Beginning at 09:13 UTC on October 25, GitHub Copilot began experiencing partial outages of individual regions, lasting approximately 12 minutes per region. These outages were due to the nodes hosting the completion model being upgraded by an automated process, and a subset of GitHub Copilot users experienced completion errors during this timeframe. The issue was fully resolved at 02:40 UTC on October 26.

In order to prevent similar outages from happening in the future, we have taken steps to disable the automated upgrade behavior that we identified as the root cause, as well as prioritizing improvements to our global load balancing during regional outages.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: October 2023 appeared first on The GitHub Blog.

GitHub Availability Report: September 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-10-11-github-availability-report-september-2023/

In September, we experienced two incidents that resulted in degraded performance across GitHub services.

September 5 16:24 UTC (lasting 19 minutes)

On September 5, from 16:24-16:43 UTC, multiple GitHub services were down or degraded due to an outage in one of our primary databases. The primary host for a shared datastore for GitHub experienced an underlying file system write error, which affected availability for the majority of public-facing GitHub services. SAML login was affected, as was access to GitHub Actions, GitHub Issues, pull requests, GitHub Pages, GitHub API, Webhooks, GitHub Codespaces, and GitHub Packages.

The primary database suffered a partial host failure when the disk storage for the operating system became unreachable. In this case, our automatic failover was unable to detect the partial file system failure mode. We mitigated by manually failing over to a healthy host, initiated 17 minutes after our first alert and completed 2 minutes later.

With the incident mitigated, we have worked to assess more detailed impact and resilience improvements to each affected service to reduce the scope of any future incident with this shared dependency. Some of those are complete and the rest will be completed within our standard repair item SLAs. To increase the resiliency of our system, we have improved our automation that will detect and initiate a failover for this type of partial host failure. Additionally, we have identified a source of resource contention that is consistent with this type of failure and patched a fix to reduce the likelihood of recurrence.

September 19 20:36 UTC (lasting 7 hours 30 minutes)

On September 19 at 20:36 UTC, while migrating the primary datastore for GitHub Projects, an incident occurred that disrupted 95% of GitHub Projects data availability for 3.5 hours. A misconfigured index constraint on the primary GitHub Projects database table caused GitHub Projects to become fully unavailable between 20:36 UTC and 00:06 UTC. By 00:06, we restored GitHub Projects data to its state from the beginning of the incident. New project data created by users while the incident was being mitigated was fully recovered and available to users by 04:28 UTC.

In addition, a database replication interruption caused by our remediation steps created limited availability for some Git Operations, APIs, and GitHub Issues for 1.25 hours from 21:48 UTC to 23:00 UTC.

To prevent similar incidents in the future, we have improved validation of data migrations in testing and during rollout. We have evaluated and are making improvements to the constraints for any data migration to prevent the unexpected behavior that led to this data loss. To reduce the time to mitigate similar incidents, we are also in the process of rolling out improvements to reduce both the time to restore data and fix replication issues.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: September 2023 appeared first on The GitHub Blog.

GitHub Availability Report: August 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-09-13-github-availability-report-august-2023/

In August, we experienced two incidents that resulted in degraded performance across GitHub services.

August 15 16:58 UTC (lasting 4 hours 29 minutes)

On August 15 at 16:58 UTC, GitHub started experiencing increasing delays in an internal job queue used to process webhooks. We statused GitHub Webhooks to yellow at 17:24 UTC. During this incident, customers experienced webhooks delays as long as 4.5 hours.

We determined that the delays were caused by a significant and sustained spike in webhook deliveries. This caused a backup of our webhooks deliveries queue. We mitigated the issue by blocking events from sources of the increased load, which allowed the system to gradually recover as we processed the backlog of events. In response to this and other recent webhooks incidents, we made improvements that allow us to handle a higher amount of traffic and absorb load spikes without increasing delivery latency. We also improved our ability to manage load sources to prevent and more quickly mitigate any impact to our service.

August 29 02:36 UTC (lasting 49 minutes)

On August 29 at 02:36 UTC, GitHub systems experienced widespread delays in background job processing. This prevented webhook deliveries, GitHub Actions, and other asynchronously-triggered workloads throughout the system from running immediately as normal. While workloads were delayed by up to an hour, no data was lost, and systems ultimately recovered and resumed timely operation.

The component of our job queueing service responsible for dispatching jobs to workers failed due to an interaction with unexpected CPU throttling and short session timeouts for a Kafka consumer group. The Kafka consumer ended up stuck in a loop, unable to stabilize fast enough before timing out and restarting the coordination process. While the service continued to accept and record incoming work, it was unable to pass jobs on to workers until we mitigated the issue by shifting the load to the standby service as well as redeploying the primary service. We have extended our monitoring to allow quicker diagnosis of this failure mode, and are pursuing additional changes to prevent reoccurrence.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: August 2023 appeared first on The GitHub Blog.

GitHub Availability Report: July 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-08-09-github-availability-report-july-2023/

In July, we experienced one incident that resulted in degraded performance across GitHub services.

July 21 13:07 UTC (lasting 59 minutes)

On July 21 at 13:07 UTC, GitHub experienced a partial power outage in one of our redundant data centers, which resulted in a loss of compute capacity. GitHub updated the status of six services to yellow at 13:12 UTC. The vast majority of customer impact occurred in the first 10 minutes up to 13:17 UTC as requests were internally rerouted to other nodes in the data center, but we elected to keep status at yellow until full capacity was restored out of an abundance of caution. As a result of this incident, we are conducting reviews of all power feeds with each of our datacenter partners. We have also identified improvements to reduce recovery time after power was restored and are evaluating ways to reduce the time to fail over all traffic.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: July 2023 appeared first on The GitHub Blog.

GitHub Availability Report: June 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-07-12-github-availability-report-june-2023/

In June, we experienced two incidents that resulted in degraded performance across GitHub services. 

June 7 16:11 UTC (lasting 2 hours 28 minutes)

On June 7 at 16:11 UTC, GitHub started experiencing increasing delays in an internal job queue used to process Git pushes. Our monitoring systems alerted our first responders after 19 minutes. During this incident, customers experienced GitHub Actions workflow run and webhook delays as long as 55 minutes, and pull requests did not accurately reflect new commits.

We immediately began investigating and found that the delays were caused by a customer making a large number of pushes to a repository with a specific data shape. The jobs processing these pushes became throttled when communicating with the Git backend, leading to increased job execution times. These slow jobs exhausted a worker pool, starving the processing of pushes for other repositories. Once the source was identified and temporarily disabled, the system gradually recovered as the backlog of jobs was completed. To prevent a recurrence, we updated the Git backend’s throttling behavior to fail faster and reduced the Git client timeout within the job to prevent it from hanging.  We have additional repair items in place to reduce the times to detect, diagnose, and recover.

June 29 14:50 UTC (lasting 32 minutes)

On June 29, starting from 17:39 UTC, GitHub was down in parts of North America, particularly the US East coast and South America, for approximately 32 minutes.

GitHub takes measures to ensure that we have redundancy in our system for various disaster scenarios. We have been working on building redundancy to an earlier single point of failure in our network architecture at a second Internet edge facility. This facility was completed in January and has been actively routing production traffic since then in a high availability (HA) architecture alongside the first edge facility. As part of the facility validation steps, we performed a live failover test in order to verify that we could use this second Internet edge facility if the primary were to fail. Unfortunately, during this failover test we inadvertently caused a production outage.

The test exposed a network path configuration issue in the secondary side that prevented it from properly functioning as a primary, which resulted in the outage. This has since been fixed. We were immediately notified of the issue and within two minutes of being alerted we reverted the change and brought the primary facility back online. Once online it took time for traffic to be rebalanced and for our border routers to reconverge restoring public connectivity to GitHub systems.

This failover test helped expose the configuration issue, and we are addressing the gaps in both configuration and our failover testing, which will help make GitHub more resilient. We recognize the severity of this outage and the importance of keeping GitHub available. Moving forward, we will continue our commitment to high availability, improving these tests and scheduling them in a way where potential customer impact is minimized as much as possible.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: May 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-06-14-github-availability-report-may-2023/

In May, we experienced four incidents that resulted in degraded performance across GitHub services. This report also sheds light into three April incidents that resulted in degraded performance across GitHub services.

April 26 23:11 UTC (lasting 51 minutes)

On April 25 at 23:11 UTC, a subset of users began to see a degraded experience with GitHub Copilot code completions. We publicly statused GitHub Copilot to yellow at 23:26 UTC, and to red at 23:41 UTC. As engineers identified the impact to be a subset of requests, we statused back to yellow at 23:48 UTC. The incident was fully resolved on April 26 at 00:02 UTC, and we publicly statused green at 00:30 UTC.

The degradation consisted of a rolling partial outage across all three GitHub Copilot regions: US Central, US East, and Switzerland North. Each of these regions experienced approximately 15-20 minutes of degraded service during the global incident. At the peak, 6% of GitHub Copilot code completion requests failed.

We identified the root cause to be a faulty configuration change by an automated maintenance process. The process was initiated across all regions sequentially, and placed a subset of faulty nodes in service before the rollout was halted by operators. Automated traffic rollover from the failed nodes and regions helped to mitigate the issue.

Our efforts to prevent a similar incident in the future include both reducing the batch size and iteration speed of the automated maintenance process, and lowering our time to detection by adjusting our alerting thresholds.

April 27 08:59 UTC (lasting 57 minutes)

On April 26 at 08:59 UTC, our internal monitors notified us of degraded availability with GitHub Packages. Users would have noticed slow or failed GitHub Packages upload and download requests. Our investigation revealed a spike in connection errors to our primary database node. We quickly took action to resolve the issue by manually restarting the database. At 09:56 UTC, all errors were cleared and users experienced a complete recovery of the GitHub Packages service. A planned migration of GitHub Packages database to a more robust platform was completed on May 2, 2023 to prevent this issue from recurring.

April 28 12:26 UTC (lasting 19 minutes)

On April 28 at 12:26 UTC, we were notified of degraded availability for GitHub Codespaces. Users in the East US region experienced failures when creating and resuming codespaces. At 12:45 UTC, we used regional failover to redirect East US codespace creates and resumes to the nearest healthy region, East US 2, and users experienced a complete and nearly immediate recovery of GitHub Codespaces.

Our investigation indicated our cloud provider had experienced an outage in the East US region, with virtual machines in that region experiencing internal operation errors. Virtual machines in the East US 2 region (and all other regions) were healthy, which enabled us to use regional failover to successfully recover GitHub Codespaces for our East US users. When our cloud provider’s outage was resolved, we were able to seamlessly direct all of our East US GitHub Codespaces uses back with no downtime.

Long-term mitigation is focused on reducing our time to detection for outages such as this by improving our monitors and alerts, as well as reducing our time to mitigate by making our regional failover tooling and documentation more accessible.

May 4th 15:53 UTC (lasting 30 minutes)

On May 4th at 15:23 UTC, our monitors detected degraded performance for Git Operations, GitHub APIs, GitHub Issues, GitHub Pull Requests, GitHub Webhooks, GitHub Actions, GitHub Pages, GitHub Codespaces, and GitHub Copilot. After troubleshooting we were able to mitigate the issue by performing a primary failover on our repositories database cluster. Further investigation indicated the root cause was connection pool exhaustion on our proxy layer. Prior updates to this configuration were inconsistently applied. We audited and fixed our proxy layer connection pool configurations during this incident, and updated our configuration automation to dynamically apply config changes without disruption to ensure consistent configuration of database proxies moving forward.

May 09 11:27 UTC (lasting 10 hours and 44 minutes)

On May 9 at 11:27 UTC, users began to see failures to read or write Git data. These failures continued until 12:33 UTC, affecting Git Operations, GitHub Issues, GitHub Actions, GitHub Codespaces, GitHub Pull Requests, GitHub Web Hooks, and GitHub APIs. Repositories and GitHub Pull Requests required additional time to fully recover job results and search capabilities, with recovery completing at 21:20 UTC. On May 11 at 13:33 UTC, similar failures occurred affecting the same services until 14:40 UTC. Again, GitHub Pull Requests required additional time to fully recover search capabilities, with recovery completing at 18:54 UTC. We discussed both of these events in a previous blog post and can confirm they share the same root cause.

Based on our investigation we determined that the cause of this crash is due to a bug in the database version we are running, and the conditions causing this bug were more likely to happen in a custom configuration on this data cluster. We updated our configuration to match the rest of our database clusters, and this cluster is no longer vulnerable to this kind of failover.

The bug has since been reported to the database maintainers, accepted as a private bug, and fixed. The fix is slated for a release expected in July.

There have been several directions of work in response to these incidents to avoid reoccurrence. We have focused on removing special case configurations of our database clusters to avoid unpredictable behavior from custom configurations. Across feature areas, we have also expanded tooling around graceful degradation of web pages when dependencies are unavailable.

May 10 12:38 UTC (lasting 11 hours and 56 minutes)

On May 10 at 12:38 UTC, issuance of auth tokens for GitHub Apps started failing, impacting GitHub Actions, GitHub API Requests, GitHub Codespaces, Git Operations, GitHub Pages, and GitHub Pull Requests. We identified the cause of these failures to be a significant increase in write latency on a shared permissions database cluster. First responders mitigated the incident by identifying the data shape in new API calls that was causing very expensive database write transactions and timeouts in a loop and blocking the source. We shared additional details on this incident in a previous blog post, but we wanted to share an update on our follow-up actions. Beyond the immediate work to address the expensive query pattern that caused this incident, we completed an audit of other endpoints to identify and correct any similar patterns. We completed improvements to the observability of API errors and have further work in progress to improve diagnosis of unhealthy MySQL write patterns. We also completed improvements to tools, documentation and playbooks, and training for both the technical diagnosis and our general incident response to address issues encountered while mitigating this issue and to reduce the time to mitigate similar incidents in the future.

May 16 21:07 UTC (lasting 25 minutes)

On May 16 at 21:08 UTC, we were alerted to degradation of multiple services. GitHub Issues, GitHub Pull Requests, and Git Ops were unavailable while GitHub API, GitHub Actions, GitHub Pages, and GitHub Codespaces were all partially unavailable. Alerts indicated that the primary database of a cluster supporting key-value data had experienced a hardware crash. The cluster was left in such a state that our failover automation was unable to select a new primary to promote due to the risk of data loss. Our first responder evaluated the cluster, determined it was safe to proceed, and then manually triggered a failover to a new primary host 11 minutes after the server crash. We aspire to reduce our response time moving forward and are looking into improving our alerting for cases like this. Long-term mitigation is focused on reducing dependency on this cluster as a single point of failure for much of the site.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: April 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-05-03-github-availability-report-april-2023/

In April, we experienced four incidents that resulted in degraded performance across GitHub services. This report also sheds light into three March incidents that resulted in degraded performance across GitHub services.

March 27 12:25 UTC (lasting 1 hour and 33 minutes)

On March 27 at 12:14 UTC, users began to see degraded experience with Git Operations, GitHub Issues, pull requests, GitHub Actions, GitHub API requests, GitHub Codespaces, and GitHub Pages. We publicly statused Git Operations 11 minutes later, initially to yellow, followed by red for other impacted services. Full functionality was restored at 13:17 UTC.

The cause was traced to a change in a frequently-used database query. The query alteration was part of a larger infrastructure change that had been rolled out gradually, starting in October 2022, then more quickly beginning February 2023, completing on March 20, 2023. The change increased the chance of lock contention, leading to increased query times and eventual resource exhaustion during brief load spikes, which caused the database to crash. An initial automatic failover solved this seamlessly, but the slow query continued to cause lock contention and resource exhaustion, leading to a second failover that did not complete. Mitigation took longer than usual because manual intervention was required to fully recover.

The query causing lock tension was disabled via feature flag, and then refactored. We have added additional monitoring of relevant database resources so as not to reach resource exhaustion, and detect similar issues earlier in our staged rollout process. Additionally, we have enhanced our query evaluation procedures related to database lock contention, along with improved documentation and training material.

March 29 14:21 UTC (lasting 4 hour and 57 minutes)

On March 29 at 14:10 UTC, users began to see a degraded experience with GitHub Actions with their workflows not progressing. Engineers initially statused GitHub Actions nine minutes later. GitHub Actions started recovering between 14:57 UTC and 16:47 UTC before degrading again. GitHub Actions fully recovered the queue of backlogged workflows at 19:03 UTC.

We determined the cause of the impact to be a degraded database cluster. Contributing factors included a new load source from a background job querying that database cluster, maxed out database transaction pools, and underprovisioning of vtgate proxy instances that are responsible for query routing, load balancing, and sharding. The incident was mitigated through throttling of job processing and adding capacity, including overprovisioning to speed processing of the backlogged jobs.

After the incident, we identified that the pool, found_rows_pool managed by the vtgate layer, was overwhelmed and unresponsive. This pool became flooded and stuck due to contention between inserting data into and reading data from the tables in the database. This contention led to us being unable to progress any new queries across our database cluster.

The health of our database clusters is a top priority for us and we have taken steps to reduce contentious queries on our cluster over the last few weeks. We also have taken multiple actions from what we learned in this incident to improve our telemetry and alerting to allow us to identify and act on blocking queries faster. We are carefully monitoring the cluster health and are taking a close look into each component to identify any additional repair items or adjustments we can make to improve long-term stability.

March 31 01:07 UTC (lasting 2 hours)

On March 31 at 00:06 UTC, a small percentage of users started to receive consistent 500 error responses on pull request files pages. At 01:07 UTC, the support team escalated reports from customers to engineering who identified the cause and statused yellow nine minutes later. The fix was deployed to all production hosts by 02:07 UTC.

We determined the source of the bug to be a notification to promote a new feature. Only repository admins who had not enabled the new feature or dismissed the notification were impacted. An expiry date in the configuration of this notification was set incorrectly, which caused a constant that was still referenced in code to no longer be available.

We have taken steps to avoid similar issues in the future by auditing the expiry dates of existing notices, preventing future invalid configurations, and improving test coverage.

April 18 09:28 UTC (lasting 11 minutes)

On April 18 at 09:22 UTC, users accessing any issues or pull request related entities experienced consistent 5xx responses. Engineers publicly statused pull requests to red and issues six minutes later. At 09:33 UTC, the root cause self-healed and traffic recovered. The impact resulted in an 11 minute outage of access to issues and pull request related artifacts. After fully validating traffic recovery, we statused green for issues at 09:42 UTC.

The root cause of this incident was a planned change in our database infrastructure to minimize the impact of unsuccessful deployments. As part of the progressive rollout of this change, we deleted nodes that were taking live traffic. When these nodes were deleted, there was an 11 minute window where requests to this database cluster failed. The incident was resolved when traffic automatically switched back to the existing nodes.

This planned rollout was a rare event. In order to avoid similar incidents, we have taken steps to review and improve our change management process. We are updating our monitoring and observability guidelines to check for traffic patterns prior to disruptive actions. Furthermore, we’re adding additional review steps for disruptive actions. We have also implemented a new checklist for change management for these types of infrequent administrative changes that will prompt the primary operator to document the change and associated risks along with mitigation strategies.

April 26 23:26 UTC (lasting 1 hour and 04 minutes)

On April 26 at 23:26 UTC, we were notified of an outage with GitHub Copilot. We resolved the incident at 00:29 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

April 27 08:59 UTC (lasting 57 minutes)

On April 26 at 08:59 UTC, we were notified of an outage with GitHub Packages. We resolved the incident at 09:56 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

April 28 12:26 UTC (lasting 19 minutes)

On April 28 at 12:26 UTC, we were notified of degraded availability for GitHub Codespaces. We resolved the incident at 12:45 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: March 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-04-05-github-availability-report-march-2023/

In March, we experienced six incidents that resulted in degraded performance across GitHub services. This report also sheds light into a February incident that resulted in degraded performance for GitHub Codespaces.

February 28 15:42 UTC (lasting 1 hour and 26 minutes)

On February 28 at 15:42 UTC, our monitors detected a higher than normal failure rate for creating and starting GitHub Codespaces in the East US region, caused by slower than normal VM allocation time from our cloud provider. To reduce impact to our customers during this time, we redirected codespace creations to a secondary region. Codespaces in other regions were not impacted during this incident.

To help reduce the impact of similar incidents in the future, we tuned our monitors to allow for quicker detection. We are also making architectural changes to enable failover for existing codespaces and to initiate failover automatically without human intervention.

March 01 12:01 UTC (lasting 3 hour and 06 minutes)

On March 1 at 12:01 UTC, our monitors detected a higher than normal latency for requests to Container, NPM, Nuget, and RubyGems Packages registries. Initial impact was 5xx responses to 0.5% of requests, peaking at 10% during this incident. We determined the root cause to be an unhealthy disk on a VM node hosting our database, which caused significant performance degradation at the OS level. All MySQL servers on this problematic node experienced connection delays and slow query execution, leading to the high failure rate.

We mitigated this with a failover of the database. We recognize this incident lasted too long and have updated our runbooks for quicker mitigation short term. To address the reliability issue, we have architectural changes in progress to migrate the application backend to a new MySQL infrastructure. This includes improved observability and auto-recovery tooling, thereby increasing application availability.

March 02 23:37 UTC (lasting 2 hour and 18 minutes)

On March 2 at 23:37 UTC, our monitors detected an elevated amount of failed requests for GitHub Actions workflows, which appeared to be caused by TLS verification failures due to an unexpected SSL certificate bound to our CDN IP address. We immediately engaged our CDN provider to help investigate. The root cause was determined to be a configuration change on the provider’s side that unintentionally changed the SSL certificate binding to some of our GitHub Actions production IP addresses. We remediated the issue by removing the certificate binding.

Our team is now evaluating onboarding multiple DNS/CDN providers to ensure we can maintain consistent networking even when issues arise that are out of our control.

March 15 14:07 UTC (lasting 1 hour and 20 minutes)

On March 15 at 14:07 UTC, our monitors detected increased latency for requests to Container, NPM, Nuget, RubyGems package registries. This was caused by an operation on the primary database host that was performed as part of routine maintenance. During the maintenance, a slow running query that was not properly drained blocked all database resources. We remediated this issue by killing the slow running query and restarting the database.

To prevent future incidents, we have paused maintenance until we investigate and address the draining of long running queries. We have also updated our maintenance process to perform additional safety checks on long running queries and blocking processes before proceeding with production changes.

March 27 12:25 UTC (lasting 1 hour and 33 minutes)

On March 27 at 12:25 UTC, we were notified of impact to pages, codespaces and issues. We resolved the incident at 13:29 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

March 29 14:21 UTC (lasting 4 hour and 29 minutes)

On March 29 at 14:21 UTC, we were notified of impact to pages, codespaces and actions. We resolved the incident at 18:50 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

March 31 01:16 UTC (lasting 52 minutes)

On March 31 at 01:16 UTC, we were notified of impact to Git operations, issues and pull requests. We resolved the incident at 02:08 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: February 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-03-01-github-availability-report-february-2023/

In February, we experienced three incidents that resulted in degraded performance across GitHub services. This report also sheds light into a January incident that resulted in degraded performance for GitHub Packages and GitHub Pages and another January incident that impacted Git users.

January 30 21:31 UTC (lasting 35 minutes)

On January 30 at 21:36 UTC, our alerting system detected a 500 error response increase in requests made to the Container registry. As a result, most builds on GitHub Pages and requests to GitHub Packages failed during the incident.

Upon investigation, we found that a change was made to the Container registry Redis configuration at 21:30 UTC to enforce authentication on Redis connections. There was an issue with the Container registry production deployment file where client connections were unable to authenticate due to a hard coded connection string, resulting in errors and preventing successful connections.

At 22:12 UTC, we reverted the configuration change for Redis authentication. Container registry began recovering two minutes later, and GitHub Pages was considered healthy again by 22:21 UTC.

To help prevent future incidents, we improved management of secrets in the Container registry’s Redis deployment configurations and added extra test coverage for authenticated Redis connections.

January 30 18:35 UTC (lasting 7 hours)

On January 30 at 18:35 UTC, GitHub deployed a change which slightly altered the compression settings on source code downloads. This change altered the checksums of the resulting archive files, resulting in unforeseen consequences for a number of communities. The contents of these files were unchanged, but many communities had come to rely on the precise layout of bytes also being unchanged. When we realized the impact we reverted the change and communicated with affected communities.

We did not anticipate the broad impact this change would have on a number of communities and are implementing new procedures to prevent future incidents. This includes working through several improvements in our deployment of Git throughout GitHub and adding a checksum validation to our workflow.

See this related blog post for details about our plan going forward.

February 7 21:30 UTC (lasting 20 hours and 35 minutes)

On February 7 at 21:30 UTC, our monitors detected failures creating, starting, and connecting to GitHub Codespaces in the Southeast Asia region, caused by a datacenter outage of our cloud provider. To reduce the impact to our customers during this time, we redirected codespace creations to a secondary location, allowing new codespaces to be used. Codespaces in that region recovered automatically when the datacenter recovered, allowing existing codespaces in the region to be restarted. Codespaces in other regions were not impacted during this incident.

Based on learnings from this incident, we are evaluating expanding our regional redundancy and have started making architectural changes to better handle temporary regional and datacenter outages, including more regularly exercising our failover capabilities.

February 18 02:36 UTC (lasting 2 hours and 26 minutes)

On February 18 at 02:36 UTC, we became aware of errors in our application code pointing to connectivity issues to our MySQL databases. Upon investigation, we believe these errors were due to a few unhealthy deployments of our sharding middleware. At 03:30 UTC, we performed a re-deployment of the database infrastructure in an effort to remediate. Unfortunately, this propagated the issue to all Kubernetes pods, leading to system-wide errors. As a result, multiple services returned 500 error responses and GitHub users were experiencing issues signing in to GitHub.com.

At 04:30 UTC, we found that the database topology in 30% of our deployments was corrupted, which prevented applications from connecting to the database. We applied a copy of the correct database topology to all deployments, which resolved the errors across services by 05:00 UTC. Users were then able to sign in to GitHub.com.

To help prevent future incidents, we added a monitor to detect database topology errors so we can identify this well in advance of these changes impacting production systems. We have also improved our observability around topology reloads, both successful and erroneous ones. We are also doing a deeper review of the contributing factors to this incident to learn and improve both our architecture and operations to prevent a recurrence.

February 28 16:05 UTC (lasting 1 hour and 26 minutes)

On February 28 at 16:05 UTC, we were notified of degraded performance for GitHub Codespaces. We resolved the incident at 17:31 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: January 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-02-01-github-availability-report-january-2023/

In January, we experienced two incidents. One that resulted in degraded performance for GitHub Packages and GitHub Pages, and another that impacted git users.

January 30 21:48 UTC (lasting 35 minutes)

Our service monitors detected degraded performance for GitHub Packages and GitHub Pages. Most requests to the container registry were failing and some GitHub Pages builds were also impacted. We determined this was caused by a backend change and mitigated by reverting that change.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

January 30 18:35 UTC (lasting 7 hours)

We upgraded our production Git binary with a recent version from upstream. The updates included a change to use an internal implementation of gzip when generating archives. This resulted in subtle changes to the contents of the “Download Source” links served by GitHub, leading to checksum mismatches. No content was changed.

After becoming aware of the impact to many communities, we rolled back the compression change to restore the previous behavior.

Similar to the above, we are still investigating the contributing factors of this incident, and will provide a more thorough update in next month’s report.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: December 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2023-01-04-github-availability-report-december-2022/

In December, we did not experience any incidents that resulted in degraded performance across GitHub services. This report sheds light into an incident that impacted GitHub Packages and GitHub Pages in November.

November 25 16:34 UTC (lasting 1 hour and 56 minutes)

On November 25, 2022 at 14:39 UTC, our alerting systems detected an incident that impacted customers using GitHub Packages and GitHub Pages. The GitHub Packages team initially statused GitHub Packages to yellow, and after assessing impact, it statused to red at 15:06 UTC.

During this incident, customers experienced unavailability of packages for container, npm, and NuGet registries. We were able to serve requests for RubyGems and Maven registries. GitHub Packages’ unavailability also impacted GitHub Pages, as they were not able to pull packages, which resulted in CI build failures. Repository landing pages also saw timeouts while fetching packages information.

GitHub Packages uses a third-party database to store data for the service and the provider was experiencing an outage, which impacted GitHub Packages performance. The first responder connected with the provider’s support team to learn more about the region specific outage. The provider then mitigated the issue before the first responder could do the failover to another region. With the mitigation in place, GitHub Packages started to recover along with GitHub Pages and the repository landing pages.

As follow up action items, the team is exploring options to make GitHub Pages and repository landing pages more resilient to GitHub Packages outages. We are also investigating options where failovers can be performed quickly and automatically in case of regional outages.

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: November 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-12-07-github-availability-report-november-2022/

In November, we experienced two incidents that resulted in degraded performance across GitHub services. This report also sheds light into an incident that impacted GitHub Codespaces in October.

November 25 16:34 UTC (lasting 1 hour and 56 minutes)

Our alerting systems detected an incident that impacted customers using GitHub Packages and Pages. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the January Availability Report, which we will publish the first Wednesday of January.

October 26 00:47 UTC (lasting 3 hours and 47 minutes)

On October 22, 2022 at 00:47 UTC, our alerting systems detected a decrease in success rates for creates and resumes of Codespaces in the East US region. We initially statused yellow, as the incident affected only the East US region. As the incident persisted for several hours, we provided guidance to customers in the affected region to manually change their location to a nearby healthy region at 01:55 UTC, and statused red at 2:34 UTC due to the prolonged outage.

During this incident, customers were unable to create or resume Codespaces in the East US region. Customers could manually select an alternate region in which to create Codespaces, but could not do so for resumes.

Codespaces uses a third-party database to store data for the service and the provider was experiencing an outage, which impacted Codespaces performance. We were unable to immediately communicate with our East US database because our service does not currently have any replication of its regional data. Our services in the East US region returned to healthy status as soon as Codespaces engineers were able to engage with the third party to help mitigate the outage.

We identified several ways to improve our database resilience to regional outages while working with the third party during this incident and in follow up internal discussions. We are implementing regional replication and failover so that we can mitigate this type of incident more quickly in the future.

November 3 16:10 UTC (lasting 1 hour and 2 minutes)

On November 3, 2022 at 16:10 UTC, our alerting systems detected an increase in the time it took GitHub Actions workflow runs to start. We initially statused GitHub Actions to red, and after assessing impact we statused to yellow at 16:11 UTC.

During this incident, customers experienced high latency in receiving webhook deliveries, starting GitHub Actions workflow runs, and receiving status updates for in-progress runs. They also experienced an increase in error responses from repositories, pull requests, Codespaces, and the GitHub API. At its peak, a majority of repositories attempting to run a GitHub Actions workflow experienced delays longer than five minutes.

GitHub Actions listens to webhooks to trigger workflow runs, and while investigating we found that the run start delays were caused by a backup in the webhooks queue. At 16:29 UTC, we scaled out and accelerated processing of the webhooks queue as a mitigation. By 17:12 UTC, the webhooks queue was fully drained and we statused back to green.

We found that the webhook delays were caused by an inefficient database query for checking repository security advisory access, which was triggered by a high volume of poorly optimized API calls. This caused a backup in background jobs running across GitHub, which is why multiple services were impacted in addition to webhooks and GitHub Actions.

Following our investigation, we fixed the inefficient query for the repository security advisory access. We also reviewed the rate limits for this particular endpoint (as well as limits in this area) to ensure they were in line with our performance expectations. Finally, we increased the default throttling of the webhooks queue to avoid potential backups in the future. As a longer-term improvement to our resiliency, we are investigating options to reduce the potential for other background jobs to impact GitHub Actions workflows. We’ll continue to run game days and conduct enhanced training for first responders to better assess impact for GitHub Actions and determine the appropriate level of statusing moving forward.

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: October 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-11-02-github-availability-report-october-2022/

In October, we experienced four incidents that resulted in significant impact and degraded state of availability to multiple GitHub services. This report also sheds light into an incident that impacted Codespaces in September.

October 26 00:47 UTC (lasting 3 hours and 47 minutes)

Our alerting systems detected an incident that impacted most Codespaces customers. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the November Availability Report, which we will publish the first Wednesday of December.

October 13 20:43 UTC (lasting 48 minutes)

On October 13, 2022 at 20:43 UTC, our alerting systems detected an increase in the Projects API error response rate. Due to the significant customer impact, we went to status red for Issues at 20:47 UTC. Within 10 minutes of the alert, we traced the cause to a recently-deployed change.

This change introduced a database validation that required a certain value to be present. However, it did not correctly set a default value in every scenario. This resulted in the population of null values in some cases, which produced an error when pulling certain records from the database.

We initiated a roll back of the change at 21:08 UTC. At 21:13 UTC, we began to see a steady decrease in the number of error responses from our APIs back to normal levels, changed the status of Issues to yellow at 21:24 UTC, and changed the status of Issues to green at 21:31 UTC once all metrics were healthy.

Following this incident, we have added mitigations to protect against missing values in the future, and we have improved testing around this particular area. We have also fixed our deployment dashboards, which contained some inaccurate data for pre-production errors. This will ensure that errors are more visible during the deploy process to help us prevent these issues from reaching production.

October 12 23:27 UTC (lasting 3 hours and 31 minutes)

On October 12, 2022 at 22:30 UTC, we rolled out a global configuration change for Codespaces. At 23:15 UTC, after the change had propagated to a variety of regions, we noticed new Codespace creation starting to trend downward and were alerted to issues from our monitors. At 23:27 UTC, we deemed the impact significant enough to status Codespaces yellow, and eventually red, based on continued degradation.

During the incident, it was discovered that one of the older components of the backend system did not cope well with the configuration change, causing a schema conflict. This was not properly tested prior to the rollout. Additionally, this component version does not support gradual exposure across regions—so many regions were impacted at once. Once we detected the issue and determined the configuration change was the cause, we worked to carefully roll back the large schema change. Due to the complexity of the rollback, the procedure took an extended period of time. Once the rollback was complete and metrics tracking new Codespaces creations were healthy, we changed the status of the service back to green at 02:58 UTC.

After analyzing this incident, we determined we can eliminate our dependency on this older configuration type and have repair work in progress to eliminate this type of configuration from Codespaces entirely. We have also verified that all future changes to any component will follow safe deployment practices (one test region followed by individual region rollouts) to avoid global impact in the future.

October 5 06:30 UTC (lasting 31 minutes)

On October 5, 2022 at 06:30 UTC, webhooks experienced a significant backlog of events caused by a high volume of automated user activity causing a rapid series of create and delete operations. This activity triggered a large influx of webhook events. However, many of these events caused exceptions in our webhook delivery worker because data needed to generate their webhook payloads had been deleted from the database. Attempting to retry these failed jobs tied up our worker and it was unable to process new incoming events, resulting in a severe backlog in our queues. Downstream services that rely on webhooks to receive their events were unable to receive them, which resulted in service degradation. We updated GitHub Actions to status red because the webhooks delay caused new job execution to be severely delayed.

Investigation into the source of the automated activity led us to find that there was automation creating and deleting many repositories in quick succession. As a mitigation, we disabled the automated accounts that were causing this activity in order to give us time to find a longer term solution for such activity.

Once the automated accounts were disabled, it brought the webhook deliveries back to normal and the backlog got mitigated at 07:01 UTC. We also updated our webhook delivery workers to not retry any jobs for which it was determined that the data did not exist in the database. Once the fix was put in place, the accounts were re-enabled and no further problems were encountered with our worker. We recognize that our services must be resilient to spikes in load and will make improvements based on what we’ve learned in this incident.

September 28 03:53 UTC (lasting 1 hour and 16 minutes)

On September 27, 2022 at 23:14 UTC, we performed a routine secret rotation procedure on Codespaces. On September 28, 2022 at 03:21 UTC, we received an internal report stating that port forwarding was not working on the Codespaces web client and began investigating. At 03:53 UTC. we statused yellow due to the broad user impact we were observing. Upon investigation, we found that we missed a step in the secret rotation checklist a few hours earlier, which caused some downstream components to fail to pick up the new secret. This resulted in some traffic not reaching backend services as expected. At 04:29 UTC, we ran the missed rotation step, after which we quickly saw the port forwarding feature returning to a healthy state. At this moment, we considered the incident to be mitigated. We investigated why we did not receive automated alerts about this issue and found that our alerts were monitoring error rates but did not alert for lack of overall traffic to the port forwarding backend. We have since improved our monitoring to include anomalies in traffic levels that cover this failure mode.

Several hours later, at 17:18 UTC, our monitors alerted us of an issue in a separate downstream component, which was similarly caused by the previous missed secret rotation step. We could see Codespaces creation and start failures increasing in all regions. The effect from the earlier secret rotation was not immediate because this secret is used in exchange for a token, which is cached for up to 24 hours. Our understanding was that the system would pick up the new secret without intervention, but in reality this secret was picked up only if the process was restarted. At 18:27 UTC, we restarted the service in all regions and could see that the VM pools, which were heavily drained before, started to slowly recover. To accelerate the draining of the backlog of queued jobs we increased the pool size at 18:45 UTC. This helped all but two pools in West Europe, which were still not recovering. At 19:44 UTC, we identified an instance of the service in West Europe that was not rotated along the rest. We rotated that instance and quickly saw a recovery in the affected pools.

After the incident, we identified why multiple downstream components failed to pick up the rotated secret. We then added additional monitoring to identify which secret versions are in use across all components in the service to more easily track and verify secret rotations. To address this short term, we have updated our secret rotation checklist to include the missing steps and added additional verification steps to ensure the new secrets are picked up everywhere. Longer term, we are automating most of our secret rotation processes to avoid human error.

In summary

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: September 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-10-05-github-availability-report-september-2022/

In September, we experienced one incident that resulted in significant impact and degraded state of availability to multiple GitHub services. We also experienced one incident resulting in significant impact to Codespaces. We are still investigating that incident and will include it in next month’s report. This report also sheds light into an incident that impacted Codespaces in August and an incident that impacted GitHub Actions in August.

September 8 19:44 UTC (lasting 5 hours and 11 minutes)

On September 8, 2022 at 19:44 UTC, our monitoring detected an increase in the number of pull request merge failures. The impact was concentrated on Enterprise Managed Users (EMUs) with a small number of bot accounts also affected.

Within 45 minutes, we traced the cause to a data transition that removed inconsistent data from profile records. Unfortunately, the transition incorrectly operated on EMU accounts, removing some data that is required to successfully merge pull requests via the UI and our API. CLI merges were unaffected.

We restored the data from backup, but this took longer than we had anticipated. We simultaneously pursued a workaround in code, but opted not to proceed with it as it could introduce data inconsistencies. Our restore operation resolved the issue with our pull request monitors having recovered by September 9, 2022 at 00:55 UTC.

Following this incident, we have made changes to our data transition procedures to allow for faster restores and transitions that can be automatically rolled back without relying on backups. We are also working on multiple improvements to our testing processes as they relate to EMUs.

September 28 03:53 UTC (lasting 1 hour and 16 minutes)

Our alerting systems detected an incident that impacted most Codespaces customers. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the October Availability Report, which we will publish the first Wednesday of November.

Follow up to August 29 12:51 UTC (lasting 5 hours and 40 minutes)

On August 29, 2022 at 12:51 UTC, our monitoring detected an increase in Codespaces create and start errors. We also started seeing DNS-related networking errors in some running Codespaces where outbound DNS resolutions were failing. At 14:19 UTC, we updated the status for Codespaces from yellow to red due to broad user impact.

This incident was caused by an Ubuntu security patch in systemd that broke DNS resolution. In recent versions of Ubuntu, unattended upgrades for security fixes are enabled by default. Codespaces host VMs were using the default recommended settings to apply security patches automatically on running VMs. When this patch was published, Codespaces host VMs started installing and applying the patch after the VM was created. Once the patch was installed on a VM, DNS resolution was broken. Depending on the timing of when the patch was installed on the host VM, this led to a few different failure modes, including failure creating/starting Codespaces or failure, making outbound network calls inside of a codespace that was already running.

Once we identified systemd’s DNS resolver configuration as the source of these errors, we were able to mitigate the issue by disabling systemd’s DNS resolver and manually configuring an upstream DNS resolver IP address. We deployed a change to the DNS configuration on the host VMs at 18:13 UTC. By 18:21 UTC, we started seeing positive signs of recovery in our metrics and changed the status to yellow. Ten minutes later, at 18:31 UTC, all metrics were fully healthy and the incident was resolved.

Following this incident, we are updating our DNS configuration to reduce dependencies on systemd’s DNS resolver. We are also investigating whether we should continue to use unattended upgrades for security patches. Disabling unattended upgrades will give us more deterministic behavior at runtime, preventing external changes from breaking Codespaces. We will remain fully capable of quickly patching VMs across our fleet even with unattended upgrades disabled.

Follow up to August 18 14:33 UTC (lasting 3 hours and 23 minutes)

This incident occurred in August but was left out of the August report because it did not result in a widespread outage. Several GitHub Actions customers experienced issues because of the degradation so we decided to include it retroactively.

At 14:13 UTC, there was a sudden spike in traffic to GitHub Actions which resulted in a higher than usual write load on our services. A majority of our services handled this graciously, but one of our internal services that is used for generating security tokens started returning 503 Service Unavailable errors to requests, triggering an alert to the engineering team. Further investigation revealed that the token database was experiencing a performance degradation which, compounded by the increased load, caused us to hit the database’s max concurrent connections limit. This was made worse due to a mismatch between our client-side throttling limits and database capacity, which resulted in our throttling thresholds allowing more traffic than this database had capacity to handle.

We mitigated the issue by scaling up the impacted database while also allowing a higher number of concurrent connections to it. The impacted service went back to a healthy state and the incident was considered resolved at 17:36 UTC. In addition to the immediate actions, we have improved our monitoring and alerting to allow faster remediation. We are also evaluating changes to our throttling mechanisms to better account for this traffic pattern.

In summary

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: August 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-09-07-github-availability-report-august-2022/

In August, we experienced one incident resulting in significant impact and degraded state of availability to Codespaces. This report also sheds light into an incident that impacted Codespaces in July.

August 29 12:51 UTC (lasting 5 hours and 40 minutes)

Our alerting systems detected an incident that impacted most Codespaces customers. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the September Availability Report, which will publish the first Wednesday of October.

Follow up to July 27 22:29 UTC (lasting 7 hours and 55 minutes)

As mentioned in the July Availability Report, we are now providing a more detailed update on this incident following further investigation. During this incident, a subset of codespaces in the East US and West US regions using 2-core and 4-core machine types could not be created or restarted.

On July 27, 2022 at approximately 21:30 UTC we started experiencing a high rate of failures creating new virtual machines (VMs) for Codespaces in the East US and West US regions. The rate of codespace creations and starts on the 2-core and 4-core machine types exceeded the rate of successful VM creations needed to run, which eventually led to resource exhaustion of the underlying VMs. At 22:29 UTC, the pools for 2-core and 4-core VMs were drained and unable to keep up with demand, so we statused yellow. Impacted codespaces took longer than normal to start while waiting for an available VM, and many ended up timing out and failing.

Each codespace runs on an isolated VM for security. The Codespaces platform builds a host VM image on a regular cadence, and then all host VMs are instantiated from that base image. This incident started when our cloud provider began rolling out an update in the East US and West US regions that was incompatible with the way we built our host VM image. Troubleshooting the failures was difficult because our cloud provider was reporting that the VMs were being created successfully even though some critical processes that were required to be started during VM creation were not running.

We applied temporary mitigations, including scaling up our VM pools to absorb the high failure rate, as well as adjusting timeouts to accelerate failure for VMs that were unlikely to succeed. While these mitigations helped, the failure rate continued to increase as our cloud provider’s update rolled out more broadly. Our cloud provider recommended adjusting our image generalization process in a way that would work with the new update. Once we made the recommended change to our image build pipeline, VM creation success rates recovered and enabled the backlog of queued codespace creation and start requests to be fulfilled with VMs to run the codespaces.

Following this incident, we have audited our VM image building process to ensure it aligns with our cloud provider’s guidance to prevent similar issues going forward. In addition, we have improved our service logic and monitoring to be able to verify that all critical operations are executed during VM creation rather than only looking at the result reported by our cloud provider. We have also updated our alerting to detect VM creation failures earlier before there is any user impact. Together, these changes will prevent this class of issue from happening, detect other failure modes earlier, and enable us to quickly diagnose and mitigate other VM creation errors in the future.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.

GitHub Availability Report: July 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-08-03-github-availability-report-july-2022/

In July, we experienced one incident that resulted in degraded performance for Codespaces. This report also sheds light into two incidents in June that impacted multiple GitHub.com services.

July 27 22:29 UTC (lasting 5 hours and 55 minutes)

Our alerting systems detected degraded availability for Codespaces in the US West and East regions during this time. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the August Availability Report, which will publish the first Wednesday of September.

Follow up to June 28 17:16 UTC (lasting 26 minutes)

During this incident, Codespaces was made unavailable due to issues introduced when migrating a DNS record to a new load balancer.

Codespaces runs a set of microservices in each region where Codespaces can be created. In order to route requests to the nearest region for each user, we have a global DNS record that uses a load balancer to resolve to the nearest regional backend. When performing an infrastructure migration, we needed to switch this record to point to a new load balancer. In order to do that, we deleted the existing global record in order to replace it with a record that pointed to the new balancer. Unfortunately, adding the new replacement record failed. Thus, any requests made to the global DNS record that pointed to Codespaces services were denied. Our alerting systems detected this almost immediately; however, our attempt to rollback the DNS update to switch to the old configuration also failed. We then disabled an endpoint in the old load balancer, upon which the rollback succeeded and all metrics recovered (after some time due to DNS caching and TTL).

As a follow-up, we are investigating safer mechanisms for testing the new load balancers and atomic DNS record updates, including setting up a mirrored testing DNS zone. We are also following up with our cloud provider to understand why the initial rollback failed and whether this is a bug.

Follow up to June 29 14:48 UTC (lasting 1 hour and 27 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, GitHub Packages, and GitHub Pages were impacted. This was due to excessive load on a proxy server that routes traffic to the database.

At approximately 14:14 UTC, the internal APIs that a data migration service uses to communicate with GitHub.com began returning 502 Service Unavailable errors to requests. This migration service allows customers to migrate to GitHub.com from other external sources, including GitHub Enterprise Server. As part of its exception handling, the service contains retry logic to requeue jobs. However, this logic captured all exceptions rather than just a subset. The 502 errors it caught triggered a bug that caused jobs to continuously requeue themselves to be retried. The situation quickly escalated when hundreds of thousands of jobs made identical API requests, overwhelming the database’s proxy server.

We mitigated the situation by pausing the processing of all new customer-initiated migrations performed with the data migration service at 15:07 UTC. We also pruned the queues of all jobs associated with in-progress migrations to alleviate the pressure on the proxy server. Approximately nine minutes later, we began to see affected services recover.

We have updated exception handling to only retry jobs in cases of a specific set of errors. We have also adjusted our logic to retry a fixed number of times before logging the exception and giving up. These actions eliminate the possibility of continuous requeuing. We are also investigating whether changes are needed to the rate limits of our internal APIs.

In summary

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: June 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-07-06-github-availability-report-june-2022/

In June, we experienced four incidents resulting in significant impact and degraded state of availability to multiple GitHub.com services. This report also sheds light into an incident that impacted multiple GitHub.com services in May.

June 1 09:40 UTC (lasting 48 minutes)

During this incident, customers experienced delays in the start up of their GitHub Actions workflows. The cause of these delays was excessive load on a proxy server that routes traffic to the database.

At 09:37 UTC, Actions service noticed a marked increase in the time it takes customer jobs to start. Our on-call engineer was paged and Actions was statused red. Once we started to investigate, we noticed that the pods running the proxy server for the database were crash-looping due to out-of-memory errors. A change was created to increase the available memory to these pods, which fully rolled out by 10:08 UTC. We started to see recovery in Actions even before 10:08 UTC, and statused to yellow at 10:17 UTC. By 10:28 UTC, we were confident that the memory increase had mitigated the issue, and statused Actions green.

Ultimately, this issue was traced back to a set of data analysis queries being pointed at an incorrect database. The large load they placed on the database caused the crash loops and the broader impact. These queries have been moved to a dedicated analytics setup that does not serve production traffic.

We are adding alerts to identify increases in load to the proxy server to catch issues like this early. We are also investigating how we can put in guardrails to ensure production database access is limited to services that own the data.

June 21 17:02 UTC (lasting 1 hour and 10 minutes)

During this incident, shortly after the GA of Copilot, users with either a Marketplace or Sponsorship plan were unable to use Copilot. Users with those subscriptions received an error from our API responsible for creating authentication tokens. This impacted a little less than 20% of our active users at the time.

At approximately 16:45 UTC, we were alerted and noticed elevated error rates in the API and began investigating causes. We were able to identify the issue and statused red. Our engineers worked quickly to roll out a fix to the API endpoint and we saw API error rates begin lowering at approximately 17:45 UTC. By 18:00 UTC, we were no longer seeing this issue but decided to wait for 10 more minutes to status back to green to ensure there were no regressions.

We have increased our testing around this particular combination of subscription types, added these scenarios to our user testing and will add additional data shape testing before future rollouts.

June 28 17:16 UTC (lasting 26 minutes)

Our alerting systems detected degraded availability for Codespaces during this time. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on the causes and remediations in the July Availability Report, which will be published the first Wednesday of August.

June 29 14:48 UTC (lasting 1 hour and 27 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, GitHub Packages, and GitHub Pages were impacted. As we continue to investigate the contributing factors, we will provide a more detailed update in the July Availability Report. We will also share more about our efforts to minimize the impact of similar incidents in the future.

Follow up to May 27 04:26 UTC (lasting 21 minutes) and May 27 07:36 UTC (lasting 1 hour and 21 minutes)

As mentioned in the May Availability Report, we are now providing a more detailed update on this incident following further investigation.

Both instances that occurred at 04:26 and 07:36 UTC were caused by the same contributing factors. In the first instance, an individual service team noticed higher than normal load and an increase in error rate on API requests and statused red. The load was particularly high on our login endpoint. While this did elevate error rates, it was not enough to cause a widespread outage and we should have likely statused yellow in this instance.

After follow-up that indicated the load pattern had subsided, our on-call team determined it was safe to report the situation was mitigated and began to investigate further.

However, three hours later, we again experienced a degradation of service from a sustained high load in traffic. This was again concentrated on our login endpoint. We statused all services red, since we were seeing sustained error rates for a variety of clients and situations, and then updated individual service statuses based on their SLOs. Services that were affected by the load pattern statused to yellow, while services that were not impacted statused back to green.

The duration of impact to GitHub.com from the second instance of the load pattern lasted about 15 minutes. We continued to see elevated traffic during this time and waited until a network-level mitigation was rolled out before statusing all affected services back to green.

In addition to network mitigation, we were able to use the data from this incident to add additional mitigations on the application side for a sustained load of this type, as well as inform architectural changes we can make in the future to make our services more resilient.

Following this incident, we are improving our on-call procedures to ensure we always report the correct status level based on SLO review. While we always want to over-communicate issues with customers for awareness, we want to only status red when necessary.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.