Tag Archives: GitHub Availability Report

GitHub Availability Report: July 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-08-03-github-availability-report-july-2022/

In July, we experienced one incident that resulted in degraded performance for Codespaces. This report also sheds light into two incidents in June that impacted multiple GitHub.com services.

July 27 22:29 UTC (lasting 5 hours and 55 minutes)

Our alerting systems detected degraded availability for Codespaces in the US West and East regions during this time. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the August Availability Report, which will publish the first Wednesday of September.

Follow up to June 28 17:16 UTC (lasting 26 minutes)

During this incident, Codespaces was made unavailable due to issues introduced when migrating a DNS record to a new load balancer.

Codespaces runs a set of microservices in each region where Codespaces can be created. In order to route requests to the nearest region for each user, we have a global DNS record that uses a load balancer to resolve to the nearest regional backend. When performing an infrastructure migration, we needed to switch this record to point to a new load balancer. In order to do that, we deleted the existing global record in order to replace it with a record that pointed to the new balancer. Unfortunately, adding the new replacement record failed. Thus, any requests made to the global DNS record that pointed to Codespaces services were denied. Our alerting systems detected this almost immediately; however, our attempt to rollback the DNS update to switch to the old configuration also failed. We then disabled an endpoint in the old load balancer, upon which the rollback succeeded and all metrics recovered (after some time due to DNS caching and TTL).

As a follow-up, we are investigating safer mechanisms for testing the new load balancers and atomic DNS record updates, including setting up a mirrored testing DNS zone. We are also following up with our cloud provider to understand why the initial rollback failed and whether this is a bug.

Follow up to June 29 14:48 UTC (lasting 1 hour and 27 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, GitHub Packages, and GitHub Pages were impacted. This was due to excessive load on a proxy server that routes traffic to the database.

At approximately 14:14 UTC, the internal APIs that a data migration service uses to communicate with GitHub.com began returning 502 Service Unavailable errors to requests. This migration service allows customers to migrate to GitHub.com from other external sources, including GitHub Enterprise Server. As part of its exception handling, the service contains retry logic to requeue jobs. However, this logic captured all exceptions rather than just a subset. The 502 errors it caught triggered a bug that caused jobs to continuously requeue themselves to be retried. The situation quickly escalated when hundreds of thousands of jobs made identical API requests, overwhelming the database’s proxy server.

We mitigated the situation by pausing the processing of all new customer-initiated migrations performed with the data migration service at 15:07 UTC. We also pruned the queues of all jobs associated with in-progress migrations to alleviate the pressure on the proxy server. Approximately nine minutes later, we began to see affected services recover.

We have updated exception handling to only retry jobs in cases of a specific set of errors. We have also adjusted our logic to retry a fixed number of times before logging the exception and giving up. These actions eliminate the possibility of continuous requeuing. We are also investigating whether changes are needed to the rate limits of our internal APIs.

In summary

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: June 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-07-06-github-availability-report-june-2022/

In June, we experienced four incidents resulting in significant impact and degraded state of availability to multiple GitHub.com services. This report also sheds light into an incident that impacted multiple GitHub.com services in May.

June 1 09:40 UTC (lasting 48 minutes)

During this incident, customers experienced delays in the start up of their GitHub Actions workflows. The cause of these delays was excessive load on a proxy server that routes traffic to the database.

At 09:37 UTC, Actions service noticed a marked increase in the time it takes customer jobs to start. Our on-call engineer was paged and Actions was statused red. Once we started to investigate, we noticed that the pods running the proxy server for the database were crash-looping due to out-of-memory errors. A change was created to increase the available memory to these pods, which fully rolled out by 10:08 UTC. We started to see recovery in Actions even before 10:08 UTC, and statused to yellow at 10:17 UTC. By 10:28 UTC, we were confident that the memory increase had mitigated the issue, and statused Actions green.

Ultimately, this issue was traced back to a set of data analysis queries being pointed at an incorrect database. The large load they placed on the database caused the crash loops and the broader impact. These queries have been moved to a dedicated analytics setup that does not serve production traffic.

We are adding alerts to identify increases in load to the proxy server to catch issues like this early. We are also investigating how we can put in guardrails to ensure production database access is limited to services that own the data.

June 21 17:02 UTC (lasting 1 hour and 10 minutes)

During this incident, shortly after the GA of Copilot, users with either a Marketplace or Sponsorship plan were unable to use Copilot. Users with those subscriptions received an error from our API responsible for creating authentication tokens. This impacted a little less than 20% of our active users at the time.

At approximately 16:45 UTC, we were alerted and noticed elevated error rates in the API and began investigating causes. We were able to identify the issue and statused red. Our engineers worked quickly to roll out a fix to the API endpoint and we saw API error rates begin lowering at approximately 17:45 UTC. By 18:00 UTC, we were no longer seeing this issue but decided to wait for 10 more minutes to status back to green to ensure there were no regressions.

We have increased our testing around this particular combination of subscription types, added these scenarios to our user testing and will add additional data shape testing before future rollouts.

June 28 17:16 UTC (lasting 26 minutes)

Our alerting systems detected degraded availability for Codespaces during this time. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on the causes and remediations in the July Availability Report, which will be published the first Wednesday of August.

June 29 14:48 UTC (lasting 1 hour and 27 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, GitHub Packages, and GitHub Pages were impacted. As we continue to investigate the contributing factors, we will provide a more detailed update in the July Availability Report. We will also share more about our efforts to minimize the impact of similar incidents in the future.

Follow up to May 27 04:26 UTC (lasting 21 minutes) and May 27 07:36 UTC (lasting 1 hour and 21 minutes)

As mentioned in the May Availability Report, we are now providing a more detailed update on this incident following further investigation.

Both instances that occurred at 04:26 and 07:36 UTC were caused by the same contributing factors. In the first instance, an individual service team noticed higher than normal load and an increase in error rate on API requests and statused red. The load was particularly high on our login endpoint. While this did elevate error rates, it was not enough to cause a widespread outage and we should have likely statused yellow in this instance.

After follow-up that indicated the load pattern had subsided, our on-call team determined it was safe to report the situation was mitigated and began to investigate further.

However, three hours later, we again experienced a degradation of service from a sustained high load in traffic. This was again concentrated on our login endpoint. We statused all services red, since we were seeing sustained error rates for a variety of clients and situations, and then updated individual service statuses based on their SLOs. Services that were affected by the load pattern statused to yellow, while services that were not impacted statused back to green.

The duration of impact to GitHub.com from the second instance of the load pattern lasted about 15 minutes. We continued to see elevated traffic during this time and waited until a network-level mitigation was rolled out before statusing all affected services back to green.

In addition to network mitigation, we were able to use the data from this incident to add additional mitigations on the application side for a sustained load of this type, as well as inform architectural changes we can make in the future to make our services more resilient.

Following this incident, we are improving our on-call procedures to ensure we always report the correct status level based on SLO review. While we always want to over-communicate issues with customers for awareness, we want to only status red when necessary.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.

GitHub Availability Report: May 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-06-01-github-availability-report-may-2022/

In May, we experienced three distinct incidents that resulted in significant impact and degraded state of availability to multiple services across GitHub.com. This report also sheds light into the billing incident that impacted GitHub Actions and Codespaces users in April.

May 20 09:44 UTC (lasting 49 minutes)

During this incident, our alerting systems detected increased CPU utilization on one of the GitHub Container registry databases. When we received the alert we immediately began investigating. Due to this preemptive monitoring added from the last incident in April at 8:59 UTC, the on-call was readily monitoring and prepared to run mitigation for this incident.

As CPU utilization on the database continued to rise, the Container registry began responding to requests with increased latency, followed by an internal server error for a percentage of requests. At this point we knew there was customer impact and changed the public status of the service. This increased CPU activity was due to a high volume of the “Put Manifest” command. Other package registries were not impacted.

The reason for the CPU utilization was that the throttling criteria configured at the API side for this command was too permissive, and a database query was found to be non-performant under that degree of scale. This caused an outage for anyone using the GitHub Container registry. Users were experiencing latency issues when pushing or pulling packages, as well as getting slow access to the packages UI.

In order to limit impact we throttled the requests from all organizations/users and to restore normal operation, we had to reset our database state by restarting our front-end servers and then the database.

To avoid this in the future, we have added separate rate limiting for operation types from specific organizations/users and will continue working on performance improvements for SQL queries.

May 27 04:26 UTC (lasting 21 minutes)

Our alerting systems detected degraded availability for API requests during this time. Due to the recency of these incidents, we are still investigating the contributing factors and will provide a more detailed update on the causes and remediations in the June Availability Report, which will be published the first Wednesday of July.

May 27 07:36 UTC (lasting 1 hour and 21 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks were impacted. As we continue to investigate the contributing factors, we will provide a more detailed update in the June Availability Report. We will also share more about our efforts to minimize the impact of similar incidents in the future.

Follow up to April 14 20:35 UTC (lasting 4 hours and 53 minutes)

As we mentioned in the April Availability Report, we are now providing a more detailed update on this incident following further investigation.

On April 14, GitHub Actions and Codespaces customers started reporting incorrect charges for metered services shown in their GitHub billing settings. As a result, customers were hitting their GitHub spending limits and unable to run new Actions or create new Codespaces. We immediately started an incident bridge. Our first step was to unblock all customers by giving unlimited Actions and Codespaces usage for no additional charge during the time of this incident.

From looking at the timing and list of recently pushed changes, we determined that the issue was caused by a code change in the metered billing pipeline. When attempting to improve performance of our metered usage processor, Actions and Codespaces minutes were mistakenly multiplied by 1,000,000,000 to convert gigabytes into bytes when this was not necessary for these products. This was due to a change to shared metered billing code that was not thought to impact these products.

To fix the issue, we reverted the code change and started repairing the corrupted data recorded during the incident. We did not re-enable metered billing for GitHub products until we had repaired the incorrect billing data, which happened 24 hours after this incident.

To prevent this incident in the future, we added a Rubocop (Ruby static code analyzer) rule to block pull requests containing non-safe billing code updates. In addition, we added anomaly monitoring for the billed quantity, so next time we are alerted before impacted customers. We also tightened the release process to require a feature flag and end-to-end test when shipping such changes.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.

GitHub Availability Report: April 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-05-04-github-availability-report-april-2022/

In April, we experienced three distinct incidents resulting in significant impact and degraded state of availability for Codespaces and GitHub Packages.

April 01  7:07 UTC (lasting 5 hours and 32 minutes)

Our alerting detected an increase in failures to create new Codespaces and start existing stopped Codespaces in the US West region. We immediately updated the GitHub status page and began to investigate.

Upon further investigation, we determined that some secrets that are used by the Codespaces service had expired. Codespaces maintains warm pools of resources to protect our users from intermittent failures in our dependent services. However, in the US West region, those pools were empty of resources due to the expired secret. In this case, we didn’t have an early enough warning on pools reaching low thresholds and didn’t have time to react until we ran out of capacity. As we worked to mitigate the incident, the pools in other regions also emptied due to the expired secret, and those regions began to see failures as well.

A limited number of GitHub engineers had access to rotate the secret, and communication issues delayed the start of the secret refresh process. The expired secret was eventually refreshed and rolled out to all regions, and the service was returned to full operation.

To prevent this failure pattern in the future, we now verify resources that expire and have monitors in place that alert well in advance if pool resources are not being maintained. We’ve also added monitors to notify us earlier when we approach resource exhaustion limits. In addition, we’ve initiated migrating the service to use a mechanism that doesn’t rely on secrets or the need to rotate credentials.

April 14 20:35 UTC (lasting 4 hours and 53 minutes)

We are still investigating the contributing factors and will provide a more detailed update in the May Availability Report, which will be published the first Wednesday of June. We will also share more about our efforts to minimize the impact of future incidents.

April 25 8:59 UTC (lasting 5 hours and 8 minutes)

During this incident, our alerting systems detected increased CPU utilization on one of the GitHub Packages Registry databases, which started approximately one hour before any customer impact occurred. The threshold for this alert was relatively low, and it was not a paging alert, so we did not immediately investigate. CPU continued to rise on the database causing the Package Registry to start responding to requests with internal server errors, eventually causing customer impact. This increased activity was due to a high volume of the “Create Manifest” command used in an unexpected manner.

The throttling criteria configured at the database level wasn’t enough to limit the above command, and this caused an outage for anyone using the GitHub Packages Registry. Users were unable to push or pull packages, as well as being unable to access the packages UI or the repository landing page.

After investigating, we determined there was a performance bug related to the high volume of “Create Manifest” commands. In order to limit impact and restore normal operation, we blocked the activity causing this problem. We are actively following up on this issue by improving the rate limiting in packages and fixing the performance problem that was uncovered. We’ve also modified database alerting thresholds and severity so we get alerted to unexpected issues more quickly (rather than after customer impact).

During this incident, we also discovered that the repository home page has a hard dependency on the packages infrastructure. When the package registry is down, the home pages for repositories that list packages also fail to load. We decoupled the package listing from the repository home page, but that required manual intervention during the outage. We are working on a fix that loosely binds the packages listing, so if it fails, it does not take down the repository home pages for repositories that list packages.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. Please follow our status page for real-time updates. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: March 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-04-06-github-availability-report-march-2022/

In March, we experienced a number of incidents that resulted in significant impact and degraded state of availability to some core GitHub services. This blog post includes a detailed follow-up on a series of incidents that occurred due to degraded database stability, and a distinct incident impacting the Actions service.

Database Stability

Last month, we experienced a number of recurring incidents that impacted the availability of our services. We want to acknowledge the impact this had on our customers, and take this opportunity during our monthly report to provide additional details as a result of further investigations and share what we have learned.

Background

The underlying theme of these issues was due to resource contention in our mysql1 cluster, which impacted the performance of a large number of our services and features during periods of peak load.

Each of these incidents resulted in a degraded state of availability for write operations on our primary services (including Git, issues, and pull requests). While some read operations were not impacted, any user who performed a write operation that involved our mysql1 cluster was affected, as the database could not handle the load.

After the other services recovered, GitHub Actions queues were saturated. We enabled the queues gradually to catch up in real time, and as a result our status page noted the multi-hour outages. When Actions are delayed, it can also impact CI completion and a host of other functions.

What we learned

These incidents were characterized by a burst in load during peak hours of GitHub traffic. During these bursts, our mysql1 cluster was not able to handle the load generated by traffic on the system and we were forced to fail-over and take other mitigations, as mentioned in the previous post.

Some of these incidents were related to our efforts to improve visibility on the database, but all of them were related to the low amount of headroom we had on our primary database and thus its susceptibility to a few poorly performing queries.

Optimizing for stability

Because of this, even after we mitigated the initial causes of downtime due to poor query performance, we were still running with low headroom and decided to take a proactive approach to managing load by intentionally slowing down services during peak hours. Furthermore, we took a calculated approach to increase capacity on the database by further optimizing queries.

Rather than risk another site outage, we established lower performance alerting thresholds on the database and proactively throttled webhooks and Actions services (the two largest drivers of automated load on the system) as we approached unsafe margins of error on March 14 14:43 UTC. We understood the potential impact to our customers, but decided it would be safer to proactively limit load on the system rather than risk another outage on multiple services.

In the meantime, we implemented a series of optimizations between March 14 and March 28 that drove queries per second on this database down by over 50% and reduced our transaction volume by 70% at peak load times. Through these performance optimizations, we became more confident in our headroom, but given ongoing investigations, we did not want to chance any unwarranted impacts.

Minimizing impact to our users

After the incidents mentioned above, we took steps to make sure we would be in a position, if necessary, to shut down any services driving high peak load. This meant taking maintenance windows for three services starting on March 24. We proactively paused migrations and team synchronization during peak load due to their potential impact.

We also took maintenance windows for GitHub Actions even though we did not actually throttle any actions and no customers were impacted during these windows. We did this in order to proactively notify customers of possible disruption. While it didn’t end up being the case, we knew we would need to throttle GitHub Actions if we saw any significant database degradation during these time windows. While this may have caused uncertainty for some customers, we wanted to prepare them for any potential impact.

Next steps

Immediate changes

In addition to the improvements mentioned above, we have significantly reduced our database performance alerting thresholds so that we are not “running hot” and will be well positioned to take action before customers are impacted.

We have also accelerated work that was already in progress to continue to shard this particular cluster and apply the learnings from this incident to other clusters that already exist outside of mysql1.

Additional technical and organizational initiatives

Due to the nature of this incident, we have also dedicated a team of engineers to study our internal processes and procedures, observability, and change release processes. While we’re still actively revisiting this incident, we feel confident we have mitigated the initial issues and we have the correct alerting and processes in place to ensure this problem is not likely to occur again.

We understand that the Actions service is critical to many of our customers. With new and ongoing investments across architecture and processes, we’ll continue to bring focus specifically to Actions reliability, including more graceful degradations when other GitHub services are experiencing issues, as well as faster recovery times.

March 29 10:26 UTC (lasting 57 minutes)

During an operation to move GitHub Actions and checks data to its own dedicated, sharded database cluster, a misconfiguration on the new database cluster caused the application to encounter errors. Once we reverted our changes, we were able to recover. This incident resulted in the failure or delay of some queued jobs for a period of time. Once mitigation was initiated, jobs that were queued during the incident were run successfully after the issue was resolved.

The Actions and checks data resides in a multi-tenant database cluster. As part of our efforts to improve reliability and scale, we have been working on functionally partitioning the Actions data to its own sharded database cluster. The switch over to the new cluster involves gradually switching over reads and then switching over writes. Immediately after switching the write traffic, we noticed Actions SLOs were breached and initiated a revert back to the old database. After we reverted back to the old database, we saw an immediate improvement in availability.

Upon further investigation, we discovered that update and delete queries were processed correctly on the new cluster, but insert queries were failing because of missing permissions on the new cluster. All changes processed on the new cluster were replicated back to the old cluster before the switch back, ensuring data integrity.

We have paused any attempts for migrations until we fully investigate and apply our learnings. Furthermore, due to the risk associated with these operations, we will no longer be attempting them during peak traffic hours, which occur between 12:00 and 21:00 UTC. From a technical perspective, we’re looking to scrutinize and improve our operational workflows for these database operations. Additionally, we are going to be performing an audit of our configurations and topology across our environment, to ensure we have properly covered them in our testing strategy. As part of these efforts, we uncovered a gap where we need to extend our pre-migration checklist with a step to verify permissions more thoroughly.

In summary

Every month we share an update on GitHub’s availability, including a description of any incidents that may have occurred and an update on how we are evolving our engineering systems and practices in response. Our hope is that by increasing our transparency and sharing what we’ve learned, everyone can gain from our experiences. At GitHub, we take the trust you place in us very seriously, and we hope this is a way for you to help hold us accountable for continuously improving our operational excellence, as well as our product functionality.

To learn more about our efforts to make GitHub more resilient every day, check out the GitHub engineering blog.

GitHub Availability Report: February 2022

Post Syndicated from Scott Sanders original https://github.blog/2022-03-02-github-availability-report-february-2022/

In February, we experienced one incident resulting in significant impact and degraded state of availability for GitHub.com, issues, pull requests, GitHub Actions, and GitHub Codespaces services.

February 2 19:05 UTC (lasting 13 minutes)

As mentioned in our January report, our service monitors detected a high rate of errors affecting a number of GitHub services.

Upon further investigation of this incident, we found that a routine deployment failed to generate the complete set of integrity hashes needed for Subresource Integrity. The resulting output was missing values needed to securely serve Javascript assets on GitHub.com.

As a safety protocol, our default behavior is to error rather than rendering script tags without integrities, if a hash cannot be found in the integrities file. In this case, that means that github.com started serving 500 error pages to all web users. As soon as the errors were detected, we rolled back to the previous deployment and resolved the incident. Throughout the incident, only browser-based access to GitHub.com was impacted, with API and Git access remaining healthy.

Since this incident, we have added additional checks to our build process to ensure that the integrities are accurate and complete. We’ve also added checks for our main Javascript resources to the health check for our deployment containers, and adjusted the build pipeline to ensure the integrity generation process is more robust and will not fail in a similar way in the future.

In summary

Every month, we share an update on GitHub’s availability, including a description of any incidents that may have occurred and an update on how we are evolving our engineering systems and practices in response. Whether in these reports or via our engineering blog, we look forward to keeping you updated on the progress and investments we’re making to ensure the reliability of our services.

You can also follow our status page for the latest on our availability.

GitHub Availability Report: January 2022

Post Syndicated from Scott Sanders original https://github.blog/2022-02-02-github-availability-report-january-2022/

In January, we experienced no incidents resulting in service downtime to our core services. However, we do want to acknowledge an incident in February that we are continuing to investigate.

February 2 19:12 UTC (lasting 26 minutes)

Our service monitors detected a high rate of errors for issues, pull requests, GitHub Codespaces, and GitHub Actions services. We have mitigated the incident and are confident it has been fully resolved.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

Please follow our status page for real time updates. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: November 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-12-01-github-availability-report-november-2021/

In November, we experienced one incident resulting in significant impact and degraded state of availability for core GitHub services, including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks.

November 27 20:40 UTC (lasting 2 hours and 50 minutes)

We encountered a novel failure mode when processing a schema migration on a large MySQL table. Schema migrations are a common task at GitHub and often take weeks to complete. The final step in a migration is to perform a rename to move the updated table into the correct place. During the final step of this migration a significant portion of our MySQL read replicas entered a semaphore deadlock. Our MySQL clusters consist of a primary node for write traffic, multiple read replicas for production traffic, and several replicas that serve internal read traffic for backup and analytics purposes. The read replicas that hit the deadlock entered a crash-recovery state causing an increased load on healthy read replicas. Due to the cascading nature of this scenario, there were not enough active read replicas to handle production requests which impacted the availability of core GitHub services.

During the incident mitigation, in an effort to increase capacity, we promoted all available internal replicas that were in a healthy state into the production path; however, the shift was not sufficient for full recovery. We also observed that read replicas serving production traffic would temporarily recover from their crash-recovery state only to crash again due to load. Based on this crash-recovery loop, we chose to prioritize data integrity over site availability by proactively removing production traffic from broken replicas until they were able to successfully process the table rename. Once the replicas recovered, we were able to move them back into production and restore enough capacity to return to normal operations.

Throughout the incident, write operations remained healthy and we have verified there was no data corruption.

To address this class of failure and reduce time to recover in the future, we continue to prioritize our functional partitioning efforts. Partitioning the cluster adds resiliency given migrations can then be run in canary mode on a single shard—reducing the potential impact of this failure mode. Additionally, we are actively updating internal procedures to increase the amount each cluster is over-provisioned.

As next steps, we’re continuing to investigate the specific failure scenario, and have paused schema migrations until we know more on safeguarding against this issue. As we continue to test our migration tooling, we are classifying opportunities to improve it during such scenarios.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: October 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-11-04-github-availability-report-october-2021/

In October, we experienced one incident resulting in significant impact and degraded state of availability for the GitHub Codespaces service.

October 8 17:16 UTC (lasting 1 hour and 36 minutes)

A core Codespaces API response was inadvertently restructured as part of our Codespaces public API launch, impacting existing API clients dependent on a stable schema.

For the duration of the incident, new Codespaces could not be initiated from the Visual Studio Code Desktop client. Connections to the web editor and pre-existing desktop sessions were not impacted, but degraded, with the extension displaying an error message while omitting Codespaces metadata from the Remote Explorer view.

The incident was mitigated once we rolled back the regression, at which point all clients could connect again, including with new Codespaces created during the incident. As our monitoring systems did not initially detect the impact of the regression, a subsequent and unrelated deployment was initiated, delaying our ability to revert the change. To ensure similar breaking changes are not introduced in the future, we are investing in tooling to support more rigorous end-to-end testing with the extension’s use of our API. Additionally, we are expanding our monitoring to better align with the user experience across the relevant internal service boundaries.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: August 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-09-01-github-availability-report-august-2021/

In August, we experienced two distinct incidents resulting in significant impact and degraded state of availability for Git operations, API requests, webhooks, issues, pull requests, GitHub Pages, GitHub Packages, and GitHub Actions services.

August 10 15:16 UTC (lasting one hour and 17 minutes)

This incident was caused when one of our MySQL database primaries entered a degraded state, affecting a number of internal services. This caused an impact to GitHub.com services requiring write access to this particular database cluster, which resulted in some users being unable to perform operations.

Investigation had identified an edge case in one of our most active applications, which caused the generation of a poorly performing query capable of impacting overall database capacity. This combined with application retry and queueing logic meant that the MySQL primary was placed into a state where the cluster was unable to automatically recover.

We have been able to address this query, as well as some of the application retry logic, to reduce the chance of recurrence in the future.

One of the novel elements to this incident was the breadth of impact across multiple services. This led to a discussion about the overall service status as we were reporting it within the incident, and, so we’d like to take this opportunity to discuss the approach we took at the time, as well as the way we look to increase our learning potential after the incident.

When we first introduced the monthly availability report, we aimed to provide post-incident reviews for major incidents that impact service availability, in addition to background on how we’re continuing to evolve the process. As part of our standard post-incident analysis process, we are using this incident as a valuable source of data to evaluate the responsiveness of our internal metrics and alerting. These systems guide our responders during incidents on both when to status and what degree of impact to status for. As a result, we’re continuing to tune and optimize these activities to ensure we are able to status—both quickly and accurately—so that we continue to earn the trust our users place in us everyday.

August 10 19:57 UTC (lasting 3 hours and 6 minutes)

Following ongoing maintenance of the Actions service, our service monitors detected a high error rate on workflow runs for new and in progress jobs, which affected the Actions service. This incident resulted in the failure of all queued jobs for a period of time. This was a new incident, unrelated to the earlier issue on August 10. We immediately reverted recent Actions deployments and started to investigate the issue.

The incident was caused by work to set up a new Actions Premium Runner microservice in the Actions service. The impacting portion of this work involved alterations to the service discovery process within the Actions microservices architecture. A bad service record pushed to this system resulted in many of the microservices being unable to make Service-to-Service calls.

Ultimately, the mitigation for this incident was to remove the bad record from the service discovery infrastructure. After investigating whether this mitigation would address the incident, we were able to confidently confirm that the bad record was the root cause of the issue, and removing it would restore the Actions service with no unintended side effects.

We have prioritized several changes as a result of this incident, including fixing this part of the Actions microservice discovery process to properly handle potential bad records. We’ve also added a broader scope of visibility into what’s changed recently across all of the Actions microservices, so we can quickly focus investigations in the correct place.

In Summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. Please follow our status page for real time updates and watch our blog for next month’s availability report. To learn more about what we’re working on, check out the GitHub Engineering blog.

GitHub Availability Report: June 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-07-07-github-availability-report-june-2021/

In June, we experienced no incidents resulting in service downtime to our core services.

Please follow our status page for real time updates and watch our blog for next month’s availability report. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: May 2021

Post Syndicated from Scott Sanders original https://github.blog/2021-06-02-github-availability-report-may-2021/

Introduction

In May, we experienced two incidents resulting in significant impact and degraded state of availability for API requests, GitHub Pages, GitHub Actions and the GitHub Packages service, specifically the GitHub Packages Container registry service.

May 8 06:46 UTC (46 minutes)

This incident was caused by failures in an underlying MySQL database, which caused some operations to time out for the GitHub Container registry service. During this incident, some customers viewing packages in the UI or interacting with the registry through “docker push” and “docker pull” may have experienced failures as the engineering team investigated the incident. After performing a failover to one of our database replicas, the affected systems were properly restored.

Our internal engineering team is now prioritizing work that will help ensure reduced impact to customers should such underlying outages happen again. This work includes creating internal documentation, dashboards, and enhanced alerts to quickly triage the cause of operation failures. We will also continue to actively maintain and increase replicas in different regions and availability zones that serve as a line of defense against unexpected region outages.

May 16 07:17 UTC (lasting 9 hours 48 minutes)

This incident was caused when a foreign key for scoped tokens exceeded max INT32, which resulted in high failure rates for GitHub Actions and GitHub Pages. It also prevented some access to operations against the GitHub API and low-level git commands, such as “push” and “pull”, using scoped tokens. We mitigated this with a long-running schema migration to change the foreign key to INT64.

Once the foreign key migration was successful, the internal engineering teams then worked to slowly remove token records stored in our cache layer that were considered invalid. After these cached records were removed, newly created API tokens were able to generate new records and API calls resumed working as expected.

Alerting and linting are already in place to help prevent integer overflows in the database. Unfortunately, these mechanisms were not sufficient in this case due to it being a foreign key that predated our linting. In response, we are manually auditing all our INT32 columns and investigating further improvements to our automation to help prevent these types of issues moving forward.

Given the nature of this overflow, only a single GitHub Action used on a single repository received unauthorized access grants for a short period of time. We revoked these grants and confirmed that no unauthorized access was gained through the use of this Action in this repository.

Our internal engineering teams are actively working on reducing the impact and likelihood of this class of issue happening in the future. This work includes tooling to prevent database inconsistencies and improved alerting to allow faster remediation.

In summary

From our open source release of the GitHub Artifact Exporter to our adoption of OpenTelemetry SDKs, you can learn more about what we are working on to improve our internal development tooling and infrastructure in the GitHub Engineering Blog.

GitHub Availability Report: April 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-05-05-github-availability-report-april-2021/

In April, we experienced two incidents resulting in significant impact and degraded state of availability for API requests and the GitHub Packages service, specifically the GitHub Packages Container registry service.

April 1 21:30 UTC (lasting one hour and 34 minutes)

This incident was caused by failures in our DNS resolution, resulting in a degraded state of availability for the GitHub Packages Container registry service. During this incident, some of our internal services that support the Container registry experienced intermittent failures when trying to connect to dependent services. The inability to resolve requests to these services resulted in users being unable to push new container images to the Container registry as well as pull existing images. The Container registry is currently in a public beta, and only beta users were impacted during this incident. The broader GitHub Packages service remained unaffected.

As a next step, we are looking at increasing the cache times of our DNS resolutions to decrease the impact of intermittent DNS resolution failures in the future.

April 22 17:16 UTC (lasting 53 minutes)

Our service monitors detected an elevated error rate when using API requests, which resulted in a degraded state of availability for repository creation. Upon further investigation of this incident, we identified the issue was caused by a bug from a recent data migration. In a data migration to isolate our secret scanning tables into their own cluster, a bug was discovered that broke the ability of the application to successfully write to the secret scanning database. The incident revealed a hitherto unknown dependency that repository creation had upon secret scanning, which makes a call for every repository created. Due to this dependency, repository creation was blocked until we were able to roll back the data migration.

As next steps, we are actively working with our vendor to update the data migration tool and have amended our migration process to include revised steps for remediation, in case similar incidents occur. Furthermore, our application code has been updated to remove the dependency on secret scanning for the creation of repositories.

In summary

From scaling the GitHub API to improving large monorepo performance, we will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.