Tag Archives: Enterprise

3 ways to meet compliance needs without slowing down agility

2023-02-24 Mark Paulsen

Post Syndicated from Mark Paulsen original https://github.blog/2023-02-24-3-ways-to-meet-compliance-needs-without-slowing-down-agility/

In the previous blog, Setting the foundations for compliance, we set the groundwork for developer-enabled compliance that will keep your teams happy, in the flow, secure, compliant, and auditable.

Today, we’ll walk through three practical ways that you can start meeting your compliance requirements without having to revolutionize or transform the culture in your company—all while enabling your developers to be more productive and happy.

1. Do the basics well and often

The first way to start meeting your compliance requirements is often overlooked because it’s so simple. No matter what industry you are in, whether it’s finance, automotive, government, or tech, there are some basic quick compliance wins that may be part of your existing developer workflows and culture.

Code review

A key part of writing clean code is code review. But having a repeatable and traceable code review process is also a foundational component of your compliance program. This ensures risks within software delivery are found and mitigated early—and are also auditable for future reviews.

GitHub makes code review easy, since pull requests are part of the existing workflow that millions of developers use every day. GitHub also makes it easy for compliance testers and auditors to review pull requests with access to immutable audit logs in a central location.

Separation of duties

In some enterprises, separation of duties has been defined as a person having too much manual control over transactions or processes. This can lead to apprehension when modern cloud native practices are introduced.

Thankfully, there is guidance from the Industry that supports a more modern approach to separation of duties. The PCI-DSS requirements guide avoids the term person and provides a more cloud native friendly approach by focusing on functions and accounts:

“The purpose of this requirement is to separate the development and test functions from the production functions. For example, a developer can use an administrator-level account with elevated privileges in the development environment and have a separate account with user-level access to the production environment.”

This approach aligns well with the 12 factor application methodology for cloud native. The Build, release, run factor explanation states, “Each release must enforce a strict separation across the build, release, and run stages. Each should be tagged with a unique ID and support the ability to roll back.” Teams can fully automate their delivery to production, without having to worry about a person manually exceeding too much control, as long as there is separation of functions and traceability back to unique IDs. With the added assurance that they are aligned to industry best practices and requirements, such as PCI-DSS.

To enable separation of duties, you have to have a clear identity and access management strategy. Thankfully, we don’t have to reinvent the wheel. GitHub Enterprise has several options to help you manage access to your overall development environment:

You can configure SAML single sign-on for your enterprise. This is an additional check, allowing you to confirm the authenticity of your users against your own identity provider (while still using your own GitHub account).
You can then synchronize team memberships with groups in your identity provider. As a result, group membership changes in your identity provider update the team membership (and therefore associated access) in GitHub.
Alternatively, you could adopt GitHub Enterprise Managed Users (EMUs). This is a flavor of GitHub Enterprise where you can only log in with an account that is centrally managed through your identity provider. The user does not have to log in to GitHub with a personal account and use single sign on to access company resources. (For more information on this, check out this blog post on exploring EMUs and the benefits they can bring.)

AI-powered compliance

In our last blog we briefly covered AI-enabled compliance and some of the existing opportunities for security, manufacturing, and banks. There are also several other opportunities on the horizon that could further optimize the basics of compliance.

It is entirely possible that a generative AI tool could soon be leveraged to help ensure that separation of duties is enforced in a declarative DevOps workflow by creating unit tests on its own. Because of the non-deterministic nature of generative AI, each time it runs it may have a different result and the unit tests may include risk scenarios that nobody has thought of yet. This can add an amazing level of true risk-based compliance.

One of the major benefits of addressing compliance often in your delivery is an increased level of trust. But quantifying trust can be extremely difficult—especially within a regulated industry that wants to leverage deep learning solutions. There is work being driven by AI results to help provide trust quantification for these types of solutions, which will not only enable continuous compliance but will also help enterprises implement new technologies that can increase business value.

Dependency management

Companies are increasing their reliance on open source software in their supply chain. As a result, optimized, repeatable, and audible controls for dependency management are becoming a cornerstone of compliance programs across industries.

From a GitHub perspective, Dependabot can provide confidence that your supply chain is secure. When Dependabot is configured, developers are alerted as soon as a security threat is found and gives them the ability to take action in their normal workflows.

By leveraging Dependabot, you will receive an immutable and auditable alert when a repository uses a software dependency with a known vulnerability. Pull requests can be triggered by either your developers or security teams, which give you another set of auditable artifacts for future review.

Approvals

While most organizations have approval processes, they can be slow and occur too late in the process. Google explains that peer reviews can help to streamline the change approval process, based on results from the DORA 2019 state of DevOps report.

There could be many control points in your delivery pipeline that may require approval, such as sign-off of requirements, testing results, security reviews, or production operational readiness confirmations before a release.

Depending on your internal structure and alignment, you may require teams to provide sign-off at different stages throughout the process. In your build and test stage, you can use manual approvals in a pull request to gather the needed reviews.

Once your builds and tests are complete, it’s time to release your code into your infrastructure. Talking about deployment patterns and strategies is beyond the scope of this post. However, you may require approvals as part of the deployment process.

To obtain these approvals, you should use environments for your deployments. Environments are used to describe a deployment target (such as staging, testing, and production). By using environments, you can require a manual approval to take place before GitHub Actions begins the deployment job.

In both instances, remember that there is a tradeoff when deciding the number of approvals required. Setting the number of required reviews too high means that you may impact your pace of delivery, as manual intervention is required. Where possible, consider automating checks to reduce the manual overhead, thereby optimizing your overall agility.

2. Have a common understanding of concepts and terms

Again, this may be overlooked since it sounds so obvious. But there are probably many concepts and terms that developers, DevOps and cloud native practitioners use on a daily basis that may be totally incomprehensible to compliance testers and auditors.

Back in 2015, the author of The Phoenix Project, Gene Kim, and several other authors, created the DevOps Audit Defense Toolkit. As we mentioned in our first blog, the goal of this document was to “educate IT management and practitioners on the audit process so they can demonstrate to auditors they understand the business risks and are properly mitigating those risks.” But where do you start?

Below is a basic cheat-sheet of terms for GitHub related control objectives and a mapping to the world of compliance, regulations, risk, and audit. This could be a good place to start building a common understanding between developers and your compliance and audit friends.

Objective

Control

Financial Reporting

Industry Frameworks

The Code Review Control ensures that security requirements have been addressed and that the code is understandable, maintainable, and properly formatted.

Pull requests let you tell others about changes you’ve pushed to a branch in a repository on GitHub.Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.

SOX: Change Management

COSO: Control Activities—Conduct application change management.

NIST Cyber: DE.CM-4 —Malicious code is detected.

PCI-DSS: 6.3.2: Review all custom code for vulnerabilities, manually or via automation.

SLSA: Two-person review is an industry best practice for catching mistakes and deterring bad behavior.

Controls and processes should be in place to scan code repositories for passwords and security tokens. Prevention measures should be enforced to ensure secrets do not reach production.

Secret scanning will scan your entire Git history on all branches present in your GitHub repository for secrets. GitHub push protection will check for high-confidence secrets as developers push code and block the push if a secret is identified.

SOX: IT Security

COSO: Control Activities—Improve security

NIST Cyber: PR.DS-5: Protections against data leaks are implemented.

PCI-DSS: 6.5.3: Protect against all insufficiently secure cryptographic key storage.

To help bring this together, there is a great example of the benefits of a common understanding between developers and compliance testers and auditors in the latest Forrester report on the economic impact of GitHub Enterprise Cloud and Advanced Security. The report highlights one of the benefits of an automated documentation process that both helps developers and auditors:

“These new, standardized documentation structures made it easier for auditors to find and compile the documentation necessary to be compliant. This helped the interviewees’ organizations save time preparing for industry compliance and security audits.”

3. Helping developers understand where they fit into the three lines of defense

The three lines of defense is a model used for risk management and governance. It helps clarify the roles and responsibilities of teams involved in addressing risk.

Think of the engineering team and engineering leads as the first line of defense, as they continue shipping software to their end-users. Making sure this team has the appropriate level of skills to identify risk within their solution is crucial.

The second line of defense is typically achieved at scale through frameworks, policies, and tools at an organizational level. This acts as oversight for the first line of defense on how well they are implementing risk mitigation techniques and consistently measuring and defining risk across the organization.

Internal audit is the third line of defense. Just like a red team and blue team compliment each other, you can think of internal audit as the final puzzle piece in the three lines of defense. They evaluate the effectiveness of risk management and governance, working with senior management and external regulators to raise awareness of the controls being executed.

Let’s use an analogy. A hockey team is not just structured to score goals but also prevent goals being scored against them.

Forwards: they are there to score and assist on goals. But they can be called on to work with the other positions in defense. Think of them as the first line of defense. Their job is to create solutions which provide business value, but are also secure.
Defenders: their main role is clearly preventing goals. But they partner with the forwards on the current priority (offense or defense). They are like the second line of defense, providing risk oversight and collaborating with the engineering teams.
Goaltender: the last line of defense. They can be thought of as the equivalent of an internal audit. They are independent of the forwards and defenders with different responsibilities, tools, and perspective.

Hockey teams that have very strong forwards but weak defenders and goalkeepers are rarely successful. Teams with a strong defense but weak offense are rarely successful, either. It takes all three aspects working in harmony to be successful.

This applies in business, too. If your solutions meet your customers’ requirements but are insecure, your business will fail. If your solutions are secure but aren’t user friendly or providing value, it will result in failure. Success is achieved by finding the right balance of value and risk management.

Developers can see that they are there to score goals for their team; creating the software that runs our world. But they need to support the defensive capabilities of the team, ensuring their software is secure. Their success is dependent on the success of the wider team, without having to take on additional responsibilities.

Next steps

We’ve taken a whistle-stop tour on how you can bring compliance into your development flow. From branch protection rules and pull requests to CODEOWNERS and environment approvals, there are several GitHub capabilities that can help you naturally focus on compliance.

This is only one step in solving the problem. A common language between compliance and DevOps practitioners is crucial in demonstrating the implemented measures. With that common language, it is clear that everyone must think about compliance; engineering teams are a part of the three lines of defense.

Next up in the series, we’ll be talking about how to ensure compliance in developer workflows.

Ready to increase developer velocity and collaboration while remaining secure and compliant? See how GitHub Enterprise can help.

GitHub Availability Report: January 2023

2023-02-01 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2023-02-01-github-availability-report-january-2023/

In January, we experienced two incidents. One that resulted in degraded performance for GitHub Packages and GitHub Pages, and another that impacted git users.

January 30 21:48 UTC (lasting 35 minutes)

Our service monitors detected degraded performance for GitHub Packages and GitHub Pages. Most requests to the container registry were failing and some GitHub Pages builds were also impacted. We determined this was caused by a backend change and mitigated by reverting that change.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

January 30 18:35 UTC (lasting 7 hours)

We upgraded our production Git binary with a recent version from upstream. The updates included a change to use an internal implementation of gzip when generating archives. This resulted in subtle changes to the contents of the “Download Source” links served by GitHub, leading to checksum mismatches. No content was changed.

After becoming aware of the impact to many communities, we rolled back the compression change to restore the previous behavior.

Similar to the above, we are still investigating the contributing factors of this incident, and will provide a more thorough update in next month’s report.

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: December 2022

2023-01-04 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2023-01-04-github-availability-report-december-2022/

In December, we did not experience any incidents that resulted in degraded performance across GitHub services. This report sheds light into an incident that impacted GitHub Packages and GitHub Pages in November.

November 25 16:34 UTC (lasting 1 hour and 56 minutes)

On November 25, 2022 at 14:39 UTC, our alerting systems detected an incident that impacted customers using GitHub Packages and GitHub Pages. The GitHub Packages team initially statused GitHub Packages to yellow, and after assessing impact, it statused to red at 15:06 UTC.

During this incident, customers experienced unavailability of packages for container, npm, and NuGet registries. We were able to serve requests for RubyGems and Maven registries. GitHub Packages’ unavailability also impacted GitHub Pages, as they were not able to pull packages, which resulted in CI build failures. Repository landing pages also saw timeouts while fetching packages information.

GitHub Packages uses a third-party database to store data for the service and the provider was experiencing an outage, which impacted GitHub Packages performance. The first responder connected with the provider’s support team to learn more about the region specific outage. The provider then mitigated the issue before the first responder could do the failover to another region. With the mitigation in place, GitHub Packages started to recover along with GitHub Pages and the repository landing pages.

As follow up action items, the team is exploring options to make GitHub Pages and repository landing pages more resilient to GitHub Packages outages. We are also investigating options where failovers can be performed quickly and automatically in case of regional outages.

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: November 2022

2022-12-07 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2022-12-07-github-availability-report-november-2022/

In November, we experienced two incidents that resulted in degraded performance across GitHub services. This report also sheds light into an incident that impacted GitHub Codespaces in October.

November 25 16:34 UTC (lasting 1 hour and 56 minutes)

Our alerting systems detected an incident that impacted customers using GitHub Packages and Pages. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the January Availability Report, which we will publish the first Wednesday of January.

October 26 00:47 UTC (lasting 3 hours and 47 minutes)

On October 22, 2022 at 00:47 UTC, our alerting systems detected a decrease in success rates for creates and resumes of Codespaces in the East US region. We initially statused yellow, as the incident affected only the East US region. As the incident persisted for several hours, we provided guidance to customers in the affected region to manually change their location to a nearby healthy region at 01:55 UTC, and statused red at 2:34 UTC due to the prolonged outage.

During this incident, customers were unable to create or resume Codespaces in the East US region. Customers could manually select an alternate region in which to create Codespaces, but could not do so for resumes.

Codespaces uses a third-party database to store data for the service and the provider was experiencing an outage, which impacted Codespaces performance. We were unable to immediately communicate with our East US database because our service does not currently have any replication of its regional data. Our services in the East US region returned to healthy status as soon as Codespaces engineers were able to engage with the third party to help mitigate the outage.

We identified several ways to improve our database resilience to regional outages while working with the third party during this incident and in follow up internal discussions. We are implementing regional replication and failover so that we can mitigate this type of incident more quickly in the future.

November 3 16:10 UTC (lasting 1 hour and 2 minutes)

On November 3, 2022 at 16:10 UTC, our alerting systems detected an increase in the time it took GitHub Actions workflow runs to start. We initially statused GitHub Actions to red, and after assessing impact we statused to yellow at 16:11 UTC.

During this incident, customers experienced high latency in receiving webhook deliveries, starting GitHub Actions workflow runs, and receiving status updates for in-progress runs. They also experienced an increase in error responses from repositories, pull requests, Codespaces, and the GitHub API. At its peak, a majority of repositories attempting to run a GitHub Actions workflow experienced delays longer than five minutes.

GitHub Actions listens to webhooks to trigger workflow runs, and while investigating we found that the run start delays were caused by a backup in the webhooks queue. At 16:29 UTC, we scaled out and accelerated processing of the webhooks queue as a mitigation. By 17:12 UTC, the webhooks queue was fully drained and we statused back to green.

We found that the webhook delays were caused by an inefficient database query for checking repository security advisory access, which was triggered by a high volume of poorly optimized API calls. This caused a backup in background jobs running across GitHub, which is why multiple services were impacted in addition to webhooks and GitHub Actions.

Following our investigation, we fixed the inefficient query for the repository security advisory access. We also reviewed the rate limits for this particular endpoint (as well as limits in this area) to ensure they were in line with our performance expectations. Finally, we increased the default throttling of the webhooks queue to avoid potential backups in the future. As a longer-term improvement to our resiliency, we are investigating options to reduce the potential for other background jobs to impact GitHub Actions workflows. We’ll continue to run game days and conduct enhanced training for first responders to better assess impact for GitHub Actions and determine the appropriate level of statusing moving forward.

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: October 2022

2022-11-02 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2022-11-02-github-availability-report-october-2022/

In October, we experienced four incidents that resulted in significant impact and degraded state of availability to multiple GitHub services. This report also sheds light into an incident that impacted Codespaces in September.

October 26 00:47 UTC (lasting 3 hours and 47 minutes)

Our alerting systems detected an incident that impacted most Codespaces customers. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the November Availability Report, which we will publish the first Wednesday of December.

October 13 20:43 UTC (lasting 48 minutes)

On October 13, 2022 at 20:43 UTC, our alerting systems detected an increase in the Projects API error response rate. Due to the significant customer impact, we went to status red for Issues at 20:47 UTC. Within 10 minutes of the alert, we traced the cause to a recently-deployed change.

This change introduced a database validation that required a certain value to be present. However, it did not correctly set a default value in every scenario. This resulted in the population of null values in some cases, which produced an error when pulling certain records from the database.

We initiated a roll back of the change at 21:08 UTC. At 21:13 UTC, we began to see a steady decrease in the number of error responses from our APIs back to normal levels, changed the status of Issues to yellow at 21:24 UTC, and changed the status of Issues to green at 21:31 UTC once all metrics were healthy.

Following this incident, we have added mitigations to protect against missing values in the future, and we have improved testing around this particular area. We have also fixed our deployment dashboards, which contained some inaccurate data for pre-production errors. This will ensure that errors are more visible during the deploy process to help us prevent these issues from reaching production.

October 12 23:27 UTC (lasting 3 hours and 31 minutes)

On October 12, 2022 at 22:30 UTC, we rolled out a global configuration change for Codespaces. At 23:15 UTC, after the change had propagated to a variety of regions, we noticed new Codespace creation starting to trend downward and were alerted to issues from our monitors. At 23:27 UTC, we deemed the impact significant enough to status Codespaces yellow, and eventually red, based on continued degradation.

During the incident, it was discovered that one of the older components of the backend system did not cope well with the configuration change, causing a schema conflict. This was not properly tested prior to the rollout. Additionally, this component version does not support gradual exposure across regions—so many regions were impacted at once. Once we detected the issue and determined the configuration change was the cause, we worked to carefully roll back the large schema change. Due to the complexity of the rollback, the procedure took an extended period of time. Once the rollback was complete and metrics tracking new Codespaces creations were healthy, we changed the status of the service back to green at 02:58 UTC.

After analyzing this incident, we determined we can eliminate our dependency on this older configuration type and have repair work in progress to eliminate this type of configuration from Codespaces entirely. We have also verified that all future changes to any component will follow safe deployment practices (one test region followed by individual region rollouts) to avoid global impact in the future.

October 5 06:30 UTC (lasting 31 minutes)

On October 5, 2022 at 06:30 UTC, webhooks experienced a significant backlog of events caused by a high volume of automated user activity causing a rapid series of create and delete operations. This activity triggered a large influx of webhook events. However, many of these events caused exceptions in our webhook delivery worker because data needed to generate their webhook payloads had been deleted from the database. Attempting to retry these failed jobs tied up our worker and it was unable to process new incoming events, resulting in a severe backlog in our queues. Downstream services that rely on webhooks to receive their events were unable to receive them, which resulted in service degradation. We updated GitHub Actions to status red because the webhooks delay caused new job execution to be severely delayed.

Investigation into the source of the automated activity led us to find that there was automation creating and deleting many repositories in quick succession. As a mitigation, we disabled the automated accounts that were causing this activity in order to give us time to find a longer term solution for such activity.

Once the automated accounts were disabled, it brought the webhook deliveries back to normal and the backlog got mitigated at 07:01 UTC. We also updated our webhook delivery workers to not retry any jobs for which it was determined that the data did not exist in the database. Once the fix was put in place, the accounts were re-enabled and no further problems were encountered with our worker. We recognize that our services must be resilient to spikes in load and will make improvements based on what we’ve learned in this incident.

September 28 03:53 UTC (lasting 1 hour and 16 minutes)

On September 27, 2022 at 23:14 UTC, we performed a routine secret rotation procedure on Codespaces. On September 28, 2022 at 03:21 UTC, we received an internal report stating that port forwarding was not working on the Codespaces web client and began investigating. At 03:53 UTC. we statused yellow due to the broad user impact we were observing. Upon investigation, we found that we missed a step in the secret rotation checklist a few hours earlier, which caused some downstream components to fail to pick up the new secret. This resulted in some traffic not reaching backend services as expected. At 04:29 UTC, we ran the missed rotation step, after which we quickly saw the port forwarding feature returning to a healthy state. At this moment, we considered the incident to be mitigated. We investigated why we did not receive automated alerts about this issue and found that our alerts were monitoring error rates but did not alert for lack of overall traffic to the port forwarding backend. We have since improved our monitoring to include anomalies in traffic levels that cover this failure mode.

Several hours later, at 17:18 UTC, our monitors alerted us of an issue in a separate downstream component, which was similarly caused by the previous missed secret rotation step. We could see Codespaces creation and start failures increasing in all regions. The effect from the earlier secret rotation was not immediate because this secret is used in exchange for a token, which is cached for up to 24 hours. Our understanding was that the system would pick up the new secret without intervention, but in reality this secret was picked up only if the process was restarted. At 18:27 UTC, we restarted the service in all regions and could see that the VM pools, which were heavily drained before, started to slowly recover. To accelerate the draining of the backlog of queued jobs we increased the pool size at 18:45 UTC. This helped all but two pools in West Europe, which were still not recovering. At 19:44 UTC, we identified an instance of the service in West Europe that was not rotated along the rest. We rotated that instance and quickly saw a recovery in the affected pools.

After the incident, we identified why multiple downstream components failed to pick up the rotated secret. We then added additional monitoring to identify which secret versions are in use across all components in the service to more easily track and verify secret rotations. To address this short term, we have updated our secret rotation checklist to include the missing steps and added additional verification steps to ensure the new secrets are picked up everywhere. Longer term, we are automating most of our secret rotation processes to avoid human error.

In summary

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: September 2022

2022-10-05 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2022-10-05-github-availability-report-september-2022/

In September, we experienced one incident that resulted in significant impact and degraded state of availability to multiple GitHub services. We also experienced one incident resulting in significant impact to Codespaces. We are still investigating that incident and will include it in next month’s report. This report also sheds light into an incident that impacted Codespaces in August and an incident that impacted GitHub Actions in August.

September 8 19:44 UTC (lasting 5 hours and 11 minutes)

On September 8, 2022 at 19:44 UTC, our monitoring detected an increase in the number of pull request merge failures. The impact was concentrated on Enterprise Managed Users (EMUs) with a small number of bot accounts also affected.

Within 45 minutes, we traced the cause to a data transition that removed inconsistent data from profile records. Unfortunately, the transition incorrectly operated on EMU accounts, removing some data that is required to successfully merge pull requests via the UI and our API. CLI merges were unaffected.

We restored the data from backup, but this took longer than we had anticipated. We simultaneously pursued a workaround in code, but opted not to proceed with it as it could introduce data inconsistencies. Our restore operation resolved the issue with our pull request monitors having recovered by September 9, 2022 at 00:55 UTC.

Following this incident, we have made changes to our data transition procedures to allow for faster restores and transitions that can be automatically rolled back without relying on backups. We are also working on multiple improvements to our testing processes as they relate to EMUs.

September 28 03:53 UTC (lasting 1 hour and 16 minutes)

Our alerting systems detected an incident that impacted most Codespaces customers. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the October Availability Report, which we will publish the first Wednesday of November.

Follow up to August 29 12:51 UTC (lasting 5 hours and 40 minutes)

On August 29, 2022 at 12:51 UTC, our monitoring detected an increase in Codespaces create and start errors. We also started seeing DNS-related networking errors in some running Codespaces where outbound DNS resolutions were failing. At 14:19 UTC, we updated the status for Codespaces from yellow to red due to broad user impact.

This incident was caused by an Ubuntu security patch in systemd that broke DNS resolution. In recent versions of Ubuntu, unattended upgrades for security fixes are enabled by default. Codespaces host VMs were using the default recommended settings to apply security patches automatically on running VMs. When this patch was published, Codespaces host VMs started installing and applying the patch after the VM was created. Once the patch was installed on a VM, DNS resolution was broken. Depending on the timing of when the patch was installed on the host VM, this led to a few different failure modes, including failure creating/starting Codespaces or failure, making outbound network calls inside of a codespace that was already running.

Once we identified systemd’s DNS resolver configuration as the source of these errors, we were able to mitigate the issue by disabling systemd’s DNS resolver and manually configuring an upstream DNS resolver IP address. We deployed a change to the DNS configuration on the host VMs at 18:13 UTC. By 18:21 UTC, we started seeing positive signs of recovery in our metrics and changed the status to yellow. Ten minutes later, at 18:31 UTC, all metrics were fully healthy and the incident was resolved.

Following this incident, we are updating our DNS configuration to reduce dependencies on systemd’s DNS resolver. We are also investigating whether we should continue to use unattended upgrades for security patches. Disabling unattended upgrades will give us more deterministic behavior at runtime, preventing external changes from breaking Codespaces. We will remain fully capable of quickly patching VMs across our fleet even with unattended upgrades disabled.

Follow up to August 18 14:33 UTC (lasting 3 hours and 23 minutes)

This incident occurred in August but was left out of the August report because it did not result in a widespread outage. Several GitHub Actions customers experienced issues because of the degradation so we decided to include it retroactively.

At 14:13 UTC, there was a sudden spike in traffic to GitHub Actions which resulted in a higher than usual write load on our services. A majority of our services handled this graciously, but one of our internal services that is used for generating security tokens started returning 503 Service Unavailable errors to requests, triggering an alert to the engineering team. Further investigation revealed that the token database was experiencing a performance degradation which, compounded by the increased load, caused us to hit the database’s max concurrent connections limit. This was made worse due to a mismatch between our client-side throttling limits and database capacity, which resulted in our throttling thresholds allowing more traffic than this database had capacity to handle.

We mitigated the issue by scaling up the impacted database while also allowing a higher number of concurrent connections to it. The impacted service went back to a healthy state and the incident was considered resolved at 17:36 UTC. In addition to the immediate actions, we have improved our monitoring and alerting to allow faster remediation. We are also evaluating changes to our throttling mechanisms to better account for this traffic pattern.

In summary

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

Account WAF now available to Enterprise customers

2022-09-19 Daniele Molteni

Post Syndicated from Daniele Molteni original https://blog.cloudflare.com/account-waf/

Account WAF now available to Enterprise customers

Do you manage more than a single domain? If the answer is yes, now you can manage a single WAF configuration for all your enterprise domains.

Cloudflare has been built around the concept of zone, which is broadly equivalent to a domain. Customers can add multiple domains to a Cloudflare account, and every domain has its own independent security configuration. If you deploy a rule to block bots on example.com, you will need to rewrite the same rule on example.org. You’ll then need to visit the dashboard of every zone when you want to update it. This applies to all WAF products including Managed, Firewall and Rate Limiting rules.

If you have just two domains that’s not a big deal. But if you manage hundreds or thousands of domains like most large organizations do. Dealing with individual domains becomes time-consuming, expensive or outright impractical. Of course, you could build automation relying on our API or Terraform. This will work seamlessly but not all organizations have the capabilities to manage this level of complexity. Furthermore, having a Terraform integration doesn’t fully replicate the experience or give the confidence provided by interacting with a well-designed UI.

Following Cloudflare philosophy of making it easy to deploy security products, we are launching Account WAF.

Welcome to the simpler world of Account WAF

You might wonder why an organization might have thousands of domains, but this is actually very common.

For example, an e-commerce business can have tens of marketing domains for all its brands localized in different countries, they’ll have APIs that power their e-commerce sites and mobile applications, applications integrated with partners, logistics services or payment systems, domains used by employees, and so on and so forth. The structure of these accounts can be very complex.

Now, let’s imagine that you need to deal with the simple use case of deploying Cloudflare Managed ruleset across all your production domains.

Without Account WAF you’d need to track down all the correct domains and visit the WAF page of each one of them, deploy the ruleset and possibly add overrides to select only the attack vectors you are interested in. This is messy and mistakes are easy.

With Account WAF, you can now deploy a managed ruleset just once while providing the list of hostnames where you want it on. With deploying here we refer to writing a filter that defines what requests we should run (or execute) the ruleset on. The filter works like a normal WAF Custom Rule, where you can take advantage of the power of the Wirefilter syntax and use any parameter of the HTTP request, metadata and computed values, such as Bot Score or our new WAF Attack Score. For example, you can run a ruleset only on traffic with a specific User Agent, or only on your API traffic.

You can deploy these rulesets multiple times on your account, so you can have different settings for different groups of domains. For example, you might want to deploy OWASP with different sensitivity levels for your staging domains versus your production domains, or enforce a minimum level of security across all zones (e.g. for legal protection or compliance), before tailoring the security posture of the most sensitive domains. Furthermore, if in the future you are going to add a new domain to your production environment, you can simply add it to the rule filter, and we will start protecting these requests too.

It works for all WAF features

You can follow a similar flow if you want to deploy WAF Custom or Rate Limiting rules. However, in this case, to simplify management of large numbers of rules, we introduced the concept of Custom Rulesets. Like with managed rules, a ruleset is a group of rules, this time they are user defined. Like in the example above, you can deploy a custom ruleset on a user-defined filter to scope on what portion of your traffic you want to run these rules.

For example, consider the situation where you want to create two rules for all your domains: one that blocks traffic from a set of countries and then one rule to only allow requests with a non-malicious WAF Attack Score. You will create a custom ruleset with these two rules and then deploy it across your entire account.

One thing to note is that Account WAF rulesets (Managed, Custom and Rate Limiting) can be deployed on traffic to domains on Enterprise plans. You won’t be able to run rulesets on traffic of Free, Pro or Biz domains. This condition is enforced by the UI when writing a deployment filter.

Finally, you can follow the same flow to deploy custom rulesets that contain rate limiting rules. Custom rulesets are designed to contain either custom or rate limiting rules, at this stage these rules cannot be combined in the same ruleset. Please note that the Rate Limiting section will be available in October.

Who gets it?

Account WAF is an Enterprise only feature. If you are an Enterprise customer on our new Advanced plan, you will get access to the new feature automatically this week. If you are not on our Advanced plan, please reach out to your account team to learn more.

GitHub Availability Report: August 2022

2022-09-07 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2022-09-07-github-availability-report-august-2022/

In August, we experienced one incident resulting in significant impact and degraded state of availability to Codespaces. This report also sheds light into an incident that impacted Codespaces in July.

August 29 12:51 UTC (lasting 5 hours and 40 minutes)

Our alerting systems detected an incident that impacted most Codespaces customers. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the September Availability Report, which will publish the first Wednesday of October.

Follow up to July 27 22:29 UTC (lasting 7 hours and 55 minutes)

As mentioned in the July Availability Report, we are now providing a more detailed update on this incident following further investigation. During this incident, a subset of codespaces in the East US and West US regions using 2-core and 4-core machine types could not be created or restarted.

On July 27, 2022 at approximately 21:30 UTC we started experiencing a high rate of failures creating new virtual machines (VMs) for Codespaces in the East US and West US regions. The rate of codespace creations and starts on the 2-core and 4-core machine types exceeded the rate of successful VM creations needed to run, which eventually led to resource exhaustion of the underlying VMs. At 22:29 UTC, the pools for 2-core and 4-core VMs were drained and unable to keep up with demand, so we statused yellow. Impacted codespaces took longer than normal to start while waiting for an available VM, and many ended up timing out and failing.

Each codespace runs on an isolated VM for security. The Codespaces platform builds a host VM image on a regular cadence, and then all host VMs are instantiated from that base image. This incident started when our cloud provider began rolling out an update in the East US and West US regions that was incompatible with the way we built our host VM image. Troubleshooting the failures was difficult because our cloud provider was reporting that the VMs were being created successfully even though some critical processes that were required to be started during VM creation were not running.

We applied temporary mitigations, including scaling up our VM pools to absorb the high failure rate, as well as adjusting timeouts to accelerate failure for VMs that were unlikely to succeed. While these mitigations helped, the failure rate continued to increase as our cloud provider’s update rolled out more broadly. Our cloud provider recommended adjusting our image generalization process in a way that would work with the new update. Once we made the recommended change to our image build pipeline, VM creation success rates recovered and enabled the backlog of queued codespace creation and start requests to be fulfilled with VMs to run the codespaces.

Following this incident, we have audited our VM image building process to ensure it aligns with our cloud provider’s guidance to prevent similar issues going forward. In addition, we have improved our service logic and monitoring to be able to verify that all critical operations are executed during VM creation rather than only looking at the result reported by our cloud provider. We have also updated our alerting to detect VM creation failures earlier before there is any user impact. Together, these changes will prevent this class of issue from happening, detect other failure modes earlier, and enable us to quickly diagnose and mitigate other VM creation errors in the future.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.

GitHub Availability Report: July 2022

2022-08-03 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2022-08-03-github-availability-report-july-2022/

In July, we experienced one incident that resulted in degraded performance for Codespaces. This report also sheds light into two incidents in June that impacted multiple GitHub.com services.

July 27 22:29 UTC (lasting 5 hours and 55 minutes)

Our alerting systems detected degraded availability for Codespaces in the US West and East regions during this time. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on cause and remediation in the August Availability Report, which will publish the first Wednesday of September.

Follow up to June 28 17:16 UTC (lasting 26 minutes)

During this incident, Codespaces was made unavailable due to issues introduced when migrating a DNS record to a new load balancer.

Codespaces runs a set of microservices in each region where Codespaces can be created. In order to route requests to the nearest region for each user, we have a global DNS record that uses a load balancer to resolve to the nearest regional backend. When performing an infrastructure migration, we needed to switch this record to point to a new load balancer. In order to do that, we deleted the existing global record in order to replace it with a record that pointed to the new balancer. Unfortunately, adding the new replacement record failed. Thus, any requests made to the global DNS record that pointed to Codespaces services were denied. Our alerting systems detected this almost immediately; however, our attempt to rollback the DNS update to switch to the old configuration also failed. We then disabled an endpoint in the old load balancer, upon which the rollback succeeded and all metrics recovered (after some time due to DNS caching and TTL).

As a follow-up, we are investigating safer mechanisms for testing the new load balancers and atomic DNS record updates, including setting up a mirrored testing DNS zone. We are also following up with our cloud provider to understand why the initial rollback failed and whether this is a bug.

Follow up to June 29 14:48 UTC (lasting 1 hour and 27 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, GitHub Packages, and GitHub Pages were impacted. This was due to excessive load on a proxy server that routes traffic to the database.

At approximately 14:14 UTC, the internal APIs that a data migration service uses to communicate with GitHub.com began returning 502 Service Unavailable errors to requests. This migration service allows customers to migrate to GitHub.com from other external sources, including GitHub Enterprise Server. As part of its exception handling, the service contains retry logic to requeue jobs. However, this logic captured all exceptions rather than just a subset. The 502 errors it caught triggered a bug that caused jobs to continuously requeue themselves to be retried. The situation quickly escalated when hundreds of thousands of jobs made identical API requests, overwhelming the database’s proxy server.

We mitigated the situation by pausing the processing of all new customer-initiated migrations performed with the data migration service at 15:07 UTC. We also pruned the queues of all jobs associated with in-progress migrations to alleviate the pressure on the proxy server. Approximately nine minutes later, we began to see affected services recover.

We have updated exception handling to only retry jobs in cases of a specific set of errors. We have also adjusted our logic to retry a fixed number of times before logging the exception and giving up. These actions eliminate the possibility of continuous requeuing. We are also investigating whether changes are needed to the rate limits of our internal APIs.

In summary

Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

6 strategic ways to level up your CI/CD pipeline

2022-07-19 Damian Brady

Post Syndicated from Damian Brady original https://github.blog/2022-07-19-6-strategic-ways-to-level-up-your-ci-cd-pipeline/

In today’s world, a well-tuned CI/CD pipeline is a critical component for any development team looking to build and ship high-quality software fast. But here’s the thing: It’s rare you’ll find two CI/CD pipelines that are exactly the same. And that’s by design. Every CI/CD pipeline should be built to meet a team’s specific needs.

Despite this, there are levels of maturity when building a CI/CD pipeline that range from basic implementations to more advanced automation workflows. But wherever you are on your CI/CD journey, there are a few things you can do to level up your CI/CD pipeline.

With that, here are six strategic things I often see missing from CI/CD pipelines that can help any developer or team advance and improve their workflows.

Need a primer on how to build a CI/CD pipeline on GitHub? Check out our guide

1. Add performance, device compatibility, and accessibility testing

Performance, device compatibility, and accessibility testing are often a manual exercise—and something that some teams are only partially doing. Manually testing for these things can slow down your delivery cycle, so many teams either eat the costs or just don’t do it.

But if these things are important to you—and they should be—there are tools that can be included in your CI/CD pipeline to automate the testing for and discovery of any issues.

Performance and device compatibility testing

One tool, for example, is Playwright which can do end-to-end testing, automated testing, and everything in between. You can also use it to do UI testing so you can catch issues in your product.

Visual regression testing

There’s another class of tools that can help you automate visual regression testing to make sure you haven’t changed the UI when you weren’t intending to do so. That means you haven’t introduced any unexpected UI changes. This can be super useful for device compatibility testing too. If something looks bad on one device, you can quickly correct it.

Accessibility testing

This is another incredibly impactful class of automated tests to add to your CI/CD pipeline. Why? Because every one of your customers should be valuable to you—and if even just a fraction of your customers have trouble using your product, that matters.

There are a ton of accessibility testing tools that can tell you things like if you have appropriate content for screen readers or if the colors on your website make sense to someone with color blindness. A great example is Pa11y, an open source tool you can use to run automated accessibility tests via the command line or Node.js.

2. Incorporate more automated security testing

Security should always be part of your software delivery pipeline, and it’s incredibly vital in today’s environments. Even still, I’ve seen a number of teams and companies who aren’t incorporating automated security tests in their CI/CD pipelines and instead treat security as something that happens after the DevOps process takes place.

Here’s the good news: There are a lot of tools that can help you do this without too much effort—including GitHub-native tools like Dependabot, code scanning, secret scanning, and if you’re a GitHub Enterprise user, you can bundle all the security functionality GitHub offers and more with GitHub Advanced Security. But even with a free GitHub account, you still can use Dependabot on any public or private repository, and code scanning and secret scanning are available on all public repositories, too.

Dependabot, for example, can help you mitigate any potential issues in your dependencies by scanning them for outdated packages and automatically creating pull requests for teams to fix them. It can also be configured to automatically update any project dependencies, too.

This is super impactful. Developers and teams often don’t update their dependencies because of the time it takes—or, sometimes they even just forget to update their dependencies. Dependencies are a legitimate source of vulnerabilities that are all too often overlooked.

Additionally, code scanning and secret scanning are offered on the GitHub platform and can be built into your CI/CD pipeline to improve your security profile. Where code scanning offers SAST capabilities that show if your code itself contains any known vulnerabilities, secret scanning makes sure you’re not leaking any credentials to your repositories. It can also be used to prevent any pushes to your repository if there are any exposed credentials.

The biggest thing is that teams should treat security as something you do throughout the SDLC—and, not just before and after something goes to production. You should, of course, always be checking for security issues. But the earlier you can catch issues, the better (hello DevSecOps). So including security testing within your CI/CD pipeline is an essential practice.

A screenshot of automated security testing workflows on GitHub.

3. Build a phased testing strategy

Phased testing is a great strategy for making sure you’re able to deliver secure software fast and at scale. But it’s also something that takes time to build. And consequently, a lot of teams just aren’t doing it.

Often, developers will put all or most of their automated testing at the build phase in their CI/CD pipelines. That means the build can take a long time to execute. And while there’s nothing necessarily wrong with this, you may find that it takes longer to get feedback on your code.

With phased testing, you can catch the big things early and get faster feedback on your codebase. The goal is to have a quick build that rapidly tests the fundamentals with simpler tests such as unit tests. After this, you may then perhaps deploy your build to a test environment to execute additional tests such as some accessibility testing, user testing, and other things that may take longer to execute. This means you’re working your way through a number of possible issues starting with the most critical elements first.

As you get closer to production in a phased testing model, you’ll want to test more and more things. This will likely include key items such as regression testing to make sure previous bugs aren’t reappearing in your codebase. At this stage, things are less likely to go wrong. But you’ll want to effectively catch the big things early and then narrow your testing down to ensure you’re shipping a very high-quality application.

Oh, and of course, there’s also testing in production, which is its own thing. But you can incorporate post-deployment tests into your production environment. You may have a hypothesis you want to test about if something works in production and execute tests to find out. At GitHub, we do this a lot by releasing new features behind feature flags and then enabling that flag for a subset of our user base to collect feedback.

4. Invest in blue-green deployments for easier rollouts

When it comes to releasing a new version of an application, what’s one word you think of? For me, the big word is “stress” (although “excitement” and “relief” are a close second and third). Blue-green deployments are one way to improve how you roll out a new version of an application in your CI/CD pipeline, but it can also be a bit more complex, too.

In the simplest terms, a blue-green deployment involves having two or more versions of your application in production and slowly moving your users from an older version to a newer one. This means that when you need to update or deploy a new version of an application, it goes to an “unused” production environment, and you can slowly move your users across safely.

The benefit of this is you can quickly roll back any changes by redirecting users to another prod environment. It also leads to drastically reduced downtime while you’re deploying a new application version. You can get everything set up in the environment and then just point people to a new one.

Blue-green deployments are perfect when you have two environments that are interchangeable. In reality with larger systems, you may have a suite of web servers or a number of serverless applications running. In practice, this means you might be using a load balancer that can distribute traffic across multiple locations. The canonical example of a load balancer is nginx—but every cloud has its own offerings (like Azure Front Door or Elastic Load Balancing on AWS).

This kind of strategy is common among organizations using Kubernetes. You may have a number of pods that are running and when you do a deployment, Kubernetes will deploy updates to new instances and redirects traffic. The management of which ones are up and running operates under the same principles as blue-green deployments—but you’re also navigating a far more complex architecture.

5. Adopt infrastructure-as-code for greater flexibility

Infrastructure provisioning is the practice of building IT infrastructure as you need it—and some teams will adopt infrastructure-as-code (IaC) in their CI/CD pipelines to provision resources automatically at specific points in the pipeline.

I strongly recommend doing this. The goal of IaC is that when you’re deploying your application, you’re also deploying your infrastructure. That means you always know what your infrastructure looks like in production, and your testing environment is also replicable to what’s in production.

There are two benefits to building IaC into your CI/CD pipeline:

It helps you make sure that your application and the infrastructure it runs on are routinely being tested in tandem. The old school way of doing things was to say that this is a production machine and it looks like this—and this is our testing machine and we want it to be as close to production as possible. But almost always, you’ll find that production environments change over time—and it makes it harder to know what your production environment is.
It helps you mitigate any real-time issues with your infrastructure. That means if your production server goes down, it’s not a disaster—you can just re-deploy it (and even automate your redeployment at that).

Last but not least: building IaC into your CI/CD pipeline means you can more effectively do things like blue-green deployments. You can deploy a new version of an application—code and infrastructure included—and reroute your DNS to go to that version. If it doesn’t work, that’s fine—you can quickly roll back to your previous version.

A screenshot of a GitHub Actions Terraform workflow.

6. Create checkpoints for automated rollbacks

Ideally, you want to avoid ever having to roll back a software release. But let’s be honest. We all make mistakes and sometimes code that worked in your development or test environment doesn’t work perfectly in production.

When you need to roll back a release to a previous application version, automation makes it much easier to do so quickly. I think of a rollback as a general term for mitigating production problems by reverting to a previous version, whether that’s redeploying or restoring from backup. If you have a great CI/CD pipeline, you can ideally fix a problem and roll out an update immediately—so you can avoid having to go to a previous app version.

Looking for more ways to improve your CI/CD pipeline?

Try exploring the GitHub Marketplace for CI/CD and automation workflow templates. At the time I’m writing this, there are more than 14,000 pre-built, community-developed CI/CD and automation actions in the GitHub Marketplace. And, of course, you can always build your own custom workflows with GitHub Actions.

Explore the GitHub Marketplace

Additional resources

GitHub Availability Report: June 2022

2022-07-06 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2022-07-06-github-availability-report-june-2022/

In June, we experienced four incidents resulting in significant impact and degraded state of availability to multiple GitHub.com services. This report also sheds light into an incident that impacted multiple GitHub.com services in May.

June 1 09:40 UTC (lasting 48 minutes)

During this incident, customers experienced delays in the start up of their GitHub Actions workflows. The cause of these delays was excessive load on a proxy server that routes traffic to the database.

At 09:37 UTC, Actions service noticed a marked increase in the time it takes customer jobs to start. Our on-call engineer was paged and Actions was statused red. Once we started to investigate, we noticed that the pods running the proxy server for the database were crash-looping due to out-of-memory errors. A change was created to increase the available memory to these pods, which fully rolled out by 10:08 UTC. We started to see recovery in Actions even before 10:08 UTC, and statused to yellow at 10:17 UTC. By 10:28 UTC, we were confident that the memory increase had mitigated the issue, and statused Actions green.

Ultimately, this issue was traced back to a set of data analysis queries being pointed at an incorrect database. The large load they placed on the database caused the crash loops and the broader impact. These queries have been moved to a dedicated analytics setup that does not serve production traffic.

We are adding alerts to identify increases in load to the proxy server to catch issues like this early. We are also investigating how we can put in guardrails to ensure production database access is limited to services that own the data.

June 21 17:02 UTC (lasting 1 hour and 10 minutes)

During this incident, shortly after the GA of Copilot, users with either a Marketplace or Sponsorship plan were unable to use Copilot. Users with those subscriptions received an error from our API responsible for creating authentication tokens. This impacted a little less than 20% of our active users at the time.

At approximately 16:45 UTC, we were alerted and noticed elevated error rates in the API and began investigating causes. We were able to identify the issue and statused red. Our engineers worked quickly to roll out a fix to the API endpoint and we saw API error rates begin lowering at approximately 17:45 UTC. By 18:00 UTC, we were no longer seeing this issue but decided to wait for 10 more minutes to status back to green to ensure there were no regressions.

We have increased our testing around this particular combination of subscription types, added these scenarios to our user testing and will add additional data shape testing before future rollouts.

June 28 17:16 UTC (lasting 26 minutes)

Our alerting systems detected degraded availability for Codespaces during this time. Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update on the causes and remediations in the July Availability Report, which will be published the first Wednesday of August.

June 29 14:48 UTC (lasting 1 hour and 27 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, GitHub Packages, and GitHub Pages were impacted. As we continue to investigate the contributing factors, we will provide a more detailed update in the July Availability Report. We will also share more about our efforts to minimize the impact of similar incidents in the future.

Follow up to May 27 04:26 UTC (lasting 21 minutes) and May 27 07:36 UTC (lasting 1 hour and 21 minutes)

As mentioned in the May Availability Report, we are now providing a more detailed update on this incident following further investigation.

Both instances that occurred at 04:26 and 07:36 UTC were caused by the same contributing factors. In the first instance, an individual service team noticed higher than normal load and an increase in error rate on API requests and statused red. The load was particularly high on our login endpoint. While this did elevate error rates, it was not enough to cause a widespread outage and we should have likely statused yellow in this instance.

After follow-up that indicated the load pattern had subsided, our on-call team determined it was safe to report the situation was mitigated and began to investigate further.

However, three hours later, we again experienced a degradation of service from a sustained high load in traffic. This was again concentrated on our login endpoint. We statused all services red, since we were seeing sustained error rates for a variety of clients and situations, and then updated individual service statuses based on their SLOs. Services that were affected by the load pattern statused to yellow, while services that were not impacted statused back to green.

The duration of impact to GitHub.com from the second instance of the load pattern lasted about 15 minutes. We continued to see elevated traffic during this time and waited until a network-level mitigation was rolled out before statusing all affected services back to green.

In addition to network mitigation, we were able to use the data from this incident to add additional mitigations on the application side for a sustained load of this type, as well as inform architectural changes we can make in the future to make our services more resilient.

Following this incident, we are improving our on-call procedures to ensure we always report the correct status level based on SLO review. While we always want to over-communicate issues with customers for awareness, we want to only status red when necessary.

In summary

One developer’s journey bringing Dependabot to GitHub Enterprise Server

2022-06-07 Landon Grindheim

Post Syndicated from Landon Grindheim original https://github.blog/2022-06-07-one-developers-journey-bringing-dependabot-to-github-enterprise-server/

If you’re like me, you’re still excited by last week’s news that Dependabot is generally available on GitHub Enterprise Server (GHES). Developers using GHES can now let Dependabot secure their dependencies and keep them up-to-date. You know who would have loved that? Me at my last job.

Before joining GitHub, I spent five years working on teams that relied on GHES to host our code. As a GHES user, I really, really wanted Dependabot. Here’s why.

Dependencies

One constant pain point for my previous teams was staying on top of dependencies. Creating a Rails project with rails new results in an app with 74 dependencies, Django apps start with 88 dependencies, and a project initialized with Create React App will have 1,432 dependencies!

Unfortunately, security vulnerabilities happen, and they can expose your customers to existential risk, so it’s important they are handled as soon as they’re published.

As I’m most familiar with the Ruby ecosystem, I’ll use Nokogiri, a gem for parsing XML and HTML, to illustrate the process of manually resolving a vulnerability. Nokogiri has been a dependency of every Rails app I’ve maintained. It’s also seen seven vulnerabilities since 2019. To fix these manually, we’ve had to:

Clone `my_rails_app`
Track down and parse the Nokogiri release notes
Patch Nokogiri in `my_rails_app` to a non-vulnerable version
Push the changes and open a pull request
Wait for CI to pass
Get the necessary reviews
Deploy, observe, and merge

This is just one of (at least) 74 dependencies in one Rails app. My team maintained 14 Rails apps in our microservices-based architecture, so we needed to repeat the process for each app. A single vulnerability would eat up days of engineering time. That’s just one dependency in one ecosystem. We also worked on apps written in Elixir, Python, JavaScript, and PHP.

If an engineer was patching vulnerabilities, they couldn’t pursue feature work, the thing our customers could actually see. This would, understandably, lead to conversations about which vulnerabilities were most likely to be exploited and which we could tolerate for now.

If we had Dependabot security updates, that process would have started with a pull request. What took an engineer days to complete on their own could have been done before lunch.

We could have invested in keeping all of our dependencies up-to-date. Incremental upgrades are typically easier to perform and pose less risk. They also give bad actors less time to find and exploit vulnerabilities. One of my previous teams was still running Rails 3.2, which was no longer maintained when Rails 6 was released six years later. As support phased out, we had to apply our own security patches to our codebase instead of getting them from the framework. This made upgrading even harder. We spent years trying to get to a supported version, but other product priorities always won out.

If my team had Dependabot version updates, Dependabot would have opened pull requests each time a new version of Rails was released. We’d still need to make changes to ensure our apps were compliant with the new versions, but the changes would be made incrementally, making the lift much lighter. But we didn’t have Dependabot. We had to upgrade manually, and that meant upgrading didn’t happen until it became a P0.

A new home

I joined GitHub in 2021 to work on Dependabot. Being intimately familiar with the challenges Dependabot could help address, I wanted to be part of the solution. Little did I know, the team was just starting the process of bringing Dependabot to GHES. Call it serendipity, a dream come true, or tea leaves arranged just so.

I quickly realized why Dependabot wasn’t already on GHES. GitHub acquired Dependabot in 2019, and it took some time to scale Dependabot to be able to secure GitHub’s millions of repositories. To achieve this, we ported the service’s backend to run on Moda, GitHub’s internal Kubernetes-based platform. The dependency update jobs that result in pull requests were updated to run on lightweight Firecracker VMs, allowing Dependabot to create millions of pull requests in just hours. It was an impressive effort by a small team.

That effort, however, didn’t lend itself to the architecture of GHES, where everything runs on a single server with limited resources. An auto-scaling backend and network of VMs wasn’t an option. Instead, we needed to port Dependabot’s backend to run on Nomad, the container orchestration option on GHES. The jobs running on Firecracker VMs needed to run on our customers’ hardware. Fortunately, organizations can self-host GitHub Actions runners in GHES, so we adapted them to run on GitHub Actions. We also had to adjust our development processes to support continuous delivery in the cloud and less frequent GHES releases.

The result is that developers relying on GHES now have the option to have their dependencies updated for them. Now, my former teammates can update their dependencies by:

Viewing the already opened pull request
Reviewing the pull request and the included release notes
Deploying, observing, and merging

We’re really proud of that. As for me, I get the immense satisfaction of knowing that I built something that will directly benefit my former teammates. It doesn’t get much better than that!

Guess what? GitHub is hiring. What would you like to make better?

If you’re inspired to work at GitHub, we’d love for you to join us. Check out our Careers page to see all of our current job openings.

Dedicated remote-first company with flexible hours
Building great products used by tens of millions of people and companies around the world
Committed to nurturing a diverse and inclusive workplace
And so much more!

GitHub Availability Report: April 2022

2022-05-05 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2022-05-04-github-availability-report-april-2022/

In April, we experienced three distinct incidents resulting in significant impact and degraded state of availability for Codespaces and GitHub Packages.

April 01 7:07 UTC (lasting 5 hours and 32 minutes)

Our alerting detected an increase in failures to create new Codespaces and start existing stopped Codespaces in the US West region. We immediately updated the GitHub status page and began to investigate.

Upon further investigation, we determined that some secrets that are used by the Codespaces service had expired. Codespaces maintains warm pools of resources to protect our users from intermittent failures in our dependent services. However, in the US West region, those pools were empty of resources due to the expired secret. In this case, we didn’t have an early enough warning on pools reaching low thresholds and didn’t have time to react until we ran out of capacity. As we worked to mitigate the incident, the pools in other regions also emptied due to the expired secret, and those regions began to see failures as well.

A limited number of GitHub engineers had access to rotate the secret, and communication issues delayed the start of the secret refresh process. The expired secret was eventually refreshed and rolled out to all regions, and the service was returned to full operation.

To prevent this failure pattern in the future, we now verify resources that expire and have monitors in place that alert well in advance if pool resources are not being maintained. We’ve also added monitors to notify us earlier when we approach resource exhaustion limits. In addition, we’ve initiated migrating the service to use a mechanism that doesn’t rely on secrets or the need to rotate credentials.

April 14 20:35 UTC (lasting 4 hours and 53 minutes)

We are still investigating the contributing factors and will provide a more detailed update in the May Availability Report, which will be published the first Wednesday of June. We will also share more about our efforts to minimize the impact of future incidents.

April 25 8:59 UTC (lasting 5 hours and 8 minutes)

During this incident, our alerting systems detected increased CPU utilization on one of the GitHub Packages Registry databases, which started approximately one hour before any customer impact occurred. The threshold for this alert was relatively low, and it was not a paging alert, so we did not immediately investigate. CPU continued to rise on the database causing the Package Registry to start responding to requests with internal server errors, eventually causing customer impact. This increased activity was due to a high volume of the “Create Manifest” command used in an unexpected manner.

The throttling criteria configured at the database level wasn’t enough to limit the above command, and this caused an outage for anyone using the GitHub Packages Registry. Users were unable to push or pull packages, as well as being unable to access the packages UI or the repository landing page.

After investigating, we determined there was a performance bug related to the high volume of “Create Manifest” commands. In order to limit impact and restore normal operation, we blocked the activity causing this problem. We are actively following up on this issue by improving the rate limiting in packages and fixing the performance problem that was uncovered. We’ve also modified database alerting thresholds and severity so we get alerted to unexpected issues more quickly (rather than after customer impact).

During this incident, we also discovered that the repository home page has a hard dependency on the packages infrastructure. When the package registry is down, the home pages for repositories that list packages also fail to load. We decoupled the package listing from the repository home page, but that required manual intervention during the outage. We are working on a fix that loosely binds the packages listing, so if it fails, it does not take down the repository home pages for repositories that list packages.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. Please follow our status page for real-time updates. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Availability Report: March 2022

2022-04-06 Jakub Oleksy

Post Syndicated from Jakub Oleksy original https://github.blog/2022-04-06-github-availability-report-march-2022/

In March, we experienced a number of incidents that resulted in significant impact and degraded state of availability to some core GitHub services. This blog post includes a detailed follow-up on a series of incidents that occurred due to degraded database stability, and a distinct incident impacting the Actions service.

Database Stability

Last month, we experienced a number of recurring incidents that impacted the availability of our services. We want to acknowledge the impact this had on our customers, and take this opportunity during our monthly report to provide additional details as a result of further investigations and share what we have learned.

Background

The underlying theme of these issues was due to resource contention in our mysql1 cluster, which impacted the performance of a large number of our services and features during periods of peak load.

Each of these incidents resulted in a degraded state of availability for write operations on our primary services (including Git, issues, and pull requests). While some read operations were not impacted, any user who performed a write operation that involved our mysql1 cluster was affected, as the database could not handle the load.

After the other services recovered, GitHub Actions queues were saturated. We enabled the queues gradually to catch up in real time, and as a result our status page noted the multi-hour outages. When Actions are delayed, it can also impact CI completion and a host of other functions.

What we learned

These incidents were characterized by a burst in load during peak hours of GitHub traffic. During these bursts, our mysql1 cluster was not able to handle the load generated by traffic on the system and we were forced to fail-over and take other mitigations, as mentioned in the previous post.

Some of these incidents were related to our efforts to improve visibility on the database, but all of them were related to the low amount of headroom we had on our primary database and thus its susceptibility to a few poorly performing queries.

Optimizing for stability

Because of this, even after we mitigated the initial causes of downtime due to poor query performance, we were still running with low headroom and decided to take a proactive approach to managing load by intentionally slowing down services during peak hours. Furthermore, we took a calculated approach to increase capacity on the database by further optimizing queries.

Rather than risk another site outage, we established lower performance alerting thresholds on the database and proactively throttled webhooks and Actions services (the two largest drivers of automated load on the system) as we approached unsafe margins of error on March 14 14:43 UTC. We understood the potential impact to our customers, but decided it would be safer to proactively limit load on the system rather than risk another outage on multiple services.

In the meantime, we implemented a series of optimizations between March 14 and March 28 that drove queries per second on this database down by over 50% and reduced our transaction volume by 70% at peak load times. Through these performance optimizations, we became more confident in our headroom, but given ongoing investigations, we did not want to chance any unwarranted impacts.

Minimizing impact to our users

After the incidents mentioned above, we took steps to make sure we would be in a position, if necessary, to shut down any services driving high peak load. This meant taking maintenance windows for three services starting on March 24. We proactively paused migrations and team synchronization during peak load due to their potential impact.

We also took maintenance windows for GitHub Actions even though we did not actually throttle any actions and no customers were impacted during these windows. We did this in order to proactively notify customers of possible disruption. While it didn’t end up being the case, we knew we would need to throttle GitHub Actions if we saw any significant database degradation during these time windows. While this may have caused uncertainty for some customers, we wanted to prepare them for any potential impact.

Next steps

Immediate changes

In addition to the improvements mentioned above, we have significantly reduced our database performance alerting thresholds so that we are not “running hot” and will be well positioned to take action before customers are impacted.

We have also accelerated work that was already in progress to continue to shard this particular cluster and apply the learnings from this incident to other clusters that already exist outside of mysql1.

Additional technical and organizational initiatives

Due to the nature of this incident, we have also dedicated a team of engineers to study our internal processes and procedures, observability, and change release processes. While we’re still actively revisiting this incident, we feel confident we have mitigated the initial issues and we have the correct alerting and processes in place to ensure this problem is not likely to occur again.

We understand that the Actions service is critical to many of our customers. With new and ongoing investments across architecture and processes, we’ll continue to bring focus specifically to Actions reliability, including more graceful degradations when other GitHub services are experiencing issues, as well as faster recovery times.

March 29 10:26 UTC (lasting 57 minutes)

During an operation to move GitHub Actions and checks data to its own dedicated, sharded database cluster, a misconfiguration on the new database cluster caused the application to encounter errors. Once we reverted our changes, we were able to recover. This incident resulted in the failure or delay of some queued jobs for a period of time. Once mitigation was initiated, jobs that were queued during the incident were run successfully after the issue was resolved.

The Actions and checks data resides in a multi-tenant database cluster. As part of our efforts to improve reliability and scale, we have been working on functionally partitioning the Actions data to its own sharded database cluster. The switch over to the new cluster involves gradually switching over reads and then switching over writes. Immediately after switching the write traffic, we noticed Actions SLOs were breached and initiated a revert back to the old database. After we reverted back to the old database, we saw an immediate improvement in availability.

Upon further investigation, we discovered that update and delete queries were processed correctly on the new cluster, but insert queries were failing because of missing permissions on the new cluster. All changes processed on the new cluster were replicated back to the old cluster before the switch back, ensuring data integrity.

We have paused any attempts for migrations until we fully investigate and apply our learnings. Furthermore, due to the risk associated with these operations, we will no longer be attempting them during peak traffic hours, which occur between 12:00 and 21:00 UTC. From a technical perspective, we’re looking to scrutinize and improve our operational workflows for these database operations. Additionally, we are going to be performing an audit of our configurations and topology across our environment, to ensure we have properly covered them in our testing strategy. As part of these efforts, we uncovered a gap where we need to extend our pre-migration checklist with a step to verify permissions more thoroughly.

In summary

Every month we share an update on GitHub’s availability, including a description of any incidents that may have occurred and an update on how we are evolving our engineering systems and practices in response. Our hope is that by increasing our transparency and sharing what we’ve learned, everyone can gain from our experiences. At GitHub, we take the trust you place in us very seriously, and we hope this is a way for you to help hold us accountable for continuously improving our operational excellence, as well as our product functionality.

To learn more about our efforts to make GitHub more resilient every day, check out the GitHub engineering blog.

GitHub Availability Report: February 2022

2022-03-03 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2022-03-02-github-availability-report-february-2022/

In February, we experienced one incident resulting in significant impact and degraded state of availability for GitHub.com, issues, pull requests, GitHub Actions, and GitHub Codespaces services.

February 2 19:05 UTC (lasting 13 minutes)

As mentioned in our January report, our service monitors detected a high rate of errors affecting a number of GitHub services.

Upon further investigation of this incident, we found that a routine deployment failed to generate the complete set of integrity hashes needed for Subresource Integrity. The resulting output was missing values needed to securely serve Javascript assets on GitHub.com.

As a safety protocol, our default behavior is to error rather than rendering script tags without integrities, if a hash cannot be found in the integrities file. In this case, that means that github.com started serving 500 error pages to all web users. As soon as the errors were detected, we rolled back to the previous deployment and resolved the incident. Throughout the incident, only browser-based access to GitHub.com was impacted, with API and Git access remaining healthy.

Since this incident, we have added additional checks to our build process to ensure that the integrities are accurate and complete. We’ve also added checks for our main Javascript resources to the health check for our deployment containers, and adjusted the build pipeline to ensure the integrity generation process is more robust and will not fail in a similar way in the future.

In summary

Every month, we share an update on GitHub’s availability, including a description of any incidents that may have occurred and an update on how we are evolving our engineering systems and practices in response. Whether in these reports or via our engineering blog, we look forward to keeping you updated on the progress and investments we’re making to ensure the reliability of our services.

You can also follow our status page for the latest on our availability.

GitHub Availability Report: January 2022

2022-02-03 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2022-02-02-github-availability-report-january-2022/

In January, we experienced no incidents resulting in service downtime to our core services. However, we do want to acknowledge an incident in February that we are continuing to investigate.

February 2 19:12 UTC (lasting 26 minutes)

Our service monitors detected a high rate of errors for issues, pull requests, GitHub Codespaces, and GitHub Actions services. We have mitigated the incident and are confident it has been fully resolved.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

Please follow our status page for real time updates. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: December 2021

2022-01-05 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2022-01-05-github-availability-report-december-2021/

In December, we experienced no incidents resulting in service downtime to our core services.

Please follow our status page for real time updates. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: November 2021

2021-12-01 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2021-12-01-github-availability-report-november-2021/

In November, we experienced one incident resulting in significant impact and degraded state of availability for core GitHub services, including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks.

November 27 20:40 UTC (lasting 2 hours and 50 minutes)

We encountered a novel failure mode when processing a schema migration on a large MySQL table. Schema migrations are a common task at GitHub and often take weeks to complete. The final step in a migration is to perform a rename to move the updated table into the correct place. During the final step of this migration a significant portion of our MySQL read replicas entered a semaphore deadlock. Our MySQL clusters consist of a primary node for write traffic, multiple read replicas for production traffic, and several replicas that serve internal read traffic for backup and analytics purposes. The read replicas that hit the deadlock entered a crash-recovery state causing an increased load on healthy read replicas. Due to the cascading nature of this scenario, there were not enough active read replicas to handle production requests which impacted the availability of core GitHub services.

During the incident mitigation, in an effort to increase capacity, we promoted all available internal replicas that were in a healthy state into the production path; however, the shift was not sufficient for full recovery. We also observed that read replicas serving production traffic would temporarily recover from their crash-recovery state only to crash again due to load. Based on this crash-recovery loop, we chose to prioritize data integrity over site availability by proactively removing production traffic from broken replicas until they were able to successfully process the table rename. Once the replicas recovered, we were able to move them back into production and restore enough capacity to return to normal operations.

Throughout the incident, write operations remained healthy and we have verified there was no data corruption.

To address this class of failure and reduce time to recover in the future, we continue to prioritize our functional partitioning efforts. Partitioning the cluster adds resiliency given migrations can then be run in canary mode on a single shard—reducing the potential impact of this failure mode. Additionally, we are actively updating internal procedures to increase the amount each cluster is over-provisioned.

As next steps, we’re continuing to investigate the specific failure scenario, and have paused schema migrations until we know more on safeguarding against this issue. As we continue to test our migration tooling, we are classifying opportunities to improve it during such scenarios.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

GitHub Availability Report: October 2021

2021-11-04 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2021-11-04-github-availability-report-october-2021/

In October, we experienced one incident resulting in significant impact and degraded state of availability for the GitHub Codespaces service.

October 8 17:16 UTC (lasting 1 hour and 36 minutes)

A core Codespaces API response was inadvertently restructured as part of our Codespaces public API launch, impacting existing API clients dependent on a stable schema.

For the duration of the incident, new Codespaces could not be initiated from the Visual Studio Code Desktop client. Connections to the web editor and pre-existing desktop sessions were not impacted, but degraded, with the extension displaying an error message while omitting Codespaces metadata from the Remote Explorer view.

The incident was mitigated once we rolled back the regression, at which point all clients could connect again, including with new Codespaces created during the incident. As our monitoring systems did not initially detect the impact of the regression, a subsequent and unrelated deployment was initiated, delaying our ability to revert the change. To ensure similar breaking changes are not introduced in the future, we are investing in tooling to support more rigorous end-to-end testing with the extension’s use of our API. Additionally, we are expanding our monitoring to better align with the user experience across the relevant internal service boundaries.

In summary

GitHub Availability Report: September 2021

2021-10-06 Scott Sanders

Post Syndicated from Scott Sanders original https://github.blog/2021-10-06-github-availability-report-september-2021/

In September, we experienced no incidents resulting in service downtime to our core services.

Please follow our status page for real time updates. To learn more about what we’re working on, check out the GitHub engineering blog.

1. Do the basics well and often

Code review

Separation of duties

AI-powered compliance

Dependency management

Approvals

2. Have a common understanding of concepts and terms

3. Helping developers understand where they fit into the three lines of defense

Next steps

In summary

In summary

Welcome to the simpler world of Account WAF

It works for all WAF features

Who gets it?

In summary

July 27 22:29 UTC (lasting 5 hours and 55 minutes)

Follow up to June 28 17:16 UTC (lasting 26 minutes)

Follow up to June 29 14:48 UTC (lasting 1 hour and 27 minutes)

In summary

1. Add performance, device compatibility, and accessibility testing

Performance and device compatibility testing

Visual regression testing

Accessibility testing

2. Incorporate more automated security testing

3. Build a phased testing strategy

4. Invest in blue-green deployments for easier rollouts

5. Adopt infrastructure-as-code for greater flexibility

There are two benefits to building IaC into your CI/CD pipeline:

6. Create checkpoints for automated rollbacks

Looking for more ways to improve your CI/CD pipeline?

Additional resources

June 1 09:40 UTC (lasting 48 minutes)

June 21 17:02 UTC (lasting 1 hour and 10 minutes)

June 28 17:16 UTC (lasting 26 minutes)

June 29 14:48 UTC (lasting 1 hour and 27 minutes)

Follow up to May 27 04:26 UTC (lasting 21 minutes) and May 27 07:36 UTC (lasting 1 hour and 21 minutes)

In summary

Dependencies

A new home

April 01 7:07 UTC (lasting 5 hours and 32 minutes)

April 14 20:35 UTC (lasting 4 hours and 53 minutes)

April 25 8:59 UTC (lasting 5 hours and 8 minutes)

In summary

Database Stability

Background

What we learned

Optimizing for stability

Minimizing impact to our users

Next steps

Immediate changes

Additional technical and organizational initiatives

March 29 10:26 UTC (lasting 57 minutes)

In summary

February 2 19:05 UTC (lasting 13 minutes)

In summary

February 2 19:12 UTC (lasting 26 minutes)

November 27 20:40 UTC (lasting 2 hours and 50 minutes)

In summary

October 8 17:16 UTC (lasting 1 hour and 36 minutes)

In summary

The collective thoughts of the interwebz