Tag Archives: Engineering

Building GitHub with Ruby and Rails

Post Syndicated from Adam Hess original https://github.blog/2023-04-06-building-github-with-ruby-and-rails/

Since the beginning, GitHub.com has been a Ruby on Rails monolith. Today, the application is nearly two million lines of code and more than 1,000 engineers collaborate on it daily. We deploy as often as 20 times a day, and nearly every week one of those deploys is a Rails upgrade.

Upgrading Rails weekly

Every Monday a scheduled GitHub Action workflow triggers an automated pull request, which bumps our Rails version to the latest commit on the Rails main branch for that day. All our builds run on this new version of Rails. Once all the builds pass, we review the changes and ship it the next day. Starting an upgrade on Monday you will already have an open pull request linking the changes this Rails upgrade proposes and a completed build.

This process is a far stretch from how we did Rails upgrades only a few years ago. In the past, we spent months migrating from our custom fork of Rails to a newer stable release, and then we maintained two Gemfiles to ensure we’d remain compatible with the upcoming release. Now, upgrades take under a week. You can read more about this process in this 2018 blog post. We work closely with the community to ensure that each Rails release is running in production before the release is officially cut.

There are real tangible benefits to running the latest version of Rails:

  • We give developers at GitHub the very best version of our tools by providing the latest version of Rails. This ensures users can take advantage of all the latest improvements including better database connection handling, faster view rendering, and all the amazing work happening in Rails every day.
  • We have removed nearly all of our Rails patches. Since we are running on the latest version of Rails, instead of patching Rails and waiting for a change, developers can suggest the patch to Rails itself.
  • Working on Rails is now easier than ever to share with your team! Instead of telling your team you found something in Rails that will be fixed in the next release, you can work on something in Rails and see it the following week!
  • Maintaining more up-to-date dependencies gives us a better security posture. Since we already do weekly upgrades, adding an upgrade when there is a security advisory is standard practice and doesn’t require any extra work.
  • There are no “big bang” migrations. Since each Rails upgrade incorporates only a small number of changes, it’s easier to understand and dig into if there are incompatibilities. The worst issues from a tough upgrade are unexpected changes from an unknown location. These issues can be mitigated by this upgrade strategy.
  • Catching bugs in the main branch and contributing back strengthens our engineering team and helps our developers deepen their expertise and understanding of our application and its dependencies.

Testing Ruby continuously

Naturally, we have a similar process for Ruby upgrades. In February 2022, shortly after upgrading to Ruby 3.1, we started building and testing Ruby shas from 3.2-alpha in a parallel build. When CI runs for the GitHub Rails application, two versions of the builds run: one build uses the Ruby version we are running in production and one uses the latest Ruby commit including the latest changes in Ruby, which we update weekly.

While we build Ruby with every change, GitHub only ships numbered Ruby versions to production. The builds help us maintain compatibility with the upcoming Ruby version and give us insight into what Ruby changes are coming.

In early December 2022, with CI giving us confidence we were compatible before the usual Christmas release of Ruby 3.2, we were able to test Ruby release candidates with a portion of production traffic and give the Ruby team insights into any changes we noticed. For example, we could reproduce an increase in allocations due to keyword argument handling that was fixed before the release of Ruby 3.2 due to this process. We also identified a subtle change when to_str and #to_i is applied. Because we upgrade all the time, identifying and resolving these issues was standard practice.

This weekly upgrade process for Ruby allowed us to upgrade our monolith from Ruby 3.1 to Ruby 3.2 within a month of release. After all, we had already tested and run it in production! At this point, this was the fastest Ruby upgrade we had ever done. We broke this record with the release of Ruby 3.2.1, which we adopted on release day.

This upgrade process has proved to be invaluable for our collaboration with the Ruby core team. A nice side effect of having these builds is that we are able to easily test and profile our own Ruby changes before we suggest them upstream. This can make it easier for us to identify regressions in our own application and better understand the impact of changes on a production environment.

Should I do it, too?

Our ability to do frequent Ruby and Rails upgrades is due to some engineering maturity at GitHub. Doing weekly Rails upgrades requires a thorough test suite with many great engineers working to maintain and improve it. We also gain confidence from having great test environments along with progressive rollout deploys. Our test suite is likely to catch problems, and if it doesn’t, we are confident we will catch it during deploy before it reaches customers.

If you have these tools, you should also upgrade Rails weekly and test using the latest Ruby. GitHub is a better Rails app because of it and it has enabled work from my team that I am really proud of.

Ruby champion Eileen Uchitelle explains why investing in Rails is important in her Rails Conf 2022 Keynote:

Ultimately, if more companies treated the framework as an extension of the application, it would result in higher resilience and stability. Investment in Rails ensures your foundation will not crumble under the weight of your application. Treating it as an unimportant part of your application is a mistake and many, many leaders make this mistake.

Thanks to contributions from people around the world, using Ruby is better than ever. GitHub, along with hundreds of other companies, benefits from Ruby and Rails continuing to improve. Upgrading regularly and investing in our frameworks is a staple of the work we do on the Ruby Architecture team at GitHub. We are always grateful for the Ruby community and glad that we can give back in a way that improves our application and tools as much as it improves them for everyone else.

GitHub Availability Report: March 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-04-05-github-availability-report-march-2023/

In March, we experienced six incidents that resulted in degraded performance across GitHub services. This report also sheds light into a February incident that resulted in degraded performance for GitHub Codespaces.

February 28 15:42 UTC (lasting 1 hour and 26 minutes)

On February 28 at 15:42 UTC, our monitors detected a higher than normal failure rate for creating and starting GitHub Codespaces in the East US region, caused by slower than normal VM allocation time from our cloud provider. To reduce impact to our customers during this time, we redirected codespace creations to a secondary region. Codespaces in other regions were not impacted during this incident.

To help reduce the impact of similar incidents in the future, we tuned our monitors to allow for quicker detection. We are also making architectural changes to enable failover for existing codespaces and to initiate failover automatically without human intervention.

March 01 12:01 UTC (lasting 3 hour and 06 minutes)

On March 1 at 12:01 UTC, our monitors detected a higher than normal latency for requests to Container, NPM, Nuget, and RubyGems Packages registries. Initial impact was 5xx responses to 0.5% of requests, peaking at 10% during this incident. We determined the root cause to be an unhealthy disk on a VM node hosting our database, which caused significant performance degradation at the OS level. All MySQL servers on this problematic node experienced connection delays and slow query execution, leading to the high failure rate.

We mitigated this with a failover of the database. We recognize this incident lasted too long and have updated our runbooks for quicker mitigation short term. To address the reliability issue, we have architectural changes in progress to migrate the application backend to a new MySQL infrastructure. This includes improved observability and auto-recovery tooling, thereby increasing application availability.

March 02 23:37 UTC (lasting 2 hour and 18 minutes)

On March 2 at 23:37 UTC, our monitors detected an elevated amount of failed requests for GitHub Actions workflows, which appeared to be caused by TLS verification failures due to an unexpected SSL certificate bound to our CDN IP address. We immediately engaged our CDN provider to help investigate. The root cause was determined to be a configuration change on the provider’s side that unintentionally changed the SSL certificate binding to some of our GitHub Actions production IP addresses. We remediated the issue by removing the certificate binding.

Our team is now evaluating onboarding multiple DNS/CDN providers to ensure we can maintain consistent networking even when issues arise that are out of our control.

March 15 14:07 UTC (lasting 1 hour and 20 minutes)

On March 15 at 14:07 UTC, our monitors detected increased latency for requests to Container, NPM, Nuget, RubyGems package registries. This was caused by an operation on the primary database host that was performed as part of routine maintenance. During the maintenance, a slow running query that was not properly drained blocked all database resources. We remediated this issue by killing the slow running query and restarting the database.

To prevent future incidents, we have paused maintenance until we investigate and address the draining of long running queries. We have also updated our maintenance process to perform additional safety checks on long running queries and blocking processes before proceeding with production changes.

March 27 12:25 UTC (lasting 1 hour and 33 minutes)

On March 27 at 12:25 UTC, we were notified of impact to pages, codespaces and issues. We resolved the incident at 13:29 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

March 29 14:21 UTC (lasting 4 hour and 29 minutes)

On March 29 at 14:21 UTC, we were notified of impact to pages, codespaces and actions. We resolved the incident at 18:50 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.

March 31 01:16 UTC (lasting 52 minutes)

On March 31 at 01:16 UTC, we were notified of impact to Git operations, issues and pull requests. We resolved the incident at 02:08 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

Evolution of quality at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/evolution-of-quality

To achieve our vision of becoming the leading superapp in Southeast Asia, we constantly need to balance development velocity with maintaining the high quality of the Grab app. Like most tech companies, we started out with the traditional software development lifecycle (SDLC) but as our app evolved, we soon noticed several challenges like high feature bugs and production issues.  

In this article, we dive deeper into our quality improvement journey that officially began in 2019, the challenges we faced along the way, and where we stand as of 2022.

Background

Figure 1 – Software development life cycle (SDLC) sample

When Grab first started in 2012, we were using the Agile SDLC (Figure 1) across all teams and features. This meant that every new feature went through the entire process and was only released to app distribution platforms (PlayStore or AppStore) after the quality assurance (QA) team manually tested and signed off on it.

Over time, we discovered that feature testing took longer, with more bugs being reported and impact areas that needed to be tested. This was the same for regression testing as QA engineers had to manually test each feature in the app before a release. Despite the best efforts of our QA teams, there were still many major and critical production issues reported on our app – the highest numbers were in 2019 (Figure 2).

Figure 2 – Critical open production issue (OPI) trend

This surge in production issues and feature bugs was directly impacting our users’ experience on our app. To directly address the high production issues and slow testing process, we changed our testing strategy and adopted shift-left testing.

Solution

Shift-left testing is an approach that brings testing forward to the early phases of software development. This means testing can start as early as the planning and design phases.

Figure 3 – Shift-left testing

By adopting shift-left testing, engineering teams at Grab are able to proactively prevent possible defect leakage in the early stages of testing, directly addressing our users’ concerns without delaying delivery times.

With shift-left testing, we made three significant changes to our SDLC:

  • Software engineers conduct acceptance testing
  • Incorporate Definition of Ready (DoR) and Definition of Done (DoD)
  • Balanced testing strategy

Let’s dive deeper into how we implemented each change, the challenges, and learnings we gained along the way.

Software engineers conduct acceptance testing

Acceptance testing determines whether a feature satisfies the defined acceptance criteria, which helps the team evaluate if the feature fulfills our consumers’ needs. Typically, acceptance testing is done after development. But our QA engineers still discovered many bugs and the cost of fixing bugs at this stage is more expensive and time-consuming. We also realised that the most common root causes of bugs were associated with insufficient requirements, vague details, or missing test cases.

With shift-left testing, QA engineers start writing test cases before development starts and these acceptance tests will be executed by the software engineers during development. Writing acceptance tests early helps identify potential gaps in the requirements before development begins. It also prevents possible bugs and streamlines the testing process as engineers can find and fix bugs even before the testing phase. This is because they can execute the test cases directly during the development stage.

On top of that, QA and Product managers also made Given/When/Then (GWT) the standard for acceptance criteria and test cases, making them easier for all stakeholders to understand.

Step by Step style GWT format
  1. Open the Grab app
  2. Navigate to home feed
  3. Tap on merchant entry point card
  4. Check that merchant landing page is shown
Given user opens the app
And user navigates to the home feed
When the user taps on the merchant entry point card
Then the user should see the merchant’s landing page

By enabling software engineers to conduct acceptance testing, we minimised back-and-forth discussions within the team regarding bug fixes and also, influenced a significant shift in perspective – quality is everyone’s responsibility.

Another key aspect of shift-left testing is for teams to agree on a standard of quality in earlier stages of the SDLC. To do that, we started incorporating Definition of Ready (DoR) and Definition of Done (DoD) in our tasks.

Incorporate Definition of Ready (DoR) and Definition of Done (DoD)

As mentioned, quality checks can be done before development even begins and can start as early as backlog grooming and sprint planning. The team needs to agree on a standard for work products such as requirements, design, engineering solutions, and test cases. Having this alignment helps reduce the possibility of unclear requirements or misunderstandings that may lead to re-work or a low-quality feature.

To enforce consistent quality of work products, everyone in the team should have access to these products and should follow DoRs and DoDs as standards in completing their tasks.

  • DoR: Explicit criteria that an epic, user story, or task must meet before it can be accepted into an upcoming sprint. 
  • DoD: List of criteria to fulfill before we can mark the epic, user story, or task complete, or the entry or exit criteria for each story state transitions. 

Including DoRs and DoDs have proven to improve delivery pace and quality. One of the first teams to adopt this observed significant improvements in their delivery speed and app quality – consistently delivering over 90% of task commitments, minimising technical debt, and reducing manual testing times.

Unfortunately, having these two changes alone were not sufficient – testing was still manually intensive and time consuming. To ease the load on our QA engineers, we needed to develop a balanced testing strategy.  

Balanced testing strategy

Figure 4 – Test automation strategy

Our initial automation strategy only included unit testing, but we have since enhanced our testing strategy to be more balanced.

  • Unit testing
  • UI component testing
  • Backend integration testing
  • End-to-End (E2E) testing

Simply having good coverage in one layer does not guarantee good quality of an app or new feature. It is important for teams to test vigorously with different types of testing to ensure that we cover all possible scenarios before a release.

As you already know, unit tests are written and executed by software engineers during the development phases. Let’s look at what the remaining three layers mean.

UI component testing

This type of testing focuses on individual components within the application and is useful for testing specific use cases of a service or feature. To reduce manual effort from QA engineers, teams started exploring automation and introduced a mobile testing framework for component testing.

This UI component testing framework used mocked API responses to test screens and interactions on the elements. These UI component tests were automatically executed whenever the pipeline was run, which helped to reduce manual regression efforts. With shift-left testing, we also revised the DoD for new features to include at least 70% coverage of UI component tests.

Backend integration testing

Backend integration testing is especially important if your application regularly interacts with backend services, much like the Grab app. This means we need to ensure the quality and stability of these backend services. Since Grab started its journey toward becoming a superapp, more teams started performing backend integration tests like API integration tests.

Our backend integration tests also covered positive and negative test cases to determine the happy and unhappy paths. At the moment, majority of Grab teams have complete test coverage for happy path use cases and are continuously improving coverage for other use cases.

End-to-End (E2E) testing

E2E tests are important because they simulate the entire user experience from start to end, ensuring that the system works as expected. We started exploring E2E testing frameworks, from as early as 2015, to automate tests for critical services like logging in and booking a ride.

But as Grab introduced more services, off-the-shelf solutions were no longer a viable option, as we noticed issues like automation limitations and increased test flakiness. We needed a framework that is compatible with existing processes, stable enough to reduce flakiness, scalable, and easy to learn.

With this criteria in mind, our QA engineering teams built an internal E2E framework that could make API calls, test different account-based scenarios, and provide many other features. Multiple pilot teams have started implementing tests with the E2E framework, which has helped to reduce regression efforts. We are continuously improving the framework by adding new capabilities to cover more test scenarios.

Now that we’ve covered all the changes we implemented with shift-left testing, let’s take a look at how this changed our SDLC.

Impact

Figure 5 – Updated SDLC process

Since the implementation of shift-left testing, we have improved our app quality without compromising our project delivery pace. Compared to 2019, we observed the following improvements within the Grab superapp in 2022:

  • Production issues with “Major and Critical” severity bugs found in production were reduced by 60%
  • Bugs found in development phase with “Major and Critical” severity were reduced by 40%

What’s next?

Through this journey, we recognise that there’s no such thing as a bug-free app – no matter how much we test, production issues still happen occasionally. To minimise the occurrence of bugs, we’re regularly conducting root cause analyses and writing postmortem reports for production incidents. These allow us to retrospect with other teams and come up with corrective actions and prevention plans. Through these continuous learnings and improvements, we can continue to shape the future of the Grab superapp.

Special thanks to Sori Han for designing the images in this article.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Determine the best technology stack for your web-based projects

Post Syndicated from Grab Tech original https://engineering.grab.com/determining-tech-stack

In the current technology landscape, startups are developing rapidly. This usually leads to an increase in the number of engineers in teams, with the goal of increasing the speed of product development and delivery frequency. However, this growth often leads to a diverse selection of technology stacks being used by different teams within the same organisation.

Having different technology stacks within a team could lead to a bigger problem in the future, especially if documentation is not well-maintained. The best course of action is to pick just one technology stack for your projects, but it begs the question, “How do I choose the best technology stack for my projects?”

One such example is OVO, which is an Indonesian payments, rewards, and financial services platform within Grab. We share our process and analysis to determine the best technology stack that complies with precise standards. By the end of the article, you may also learn to choose the best technology stack for your needs.

Background

In recent years, we have seen massive growth in modern web technologies, such as React, Angular, Vue, Svelte, Django, TypeScript, and many more. Each technology has its benefits. However, having so many choices can be confusing when you must determine which technologies are best for your projects. To narrow down the choices, a few aspects, such as scalability, stability, and usage in the market, must be considered.

That’s the problem that we used to face. Most of our legacy services were not standardised and were written in different languages like PHP, React, and Vue. Also, the documentation for these legacy services is not well-structured or regularly updated.

Current technology stack usage in OVO

We realised that we had two main problems:

  • Various technology stacks (PHP, Vue, React, Nuxt, and Go) maintained simultaneously, with incomplete documentation, may consume a lot of time to understand the code, especially for engineers unfamiliar with the frameworks or even a new hire.
  • Context switching when reviewing code makes it hard to review other teammates’ merge requests on complex projects and quickly offer better code suggestions.

To prevent these problems from recurring, teams must use one primary technology stack.

After detailed comparisons, we narrowed our choices to two options – React and Vue – because we have developed projects in both technologies and already have the user interface (UI) library in each technology stack.

Taken from ulam.io

Next, we conducted a more detailed research and exploration for each technology. The main goals were to find the unique features, scalability, ease of migration, and compatibility for the UI library for React and Vue. To test the compatibility of each UI library, we also used a sample UI on one of our upcoming projects and sliced it.

Here’s a quick summary of our exploration:

Metrics Vue React
UI Library Compatibility Doesn’t require much component development Doesn’t require much component development
Scalability Easier to upgrade, slower in releasing major updates, clear migration guide Quicker release of major versions, supports gradual updates
Others Composition API, strong community (Vue Community) Latest version (v18) of React gradual updates, doesn’t support IE

From this table, we found that the differences between these frameworks are miniscule, making it tough for us to determine which to use. Ultimately, we decided to step back and see the Big Why

Solution

The Big Why here was “Why do we need to standardise our technology stack?”. We wanted to ease the onboarding process for new hires and reduce the complexity, like context switching, during code reviews, which ultimately saves time.

As Kleppmann (2017) states, “The majority of the cost of software is in its ongoing maintenance”. In this case, the biggest cost was time. Increasing the ease of maintenance would reduce the cost, so we decided to use maintainability as our north star metric.

Kleppmann (2017) also highlighted three design principles in any software system:

  • Operability: Make it easy to keep the system running.
  • Simplicity: Easy for new engineers to understand the system by minimising complexity.
  • Evolvability: Make it easy for engineers to make changes to the system in the future.

Keeping these design principles in mind, we defined three metrics that our selected tech stack must achieve:

  • Scalability
    • Keeping software and platforms up to date
    • Anticipating possible future problems
  • Stability of the library and documentation
    • Establishing good practices and tools for development
  • Usage in the market
    • The popularity of the library or framework and variety of coding best practices

Metrics Vue React
Scalability Framework

Operability
Easier to update because there aren’t many approaches to writing Vue.

Evolvability
Since Vue is a framework, it needs fewer steps to upgrade.

Library
Supports gradual updates but there will be many different approaches when upgrading React on our services.
Stability of the library and documentation Has standardised documentation Has many versions of documentation
Usage on Market Smaller market share.

Simplicity
We can reduce complexity for new hires, as the Vue standard in OVO remains consistent with standards in other companies.

Larger market share.

Many React variants are currently in the market, so different companies may have different folder structures/conventions.

Screenshot taken from https://www.statista.com/ on 2022-10-13

After conducting a detailed comparison between Vue and React, we decided to use Vue as our primary tech stack as it best aligns with Kleppmann’s three design principles and our north star metric of maintainability. Even though we noticed a few disadvantages to using Vue, such as smaller market share, we found that Vue is still the better option as it complies with all our metrics.

Moving forward, we will only use one tech stack across our projects but we decided not to migrate technology for existing projects. This allows us to continue exploring and learning about other technologies’ developments. One of the things we need to do is ensure that our current projects are kept up-to-date.

Implementation

After deciding on the primary technology stack, we had to do the following:

  • Define a boilerplate for future Vue projects, which will include items like a general library or dependencies, implementation for unit testing, and folder structure, to align with our north star metric.
  • Update our existing UI library with new components and the latest Vue version.
  • Perform periodic upgrades to existing React services and create a standardised code structure with proper documentation.

With these practices in place, we can ensure that future projects will be standardised, making them easier for engineers to maintain.

Impact

There are a few key benefits of standardising our technology stack.

  • Scalability and maintainability: It’s much easier to scale and maintain projects using the same technology stack. For example, when implementing security patches on all projects due to certain vulnerabilities in the system or libraries, we will need one patch for each technology. With only one stack, we only need to implement one patch across all projects, saving a lot of time.
  • Faster onboarding process: The onboarding process is simplified for new hires because we have standardisation between all services, which will minimise the amount of context switching and lower the learning curve.
  • Faster deliveries: When it’s easier to implement a change, there’s a compounding impact where the delivery process is shortened and release to production is quicker. Ultimately, faster deliveries of a new product or feature will help increase revenue.

Learnings/Conclusion

For every big decision, it is important to take a step back and understand the Big Why or the main motivation behind it, in order to remain objective. That’s why after we identified maintainability as our north star metric, it was easier to narrow down the choices and make detailed comparisons.

The north star metric, or deciding factor, might differ vastly, but it depends on the problems you are trying to solve.

References

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How the Grafana Alerting team scales their issue management with GitHub Projects

Post Syndicated from Mario Rodriguez original https://github.blog/2023-03-15-how-the-grafana-alerting-team-scales-their-issue-management-with-github-projects/

At GitHub you’ve heard us talk about how we are using GitHub Projects and GitHub Actions to plan and track our work and now we’ve asked one of our customers, Grafana Labs, to share how their teams are approaching work in a new way. Whether they are managing open source requests, operational tasks, or escalations, the Grafana Labs Alerting team uses GitHub Projects to manage all these issues efficiently.

Headshot photograph of Armand Grillet, a white male with short brownish hair and a short beard wearing round tortoise shell glasses. Let’s hear more from Armand Grillet, Senior Engineering Manager at Grafana Labs, including how his teams use tasklists to break work into manageable tasks, use a common set of labels to filter tasks, create multiple views on a single project to meet the needs of different teams and stakeholders, and use automation to enable engineers to stay focused on the code.


Grafana is a leading open source platform for monitoring and observability, which is why the Grafana Labs GitHub organization is the center of our engineering efforts with nearly 1,000 repositories, including eight having more than 2,000 stars. In addition to this open source work, Grafana Labs engineers also work on the Grafana Cloud observability platform and its customers’ escalations.

As the manager of the Grafana Alerting and service-level objectives (SLOs) backend teams at Grafana Labs it was essential to have one project board that benefited our multiple stakeholders: team members, other employees, as well as the open source community. Our GitHub Project has offered us the opportunity to do just that. You can even make a copy of our board and adapt it for your own project needs, using the ‘make a copy’ functionality.

One view for each team, assembled around task lists

Our four teams each have a view in the Grafana Alerting project: “Backend”,”Front-end”,”UX”,and “Docs”.

Grafana Alerting has contributors working across four teams: backend, front-end, UX, and docs. Each of these contributor types has their own view in our project. The field options (thus the columns) are the same for each of these views:

  • Inbox—not reviewed yet
  • Waiting for input—open source issues that need more details
  • Backlog—reviewed, priority depending on the milestone set
  • In progress
  • In review
  • Done

These columns (custom fields) are intentionally generic. They can work for all teams, no matter whether the issue has been written by someone in the community or by someone internally. We use project filters based on labels, such as “area/frontend” to allow issues to be automatically added to the correct views once they are added to our project.

Big issues that we work on over a quarter use tasklists to breakdown the main pieces of work.
Big issues that we work on over a quarter use tasklists to breakdown the main pieces of work.
Note: Tasklists are currently in private beta; you can sign your organization up for access on the GitHub Feature Preview portal

For bigger issues, we make use of the GitHub tasklist feature to break down the work into tasks. We use labels to filter the tasks to be included in the relevant team’s view. This creates views that provide two different, but useful, kinds of information. For example, for the docs team:

  • Smaller issues with only the label “type/docs” are items the docs team needs to work on.
  • Bigger issues containing task lists with the label “type/docs” along with labels for additional areas (for example, “area/frontend”) are issues with a dependency on the docs team, but the item is not owned by “docs.” For these bigger issues, if the status is “in progress” someone in the docs team should start checking how they can help.

As a manager of the backend team and project lead for Grafana Alerting, this workflow gives me peace of mind. Even if our docs team members miss some meetings, the Docs view in our project is always accurate, because engineers maintain the status of issues.

The four team views, combined with our “Epics” view (the first view in our board) that lists our big issues for an entire quarter, allow everyone to see our progress on Grafana Alerting. GitHub users who are part of the Grafana Labs organization can see all issues, whereas Github users within our community can only see issues in public repositories. As most of our issues live in the Grafana public repository this allows us to be transparent by default.

Whilst we use a private repository for issues relating to our operational work, we use the same labels in all our alerting-related repositories so that we can use project filters easily. Having common labels in many repositories creates an incentive to have the same labels in other repositories, especially new ones, even if they do not relate to alerting. This growing commonality makes searching for issues across multiple repositories easier, which is particularly useful for our product managers.

Custom fields to create tailored views

A valuable feature of GitHub Projects is the ability to have different columns (custom fields) per view. This allows us to view not only smaller issues but also larger issues covering an entire quarter. Our team also handles engineering escalations that are worked at a faster pace compared to normal issues.

As a team, our three custom fields are:

  • Status: the default field for all issues except escalations (as mentioned above).
  • Quarter: used for bigger issues that include tasklists.
  • Escalation: used to capture more granular status of each escalation.

With these custom fields, we can have custom views, such as our epic view, which gives us a birds-eye view of our quarterly goals, and our escalations view, which lets us review the state of escalations that need engineering work.

The escalations view uses the escalation status custom field with special values such as “Waiting for release.”

Thanks to the adaptability of GitHub Projects, we finally have one ‘source-of-truth’ to reference normal issues, big issues, and escalations.

Enhancing projects with GitHub Actions

For escalations, which are urgent to solve, we also use GitHub Actions to notify us on Slack or use Grafana OnCall if this is a high priority escalation.

Combining GitHub Projects and GitHub Actions offers endless possibilities. Actions like github/issue-labeler allow us to have repositories shared by different teams with issues labeled automatically depending on keywords used in an issue. These labels are then used by other actions to add the issue to the right view or to send notifications to external systems. Opening the project and seeing that new relevant issues have been added automatically, ready for triage, feels like magic.

An issue labeled “alerting” and “prio/3” has been created by support in “grafana/support-escalations.” This adds the issue to the right project and notifies the current on-call on Slack via GitHub Actions.

We often have escalations related to features requested by customers. Being able to link to an escalation in an open source issue allows engineers and product managers to prioritize together on one platform while keeping this information confidential from GitHub users outside our organization.

These automations have been important in terms of engineers’ motivation and our productivity. For example, the Grafana Alerting team receives around five issues from support regarding escalations per week. During the last quarter of 2022, at any given time, the team had on average only four open escalations due to the automation motivating engineers to prioritize this work. This is a significant reduction from the average of 20 open escalations at any given time prior to our shift to GitHub Projects and GitHub Actions.

Flow chart showing the origins and routes that an item can take on its way to ending up in a GitHub project.
Combining GitHub Projects and GitHub Actions allows us to route feedback from various sources through the most relevant repos and then onto a single project.

GitHub Projects at scale

Since we started using GitHub Projects, our team has been growing. An important moment for me in our scaling was the creation of a new project for the team working on SLOs. Originally, this team had no escalations and a small number of engineers. We were able to easily make a copy of the ‘Alerting’ project, changing the filters to accommodate the new team’s repositories, too. The important question was whether the workflows we had for a big team would be too much for such a new team. The answer was yes, there were too many views, creating unnecessary overhead, but it was very easy to reduce the number of views to adapt to the new team’s needs. Using the existing project as a starting point for the SLO team enabled us to get a new project up and running fast.

After a year of using GitHub Projects, we have seen its ability to handle our different types of issues and to adapt to the needs of our various teams at Grafana Labs. We find GitHub Projects to be a flexible planning and tracking solution that enables engineers to stay focussed on code and gives issues more visibility. GitHub Projects keeps the burden for contributors low whilst still providing the views managers need to understand and plan team efforts. GitHub Projects is the only tool the Grafana Alerting team needs to do project planning.

Highlights from Git 2.40

Post Syndicated from Taylor Blau original https://github.blog/2023-03-13-highlights-from-git-2-40/

The open source Git project just released Git 2.40 with features and bug fixes from over 88 contributors, 30 of them new.

We last caught up with you on the latest in Git when 2.39 was released. To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.


  • Longtime readers will recall our coverage of git jump from way back in our Highlights from Git 2.19 post. If you’re new around here, don’t worry: here’s a brief refresher.

    git jump is an optional tool that ships with Git in its contrib directory. git jump wraps other Git commands, like git grep and feeds their results into Vim’s quickfix list. This makes it possible to write something like git jump grep foo and have Vim be able to quickly navigate between all matches of “foo” in your project.

    git jump also works with diff and merge. When invoked in diff mode, the quickfix list is populated with the beginning of each changed hunk in your repository, allowing you to quickly scan your changes in your editor before committing them. git jump merge, on the other hand, opens Vim to the list of merge conflicts.

    In Git 2.40, git jump now supports Emacs in addition to Vim, allowing you to use git jump to populate a list of locations to your Emacs client. If you’re an Emacs user, you can try out git jump by running:

    M-x grepgit jump --stdout grep foo

    [source]

  • If you’ve ever scripted around a Git repository, you may be familiar with Git’s cat-file tool, which can be used to print out the contents of arbitrary objects.

    Back when v2.38.0 was released, we talked about how cat-file gained support to apply Git’s mailmap rules when printing out the contents of a commit. To summarize, Git allows rewriting name and email pairs according to a repository’s mailmap. In v2.38.0, git cat-file learned how to apply those transformations before printing out object contents with the new --use-mailmap option.

    But what if you don’t care about the contents of a particular object, and instead want to know the size? For that, you might turn to something like --batch-check=%(objectsize), or -s if you’re just checking a single object.

    But you’d be mistaken! In previous versions of Git, both the --batch-check and -s options to git cat-file ignored the presence of --use-mailmap, leading to potentially incorrect results when the name/email pairs on either side of a mailmap rewrite were different lengths.

    In Git 2.40, this has been corrected, and git cat-file -s and --batch-check with will faithfully report the object size as if it had been written using the replacement identities when invoked with --use-mailmap.

    [source]

  • While we’re talking about scripting, here’s a lesser-known Git command that you might not have used: git check-attr. check-attr is used to determine which gitattributes are set for a given path.

    These attributes are defined and set by one or more .gitattributes file(s) in your repository. For simple examples, it’s easy enough to read them off from a .gitattributes file, like this:

    $ head -n 2 .gitattributes 
    * whitespace=!indent,trail,space 
    *.[ch] whitespace=indent,trail,space diff=cpp
    

    Here, it’s relatively easy to see that any file ending in *.c or *.h will have the attributes set above. But what happens when there are more complex rules at play, or your project is using multiple .gitattributes files? For those tasks, we can use check-attr:

    $ git check-attr -a git.c 
    git.c: diff: cpp 
    git.c: whitespace: indent,trail,space
    

    In the past, one crucial limitation of check-attr is that it required an index, meaning that if you wanted to use check-attr in a bare repository, you had to resort to temporarily reading in the index, like so:

    TEMP_INDEX="$(mktemp ...)" 
    
    git read-tree --index-output="$TEMP_INDEX" HEAD 
    GIT_INDEX_FILE="$TEMP_INDEX" git check-attr ... 
    

    This kind of workaround is no longer required in Git 2.40 and newer. In Git 2.40, check-attr supports a new --source= to scan for .gitattributes in, meaning that the following will work as an alternative to the above, even in a bare repository:

    $ git check-attr -a --source=HEAD^{tree} git.c 
    git.c: diff: cpp 
    git.c: whitespace: indent,trail,space
    

    [source]

  • Over the years, there has been a long-running effort to rewrite old parts of Git from their original Perl or Shell implementations into more modern C equivalents. Aside from being able to use Git’s own APIs natively, consolidating Git commands into a single process means that they are able to run much more quickly on platforms that have a high process start-up cost, such as Windows.

    On that front, there are a couple of highlights worth mentioning in this release:

    In Git 2.40, git bisect is now fully implemented in C as a native builtin. This is the result of years of effort from many Git contributors, including a large handful of Google Summer of Code and Outreachy students.

    Similarly, Git 2.40 retired the legacy implementation of git add --interactive, which also began as a Shell script and was re-introduced as a native builtin back in version 2.26, supporting both the new and old implementation behind an experimental add.interactive.useBuiltin configuration.

    Since that default has been “true” since version 2.37, the Git project has decided that it is time to get rid of the now-legacy implementation entirely, marking the end of another years-long effort to improve Git’s performance and reduce the footprint of legacy scripts.

    [source, source]

  • Last but not least, there are a few under-the-hood improvements to Git’s CI infrastructure. Git has a handful of long-running Windows-specific CI builds that have been disabled in this release (outside of the git-for-windows repository). If you’re a Git developer, this means that your CI runs should complete more quickly, and consume fewer resources per push.

    On a similar front, you can now configure whether or not pushes to branches that already have active CI jobs running should cancel those jobs or not. This may be useful when pushing to the same branch multiple times while working on a topic.

    This can be configured using Git’s ci-config mechanism, by adding a special script called skip-concurrent to a branch called ci-config. If your fork of Git has that branch then Git will consult the relevant scripts there to determine whether CI should be run concurrently or not based on which branch you’re working on.

    [source, source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.40, or any previous version in the Git repository.

How GitHub Docs’ new search works

Post Syndicated from Peter Bengtsson original https://github.blog/2023-03-09-how-github-docs-new-search-works/

Until recently, the site-search on GitHub Docs was an in-memory solution. While it was a great starting point, we ultimately needed a solution that would scale with our growing needs, so we rewrote it in Elasticsearch. In this blog post, we share how the implementation works and how you can impress users with your site-search by doing the same.

How it started

Our previous solution couldn’t scale because it required loading all of the records into memory (the Node.js code that Express.js runs). This means that we’d need to scrape all the searchable text—title, headings, breadcrumbs, content, etc.—and store that data somewhere so it could quickly be loaded in the Node process runtime. To store the data, we used the Git repository itself, so when we built the Docker image run in Azure, it would have access to all the searchable text from disk.

It would take a little while to load in all the searchable text, so we generated a serialized index from the searchable text, and stored that on disk too. This solution is OKish if your data is small, but we have eight languages and five different versions per language.

Running Elasticsearch locally

The reason we picked Elasticsearch over other alternatives is a story for another day, but one compelling argument is that it’s possible to run it locally on your laptop. For the majority of contributors who make copy edits, the task of installing Elasticsearch locally isn’t necessary. So by default, our /api/search Express.js middleware looks something like this:

if (process.env.ELASTICSEARCH_URL) {
  router.use('/search', search)
} else {
  router.use(
    '/search',
    createProxyMiddleware({
      target: 'https://docs.github.com',
  ...

When someone uses http://localhost:4000/api/search on their laptop (or Codespaces) it simply forwards the Elasticsearch stuff to our production server. That way, engineers (like myself) who are debugging the search engine locally can start an Elasticsearch on http://localhost:9200 and set that in their .env file. Now engineers can quickly try new search query techniques entirely with their own local Elasticsearch.

The search implementation

A core idea in our search implementation is that we make only one single query to Elasticsearch, which contains our entire specification for how we want results ranked. Rather than sending a single request, we could try a more specific search query first, then, if there are too few results, we could attempt a second, less defined search query:

// NOTE! This is NOT what we do.

let result = await client.search({ index, body: searchQueryStrict })
if (result.hits.length === 0) {
  // nothing found when being strict, try again with a loose query
  result = await client.search({ index, body: searchQueryLoose })
}

To ensure we get the most exact results, we use boosts and a matrix of various matching techniques. In somewhat pseudo code it looks like this:

If the query is multiple terms (we’ll explain what _explicit means later):

[
  { match_phrase: { title_explicit: [Object] } },
  { match_phrase: { title: [Object] } },
  { match_phrase: { headings_explicit: [Object] } },
  { match_phrase: { headings: [Object] } },
  { match_phrase: { content: [Object] } },
  { match_phrase: { content_explicit: [Object] } },
  { match: { title_explicit: [Object] } },
  { match: { headings_explicit: [Object] } },
  { match: { content_explicit: [Object] } },
  { match: { title: [Object] } },
  { match: { headings: [Object] } },
  { match: { content: [Object] } },
  { match: { title_explicit: [Object] } },
  { match: { headings_explicit: [Object] } },
  { match: { content_explicit: [Object] } },
  { match: { title: [Object] } },
  { match: { headings: [Object] } },
  { match: { content: [Object] } },
  { fuzzy: { title: [Object] } }
]

If the query is a single term:

[
  { match: { title_explicit: [Object] } },
  { match: { headings_explicit: [Object] } },
  { match: { content_explicit: [Object] } },
  { match: { title: [Object] } },
  { match: { headings: [Object] } },
  { match: { content: [Object] } },
  { fuzzy: { title: [Object] } }
]

Sure, it can look like a handful of queries, but Elasticsearch is fast. Most of the total time is the network time to send the query and receive the results. In fact, the time it takes to execute the entire search query (excluding the networking) hovers quite steadily at 20 milliseconds on our Elasticsearch server.

About 55% of all searches on docs.github.com are multi-term queries, for example, actions rest. Meaning that in roughly half the cases, we can use the simplified queries because we can omit things like the match_phrase parts of the total query.

Before getting into the various combinations of searches, there’s a section later about what “explicit” means. Essentially, each field is indexed twice. E.g. title and title_explicit. It’s the same content underneath but it’s tokenized differently, which affects how it matches to queries—and that difference is exploited by having different boostings.

The nodes that make up the matrix are:

Fields:

  • title (the <h1> text)
  • headings (the <h2> texts)
  • content (the bulk of the article text)

Analyzer:

  • explicit (no stemming and no synonyms)
  • regular (full Snowball stemming and possibly synonyms)

Matches: (on multi-term queries)

  • match_phrase
  • match with OR (docs that contain “foo” OR “bar”)
  • match with AND (docs that contain “foo” AND “bar”)

Each of these combinations has a unique boost number, which boosts the ranking of the matched result. The actual number doesn’t matter much, but what matters is that the boost numbers are different from each other. For example, a match on the title has a slightly higher boost than a match on the content. And a match where all words are present has a slightly higher boost than when only some of the words match. Another example: if the search term is docker action, the users would prefer to see “Creating a Docker container action” ahead of “Publishing Docker images” or “Metadata syntax for GitHub Actions.”

Each of the above nodes has a boost calculation that looks something like this:

const BOOST_PHRASE = 10.0
const BOOST_TITLE = 4.0
const BOOST_HEADINGS = 3.0
const BOOST_CONTENT = 1.0
const BOOST_AND = 2.5
const BOOST_EXPLICIT = 3.5

...
match_phrase: { title_explicit: { boost: BOOST_EXPLICIT * BOOST_PHRASE * BOOST_TITLE, query } },
match: { headings: { boost: BOOST_HEADINGS * BOOST_AND, query, operator: 'AND' } },
...

If you exclusively print out what the boost value becomes for each node in the matrix you get:

[
  { match_phrase: { title_explicit: 140 } },
  { match_phrase: { title: 40 } },
  { match_phrase: { headings_explicit: 105 } },
  { match_phrase: { headings: 30 } },
  { match_phrase: { content: 10 } },
  { match_phrase: { content_explicit: 35 } },
  { match: { title_explicit: 35, operator: 'AND' } },
  { match: { headings_explicit: 26.25, operator: 'AND' } },
  { match: { content_explicit: 8.75, operator: 'AND' } },
  { match: { title: 10, operator: 'AND' } },
  { match: { headings: 7.5, operator: 'AND' } },
  { match: { content: 2.5, operator: 'AND' } },
  { match: { title_explicit: 14 } },
  { match: { headings_explicit: 10.5 } },
  { match: { content_explicit: 3.5 } },
  { match: { title: 4 } },
  { match: { headings: 3 } },
  { match: { content: 1 } },
  { fuzzy: { title: 0.1 } }
]

So that’s what we send to Elasticsearch. It’s like a complex wishlist saying, “I want a bicycle for Christmas. But if you have a pink one, even better. And a pink one with blue stripes is even better still. Actually, bestest would be a pink one with blue stripes and a brass bell.”

What’s important, in terms of an ideal implementation and search result for users, is that we use our human and contextual intelligence to define these parameters. Some of it is fairly obvious and some is more subtle. For example, we think a phrase match on the title without the need for stemming is the best possible match, so that gets the highest boost.

Why explicit boost is important

If someone types “creating repositories” it should definitely match a title like “Create a private GitHub repository” because of the stems creating => creat <= create and repositories => repository <= repository. We should definitely include those matches that take stemming into account. But if there’s an article that explicitly uses the words that match what the user typed, like “Creating private GitHub repositories,” then we want to boost that article’s ranking because we think that’s more relevant to the searcher.

Another good example of this is the special keyword working-directory which is an actual exact term that can appear inside the content. If someone searches for working-directory, we don’t want to let an (example) title like “Directories that work” overpower the rankings when working-directory and “Directories that work” are both deconstructed to the same two stems [ 'work', 'directori' ].

The solution we rely on is to make two matches: one with stemming and one without. Each one has a different boost. It’s similar to saying, “I’m looking for a ‘Peter’ but if there’s a ‘Petter’ or ‘Pierre’ or ‘Piotr’ that’ll also do. But ideally, ‘Peter’ first and foremost. In fact, give me all the results, but ‘Peter’ first.” So this is what we do. The stemming is great but it can potentially “overpower” search results. This helps with specific keywords that might appear to be English prose. For example “working-directory” even looks like a regular English expression but it’s actually a hardcoded specific keyword.

In terms of code, it looks like this:

// Creating the index...
await client.indices.create({
  mappings: {
    properties: {
      url: { type: 'keyword' },
      title: { type: 'text', analyzer: 'text_analyzer', norms: false },
      title_explicit: { type: 'text', analyzer: 'text_analyzer_explicit', norms: false },
      content: { type: 'text', analyzer: 'text_analyzer' },
      content_explicit: { type: 'text', analyzer: 'text_analyzer_explicit' },
      // ...snip...
      },
    },
    // ...snip...
  })
// Searching...
matchQueries.push(
  ...[
    { match: { title_explicit: { boost: BOOST_EXPLICIT * BOOST_TITLE, query } } },
    { match: { content_explicit: { boost: BOOST_EXPLICIT * BOOST_CONTENT, query } } },
    { match: { title: { boost: BOOST_TITLE, query } } },
    { match: { content: { boost: BOOST_CONTENT, query } } },
    // ...snip...
])

Ranking is not that easy

Search is quickly becoming an art. It’s not enough to just match the terms and display a list of documents that match the terms of the input. For starters, the ranking is crucially important. This is especially true when a search term yields tens or hundreds of matching documents. Years of depending on Google has taught us all to expect the first search result to be the one we want.

To make this a great experience, we try to infer which document the user is genuinely looking for by using pageview metrics as a way to determine which page is most popular. Sure, if the most popular one is offered first, and gets the clicks, it’ll just get even more popular, but it’s a start. We get a lot of pageviews from users Googling something. But we also get regular pageview metrics from users simply navigating themselves to the pages they have figured out has the best information for them.

At the moment, we gather pageview metrics for the top 1,000 most popular URLs. Then we rank them and normalize it to a number between (and including) 0.0 to 1.0. Whatever the number is we add +1.0 and then we multiply this number in Elasticsearch with the match score.

Suppose a search query finds two documents that match, and their match score is 15.6 and 13.2 based on the search implementation mentioned above. Now, suppose that match of 13.2 is on a popular page, its popularity number might be 0.75, so it becomes 13.2 * (1 + 0.75) = 23.1. And the other one that matched a little bit better, has a popularity number of 0.44, so its final number becomes 15.6 * (1 + 0.44) = 22.5, which is less. In conclusion, it gives documents that might not be term-for-term as “matchy” as others a chance to rise above. This also ensures that a hugely popular document that only matched vaguely in the content won’t “overpower” other matches that might have matched in the title.

It’s a tricky challenge, but that’s what makes it so much fun. You have to bake in some human touch to the code as a way of trying to think like your users. It’s also an algorithm that will never reach perfection because even with more and better metrics, the landscape of users and where they come from, is constantly changing. But we’ll try to keep up.

What’s next?

Elasticsearch has a functionality where you can define aliases for words, which they call synonyms. (For example, repo = repository.) The challenge with this feature is how it’s managed and maintained by writers in a convenient and maintainable way.

Currently at GitHub, we base our popularity numbers by pageview metrics. It would be interesting to dig deeper into which page the user lands on. As an example, suppose the reader isn’t sure how to find what they need, so they start on a product landing page (we have about 20 of those) and slowly make their way deeper into the content and finally arrive on the page that supplied the knowledge or answer they want. This process would give each page an equal measure, and that’s not great.

Another exciting idea is to record all the times a search result URL is not clicked on, especially when it shows up higher in the ranking. That would decrease the risk of popular listings only getting more popular. If you record when something is “corrected” by humans, that could be a very powerful signal.

You could also tailor the underlying search based on other contextual variables. For example, if a user is currently inside the REST API docs you could infer that REST API-related docs are slightly preferred when searching for something ambiguous like “billing” (i.e. prefer REST API “About billing” not Billing and payments “Setting your billing email.”).

What are you looking for next? Have you tried the search on https://docs.github.com recently and found that it wasn’t giving you the best search result? Please let us know and get in touch.

Migrating from Role to Attribute-based Access Control

Post Syndicated from Grab Tech original https://engineering.grab.com/migrating-to-abac

Grab has always regarded security as one of our top priorities; this is especially important for data platform teams. We need to control access to data and resources in order to protect our consumers and ensure compliance with various, continuously evolving security standards.

Additionally, we want to keep the process convenient, simple, and easily scalable for teams. However, as Grab continues to grow, we have more services and resources to manage and it becomes increasingly difficult to keep the process frictionless. That’s why we decided to move from Role-Based Access Control (RBAC) to Attribute-Based Access Control (ABAC) for our Kafka Control Plane (KCP).

In this article, you will learn how Grab’s streaming data platform team (Coban) deleted manual role and permission management of hundreds of roles and resources, and reduced operational overhead of requesting or approving permissions to zero by moving from RBAC to ABAC.

Introduction

Kafka is widely used across Grab teams as a streaming platform. For decentralised Kafka resource (e.g. topic) management, teams have the right to create, update, or delete based on their needs. As the data platform team, we implemented a KCP to ensure that these operations are only performed by authorised parties, especially on multi-tenant Kafka clusters.

For internal access management, Grab uses its own Identity and Access Management (IAM) service, based on RBAC, to support authentication and authorisation processes:

  • Authentication verifies the identity of a user or service, for example, if the provided token is valid or expired.
  • Authorisation determines their access rights, for example, whether users can only update and/or delete their own Kafka topics.

In RBAC, roles, permissions, actions, resources, and the relationships between them need to be defined in the IAM service. They are used to determine whether a user can access a certain resource.

In the following example, we can see how IAM concepts come together. The Coban engineer role belongs to the Engineering-coban group and has permission to update the retention topic. Any engineer added to the Engineering-coban group will also be able to update the topic retention.

Following the same concept, each team using the KCP has its own roles, permissions, and resources created in the system. However, there are some disadvantages to this approach:

  • It leads to a significant growth in the number of access control artifacts both platform and user teams need to manage, and increased time and effort to debug access control issues. We start off by finding which group the engineer belongs to and locating the group that should be used for KCP, and then trace to role and permissions.
  • All group membership access requests of new joiners need to be reviewed and approved by their direct managers. This leads to a lot of backlog as new joiners might have multiple groups to join and managers might not be able to review them timely. In some cases, roles need to be re-applied or renewed every 90 days, which further adds to the delay.
  • Group memberships are not updated to reflect active members in the team, leaving some engineers with access they don’t need and others with access they should have but don’t.

Solution

With ABAC, access management becomes a lot easier. Any new joiner to a specific team gets the same access rights as everyone on that team – no need for manual approval from a manager. However, for ABAC to work, we need these components in place:

  • User attributes: Who is the subject (actor) of a request?
  • Resource attributes: Which object (resource) does the actor want to deal with?
  • Evaluation engine: How do we decide if the actor is allowed to perform the action on the resource?

User attributes

All users have certain attributes depending on the department or team they belong to. This data is then stored and synced automatically with the human resource management system (HRMS) tool, which acts as a source of truth for Grab-wide data, every time a user switches teams, roles, or leaves the company.

Resource attributes

Resource provisioning is an authenticated operation. This means that KCP knows who sent the requests and what each request/action is about. Similarly, resource attributes can be derived from their creators. For new resource provisioning, it is possible to capture the resource tags and store them after authentication. For existing resources, a major challenge was the need to backfill the tagging and ensure a seamless transition from the user’s perspective. In the past, all resource provisioning operations were done by a centralised platform team and most of the existing resource attributes are still under platform team’s ownership.

Evaluation engine

We chose to use Open Policy Agent (OPA) as our policy evaluation engine mainly for its wide community support, applicable feature set, and extensibility to other tools and platforms in our system. This is also currently used by our team for Kafka authorisation. The policies are written in Rego, the default language supported by OPA.

Architecture and implementation

With ABAC, the access control process looks like this:

User attributes

Authentication is handled by the IAM service. In the /generate_token call, a user requests an authentication token from KCP before calling an authenticated endpoint. KCP then calls IAM to generate a token and returns it to the user.

In the /create_topic call, the user includes the generated token in the request header. KCP takes the token and verifies the token validity with IAM. User attributes are then extracted from the token payload for later use in request authorisation.

Some of the common attributes we use for our policy are user identifier, department code, and team code, which provide details like a user’s department and work scope.

When it comes to data governance and central platform and identity teams, one of the major challenges was standardising the set of attributes to be used for clear and consistent ABAC policies across platforms so that their lifecycle and changes could be governed. This was an important shift in the mental model for attribute management over the RBAC model.

Resource attributes

For newly created resources, attributes will be derived from user attributes that are captured during the authentication process.

Previously with RBAC, existing resources did not have the required attributes. Since migrating to ABAC, the implementation has tagged newly created resources and ensured that their attributes are up to standard. IAM was also still doing the actual authorisation using RBAC.

It is also important to note that we collaborated with data governance teams to backfill Kafka resource ownership. Having accurate ownership of resources like data lake or Kafka topics enabled us to move toward a self-service model and remove bottlenecks from centralised platform teams.

After identifying most of the resource ownership, we started switching over to ABAC. The transition was smooth and had no impact on user experience. The remaining unidentified resources were tagged to lost-and-found and could be reclaimed by service teams when they needed permission to manage them.

Open Policy Agent

The most common question when implementing the policy is “how do you define ownership by attributes?”. With respect to the principle of least privilege, each policy must be sufficiently strict to limit access to only the relevant parties. In the end, we aligned as an organisation on defining ownership by department and team.

We created a simple example below to demonstrate how to define a policy:

package authz

import future.keywords

default allow = false

allow if {
        input.endpoint == "updateTopic"
        is_owner(input.resource_attributes)
}

is_owner(md) if {
        md.department == input.user_attributes.department
        md.team == input.user_attributes.team
}

In this example, we start with denying access to everyone. If the updateTopic endpoint is called and the department and team attributes between user and resource are matched, access is allowed.

With a similar scenario, we would need 1 role, 1 action, 1 resource, and 1 mapping (a.k.a permission) between action and resource. We will need to keep adding resources and permissions when we have new resources created. Compared to the policy above, no other changes are required.

With ABAC, there are no further setup or access requests needed when a user changes teams. The user will be tagged to different attributes, automatically granted access to the new team’s resources, and excluded from the previous team’s resources.

Another consideration we had was making sure that the policy is well-written and transparent in terms of change history. We decided to include this as part of our application code so every change is accounted for in the unit test and review process.

Authorisation

The last part of the ABAC process is authorisation logic. We added the logic to the middleware so that we could make a call to OPA for authorisation.

To ensure token validity after authentication, KCP extracts user attributes from the token payload and fetches resource attributes from the resource store. It combines the request metadata such as method and endpoint, along with the user and resource attributes into an OPA request. OPA then evaluates the request based on the redefined policy above and returns a response.

Auditability

For ABAC authorisation, there are two key areas of consideration:

  • Who made changes to the policy, who deployed, and when the change was made
  • Who accessed what resource and when

We manage policies in a dedicated GitLab repository and changes are submitted via merge requests. Based on the commit history, we can easily tell who made changes, reviewed, approved, and deployed the policy. 

For resource access, OPA produces a decision log containing user attributes, resource attributes, and the authorisation decision for every call it serves. The log is kept for five days in Kibana for debugging purposes, then moved to S3 where it is kept for 28 days.

Impact

The move to ABAC authorisation has improved our controls as compared to the previous RBAC model, with the biggest impact being fewer resources to manage. Some other benefits include:

  • Optimised resource allocation: Discarded over 200 roles, 200 permissions, and almost 3000 unused resources from IAM services, simplifying our debugging process. Now, we can simply check the user and resource attributes as needed.
  • Simplified resource management: In the three months we have been using ABAC, about 600 resources have been added without any increase in complexity for authorisation, which is significantly lesser than the RBAC model.
  • Reduction in delays and waiting time: Engineers no longer have to wait for approval for KCP access.
  • Better governance over resource ownership and costs: ABAC allowed us to have a standardised and accurate tagging system of almost 3000 resources.

Learnings

Although ABAC does provide significant improvements over RBAC, it comes with its own caveats:

  • It needs a reliable and comprehensive attribute tagging system to function properly. This only became possible after roughly three months of identifying and tagging the ownership of existing resources by both automated and manual methods.
  • Tags should be kept up to date with the company’s growth. Teams could lose access to their resources if they are wrongly tagged. It needs a mechanism to keep up with changes, or people will unexpectedly lose access when user and resource attributes are changed.

What’s next?

  • To keep up with organisational growth, KCP needs to start listening to the IAM stream, which is where all IAM changes are published. This will allow KCP to regularly update user attributes and refresh resource attributes when restructuring occurs, allowing authorisation to be done with the right data.
  • Constant collaboration with HR to ensure that we maintain sufficient user attributes (no extra unused information) that remain clean so ABAC works as expected.

References

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Keeping the Cloudflare API ‘all green’ using Python-based testing

Post Syndicated from Elie Mitrani original https://blog.cloudflare.com/keeping-cloudflare-api-all-green-using-python-based-testing/

Keeping the Cloudflare API 'all green' using Python-based testing

Keeping the Cloudflare API 'all green' using Python-based testing

At Cloudflare, we reuse existing core systems to power multiple products and testing of these core systems is essential. In particular, we require being able to have a wide and thorough visibility of our live APIs’ behaviors. We want to be able to detect regressions, prevent incidents and maintain healthy APIs. That is why we built Scout.

Scout is an automated system periodically running Python tests verifying the end to end behavior of our APIs. Scout allows us to evaluate APIs in production-like environments and thus ensures we can green light a production deployment while also monitoring the behavior of APIs in production.

Why Scout?

Before Scout, we were using an automated test system leveraging the Robot Framework. This older system was limiting our testing capabilities. In fact, we could not easily match json responses against keys we were looking for. We would abandon covering different behaviors of our APIs as it was impossible to decide on which resources a given test suite would run. Two different test suites would create false negatives as they were running on the same account.

Regarding schema validation, only API responses were validated against a json schema and tests would not fail if the response did not match the schema. Moreover, It was impossible to validate API requests.

Test suites were run in a queue, making the delay to a new feature assessment dependent on the number of test suites to run. The queue would as well potentially make newer test suites run the following day. Hence we often ended up with a mismatch between tests and APIs versions. Test steps could not be run in parallel either.

We could not split test suites between different environments. If a new API feature was being developed it was impossible to write a test without first needing the actual feature to be released to production.

We built Scout to overcome all these difficulties. We wanted the developer experience to be easy and we wanted Scout to be fast and reliable while spotting any live API issue.

A Scout test example

Scout is built in Python and leverages the functionalities of Pytest. Before diving into the exact capabilities of Scout and its architecture, let’s have a quick look at how to use it!

Following is an example of a Scout test on the Rulesets API (the docs are available here):

from scout import requires, validate, Account, Zone

@validate(schema="rulesets", ignorePaths=["accounts/[^/]+/rules/lists"])
@requires(
    account=Account(
        entitlements={"rulesets.max_rules_per_ruleset": 2),
    zone=Zone(plan="ENT",
        entitlements={"rulesets.firewall_custom_phase_allowed": True},
        account_entitlements={"rulesets.max_rules_per_ruleset": 2 }))
class TestZone:
    def test_create_custom_ruleset(self, cfapi):
        response = cfapi.zone.request(
            "POST",
            "rulesets",
            payload=f"""{{
            "name": "My zone ruleset",
            "description": "My ruleset description",
            "phase": "http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\""
                }}
            ]
        }}""")
        response.expect_json_success(
            200,
            result=f"""{{
            "name": "My zone ruleset",
            "version": "1",
            "source": "firewall_custom",
            "phase": "http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\"",
                    "enabled": true,
                    ...
                }}
            ],
            ...
        }}""")

A Scout test is a succession of roundtrips of requests and responses against a given API. We use the functionalities of Pytest fixtures and marks to be able to target specific resources while validating the request and responses.  Pytest marks in Scout allow to provide an extra set of information to test suites. Pytest fixtures are contexts with information and methods which can be used across tests to enhance their capabilities. Hence the conjunction of marks with fixtures allow Scout to build the whole harness required to run a test suite against APIs.

Being able to exactly describe the resources against which a given test will run provides us confidence the live API behaves as expected under various conditions.

The cfapi fixture provides the capability to target different resources such as a Cloudflare account or a zone. In the test above, we use a Pytest mark @requires to describe the characteristics of the resources we want, e.g. we need here an account with a flag allowing us to have 2 rules for a ruleset. This will allow the test to only be run against accounts with such entitlements.

The @validate mark provides the capability to validate requests and responses to a given OpenAPI schema (here the rulesets OpenAPI schema). Any validation failure will be reported and flagged as a test failure.

Regarding the actual requests and responses, their payloads are described as f-strings, in particular the response f-string can be written as a “semi-json”:

 response.expect_json_success(
            200,
            result=f"""{{
            "name": "My zone ruleset",
            "version": "1",
            "source": "firewall_custom",
            "phase": "phase_http_request_firewall_custom",
            "kind": "zone",
            "rules": [
                {{
                    "description": "My rule",
                    "action": "block",
                    "expression": "http.host eq \"fake.net\"",
                    "enabled": true,
                    ...
                }}
            ],
            ...
        }}""")

Among many test assertions possible, Scout can assert the validity of a partial json response and it will log the information. We added the handling of ellipsis as an indication for Scout not to care about any further fields at a given json nesting level. Hence, we are able to do partial matching on JSON API responses, thus focusing only on what matters the most in each test.

Once a test suite run is complete, the results are pushed by the service and stored using Cloudflare Workers KV. They are displayed via a Cloudflare Worker.

Keeping the Cloudflare API 'all green' using Python-based testing

Scout is run in separate environments such as production-like and production environments. It is part of our deployment process to verify Scout is green in our production-like environment prior to deploying to production where Scout is also used for monitoring purposes.

How we built it

The core of Scout is written in Python and it is a combination of three components interacting together:

Keeping the Cloudflare API 'all green' using Python-based testing
  • The Scout plugin: a Pytest plugin to write tests easily
  • The Scout service: a scheduler service to run the test suites periodically
  • The Scout Worker: a collector and presenter of test reports

The Scout plugin

This is the core component of the Scout system. It allows us to write self explanatory tests while ensuring a high level of compliance against OpenAPI schemas and verifying the APIs’ behaviors.

Keeping the Cloudflare API 'all green' using Python-based testing

The Scout plugin architecture can be split into three components: setup, resource allocator, and runners. Setup is a conjunction of multiple sub components in charge of setting up the plugin.

The Registry contains all the information regarding a pool of accounts and zones we use for testing. As an example, entitlements are flags gating customers for using products features, the Registry provides the capability to describe entitlements per account and zone so that Scout can run a test against a specific setup.

As explained earlier, Scout can validate requests and responses against OpenAPI schemas. This is the responsibility of validators. A validator is built per OpenAPI schema and can be selected via the @validate mark we saw above.

@validate(schema="rulesets", ignorePaths=["accounts/[^/]+/rules/lists"])

As soon as a validator is selected, all the interaction of a given test with an API will be validated. If there is a validation failure, it will be marked as a test failure.

Last element of the setup, the config reader. It is the sub component in charge of providing all the URLs and authentication elements required for the Scout plugin to communicate with APIs.

Next in the chain, the resources allocator. This component is in charge of consuming the configuration and objects of the setup to build multiple runners. This is a factory which will make available the runners in the cfapi fixture.

response = cfapi.zone.request(method, path, payload)

When such a line of code is processed, it is the actual method request of the zone runner allocated for the test which is executed. Actually, the resources allocator is able to provide specialized runners (account, zone or default) which grant the possibility of targeting specific API endpoints for a given account or zone.

Runners are in charge of handling the execution of requests, managing the test expectations and using the validators for request/response schema validation.

Any failure on expectation or validation and any exceptions are recorded in the stash. The stash is shared across all runners. As such, when a test setup, run or cleanup is processed, the timeline of execution and potential retries are logged in the stash. The stash contents are later used for building the test suite reports.

Scout is able to run multiple test steps in parallel. Actually, each resource couple (Account Runner, Zone Runner) is associated with a Pytest-xdist worker which runs test steps independently. There can be as many workers as there are resource couples. An extra “default” runner is provided for reaching our different APIs and/or URLs with or without authentication.

Testing a test system was not the easiest part. We have been required to build a fake API and assert the Scout plugin would behave as it should in different situations. We reached and maintained a test coverage confidence which was considered good (close to 90%) for using the Scout plugin permanently.

The Scout service

The Scout service is meant to schedule test suites periodically. It is a configurable scheduler providing a reporting harness for the test suites as well as multiple metrics. It was a design decision to build a scheduler instead of using cron jobs.

We wanted to be aware of any scheduling issue as well as run issues. For this we used Prometheus metrics. The problem is that the Prometheus default configuration is to scrape metrics advertised by services. This scraping happens periodically and we were concerned about the eventuality of missing metrics if a cron job was to finish prior to the next Prometheus metrics scraping. As such we decided a small scheduler was better suited for overall observability of the test runs. Among the metrics the Scout service provides are network failures, general test failures, reporting failures, tests lagging and more.

Keeping the Cloudflare API 'all green' using Python-based testing

The Scout service runs threads on configured periods. Each thread is a test suite run as a separate Pytest with Scout plugin process followed by a reporting execution consuming the results and publishing them to the relevant parties.

The reporting component provided to each thread publishes the report to Workers KV and notifies us on chat in case there is a failure. Reporting takes also care of publishing the information relevant for building API testing coverage. In fact it is mandatory for us to have coverage of all the API endpoints and their possible methods so that we can achieve a wide and thorough visibility of our live APIs.

As a fallback, if there are any thread failure, test failure or reporting failure we are alerted based on the Prometheus metrics being updated across the service execution. The logs of the Scout service as well as the logs of each Pytest-Scout plugin execution provide the last resort information if no metrics are available and reporting is failing.

The service can be deployed with a minimal YAML configuration and be set up for different environments. We can for example decide to run different test suites based on the environment, publish or not to Cloudflare Workers, set different periods and retry mechanisms and so on.

We keep the tests as part of our code base alongside the configuration of the Scout service, and that’s about it, the Scout service is a separate entity.

The Scout Worker

It is a Cloudflare worker in charge of fetching the most recent Worker KVs and displaying them in an eye pleasing manner. The Scout service publishes a test report as JSON, thus the Scout worker parses the report and displays its content based on the status of the test suite run.

For example, we present below an authentication failure during a test which resulted in such a display in the worker:

Keeping the Cloudflare API 'all green' using Python-based testing

What does Scout let us do

Through leveraging the capabilities of Pytest and Cloudflare Workers, we have been able to build a configurable, robust and reliable system which allows us to easily write self explanatory tests for our APIs.

We can validate requests and responses against OpenAPI schemas and test behaviors over specific resources while getting alerted through multiple means if something goes wrong.

For specific use cases, we can write a test verifying the API behaves as it should, the configuration to be pushed at the edge is valid and a given zone will react as it should to security threats. Thus going beyond an end-to-end API test.

Scout quickly became our permanent live tester and monitor of APIs. We wrote tests for all endpoints to maintain a wide coverage of all our APIs. Scout has since been used for verifying an API version prior to its deployment to production. In fact, after a deployment in a production-like environment we can know in a couple of minutes if a new feature is good to go to production and assess if it is behaving correctly.

We hope you enjoyed this deep dive description into one of our systems!

How to automate your dev environment with dev containers and GitHub Codespaces

Post Syndicated from Kedasha Kerr original https://github.blog/2023-03-06-how-to-automate-your-dev-environment-with-dev-containers-and-github-codespaces/

When I started my first role as a software engineer, I remember taking about four days to set up my local development environment. I had so many issues with missing dependencies, incorrect versions, and failed installations. When I finally finished setting up all the tools and software I needed to be a productive member of the team, I cloned one of our repositories to my machine, set up my environment variables, ran npm run dev and received so many errors because I forgot to install the dependencies (and read the README) or switch to the right node version. Ugh! I can’t tell you how many times this happened to me in my first year!

Back then, I wished I had a way that was streamlined—something I set up only once, that just worked every time I accessed the repository. Although I did learn how to automate my computer setup with a Brewfile, I wish I could just get to coding in a repository without thinking about configuration.

Gif from the animated show Spongebob Squarepants of a character picking up a computer as if to toss it away, saying "I'll show you automated."
source

When I think about how we work on projects in a repository, I realize that many of the processes we need to get started on that project can be automated with the help of dev containers, in this case, by using a devcontainer.json file and Codespaces.

Let’s take a look at how we can automate our dev environment by adding a dev container to this open source project—Tech is Hiring in GitHub Codespaces.

For a TLDR of this post, GitHub Codespaces enables you to start coding faster when coupled with dev containers. See image below for a summary of how:

Image of a table entitled "Start coding faster with Codespaces." The left column is labeled "Old Way" and the right column is labeled "New Way." The rest of this blog post will enumerate the items listed in the image.

Now, let’s get some definitions out of the way.

What is GitHub Codespaces?

GitHub Codespaces is a development environment in the cloud. It is hosted by GitHub in an isolated environment (Docker container) that runs on a virtual machine. If you’re not familiar with virtual machines or Docker containers, take a look at these videos: what is a virtual machine? and what are Docker containers?.

Currently, individual developers have 60 hours of free codespaces usage per month, so definitely take advantage of this awesomeness to build from anywhere.

What are dev containers?

Dev containers allow us to run our code in a preconfigured, isolated environment (container). It gives us the ability to predefine our dev environment in our repositories and use a consistent, reliable environment across the board without worrying about configuration files—since it’s all set up for us from the beginning with a devcontainer.json file.

What is the devcontainer.json file?

The devcontainer.json file is a document that defines how your project will build, display, and run. Simply put, it allows us to specify the behavior we want in our dev environment. For example, if we have a devcontainer.json file that includes installing the ESLint extension in VS Code, once we open up a workspace, ESLint will be automatically installed for us.

Automating your workflow with dev containers and GitHub Codespaces

To start using GitHub Codespaces, we don’t need to set up a devcontainer.json file. By default, GitHub Codespaces uses the universal dev container image, which caters to a vast array of languages and tools. This means, whenever you open up a new codespace without a devcontainer.json file your codespace will automagically load so you can code instantly. However, adding a devcontainer.json file gives us the ability to automate a lot of our dev environment workflows to our liking.

Okay, okay, that was a lot of chatter—let’s now get into what you really came here for!

Gif of the character Mary Poppins pinching shut the mouth of her talking umbrella handle and telling it, "The will be quite enough of that, thank you." She then opens the umbrella and floats away.
source

Using the open source project, Tech is Hiring, let’s walk through how we typically work with a repository using our local dev environment.

What work typically looks like

At first glance, we see that this project uses Nextjs, Tailwind CSS, Chakra UI, TypeScript, Storybook, Vite, Cypress, Axios, and Reactjs as some of its dependencies. We’d need to install all these dependencies to our local machine to get this project running.

  1. Let’s clone the repository, and cd into the project.

    Gif of the terminal output as the Tech is Hiring repository is cloned.

  2. Then, let’s install dependencies to get the project running locally.

    Gif showing terminal output as the necessary dependencies are installed.

  3. This project uses storybook, so let’s run both storybook and spin up the actual app locally.

    Gif of the user's desktop with a terminal app running commands.

The process is not so bad, but it took a bit of time. We also need to check to make sure we’re using the correct node version, check if we need any environment variables, and if there are any runtime errors to resolve. Thankfully, I didn’t encounter any errors while working on this, but it still took a bit of time.

Going faster with GitHub Codespaces and dev containers

Let’s make the process better by adding a devcontainer.json file to this project and opening it in GitHub Codespaces to see what happens.

We can either use the VS Code command palette to add a pre-existing dev container or we can write the configuration file ourselves (which we’ll do below).

  1. Let’s first create a .devcontainer folder in the root of the project and a devcontainer.json file in the new folder.

    Creating a a `.devcontainer` folder in the root of the project and a `devcontainer.json` file in the new folder.

  2. Now, let’s automate installing dependencies, starting the dev server, opening a preview of our app on localhost:3000, and installing vscode extensions. Once we get everything configured, your json file should look like this:

    {
      // image being used
       "image": "mcr.microsoft.com/devcontainers/universal:2",
      // set minimum cpu
       "hostRequirements": {
           "cpus": 4
       },
       // install dependencies and start app
       "updateContentCommand": "npm install",
       "postAttachCommand": "npm run dev",
       // open app.tsx once container is built
       "customizations": {
           "codespaces": {
               "openFiles": [
                   "src/pages/_app.tsx"
               ]
           },
           // install some vscode extensions
           "vscode": {
               "extensions": [
                   "dbaeumer.vscode-eslint",
                   "github.vscode-pull-request-github",
                   "eamodio.gitlens",
                   "christian-kohler.npm-intellisense"
               ]
           }
       },
       // connect to remote server
       "forwardPorts": [3000],
       // give port a label and open a preview of the app
       "portsAttributes": {
          "3000": {
             "label": "Application",
             "onAutoForward": "openPreview"
           }
         }
    }
    

    Sidenote: I’ve broken down the purpose of the properties in this file for you by adding comments before each. To learn more about each property, continue reading at container.dev. I also installed a few extensions that are not needed, but I wanted to show you that you could automate installing extensions, too!

  3. Let’s commit the file and merge it into the main branch, then open up the project in GitHub Codespaces.

    Committing a file, merging it to the main branch, and then opening GitHub Codespaces.

    When we open up the application in GitHub Codespaces, our dependencies will be installed, the server will start, and a preview will automatically open for us. If we needed environment variables to work on this repository, those would have already been configured for us as a repository secret on GitHub. We also didn’t need to install hefty node_modules to our machine.

I call this a win!

Comparing both ways

There’s plenty more that we can do with dev containers and GitHub Codespaces to automate our dev environment. But let’s summarize what we just did in GitHub Codespaces and with the help of dev containers:

  • We clicked the GitHub Codespaces button on the GitHub repository.
  • Everything was setup/installed for us (thanks json file!).
  • We got to work and started coding.

Now, isn’t that better?

Wrapping up

So, what’s the point of GitHub Codespaces and why should you care as a developer? Well, for one, you can automate most of the startup processes you need to access a repository. You can also do a lot more customizations to your dev environment with dev containers. Take a look at all the options you have—and watch out for my next blog post where I’ll go through the anatomy of a devcontainer.json file.

Secondly, you can code from anywhere. I hate it when I’m not able to access one of my side projects on a different machine because that one machine is configured perfectly to suit the project. With GitHub Codespaces, you can start coding at the click of a button and from any machine that supports a modern browser.

I encourage you to get started with GitHub Codespaces today and try adding a devcontainer.json file to one of your projects! I promise you won’t regret it.

Until next time, happy coding!

Why Python keeps growing, explained

Post Syndicated from Rizel Scarlett original https://github.blog/2023-03-02-why-python-keeps-growing-explained/

Which programming language has been around for more than three decades and continues to grow in popularity each year?

If you guessed Python, you nailed it. In the 2022 Octoverse report, we found that Python remains the second most-used programming language on GitHub. Interestingly, Python’s use grew more than 22 percent year over year with more than four million developers on GitHub using it at some point in 2022.

In this article, we’ll dive into a brief history of Python, its benefits, its use cases, and seek to answer why a program language conceived in the 1980s continues to dominate development. And, since this is GitHub, we’ll also offer a few useful tips and tricks for developers new to—and experienced in—Python.

So, what is Python? 🤔

Python is a high-level, interpreted programming language with a simple syntax, which makes it easily readable and extremely user- and beginner-friendly. Originally built to satisfy Guido Van Rossum’s desire for a programming language that was simple to use and beautiful to look at, Python was first released to the world in 1991.

Fun fact: Python was named after the BBC TV show, “Monty Python’s Flying Circus.”

Since its development, it has grown to have widespread applicability for developers, data scientists, researchers, and more. But how, you may ask, can a coding language be simple and beautiful to look at? Here’s some proof:

Python

print("Hello world.")

vs.

Java

public class HelloWorld {
    public static void main (String[]args) {
      System.out.println.("Hello world");
    }
}

Since Python is a general-purpose language, it can be used in a variety of applications, and its uncomplicated nature makes it an excellent language for automating tasks, building websites or software, and analyzing data.

Python also has several other characteristics that make it popular amongst developers and engineers. These include:

  • It’s easy to read. Python code uses English keywords rather than punctuation, and its line breaks help define the code blocks. In practice, this means you can identify what the code is designed to do simply by looking at it.

  • It’s open source. You can download the source code, modify it, and use it however you want.

  • It’s portable. Some languages require you to modify code to run on different platforms, but Python is a cross-platform language, which means you can run the same code on any operating system with a Python interpreter.

  • It’s extendable. Python code can be written in other languages (such as C++), and users can add low-level modules to the Python interpreter to customize and optimize their tools.

  • It has a broad standard library. This library is available for anyone to access and means that users don’t have to write code for every single function—they can access built-in modules that help with issues in everyday programming and more.

What is Python commonly used for? 💻

Python can be used for just about anything, from web and software development to machine learning and artificial intelligence (AI). Let’s take a look at some of its most common use cases.

import antigravity

def main():
    antigravity.fly()

if __name__ == '__main__':
    main()

Run this command to check out an inside joke among Python developers.

Using Python for web and software development

Python is a popular language for web and software development because you can create complex, multi-protocol applications while maintaining concise, readable syntax. In fact, some of the most popular applications were built with Python. Plus, Python’s open source community provides developers with an extensive amount of reusable code, frameworks, and support. Case in point: Django is one of the most-used Python frameworks designed by experienced developers to help others accelerate their application build times and avoid issues that might balk their progress.

Using Python for task automation

One of Python’s key benefits is its ability to automate manual, repetitive tasks. With Python, you can learn how to automate just about anything by using either built-in modules or pre-written code from its robust library. Or you can write your own custom scripts to perform specific actions. For example, you can easily automate emails with the “smtplib” module or copy files with the “shutil” module. Python also has a robust set of testing frameworks, which makes it an excellent language for test automation. Frameworks such as Pytest, Behave, and Robot allow developers to write simple yet effective tests to ensure the quality of their builds.

Using Python for machine learning and data science

Here’s a fun fact: Python is the top preferred language for data science and research. Since its syntax is easily understandable and adaptable, people with little-to-no development experience can easily learn Python and use it to manipulate data for research, reporting, predictable or regression analyses, and more. Collecting and parsing data can be a time-consuming task for data scientists. Python is also one of the top languages for training machine learning (ML) models. Through specific algorithms, these models can analyze and identify patterns in data to make predictions or decisions based on that data. They also constantly evolve based on outputs of previous datasets to confront new variables. Data scientists and developers training ML models often utilize libraries, such as NumPy, Pandas, and Matplotlib, to automate functions like cleaning, data transformation, and visualization.

Using Python for financial analysis

Similar to how Python can assist data scientists with the heavy lift of large data sets, Python is widely used in the financial industry to quickly perform complex computations. Stock markets generate huge amounts of data, and Python can be used to import data on stock prices and generate strategies through algorithms to identify trading opportunities. The language can also be used for portfolio optimization, risk management, financial modeling and visualization, cryptocurrency analysis, and even fraud detection.

Using Python for and artificial intelligence

Python can also be found in some of the most complex, artificial intelligence (AI) technologies—and it’s actually one of the preferred languages for AI. Python’s concise and readable code allows developers to create consistent, reliable systems, and its vast library provides a number of frameworks like PyBrain, which offers developers powerful algorithms for machine learning tasks. Plus, Python’s visualization capabilities can help convert these large datasets for AI or ML into comprehensible graphs or reports. Interestingly enough, OpenAI, the artificial intelligence research lab, utilizes the Python framework, Pytorch, as their standard framework for deep learning, which trains its AI systems.

In addition to its relative simplicity to learn, there are a few other reasons why Python continues to consistently grow in popularity. These include:

  • It’s more productive. Compared to some other more complex programming languages like C++, Python’s syntax allows users to do more with less and cut down on time and effort to write the same lines of code.

  • It has an expansive, supportive community of users. Even the best developers run into problems— and this is where user communities can become an invaluable resource. Python has a huge community with documentation, tutorials, tips, and tricks to master the language. The Python community on GitHub, for example, offers everything from information on the latest version of the language to bug reports and update notes.

  • It’s academic. Python has become the go-to language in academia with some students even encountering Python as early as elementary school. (Believe it or not, there are children’s picture books dedicated to Python.) While computer science students are often taught Python, its use extends beyond that discipline into other areas of STEM and academic research. For example, Python can be used to solve differential equations, perform statistical analyses, simulate and track particle diffusion, and more.

  • It has high corporate demand. Because of its wide scale applicability in development and data analysis work, learning and knowing Python is often considered a top-skill among job seekers. According to Statista, Python was the third most demanded language in 2022 by recruiters worldwide.

The bottom line

Python is everywhere—and it’s been used to build a significant number of the technologies, websites, and even systems most people encounter on a daily basis. It powers everything from your favorite video streaming service to the ML algorithms that can help you make your next cryptocurrency trade. And for an even broader scope example (pun absolutely intended), NASA uses Python to power data analysis with its sophisticated James Webb Space Telescope, which makes it one of the few programming languages that is, quite literally, out of this world. 🚀

How to get started with Python 📓

A quick Google search will yield hundreds of resources out there to jumpstart your Python journey—and that can quickly get a little overwhelming. To simplify things, here are a few helpful GitHub repositories to help you get started with Python:

To get started, download the latest version of Python.

Start building on GitHub today

GitHub offers two easier ways to start working with Python: GitHub Codespaces and GitHub Copilot.

You can start building today for free with GitHub Codespaces, which every developer on GitHub gets 60 free hours of use time per month to spin up a development environment in the cloud from any device at speed. Check out the Django quick start template to begin coding right in your browser!

You can also use GitHub Copilot, GitHub’s AI pair programmer, to write your first lines of Python. Here’s how:

  1. Install the GitHub Copilot extension into your code editor.
  2. Describe the purpose of your project in a comment.
  3. Write a comment describing which libraries you may need.
  4. Start tabbing and let GitHub Copilot suggest lines of code to help you learn new techniques or methods.

From machine learning to data analysis, Python’s versatility allows it to continue its explosive growth with developers and non-developers alike. Experiment with Python through GitHub or on your local machine to be part of this growth and get started today!

GitHub Availability Report: February 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-03-01-github-availability-report-february-2023/

In February, we experienced three incidents that resulted in degraded performance across GitHub services. This report also sheds light into a January incident that resulted in degraded performance for GitHub Packages and GitHub Pages and another January incident that impacted Git users.

January 30 21:31 UTC (lasting 35 minutes)

On January 30 at 21:36 UTC, our alerting system detected a 500 error response increase in requests made to the Container registry. As a result, most builds on GitHub Pages and requests to GitHub Packages failed during the incident.

Upon investigation, we found that a change was made to the Container registry Redis configuration at 21:30 UTC to enforce authentication on Redis connections. There was an issue with the Container registry production deployment file where client connections were unable to authenticate due to a hard coded connection string, resulting in errors and preventing successful connections.

At 22:12 UTC, we reverted the configuration change for Redis authentication. Container registry began recovering two minutes later, and GitHub Pages was considered healthy again by 22:21 UTC.

To help prevent future incidents, we improved management of secrets in the Container registry’s Redis deployment configurations and added extra test coverage for authenticated Redis connections.

January 30 18:35 UTC (lasting 7 hours)

On January 30 at 18:35 UTC, GitHub deployed a change which slightly altered the compression settings on source code downloads. This change altered the checksums of the resulting archive files, resulting in unforeseen consequences for a number of communities. The contents of these files were unchanged, but many communities had come to rely on the precise layout of bytes also being unchanged. When we realized the impact we reverted the change and communicated with affected communities.

We did not anticipate the broad impact this change would have on a number of communities and are implementing new procedures to prevent future incidents. This includes working through several improvements in our deployment of Git throughout GitHub and adding a checksum validation to our workflow.

See this related blog post for details about our plan going forward.

February 7 21:30 UTC (lasting 20 hours and 35 minutes)

On February 7 at 21:30 UTC, our monitors detected failures creating, starting, and connecting to GitHub Codespaces in the Southeast Asia region, caused by a datacenter outage of our cloud provider. To reduce the impact to our customers during this time, we redirected codespace creations to a secondary location, allowing new codespaces to be used. Codespaces in that region recovered automatically when the datacenter recovered, allowing existing codespaces in the region to be restarted. Codespaces in other regions were not impacted during this incident.

Based on learnings from this incident, we are evaluating expanding our regional redundancy and have started making architectural changes to better handle temporary regional and datacenter outages, including more regularly exercising our failover capabilities.

February 18 02:36 UTC (lasting 2 hours and 26 minutes)

On February 18 at 02:36 UTC, we became aware of errors in our application code pointing to connectivity issues to our MySQL databases. Upon investigation, we believe these errors were due to a few unhealthy deployments of our sharding middleware. At 03:30 UTC, we performed a re-deployment of the database infrastructure in an effort to remediate. Unfortunately, this propagated the issue to all Kubernetes pods, leading to system-wide errors. As a result, multiple services returned 500 error responses and GitHub users were experiencing issues signing in to GitHub.com.

At 04:30 UTC, we found that the database topology in 30% of our deployments was corrupted, which prevented applications from connecting to the database. We applied a copy of the correct database topology to all deployments, which resolved the errors across services by 05:00 UTC. Users were then able to sign in to GitHub.com.

To help prevent future incidents, we added a monitor to detect database topology errors so we can identify this well in advance of these changes impacting production systems. We have also improved our observability around topology reloads, both successful and erroneous ones. We are also doing a deeper review of the contributing factors to this incident to learn and improve both our architecture and operations to prevent a recurrence.

February 28 16:05 UTC (lasting 1 hour and 26 minutes)

On February 28 at 16:05 UTC, we were notified of degraded performance for GitHub Codespaces. We resolved the incident at 17:31 UTC.

Due to the recency of this incident, we are still investigating the contributing factors and will provide a more detailed update in next month’s report.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub Actions: Introducing faster GitHub-hosted x64 macOS runners

Post Syndicated from Stephen Glass original https://github.blog/2023-03-01-github-actions-introducing-faster-github-hosted-x64-macos-runners/

Today, GitHub is releasing a public beta for all new, more powerful hosted macOS runners for GitHub Actions. Teams who are looking to speed up their macOS jobs now have new options to meet their needs.

Faster GitHub-hosted macOS runners

When developers use GitHub-hosted runners for GitHub Actions, GitHub is always working to give teams performance and flexibility to quickly build for different platforms. The new GitHub-hosted macOS runners are based on the most current x64 Apple hardware and reduce the time it takes to run your jobs. This means less time waiting on your CI and more time building and rapidly testing code.

In addition to the existing 3-core macOS hosted runners, we are also introducing a new “XL” size, 12-core macOS runner to allow you to pay for that extra performance when you need it. Increasing your runner size is as simple as changing a single line in your workflow file, making it super easy to speed up lagging builds, and with on-demand availability when you want it, you only pay for what you use.

PyTorch, the open source machine learning framework, has used the new runner to improve build times from 1.5 hours to 30 minutes!

If you’re already using the 3-core macOS runners, simply request to join the beta, and we’ll give you access to the hardware to our new fleet. To try the new 12-core runner runner, update the runs-on: key in your GitHub Actions workflow YAML to target the macos-12-xl or macos-latest-xl runner images once you’re in the beta.

The new macOS XL runner is always billed in both private and public repositories and so does not consume any of the free included GitHub Action minutes. To learn more about runner per job minute pricing, check out the docs.

Learn more and join the beta

To learn more about using the new macOS runner, check out the hosted runner docs.

Apple silicon is on the way!

We’ve started racking up and testing out Apple silicon-based machines in the hosted runner data centers. Follow this roadmap item to track progress. We hope to be offering a limited beta in CY2023 depending on hardware availability, but keep your eyes on the GitHub blog for more information.

Securing GitOps pipelines

Post Syndicated from Grab Tech original https://engineering.grab.com/securing-gitops-pipeline

Introduction

Grab’s real-time data platform team, Coban, has been managing infrastructure resources via Infrastructure-as-code (IaC). Through the IaC approach, Terraform is used to maintain infrastructure consistency, automation, and ease of deployment of our streaming infrastructure, notably:

With Grab’s exponential growth, there needs to be a better way to scale infrastructure automatically. Moving towards GitOps processes benefits us in many ways:

  • Versioned and immutable: With our source code being stored in Git repositories, the desired state of infrastructure is stored in an environment that enforces immutability, versioning, and retention of version history, which helps with auditing and traceability.
  • Faster deployment: By automating the process of deploying resources after code is merged, we eliminate manual steps and improve overall engineering productivity while maintaining consistency.
  • Easier rollbacks: It’s as simple as making a revert for a Git commit as compared to creating a merge request (MR) and commenting Atlantis commands, which add extra steps and contribute to a higher mean-time-to-resolve (MTTR) for incidents.

Background

Originally, Coban implemented automation on Terraform resources using Atlantis, an application that operates based on user comments on MRs.

Fig. 1 User flow with Atlantis

We have come a long way with Atlantis. It has helped us to automate our workflows and enable self-service capabilities for our engineers. However, there were a few limitations in our setup, which we wanted to improve:

  • Course grained: There is no way to restrict the kind of Terraform resources users can create, which introduces security issues. For example, if a user is one of the Code owners, they can create another IAM role with Admin privileges with approval from their own team anywhere in the repository.
  • Limited automation: Users are still required to make comments in their MR such as atlantis apply. This requires the learning of Atlantis commands and is prone to human errors.
  • Limited capability: Having to rely entirely on Terraform and Hashicorp Configuration Language (HCL) functions to validate user input comes with limitations. For example, the ability to validate an input variable based on the value of another has been a requested feature for a long time.
  • Not adhering to Don’t Repeat Yourself (DRY) principle: Users need to create an entire Terraform project with boilerplate codes such as Terraform environment, local variables, and Terraform provider configurations to create a simple resource such as a Kafka topic.

Solution

We have developed an in-house GitOps solution named Khone. Its name was inspired by the Khone Phapheng Waterfall. We have evaluated some of the best and most widely used GitOps products available but chose not to go with any as the majority of them aim to support Kubernetes native or custom resources, and we needed infrastructure provisioning that is beyond Kubernetes. With our approach, we have full control of the entire user flow and its implementation, and thus we benefit from:

  • Security: The ability to secure the pipeline with many customised scripts and workflows.
  • Simple user experience (UX): Simplified user flow and prevents human errors with automation.
  • DRY: Minimise boilerplate codes. Users only need to create a single Terraform resource and not an entire Terraform project.
Fig. 2 User flow with Khone

With all types of streaming infrastructure resources that we support, be it Kafka topics or Flink pipelines, we have identified they all have common properties such as namespace, environment, or cluster name such as Kafka cluster and Kubernetes cluster. As such, using those values as file paths help us to easily validate users input and de-couple them from the resource specific configuration properties in their HCL source code. Moreover, it helps to remove redundant information to maintain consistency. If the piece of information is in the file path, it won’t be elsewhere in resource definition.

Fig. 3 Khone directory structure

With this approach, we can utilise our pipeline scripts, which are written in Python and perform validations on the types of resources and resource names using Regular Expressions (Regex) without relying on HCL functions. Furthermore, we helped prevent human errors and improved developers’ efficiency by deriving these properties and reducing boilerplate codes by automatically parsing out other necessary configurations such as Kafka brokers endpoint from the cluster name and environment.

Pipeline stages

Khone’s pipeline implementation is designed with three stages. Each stage has different duties and responsibilities in verifying user input and securely creating the resources.

Fig. 4 An example of a Khone pipeline

Initialisation stage

At this stage, we categorise the changes into Deleted, Created or Changed resources and filter out unsupported resource types. We also prevent users from creating unintended resources by validating them based on resource path and inspecting the HCL source code in their Terraform module. This stage also prepares artefacts for subsequent stages.

Fig. 5 Terraform changes detected by Khone

Terraform stage

This is a downstream pipeline that runs either the Terraform plan or Terraform apply command depending on the state of the MR, which can either be pending review or merged. Individual jobs run in parallel for each resource change, which helps with performance and reduces the overall pipeline run time.

For each individual job, we implemented multiple security checkpoints such as:

  • Code inspection: We use the python-hcl2 library to read HCL content of Terraform resources to perform validation, restrict the types of Terraform resources users can create, and ensure that resources have the intended configurations. We also validate whitelisted Terraform module source endpoint based on the declared resource type. This enables us to inherit the flexibility of Python as a programming language and perform validations more dynamically rather than relying on HCL functions.
  • Resource validation: We validate configurations based on resource path to ensure users are following the correct and intended directory structure.
  • Linting and formatting: Perform HCL code linting and formatting using Terraform CLI to ensure code consistency.

Furthermore, our Terraform module independently validates parameters by verifying the working directory instead of relying on user input, acting as an additional layer of defence for validation.

path = one(regexall(join("/",
[
    "^*",
    "(?P<repository>khone|khone-dev)",
    "resources",
    "(?P<namespace>[^/]*)",
    "(?P<resource_type>[^/]*)",
    "(?P<env>[^/]*)",
    "(?P<cluster_name>[^/]*)",
    "(?P<resource_name>[^/]*)$"
]), path.cwd))

Metric stage

In this stage, we consolidate previous jobs’ status and publish our pipeline metrics such as success or error rate.

For our metrics, we identified actual users by omitting users from Coban. This helps us measure success metrics more consistently as we could isolate metrics from test continuous integration/continuous deployment (CI/CD) pipelines.

For the second half of 2022, we achieved a 100% uptime for Khone pipelines.

Fig. 6 Khone’s success metrics for the second half of 2022

Preventing pipeline config tampering

By default, with each repository on GitLab that has CI/CD pipelines enabled, owners or administrators would need to have a pipeline config file at the root directory of the repository with the name .gitlab-ci.yml. Other scripts may also be stored somewhere within the repository.

With this setup, whenever a user creates an MR, if the pipeline config file is modified as part of the MR, the modified version of the config file will be immediately reflected in the pipeline’s run. Users can exploit this by running arbitrary code on the privileged GitLab runner.

In order to prevent this, we utilise GitLab’s remote pipeline config functionality. We have created another private repository, khone-admin, and stored our pipeline config there.

Fig. 7 Khone’s remote pipeline config

In Fig. 7, our configuration is set to a file called khone-gitlab-ci.yml residing in the khone-admin repository under snd group.

Preventing pipeline scripts tampering

We had scripts that ran before the MR and they were approved and merged to perform preliminary checks or validations. They were also used to run the Terraform plan command. Users could modify these existing scripts to perform malicious actions. For example, they could bypass all validations and directly run the Terraform apply command to create unintended resources.

This can be prevented by storing all of our scripts in the khone-admin repository and cloning them in each stage of our pipeline using the before_script clause.

default:
  before_script:
    - rm -rf khone_admin
    - git clone --depth 1 --single-branch https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git khone_admin

Even though this adds an overhead to each of our pipeline jobs and increases run time, the amount is insignificant as we have optimised the process by using shallow cloning. The Git clone command included in the above script with depth=1 and single-branch flag has reduced the time it takes to clone the scripts down to only 0.59 seconds.

Testing our pipeline

With all the security measures implemented for Khone, this raises a question of how did we test the pipeline? We have done this by setting up an additional repository called khone-dev.

Fig. 8 Repositories relationship

Pipeline config

Within this khone-dev repository, we have set up a remote pipeline config file following this format:

<File Name>@<Repository Ref>:<Branch Name>

Fig. 9 Khone-dev’s remote pipeline config

In Fig. 9, our configuration is set to a file called khone-gitlab-ci.yml residing in the khone-admin repository under the snd group and under a branch named ci-test. With this approach, we can test our pipeline config without having to merge it to master branch that affects the main Khone repository. As a security measure, we only allow users within a certain GitLab group to push changes to this branch.

Pipeline scripts

Following the same method for pipeline scripts, instead of cloning from the master branch in the khone-admin repository, we have implemented a logic to clone them from the branch matching our lightweight directory access protocol (LDAP) user account if it exists. We utilised the GITLAB_USER_LOGIN environment variable that is injected by GitLab to each individual CI job to get the respective LDAP account to perform this logic.

default:
  before_script:
    - rm -rf khone_admin
    - |
      if git ls-remote --exit-code --heads "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" "$GITLAB_USER_LOGIN" > /dev/null; then
        echo "Cloning khone-admin from dev branch ${GITLAB_USER_LOGIN}"
        git clone --depth 1 --branch "$GITLAB_USER_LOGIN" --single-branch "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" khone_admin
      else
        echo "Dev branch ${GITLAB_USER_LOGIN} not found, cloning from master instead"
        git clone --depth 1 --single-branch "https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.myteksi.net/snd/khone-admin.git" khone_admin
      fi

What’s next?

With security being our main focus for our Khone GitOps pipeline, we plan to abide by the principle of least privilege and implement separate GitLab runners for different types of resources and assign them with just enough IAM roles and policies, and minimal network security group rules to access our Kafka or Kubernetes clusters.

Furthermore, we also plan to maintain high standards and stability by including unit tests in our CI scripts to ensure that every change is well-tested before being deployed.

References

Special thanks to Fabrice Harbulot for kicking off this project and building a strong foundation for it.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How to build a consistent workflow for development and operations teams

Post Syndicated from Mark Paulsen original https://github.blog/2023-02-28-how-to-build-a-consistent-workflow-for-development-and-operations-teams/

In GitHub’s recent 2022 State of the Octoverse report, HashiCorp Configuration Language (HCL) was the fastest growing programming language on GitHub. HashiCorp is a leading provider of Infrastructure as Code (IaC) automation for cloud computing. HCL is HashiCorp’s configuration language used with tools like Terraform and Vault to deliver IaC capabilities in a human-readable configuration file across multi-cloud and on-premises environments.

HCL’s growth shows the importance of bringing together the worlds of infrastructure, operations, and developers. This was always the goal of DevOps. But in reality, these worlds remain siloed for many enterprises.

In this post we’ll look at the business and cultural influences that bring development and operations together, as well as security, governance, and networking teams. Then, we’ll explore how GitHub and HashiCorp can enable consistent workflows and guardrails throughout the entire CI/CD pipeline.

The traditional world of operations (Ops)

Armon Dadgar, co-founder of HashiCorp, uses the analogy of a tree to explain the traditional world of Ops. The trunk includes all of the shared and consistent services you need in an enterprise to get stuff done. Think of things like security requirements, Active Directory, and networking configurations. A branch represents the different lines of business within an enterprise, providing services and products internally or externally. The leaves represent the different environments and technologies where your software or services are deployed: cloud, on-premises, and container environment, among others.

In many enterprises, the communication channels and processes between these different business areas can be cumbersome and expensive. If there is a significant change to the infrastructure or architecture, multiple tickets are typically submitted to multiple teams for reviews and approvals across different parts of the enterprise. Change Advisory Boards are commonly used to protect the organization. The change is usually unable to proceed unless the documentation is complete. Commonly, there’s a set of governance logs and auditable artifacts which are required for future audits.

Wouldn’t it be more beneficial for companies if teams had an optimized, automated workflow that could be used to speed up delivery and empower teams to get the work done in a set of secure guardrails? This could result in significant time and cost savings, leading to added business value.

After all, a recent Forrester report found that over three years, using GitHub drove 433% ROI for a composite organization simply with the combined power of all GitHub’s enterprise products. Not to mention the potential for time savings and efficiency increase, along with other qualitative benefits that come with consistency and streamlining work.

Your products and services would be deployed through an optimized path with security and governance built-in, rather than a sluggish, manual and error-prone process. After all, isn’t that the dream of DevOps, GitOps, and Cloud Native?

Introducing IaC

Let’s use a different analogy. Think of IaC as the blueprint for resources (such as servers, databases, networking components, or PaaS services) that host our software and services.

If you were architecting a hospital or a school, you wouldn’t use the same overall blueprint for both scenarios as they serve entirely different purposes with significantly different requirements. But there are likely building blocks or foundations that can be reused across the two designs.

IaC solutions, such as HCL, allow us to define and reuse these building blocks, similarly to how we reuse methods, modules, and package libraries in software development. With it being IaC, we can start adopting the same recommended practices for infrastructure that we use when collaborating and deploying on applications.

After all, we know that teams that adopt DevOps methodologies will see improved productivity, cloud-enabled scalability, collaboration, and security.

A better way to deliver

With that context, let’s explore the tangible benefits that we gain in codifying our infrastructure and how they can help us transform our traditional Ops culture.

Storing code in repositories

Let’s start with the lowest-hanging fruit. With it being IaC, we can start storing infrastructure and architectural patterns in source code repositories such as GitHub. This gives us a single source of truth with a complete version history. This allows us to easily rollback changes if needed, or deploy a specific version of the truth from history.

Teams across the enterprise can collaborate in separate branches in a Git repository. Branches allow teams and individuals to be productive in “their own space” and not have to worry about negatively impacting the in-progress work of other teams, away from the “production” source of truth (typically, the main branch).

Terraform modules, the reusable building blocks mentioned in the last section, are also stored and versioned in Git repositories. From there, modules can be imported to the private registry in Terraform Cloud to make them easily discoverable by all teams. When a new release version is tagged in GitHub, it is automatically updated in the registry.

Collaborate early and often

As we discussed above, teams can make changes in separate branches to not impact the current state. But what happens when you want to bring those changes to the production codebase? If you’re unfamiliar with Git, then you may not have heard of a pull request before. As the name implies, we can “pull” changes from one branch into another.

Pull requests in GitHub are a great way to collaborate with other users in the team, being able to get peer reviews so feedback can be incorporated into your work. The pull request process is deliberately very social, to foster collaboration across the team.

In GitHub, you could consider setting branch protection rules so that direct changes to your main branch are not allowed. That way, all users must go through a pull request to get their code into production. You can even specify the minimum number of reviewers needed in branch protection rules.

Tip: you could use a special type of file, the CODEOWNERS file in GitHub, to automatically add reviewers to a pull request based on the files being edited. For example, all HCL files may need a review by the core infrastructure team. Or IaC configurations for line of business core banking systems might require review by a compliance team.

Unlike Change Advisory Boards, which typically take place on a specified cadence, pull requests become a natural part of the process to bring code into production. The quality of the decisions and discussions also evolves. Rather than being a “yes/no” decision with recommendations in an external system, the context and recommendations can be viewed directly in the pull request.

Collaboration is also critical in the provisioning process, and GitHub’s integrations with Terraform Cloud will help you scale these processes across multiple teams. Terraform Cloud offers workflow features like secure storage for your Terraform state and direct integration with your GitHub repositories for a turnkey experience around the pull request and merge lifecycle.

Bringing automated quality reviews into the process

Building on from the previous section, pull requests also allow us to automatically check the quality of the changes that are being proposed. It is common in software to check that the application still compiles correctly, that unit tests pass, that no security vulnerabilities are introduced, and more.

From an IaC perspective, we can bring similar automated checks into our process. This is achieved by using GitHub status checks and gives us a clear understanding of whether certain criteria has been met or not.

GitHub Actions are commonly used to execute some of these automated checks in pull requests on GitHub. To determine the quality of IaC, you could include checks such as:

  • Validating that the code is syntactically correct (for example, Terraform validate).
  • Linting the code to ensure a certain set of standards are being followed (for example, TFLint or Terraform format).
  • Static code analysis to identify any misconfigurations in your infrastructure at “design time” (for example, tfsec or terrascan).
  • Relevant unit or integration tests (using tools such as Terratest).
  • Deploying the infrastructure into a “smoke test”environment to verify that the infrastructure configuration (along with a known set of parameters) results deploy into a desired state.

Getting started with Terraform on GitHub is easy. Versions of Terraform are installed on our Linux-based GitHub-hosted runners, and HashiCorp has an official GitHub Action to set up Terraform on a runner using a Terraform version that you specify.

Compliance as an automated check

We recently blogged about building compliance, security, and audit into your delivery pipelines and the benefits of this approach. When you add IaC to your existing development pipelines and workflows, you’ll have the ability to describe previously manual compliance testing and artifacts as code directly into your HCL configurations files.

A natural extension to IaC, policy as code allows your security and compliance teams to centralize the definitions of your organization’s requirements. Terraform Cloud’s built-in support for the HashiCorp Sentinel and Open Policy Agent (OPA) frameworks allows policy sets to be automatically ingested from GitHub repositories and applied consistently across all provisioning runs. This ensures policies are applied before misconfigurations have a chance to make it to production.

An added bonus mentioned in another recent blog is the ability to leverage AI-powered compliance solutions to optimize your delivery even more. Imagine a future where generative AI could create compliance-focused unit-tests across your entire development and infrastructure delivery pipeline with no manual effort.

Security in the background

You may have heard of Dependabot, our handy tool to help you keep your dependencies up to date. But did you know that Dependabot supports Terraform? That means you could rely on Dependabot to help keep your Terraform provider and module versions up to date.

Checks complete, time to deploy

With the checks complete, it’s now time for us to deploy our new infrastructure configuration! Branching and deployment strategies is beyond the scope of this post, so we’ll leave that for another discussion.

However, GitHub Actions can help us with the deployment aspect as well! As we explained earlier, getting started with Terraform on GitHub is easy. Versions of Terraform are installed on our Linux-based GitHub-hosted runners, and HashiCorp has an official GitHub Action to set up Terraform on a runner using a Terraform version that you specify.

But you can take this even further! In Terraform, it is very common to use the command terraform plan to understand the impact of changes before you push them to production. terraform apply is then used to execute the changes.

Reviewing environment changes in a pull request

HashiCorp provides an example of automating Terraform with GitHub Actions. This example orchestrates a release through Terraform Cloud by using GitHub Actions. The example takes the output of the terraform plan command and copies the output into your pull request for approval (again, this depends on the development flow that you’ve chosen).

Reviewing environment changes using GitHub Actions environments

Let’s consider another example, based on the example from HashiCorp. GitHub Actions has a built-in concept of environments. Think of these environments as a logical mapping to a target deployment location. You can associate a protection rule with an environment so that an approval is given before deploying.

So, with that context, let’s create a GitHub Action workflow that has two environments—one which is used for planning purposes, and another which is used for deployment:

name: 'Review and Deploy to EnvironmentA'
on: [push]

jobs:
  review:
    name: 'Terraform Plan'
    environment: environment_a_plan
    runs-on: ubuntu-latest

    steps:
      - name: 'Checkout'
        uses: actions/checkout@v2

      - name: 'Terraform Setup'
        uses: hashicorp/setup-terraform@v2
        with:
          cli_config_credentials_token: ${{ secrets.TF_API_TOKEN }}

      - name: 'Terraform Init'
        run: terraform init


      - name: 'Terraform Format'
        run: terraform fmt -check

      - name: 'Terraform Plan'
        run: terraform plan -input=false

  deploy:
    name: 'Terraform'
    environment: environment_a_deploy
    runs-on: ubuntu-latest
    needs: [review]

    steps:
      - name: 'Checkout'
        uses: actions/checkout@v2

      - name: 'Terraform Setup'
        uses: hashicorp/setup-terraform@v2
        with:
          cli_config_credentials_token: ${{ secrets.TF_API_TOKEN }}

      - name: 'Terraform Init'
        run: terraform init

      - name: 'Terraform Plan'
        run: terraform apply -auto-approve -input=false

Before executing the workflow, we can create an environment in the GitHub repository and associate protection rules with the environment_a_deploy. This means that a review is required before a production deployment.

Learn more

Check out HashiCorp’s Practitioner’s Guide to Using HashiCorp Terraform Cloud with GitHub for some common recommendations on getting started. And find out how we at GitHub are using Terraform to deliver mission-critical functionality faster and at lower cost.

10 things you didn’t know you could do with GitHub Codespaces

Post Syndicated from Rizel Scarlett original https://github.blog/2023-02-28-10-things-you-didnt-know-you-could-do-with-github-codespaces/

Ever feel like you’re coding on a plane mid-flight? When I first learned to code about five years ago, my laptop was painstakingly slow, but I couldn’t afford a better one. That’s why I relied on browser-based IDEs like jsbin.com to run my code.

Now fast forward to today, where GitHub Codespaces provides a fully-fledged, browser-based Integrated Development Environment (IDE) on a virtual machine. This means you can code without draining your local machine’s resources. Cloud-powered development is a game-changer for folks with less powerful machines, but that barely scratches the surface of GitHub Codespaces’ versatility.

In this blog post, we’ll discuss a few ways that you can get the most out of GitHub Codespaces!

Generate AI images 🎨

You can run Stable Diffusion with GitHub Codespaces. Like DALL-E and Midjourney, Stable Diffusion is one of many machine-learning models using deep learning to convert text into art.

Stable Diffusion takes the following steps to convert text into art:

  • AI receives an image
  • AI adds noise to the image until the image becomes completely unrecognizable. (Noise is another way of saying pixelated dots.)
  • AI removes the noise until it produces a clear, high-quality image
  • AI learns from a database called LAION-Aesthetics. The database has image-text pairs to learn to convert text into images.

This entire process is resource-intensive! Experts recommend using a computer with a powerful graphics processing unit (GPU) to run data-heavy tasks like Stable Diffusion. However, not everyone has a computer with that type of computing power (including me), so instead, I use GitHub Codespaces.

Since your codespace is hosted on a virtual machine, you can set the machine type from 2-core to 32-core. You can request access to a GPU-powered codespace if you need a more powerful machine. This means a machine learning engineer can use an iPad or Chromebook to perform data-heavy deep learning computations via GitHub Codespaces.

Check out my DEV post and repository to learn more about how I generated art inside my codespace.

Below you can see the code and AI image that GitHub Codespaces helped me produce:


from torch import autocast # Change prompt for image here! prompt = "a cartoon black girl with cotton candy hair and a pink dress standing in front of a pink sky with cotton candy clouds" with autocast(device): image = pipe(prompt, height=768, width=768).images[0] image

Computer-generated drawing-style image of a Black woman wearing a pink t-shirt, standing in front of a green field and some trees, with a pink cloud floating above her in the blue sky.

Manage GitHub Codespaces from the command line 🧑‍💻

Back in the 2000s, I would inadvertently close my documents, or my computer would suddenly shut down, and my work would be lost forever. Fortunately, it’s 2023, and many editors protect against that, including GitHub Codespaces. If you close a codespace, it will save your work even if you forget to commit your changes.

However, if you’re done with a codespace because you pushed your code to a repository, you can feel free to delete it. GitHub Codespace management is important for the following reasons:

  • You don’t want to get confused with all your GitHub Codespaces and accidentally delete the wrong one.
  • By default, inactive GitHub Codespaces automatically delete every 30 days.

You can manage your GitHub Codespaces through the GitHub UI or GitHub CLI. The GitHub CLI allows you to do things like:

Create a codespace

gh codespace create -r OWNER/REPO_NAME [-b BRANCH]

List all your codespaces

gh codespace list

Delete a codespace:

gh codespace delete -c CODESPACE-NAME

Rename a codespace

gh codespace edit -c CODESPACE-NAME -d DISPLAY-NAME

Change the machine type of a codepace

gh codespace edit -m MACHINE-TYPE-NAME

This is not an exhaustive list of all ways you can manage GitHub Codespaces via the CLI. You learn more about managing your codespaces from the command line here!

Pair program with a teammate 👥

Pair programming when you’re working remotely is challenging. It’s harder to share your screen and your code when you’re not sitting next to your teammate. However, it’s worthwhile when it works because it helps both parties improve their communication and technical skills. Installing the Live Share extension and forward ports can make remote pair programming easier with GitHub Codespaces.

Share your forwarded ports

You can share the URL of your locally hosted web app and make it available to your teammates or collaborators. If you right-click on the port you’d like to share, you’ll see the option to switch the port visibility from “Private” to “Public.” If you copy the value in the local address and share it with necessary collaborators, they can view, test, and debug your locally hosted web app even though they’re in a different room than you.

Screenshot of how to share the URL of your locally hosted web app by setting its port to public.

Live Share

The Live Share extension allows you and your teammate(s) to type in the same project simultaneously. It highlights your teammate’s cursor versus your cursor, so you can easily identify who is typing, and it’s completely secure!

Screenshot of code in an editor showing sections highlighted and labeled with the names of users currently working on those lines.

Pair program with AI 🤖

Visual Studio Code’s marketplace offers a wide variety of extensions compatible with GitHub Codespaces, including GitHub Copilot. If you need to exchange ideas while you code, but there’s no one around, try installing the GitHub Copilot extension to pair program with AI. GitHub Copilot is an AI-powered code completion tool that helps you write code faster by suggesting code snippets and completing code for you.

Beyond productivity through code completion, GitHub Copilot empowers developers of varied backgrounds. One of my favorite GitHub Copilot tools is “Hey, GitHub,” a hands-free, voice-activated AI programmer. Check out the video below to learn how “Hey, GitHub” improves the developer experience for folks with limited physical dexterity or visual impairments.



Learn about more use cases for GitHub Copilot here.

Gif demonstrating how Copilot can recommend lines of code.

Teach people to code 👩‍🏫

While educating people about coding via bootcamps, classrooms, meetups, and conferences, I’ve learned that teaching people how to code is hard. Sometimes in workshops, attendees spend most of the time trying to set up their environment so that they can follow along. However, teaching people to code or running a workshop in GitHub Codespaces creates a better experience for all participants. Instead of expecting beginner developers to understand how to clone repositories to work with a template, they can open up a codespace and work in a development environment you’ve configured. Now, you can have the peace of mind that everyone has the same environment as you and can easily follow along. This reduces tons of time, stress, and chaos.

We made GitHub Codespaces more accessible for teachers and students by offering 180 free hours of usage (equivalent to five assignments per month for a class of 50). Check out my tutorial on configuring GitHub Codespaces and leveraging extensions like CodeTour to run self-guided workshops.

Learn how CS50, Harvard’s largest class, uses GitHub Codespaces to teach students to code.

Learn a new framework 📝

When you’re learning to code, it’s easy to over-invest your time into following tutorials. Learning is often more effective when you balance tutorial consumption with building projects. GitHub Codespaces’ quickstart templates are a speedy and efficient way to learn a framework.

Quickstart templates include boilerplate code, forwarded ports, and a configured development container for some of the most common application frameworks, including Next.js, React.js, Django, Express, Ruby on Rails, Preact, Flask, and Jupyter Notebook. Templates provide developers with a sandbox to build, test, and debug applications in a codespace. It only takes one click to open a template and experiment with a new framework.

Experimenting with a framework in a codespace can help developers better understand a framework’s structure and functionalities, leading to better retention and mastery. For example, I’m not familiar with Jupyter Notebook, but I’ve used a Jupyter Notebook template in GitHub Codespaces. The boilerplate code helped me to experiment with and better understand the structure of a Jupyter Notebook. I even built a small project with it!

Check out this blog post and template, where we use GitHub Codespaces to guide developers through writing their first lines of React.

Screenshot showing a list of available codespace templates.

Store environment variables 🔐

In the past, I’ve had multiple moments where I’ve accidentally shared or mishandled my environment variables. Here are a few cringe-worthy moments where I’ve mishandled environment variables:

  • I’ve shown them to an audience during a live demo.
  • I’ve pushed them to GitHub.
  • I couldn’t figure out a great way to share the values with teammates.

To avoid these moments, I could’ve used GitHub Codespaces secrets. You can store API Keys, environment variables, and secret credentials in your GitHub Codespaces secrets for each of your GitHub Codespaces.

After you store them in your GitHub Codespaces secrets, you can refer to the environment variables in your code as: process.env.{SUPER_SECRET_API_KEY}.

Check out this tutorial for more information on securely storing your environment variables with GitHub Codespaces

Screenshot showing how to securely store your environment variables with GitHub Codespaces.

Onboard developers faster 🏃💨

Typically, software engineers are responsible for setting up their local environment when they join a team. Local environment setup can take a day or a week, depending on the quality of the documentation. Fortunately, organizations can automate the onboarding process using GitHub Codespaces to configure a custom environment. When a new engineer joins the team, they can open a codespace and skip the local environment setup because the required extensions, dependencies, and environment variables exist within the codespace.

In 2021, the GitHub Engineering Team migrated from our macOS model to GitHub Codespaces for the development of GitHub.com. After over 14 years, our repository accrued almost 13 GB on disk. It took approximately 45 minutes to clone, set up dependencies, and bootstrap our repository. After fully migrating to GitHub Codespaces, it only takes 10 seconds to launch GitHub.com in a preconfigured, reliable codespace.

However, you don’t have to take our word for it. Ketan Shah, Manager of Information Services at TELUS, a telecommunications company, attests, “GitHub shaves off entire days from the new employee onboarding process. New developers are up and running within minutes because everything is all in one place.”

Learn more about GitHub Codespaces can quickly onboard developers:

Conduct technical interviews 💼

Take-home projects, technical screens, and live coding are often painful, but necessary parts of the interview experience for software developers, but using GitHub Codespaces can help ease some of the difficulties. By providing candidates with a well-configured and reliable environment, interviewers can help reduce anxiety and improve performance. With GitHub Codespaces, candidates no longer have to worry about setting up their environment. Interviewers can use GitHub codespaces to provide real-time feedback and collaborate using built-in tools like Visual Studio Code Live Share. Furthermore, GitHub Codespaces offers a secure and isolated environment for the interview process, ensuring that sensitive information and data remain protected.

Learn how various engineering teams at GitHub use GitHub Codespaces to conduct interviews.

Code in the cloud with your preferred editor ☁

I love Visual Studio Code; it’s my primary editor. However, depending on your role and other factors, you might have a different preference. GitHub Codespaces supports the following editors, in addition to Visual Studio Code:

  • Jupyter Notebook
  • IntelliJ IDEA
  • RubyMine
  • GoLand
  • PyCharm
  • PhpStorm
  • WebStorm

But, wait, there’s more!

GitHub Codespaces’ superpower is that you can code from any device and get a standardized environment as long as you have internet. However, it also includes features that make software development workflows easier. With 60 hours of free access per month for GitHub users and 90 hours per month for GitHub Pro users, there’s no reason not to try it out and experience its benefits for yourself. So why not give it a try today?

3 ways to meet compliance needs without slowing down agility

Post Syndicated from Mark Paulsen original https://github.blog/2023-02-24-3-ways-to-meet-compliance-needs-without-slowing-down-agility/

In the previous blog, Setting the foundations for compliance, we set the groundwork for developer-enabled compliance that will keep your teams happy, in the flow, secure, compliant, and auditable.

Today, we’ll walk through three practical ways that you can start meeting your compliance requirements without having to revolutionize or transform the culture in your company—all while enabling your developers to be more productive and happy.

1. Do the basics well and often

The first way to start meeting your compliance requirements is often overlooked because it’s so simple. No matter what industry you are in, whether it’s finance, automotive, government, or tech, there are some basic quick compliance wins that may be part of your existing developer workflows and culture.

Code review

A key part of writing clean code is code review. But having a repeatable and traceable code review process is also a foundational component of your compliance program. This ensures risks within software delivery are found and mitigated early—and are also auditable for future reviews.

GitHub makes code review easy, since pull requests are part of the existing workflow that millions of developers use every day. GitHub also makes it easy for compliance testers and auditors to review pull requests with access to immutable audit logs in a central location.

Separation of duties

In some enterprises, separation of duties has been defined as a person having too much manual control over transactions or processes. This can lead to apprehension when modern cloud native practices are introduced.

Thankfully, there is guidance from the Industry that supports a more modern approach to separation of duties. The PCI-DSS requirements guide avoids the term person and provides a more cloud native friendly approach by focusing on functions and accounts:

“The purpose of this requirement is to separate the development and test functions from the production functions. For example, a developer can use an administrator-level account with elevated privileges in the development environment and have a separate account with user-level access to the production environment.”

This approach aligns well with the 12 factor application methodology for cloud native. The Build, release, run factor explanation states, “Each release must enforce a strict separation across the build, release, and run stages. Each should be tagged with a unique ID and support the ability to roll back.” Teams can fully automate their delivery to production, without having to worry about a person manually exceeding too much control, as long as there is separation of functions and traceability back to unique IDs. With the added assurance that they are aligned to industry best practices and requirements, such as PCI-DSS.

To enable separation of duties, you have to have a clear identity and access management strategy. Thankfully, we don’t have to reinvent the wheel. GitHub Enterprise has several options to help you manage access to your overall development environment:

  • You can configure SAML single sign-on for your enterprise. This is an additional check, allowing you to confirm the authenticity of your users against your own identity provider (while still using your own GitHub account).
  • You can then synchronize team memberships with groups in your identity provider. As a result, group membership changes in your identity provider update the team membership (and therefore associated access) in GitHub.
  • Alternatively, you could adopt GitHub Enterprise Managed Users (EMUs). This is a flavor of GitHub Enterprise where you can only log in with an account that is centrally managed through your identity provider. The user does not have to log in to GitHub with a personal account and use single sign on to access company resources. (For more information on this, check out this blog post on exploring EMUs and the benefits they can bring.)

AI-powered compliance

In our last blog we briefly covered AI-enabled compliance and some of the existing opportunities for security, manufacturing, and banks. There are also several other opportunities on the horizon that could further optimize the basics of compliance.

It is entirely possible that a generative AI tool could soon be leveraged to help ensure that separation of duties is enforced in a declarative DevOps workflow by creating unit tests on its own. Because of the non-deterministic nature of generative AI, each time it runs it may have a different result and the unit tests may include risk scenarios that nobody has thought of yet. This can add an amazing level of true risk-based compliance.

One of the major benefits of addressing compliance often in your delivery is an increased level of trust. But quantifying trust can be extremely difficult—especially within a regulated industry that wants to leverage deep learning solutions. There is work being driven by AI results to help provide trust quantification for these types of solutions, which will not only enable continuous compliance but will also help enterprises implement new technologies that can increase business value.

Dependency management

Companies are increasing their reliance on open source software in their supply chain. As a result, optimized, repeatable, and audible controls for dependency management are becoming a cornerstone of compliance programs across industries.

From a GitHub perspective, Dependabot can provide confidence that your supply chain is secure. When Dependabot is configured, developers are alerted as soon as a security threat is found and gives them the ability to take action in their normal workflows.

By leveraging Dependabot, you will receive an immutable and auditable alert when a repository uses a software dependency with a known vulnerability. Pull requests can be triggered by either your developers or security teams, which give you another set of auditable artifacts for future review.

Approvals

While most organizations have approval processes, they can be slow and occur too late in the process. Google explains that peer reviews can help to streamline the change approval process, based on results from the DORA 2019 state of DevOps report.

There could be many control points in your delivery pipeline that may require approval, such as sign-off of requirements, testing results, security reviews, or production operational readiness confirmations before a release.

Depending on your internal structure and alignment, you may require teams to provide sign-off at different stages throughout the process. In your build and test stage, you can use manual approvals in a pull request to gather the needed reviews.

Once your builds and tests are complete, it’s time to release your code into your infrastructure. Talking about deployment patterns and strategies is beyond the scope of this post. However, you may require approvals as part of the deployment process.

To obtain these approvals, you should use environments for your deployments. Environments are used to describe a deployment target (such as staging, testing, and production). By using environments, you can require a manual approval to take place before GitHub Actions begins the deployment job.

In both instances, remember that there is a tradeoff when deciding the number of approvals required. Setting the number of required reviews too high means that you may impact your pace of delivery, as manual intervention is required. Where possible, consider automating checks to reduce the manual overhead, thereby optimizing your overall agility.

2. Have a common understanding of concepts and terms

Again, this may be overlooked since it sounds so obvious. But there are probably many concepts and terms that developers, DevOps and cloud native practitioners use on a daily basis that may be totally incomprehensible to compliance testers and auditors.

Back in 2015, the author of The Phoenix Project, Gene Kim, and several other authors, created the DevOps Audit Defense Toolkit. As we mentioned in our first blog, the goal of this document was to “educate IT management and practitioners on the audit process so they can demonstrate to auditors they understand the business risks and are properly mitigating those risks.” But where do you start?

Below is a basic cheat-sheet of terms for GitHub related control objectives and a mapping to the world of compliance, regulations, risk, and audit. This could be a good place to start building a common understanding between developers and your compliance and audit friends.

Objective Control Financial Reporting Industry Frameworks
The Code Review Control ensures that security requirements have been addressed and that the code is understandable, maintainable, and properly formatted. Pull requests let you tell others about changes you’ve pushed to a branch in a repository on GitHub.Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch. SOX: Change Management

COSO: Control Activities—Conduct application change management.

NIST Cyber: DE.CM-4 —Malicious code is detected.

PCI-DSS: 6.3.2: Review all custom code for vulnerabilities, manually or via automation.

SLSA: Two-person review is an industry best practice for catching mistakes and deterring bad behavior.

Controls and processes should be in place to scan code repositories for passwords and security tokens. Prevention measures should be enforced to ensure secrets do not reach production. Secret scanning will scan your entire Git history on all branches present in your GitHub repository for secrets. GitHub push protection will check for high-confidence secrets as developers push code and block the push if a secret is identified. SOX: IT Security

COSO: Control Activities—Improve security

NIST Cyber: PR.DS-5: Protections against data leaks are implemented.

PCI-DSS: 6.5.3: Protect against all insufficiently secure cryptographic key storage.

To help bring this together, there is a great example of the benefits of a common understanding between developers and compliance testers and auditors in the latest Forrester report on the economic impact of GitHub Enterprise Cloud and Advanced Security. The report highlights one of the benefits of an automated documentation process that both helps developers and auditors:

“These new, standardized documentation structures made it easier for auditors to find and compile the documentation necessary to be compliant. This helped the interviewees’ organizations save time preparing for industry compliance and security audits.”

3. Helping developers understand where they fit into the three lines of defense

The three lines of defense is a model used for risk management and governance. It helps clarify the roles and responsibilities of teams involved in addressing risk.

Think of the engineering team and engineering leads as the first line of defense, as they continue shipping software to their end-users. Making sure this team has the appropriate level of skills to identify risk within their solution is crucial.

The second line of defense is typically achieved at scale through frameworks, policies, and tools at an organizational level. This acts as oversight for the first line of defense on how well they are implementing risk mitigation techniques and consistently measuring and defining risk across the organization.

Internal audit is the third line of defense. Just like a red team and blue team compliment each other, you can think of internal audit as the final puzzle piece in the three lines of defense. They evaluate the effectiveness of risk management and governance, working with senior management and external regulators to raise awareness of the controls being executed.

Let’s use an analogy. A hockey team is not just structured to score goals but also prevent goals being scored against them.

  • Forwards: they are there to score and assist on goals. But they can be called on to work with the other positions in defense. Think of them as the first line of defense. Their job is to create solutions which provide business value, but are also secure.
  • Defenders: their main role is clearly preventing goals. But they partner with the forwards on the current priority (offense or defense). They are like the second line of defense, providing risk oversight and collaborating with the engineering teams.
  • Goaltender: the last line of defense. They can be thought of as the equivalent of an internal audit. They are independent of the forwards and defenders with different responsibilities, tools, and perspective.

Hockey teams that have very strong forwards but weak defenders and goalkeepers are rarely successful. Teams with a strong defense but weak offense are rarely successful, either. It takes all three aspects working in harmony to be successful.

This applies in business, too. If your solutions meet your customers’ requirements but are insecure, your business will fail. If your solutions are secure but aren’t user friendly or providing value, it will result in failure. Success is achieved by finding the right balance of value and risk management.

Developers can see that they are there to score goals for their team; creating the software that runs our world. But they need to support the defensive capabilities of the team, ensuring their software is secure. Their success is dependent on the success of the wider team, without having to take on additional responsibilities.

Next steps

We’ve taken a whistle-stop tour on how you can bring compliance into your development flow. From branch protection rules and pull requests to CODEOWNERS and environment approvals, there are several GitHub capabilities that can help you naturally focus on compliance.

This is only one step in solving the problem. A common language between compliance and DevOps practitioners is crucial in demonstrating the implemented measures. With that common language, it is clear that everyone must think about compliance; engineering teams are a part of the three lines of defense.

Next up in the series, we’ll be talking about how to ensure compliance in developer workflows.

Ready to increase developer velocity and collaboration while remaining secure and compliant? See how GitHub Enterprise can help.

New zoom freezing feature for Geohash plugin

Post Syndicated from Grab Tech original https://engineering.grab.com/geohash-plugin

Introduction

Geohash is an encoding system with a unique identifier for each region on the planet. Therefore, all geohash units can be associated with an individual set of digits and letters.

Geohash is a plugin built by Grab that is available in the Java OpenStreetMap Editor (JOSM) tool, which comes in handy for those who work on precise areas based on geohash units.

Background

Up until recently, users of the Geohash JOSM plugin were unable to stop the displaying of new geohashes with every zoom-in or zoom-out. This meant that every time they changed the zoom, new geohashes would be displayed, and this became bothersome for many users when it was unneeded. The previous behaviour of the plugin when zooming in and out is depicted in the following short video:


This led to the implementation of the zoom freeze feature, which helps users toggle between Enable zoom freeze and Disable zoom freeze, based on their needs.

Solution

As you can see in the following image, a new label was created with the purpose of freezing or unfreezing the display of new geohashes with each zoom change:


By default, this label says “Enable zoom freeze”, and when zoom freezing is enabled, the label changes to “Disable zoom freeze”.

In order to see how zoom freezing works, let’s consider the following example: a user wants to zoom inside the geohash with the code w886hu, without triggering the display of smaller geohashes inside of it. For this purpose, the user will enable the zoom freezing feature by clicking on the label, and then they will proceed with the zoom. The map will look like this:


It is apparent from the image that no new geohashes were created. Now, let’s say the user has finished what they wanted to do, and wants to go back to the “normal” geohash visualisation mode, which means disabling the zoom freeze option. After clicking on the label that now says ‘Disable zoom freeze’, new, smaller geohashes will be displayed, according to the current zoom level:


The functionality is illustrated in the following short video:


Another effect that enabling zoom freeze has is that it disables the ‘Display larger geohashes’ and ‘Display smaller geohashes’ options, since the geohashes are now fixed. The following images show how these options work before and after disabling zoom freeze:



To conclude, we believe that the release of this new feature will benefit users by making it more comfortable for them to zoom in and out of a map. By turning off the display of new geohashes when this is unwanted, map readability is improved, and this translates to a better user experience.

Impact/Limitations

In order to start using this new feature, users need to update the Geohash JOSM plugin.

What’s next?

Grab has come a long way in map-making, from using open source map-making software and developing its own suite of map-making tools to contributing to the open-source map community and building and launching GrabMaps. To find out more, read How KartaCam powers GrabMaps and KartaCam delivers comprehensive, cost-effective mapping data.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

The technology behind GitHub’s new code search

Post Syndicated from Timothy Clem original https://github.blog/2023-02-06-the-technology-behind-githubs-new-code-search/

From launching our technology preview of the new and improved code search experience a year ago, to the public beta we released at GitHub Universe last November, there’s been a flurry of innovation and dramatic changes to some of the core GitHub product experiences around how we, as developers, find, read, and navigate code.

One question we hear about the new code search experience is, “How does it work?” And to complement a talk I gave at GitHub Universe, this post gives a high-level answer to that question and provides a small window into the system architecture and technical underpinnings of the product.

So, how does it work? The short answer is that we built our own search engine from scratch, in Rust, specifically for the domain of code search. We call this search engine Blackbird, but before I explain how it works, I think it helps to understand our motivation a little bit. At first glance, building a search engine from scratch seems like a questionable decision. Why would you do that? Aren’t there plenty of existing, open source solutions out there already? Why build something new?

To be fair, we’ve tried and have been trying, for almost the entire history of GitHub, to use existing solutions for this problem. You can read a bit more about our journey in Pavel Avgustinov’s post, A brief history of code search at GitHub, but one thing sticks out: we haven’t had a lot of luck using general text search products to power code search. The user experience is poor, indexing is slow, and it’s expensive to host. There are some newer, code-specific open source projects out there, but they definitely don’t work at GitHub’s scale. So, knowing all that, we were motivated to create our own solution by three things:

  1. We’ve got a vision for an entirely new user experience that’s about being able to ask questions of code and get answers through iteratively searching, browsing, navigating, and reading code.
  2. We understand that code search is uniquely different from general text search. Code is already designed to be understood by machines and we should be able to take advantage of that structure and relevance. Searching code also has unique requirements: we want to search for punctuation (for example, a period or open parenthesis); we don’t want stemming; we don’t want stop words to be stripped from queries; and, we want to search with regular expressions.
  3. GitHub’s scale is truly a unique challenge. When we first deployed Elasticsearch, it took months to index all of the code on GitHub (about 8 million repositories at the time). Today, that number is north of 200 million, and that code isn’t static: it’s constantly changing and that’s quite challenging for search engines to handle. For the beta, you can currently search almost 45 million repositories, representing 115 TB of code and 15.5 billion documents.

At the end of the day, nothing off the shelf met our needs, so we built something from scratch.

Just use grep?

First though, let’s explore the brute force approach to the problem. We get this question a lot: “Why don’t you just use grep?” To answer that, let’s do a little napkin math using ripgrep on that 115 TB of content. On a machine with an eight core Intel CPU, ripgrep can run an exhaustive regular expression query on a 13 GB file cached in memory in 2.769 seconds, or about 0.6 GB/sec/core.

We can see pretty quickly that this really isn’t going to work for the larger amount of data we have. Code search runs on 64 core, 32 machine clusters. Even if we managed to put 115 TB of code in memory and assume we can perfectly parallelize the work, we’re going to saturate 2,048 CPU cores for 96 seconds to serve a single query! Only that one query can run. Everybody else has to get in line. This comes out to a whopping 0.01 queries per second (QPS) and good luck doubling your QPS—that’s going to be a fun conversation with leadership about your infrastructure bill.

There’s just no cost-effective way to scale this approach to all of GitHub’s code and all of GitHub’s users. Even if we threw a ton of money at the problem, it still wouldn’t meet our user experience goals.

You can see where this is going: we need to build an index.

A search index primer

We can only make queries fast if we pre-compute a bunch of information in the form of indices, which you can think of as maps from keys to sorted lists of document IDs (called “posting lists”) where that key appears. As an example, here’s a small index for programming languages. We scan each document to detect what programming language it’s written in, assign a document ID, and then create an inverted index where language is the key and the value is a posting list of document IDs.

Forward index

Doc ID Content
1 def lim
puts “mit”
end
2 fn limits() {
3 function mits() {

Inverted index

Language Doc IDs (postings)
JavaScript 3, 8, 12, …
Ruby 1, 10, 13, …
Rust 2, 5, 11, …

For code search, we need a special type of inverted index called an ngram index, which is useful for looking up substrings of content. An ngram is a sequence of characters of length n. For example, if we choose n=3 (trigrams), the ngrams that make up the content “limits” are lim, imi, mit, its. With our documents above, the index for those trigrams would look like this:

ngram Doc IDs (postings)
lim 1, 2, …
imi 2, …
mit 1, 2, 3, …
its 2, 3, …

To perform a search, we intersect the results of multiple lookups to give us the list of documents where the string appears. With a trigram index you need four lookups: lim, imi, mit, and its in order to fulfill the query for limits.

Unlike a hashmap though, these indices are too big to fit in memory, so instead, we build iterators for each index we need to access. These lazily return sorted document IDs (the IDs are assigned based on the ranking of each document) and we intersect and union the iterators (as demanded by the specific query) and only read far enough to fetch the requested number of results. That way we never have to keep entire posting lists in memory.

Indexing 45 million repositories

The next problem we have is how to build this index in a reasonable amount of time (remember, this took months in our first iteration). As is often the case, the trick here is to identify some insight into the specific data we’re working with to guide our approach. In our case it’s two things: Git’s use of content addressable hashing and the fact that there’s actually quite a lot of duplicate content on GitHub. Those two insights lead us the the following decisions:

  1. Shard by Git blob object ID which gives us a nice way of evenly distributing documents between the shards while avoiding any duplication. There won’t be any hot servers due to special repositories and we can easily scale the number of shards as necessary.
  2. Model the index as a tree and use delta encoding to reduce the amount of crawling and to optimize the metadata in our index. For us, metadata are things like the list of locations where a document appears (which path, branch, and repository) and information about those objects (repository name, owner, visibility, etc.). This data can be quite large for popular content.

We also designed the system so that query results are consistent on a commit-level basis. If you search a repository while your teammate is pushing code, your results shouldn’t include documents from the new commit until it has been fully processed by the system. In fact, while you’re getting back results from a repository-scoped query, someone else could be paging through global results and looking at a different, prior, but still consistent state of the index. This is tricky to do with other search engines. Blackbird provides this level of query consistency as a core part of its design.

Let’s build an index

Armed with those insights, let’s turn our attention to building an index with Blackbird. This diagram represents a high level overview of the ingest and indexing side of the system.

a high level overview of the ingest and indexing side of the system

Kafka provides events that tell us to go index something. There are a bunch of crawlers that interact with Git and a service for extracting symbols from code, and then we use Kafka, again, to allow each shard to consume documents for indexing at its own pace.

Though the system generally just responds to events like git push to crawl changed content, we have some work to do to ingest all the repositories for the first time. One key property of the system is that we optimize the order in which we do this initial ingest to make the most of our delta encoding. We do this with a novel probabilistic data structure representing repository similarity and by driving ingest order from a level order traversal of a minimum spanning tree of a graph of repository similarity1.

Using our optimized ingest order, each repository is then crawled by diffing it against its parent in the delta tree we’ve constructed. This means we only need to crawl the blobs unique to that repository (not the entire repository). Crawling involves fetching blob content from Git, analyzing it to extract symbols, and creating documents that will be the input to indexing.

These documents are then published to another Kafka topic. This is where we partition2 the data between shards. Each shard consumes a single Kafka partition in the topic. Indexing is decoupled from crawling through the use of Kafka and the ordering of the messages in Kafka is how we gain query consistency.

The indexer shards then take these documents and build their indices: tokenizing to construct ngram indices3 (for content, symbols, and paths) and other useful indices (languages, owners, repositories, etc) before serializing and flushing to disk when enough work has accumulated.

Finally, the shards run compaction to collapse up smaller indices into larger ones that are more efficient to query and easier to move around (for example, to a read replica or for backups). Compaction also k-merges the posting lists by score so relevant documents have lower IDs and will be returned first by the lazy iterators. During the initial ingest, we delay compaction and do one big run at the end, but then as the index keeps up with incremental changes, we run compaction on a shorter interval as this is where we handle things like document deletions.

Life of a query

Now that we have an index, it’s interesting to trace a query through the system. The query we’re going to follow is a regular expression qualified to the Rails organization looking for code written in the Ruby programming language: /arguments?/ org:rails lang:Ruby. The high level architecture of the query path looks a little bit like this:

Architecture diagram of a query path.

In between GitHub.com and the shards is a service that coordinates taking user queries and fanning out requests to each host in the search cluster. We use Redis to manage quotas and cache some access control data.

The front end accepts the user query and passes it along to the Blackbird query service where we parse the query into an abstract syntax tree and then rewrite it, resolving things like languages to their canonical Linguist language ID and tagging on extra clauses for permissions and scopes. In this case, you can see how rewriting ensures that I’ll get results from public repositories or any private repositories that I have access to.

And(
    Owner("rails"),
    LanguageID(326),
    Regex("arguments?"),
    Or(
        RepoIDs(...),
        PublicRepo(),
    ),
)

Next, we fan out and send n concurrent requests: one to each shard in the search cluster. Due to our sharding strategy, a query request must be sent to each shard in the cluster.

On each individual shard, we then do some further conversion of the query in order to lookup information in the indices. Here, you can see that the regex gets translated into a series of substring queries on the ngram indices.

and(
  owners_iter("rails"),
  languages_iter(326),
  or(
    and(
      content_grams_iter("arg"),
      content_grams_iter("rgu"),
      content_grams_iter("gum"),
      or(
        and(
         content_grams_iter("ume"),
         content_grams_iter("ment")
        )
        content_grams_iter("uments"),
      )
    ),
    or(paths_grams_iter…)
    or(symbols_grams_iter…)
  ), 
  …
)

If you want to learn more about a method to turn regular expressions into substring queries, see Russ Cox’s article on Regular Expression Matching with a Trigram Index. We use a different algorithm and dynamic gram sizes instead of trigrams (see below3). In this case the engine uses the following grams: arg,rgu, gum, and then either ume and ment, or the 6 gram uments.

The iterators from each clause are run: and means intersect, or means union. The result is a list of documents. We still have to double check each document (to validate matches and detect ranges for them) before scoring, sorting, and returning the requested number of results.

Back in the query service, we aggregate the results from all shards, re-sort by score, filter (to double-check permissions), and return the top 100. The GitHub.com front end then still has to do syntax highlighting, term highlighting, pagination, and finally we can render the results to the page.

Our p99 response times from individual shards are on the order of 100 ms, but total response times are a bit longer due to aggregating responses, checking permissions, and things like syntax highlighting. A query ties up a single CPU core on the index server for that 100 ms, so our 64 core hosts have an upper bound of something like 640 queries per second. Compared to the grep approach (0.01 QPS), that’s screaming fast with plenty of room for simultaneous user queries and future growth.

In summary

Now that we’ve seen the full system, let’s revisit the scale of the problem. Our ingest pipeline can publish around 120,000 documents per second, so working through those 15.5 billion documents should take about 36 hours. But delta indexing reduces the number of documents we have to crawl by over 50%, which allows us to re-index the entire corpus in about 18 hours.

There are some big wins on the size of the index as well. Remember that we started with 115 TB of content that we want to search. Content deduplication and delta indexing brings that down to around 28 TB of unique content. And the index itself clocks in at just 25 TB, which includes not only all the indices (including the ngrams), but also a compressed copy of all unique content. This means our total index size including the content is roughly a quarter the size of the original data!

If you haven’t signed up already, we’d love for you to join our beta and try out the new code search experience. Let us know what you think! We’re actively adding more repositories and fixing up the rough edges based on feedback from people just like you.

Notes


  1. To determine the optimal ingest order, we need a way to tell how similar one repository is to another (similar in terms of their content), so we invented a new probabilistic data structure to do this in the same class of data structures as MinHash and HyperLogLog. This data structure, which we call a geometric filter, allows computing set similarity and the symmetric difference between sets with logarithmic space. In this case, the sets we’re comparing are the contents of each repository as represented by (path, blob_sha) tuples. Armed with that knowledge, we can construct a graph where the vertices are repositories and edges are weighted with this similarity metric. Calculating a minimum spanning tree of this graph (with similarity as cost) and then doing a level order traversal of the tree gives us an ingest order where we can make best use of delta encoding. Really though, this graph is enormous (millions of nodes, trillions of edges), so our MST algorithm computes an approximation that only takes a few minutes to calculate and provides 90% of the delta compression benefits we’re going for. 
  2. The index is sharded by Git blob SHA. Sharding means spreading the indexed data out across multiple servers, which we need to do in order to easily scale horizontally for reads (where we are concerned about QPS), for storage (where disk space is the primary concern), and for indexing time (which is constrained by CPU and memory on the individual hosts). 
  3. The ngram indices we use are especially interesting. While trigrams are a known sweet spot in the design space (as Russ Cox and others have noted: bigrams aren’t selective enough and quadgrams take up too much space), they cause some problems at our scale.

    For common grams like for trigrams aren’t selective enough. We get way too many false positives and that means slow queries. An example of a false positive is something like finding a document that has each individual trigram, but not next to each other. You can’t tell until you fetch the content for that document and double check at which point you’ve done a lot of work that has to be discarded. We tried a number of strategies to fix this like adding follow masks, which use bitmasks for the character following the trigram (basically halfway to quad grams), but they saturate too quickly to be useful.

    We call the solution “sparse grams,” and it works like this. Assume you have some function that given a bigram gives a weight. As an example, consider the string chester. We give each bigram a weight: 9 for “ch”, 6 for “he”, 3 for “es”, and so on.

    Click to view slideshow.

    Using those weights, we tokenize by selecting intervals where the inner weights are strictly smaller than the weights at the borders. The inclusive characters of that interval make up the ngram and we apply this algorithm recursively until its natural end at trigrams. At query time, we use the exact same algorithm, but keep only the covering ngrams, as the others are redundant. 

Enabling branch deployments through IssueOps with GitHub Actions

Post Syndicated from Grant Birkinbine original https://github.blog/2023-02-02-enabling-branch-deployments-through-issueops-with-github-actions/

At GitHub, the branch deploy model is ubiquitous and it is the standard way we ship code to production, and it has been for years. We released details about how we perform branch deployments with ChatOps all the way back in 2015.

We are able to use ChatOps to perform branch deployments for most of our repositories, but there are a few situations where ChatOps simply won’t work for us. What if developers want to leverage branch deployments but don’t have a full ChatOps stack integrated with their repositories? We wanted to set out to find a way for all developers to be able to take advantage of branch deployments with ease, right from their GitHub repository, and so the branch-deploy Action was born!

Gif demonstrating how to us the branch-deploy Action.

How Does GitHub use this Action?

GitHub primarily uses ChatOps with Hubot to facilitate branch deployments where we can. If ChatOps isn’t an option, we use this branch-deploy Action instead. The majority of our use cases include Infrastructure as Code (IaC) repositories where we use Terraform to deploy infrastructure changes. GitHub uses this Action in many internal repositories and so does npm. There are also many other public, open source, and corporate organizations adopting this Action, as well, to help ship their code to production!

Understanding the branch deploy model

Before we dive into the branch-deploy Action, let’s first understand what the branch deploy model is and why it is so useful.

To really understand the branch deploy model, let’s first take a look at a traditional deploy → merge model. It goes like this:

  1. Create a branch.
  2. Add commits to your branch.
  3. Open a pull request.
  4. Gather feedback plus peer reviews.
  5. Merge your branch.
  6. A deployment starts from the main branch.
Diagram outlining the steps of the traditional deploy model, enumerated in the numbered list above.

Now, let’s take a look at the branch deploy model:

  1. Create a branch.
  2. Add commits to your branch.
  3. Open a pull request.
  4. Gather feedback plus peer reviews.
  5. Deploy your change.
  6. Validate.
  7. Merge your branch to the main / master branch.
Diagram outlining the steps of the branch deploy model, enumerated in the list above.

The merge deploy model is inherently riskier because the main branch is never truly a stable branch. If a deployment fails, or we need to roll back, we follow the entire process again to roll back our changes. However, in the branch deploy model, the main branch is always in a “good” state and we can deploy it at any time to revert the deployment from a branch deploy. In the branch deploy model, we only merge our changes into main once the branch has been successfully deployed and validated.

Note: this is sometimes referred to as the GitHub flow.

Key concepts

Key concepts of the branch deploy model:

  • The main branch is always considered to be a stable and deployable branch.
  • All changes are deployed to production before they are merged to the main branch.
  • To roll back a branch deployment, you deploy the main branch.

By now you may be sold on the branch deploy methodology. How do we implement it? Introducing IssueOps with GitHub Actions!

IssueOps

The best way to define IssueOps is to compare it to something similar, ChatOps. You may be familiar with the concept, ChatOps, already; if not, here is a quick definition:

ChatOps is the process of interacting with a chat bot to execute commands directly in a chat platform. For example, with ChatOps you might do something like .ping example.org to check the status of a website.

IssueOps adopts the same mindset but through a different medium. Rather than using a chat service (Discord, Slack, etc.) to invoke the commands we use comments on a GitHub Issue or pull request. GitHub Actions is the runtime that executes our desired logic when an IssueOps command is invoked.

GitHub Actions

How does it work? This section will go into detail about how this Action works and hopefully inspire you to leverage it in your own projects. The full source code and further documentation can be found on GitHub.

Let’s walk through the process using the demo configuration of a branch-deploy Action below.

1. Create this file under .github/workflows/branch-deploy.yml in your GitHub repository:

name: "branch deploy demo"

# The workflow will execute on new comments on pull requests - example: ".deploy" as a comment
on:
  issue_comment:
    types: [created]

jobs:
  demo:
    if: ${{ github.event.issue.pull_request }} # only run on pull request comments (no need to run on issue comments)
    runs-on: ubuntu-latest
    steps:
      # Execute IssueOps branch deployment logic, hooray!
      # This will be used to "gate" all future steps below and conditionally trigger steps/deployments
      - uses: github/[email protected] # replace X.X.X with the version you want to use
        id: branch-deploy # it is critical you have an id here so you can reference the outputs of this step
        with:
          trigger: ".deploy" # the trigger phrase to look for in the comment on the pull request

      # Run your deployment logic for your project here - examples seen below

      # Checkout your project repository based on the ref provided by the branch-deploy step
      - uses: actions/[email protected]
        if: ${{ steps.branch-deploy.outputs.continue == 'true' }} # skips if the trigger phrase is not found
        with:
          ref: ${{ steps.branch-deploy.outputs.ref }} # uses the detected branch from the branch-deploy step

      # Do some fake "noop" deployment logic here
      # conditionally run a noop deployment
      - name: fake noop deploy
        if: ${{ steps.branch-deploy.outputs.continue == 'true' && steps.branch-deploy.outputs.noop == 'true' }} # only run if the trigger phrase is found and the branch-deploy step detected a noop deployment
        run: echo "I am doing a fake noop deploy"

      # Do some fake "regular" deployment logic here
      # conditionally run a regular deployment
      - name: fake regular deploy
        if: ${{ steps.branch-deploy.outputs.continue == 'true' && steps.branch-deploy.outputs.noop != 'true' }} # only run if the trigger phrase is found and the branch-deploy step detected a regular deployment
        run: echo "I am doing a fake regular deploy"

2. Trigger a noop deploy by commenting .deploy noop on a pull request.

A noop deployment is detected so this action outputs the noop variable to true. If you have the correct permissions to execute the IssueOps command, the action outputs the continue variable to true as well. The step named fake noop deploy runs, while the fake regular deploy step is skipped.

3. After your noop deploy completes, you would typically run .deploy to execute the actual deployment, fake regular deploy.

Features

The best part about the branch-deploy Action is that it is highly customizable for any deployment targets and use cases. Here are just a few of the features that this Action comes bundled with:

  • 🔍 Detects when IssueOps commands are used on a pull request.
  • 📝 Configurable: choose your command syntax, environment, noop trigger, base branch, reaction, and more.
  • ✅ Respects your branch protection settings configured for the repository.
  • 💬 Comments and reacts to your IssueOps commands.
  • 🚀 Triggers GitHub deployments for you with simple configuration.
  • 🔓 Deploy locks to prevent multiple deployments from clashing.
  • 🌎 Configurable environment targets.

The repository also comes with a usage guide, which can be referenced by you and your team to quickly get familiar with available IssueOps commands and how they work.

Examples

The branch-deploy Action is customizable and suited for a wide range of projects. Here are a few examples of how you can use the branch-deploy Action to deploy to different services:

Conclusion

If you are looking to enhance your DevOps experience, have better reliability in your deployments, or ship changes faster, then branch deployments are for you!

Hopefully, you now have a better understanding of why the branch deploy model is a great option for shipping your code to production.

By using GitHub plus Actions plus IssueOps you can leverage the branch deploy model in any repository!

Source code: GitHub