Tag Archives: Engineering

How to build an enterprise LLM application: Lessons from GitHub Copilot

Post Syndicated from Shuyin Zhao original https://github.blog/2023-09-06-how-to-build-an-enterprise-llm-application-lessons-from-github-copilot/

If you want to build and scale an application using a large language model (LLM), this article’s for you.

It took us three years to develop GitHub Copilot before we officially launched it to the general public. To go from idea to production, we followed three stages—find it, nail it, scale it—loosely based on the “Nail It, Then Scale It” framework for entrepreneurial product development.

Here’s how it breaks down:

  • Find it: Identify an impactful problem space for your LLM application
  • Nail it: Create a smooth AI product experience
  • Scale it: Get your LLM application ready and useable for general availability (GA)

Let’s get started.

Find it: Isolate the problem you want to solve

Sometimes the hardest part about creating a solution is scoping down a problem space. The problem should be focused enough to quickly deliver impact, but also big enough that the right solution will wow users. Additionally, you want to find a problem where the use of an LLM is the right solution (and isn’t integrated to just drive product engagement).

  • Get clear on who you want to help. We saw that AI could drive efficiency, so we wanted to prioritize helping developers who were consistently crunched for time, enabling them to write code faster with less context switching.

  • Focus on a single problem, first. Rather than trying to address all developer problems with AI, we focused on one part of the software development lifecycle: coding functions in the IDE. At the time, most AI coding assistants could only complete a single line of code.

  • Balance product ambition with quality. While the GitHub Copilot team initially explored generating entire commits, the state of LLMs couldn’t support that function at a high enough quality at the time. Through additional testing, the team landed on code suggestions at the “whole function” level.

  • Meet people where they are. When it comes to designing products for developers, an LLM app should amplify an existing tool or integrate into an existing workflow. A mantra of the GitHub Copilot team was, “It’s a bug if you have to change the way you code when using GitHub Copilot.” In practice, this means enabling developers to receive code suggestions without changing how they work.

Nail it: Iterate to create a smooth AI product experience

Product development with emerging tech, like generative AI, is often more of a winding path and a linear journey because so much is unknown and rapid advancements in the field can quickly open new doors. Building quick iteration cycles into the product development process allows teams to fail and learn fast. At GitHub, the main mechanism for us to quickly iterate is an A/B experimental platform.

According to Idan Gazit, Senior Director of Research for GitHub Next, “We have to design apps not only for models whose outputs need evaluation by humans, but also for humans who are learning how to interact with AI.

  • Put yourself in users’ shoes. GitHub employees have a culture of putting themselves in the shoes of their end users by “dogfooding” products before—and after—they’re released. In practice, this meant the GitHub Copilot team stood up a simple web interface where it could tinker with foundation models and explore ways to leverage those models in their own developer workflows.

    We quickly found that a web interface was not the right canvas since it meant developers had to switch back and forth between their editor and the web browser. As a result, the team decided to focus on bringing GitHub Copilot to the IDE and making the AI capability modeless—or working in the background.

    The developers on our team also noticed they often referenced multiple open tabs in the IDE while coding. This insight led them to experiment with a technique called neighboring tabs, where GitHub Copilot processes multiple files open in a developer’s IDE instead of just the single one the developer is working on. Neighboring tabs helped to increase the acceptance rates of GitHub Copilot’s suggestions by 5%.

  • Evaluate your testing tools. As our experiment continued, we had to scale our internal testing tools to be more versatile and powerful. While we initially relied on our own tools for testing, we ultimately switched to the Microsoft Experimentation Platform to optimize functionality based on the feedback and interaction at scale.

  • Avoid the sunk cost fallacy. This is when you’re reluctant to abandon a course of action because you’ve heavily invested in it—even when it’s clear switching gears would be more beneficial.

    The GitHub and OpenAI teams initially believed every coding language would require its own fine-tuned AI model. But the field of generative AI was rapidly advancing, and our assumption turned out to be incorrect. In the end, OpenAI’s LLMs significantly improved and one model could handle a wide variety of coding languages and tasks.

  • Make a habit of revisiting old ideas. In a field that’s rapidly advancing, the functions that aren’t feasible with today’s LLMs might be possible with tomorrow’s.

    In the beginning, we explored a chat interface for developers to ask coding questions. However, in early testing, users had much higher expectations for the capabilities and quality of coding suggestions than GitHub Copilot could deliver at the time. As a result, we deprioritized the feature. But as customers became familiar with AI chatbot following the emergence of ChatGPT and LLMs continued to evolve, an iterative chat experience, like GitHub Copilot Chat, became possible.

Scale it: Optimize quality, usability, and responsible use of AI to get to GA

Early feedback and technical previews are key to driving product improvements and getting your application to GA. Below you’ll find the steps we took before launching the GitHub Copilot technical preview, how we managed the technical preview and optimized user feedback, and how we prepared our internal infrastructure to handle demand at scale.

Optimize quality and usability

  • Ensure consistent results. Because LMMs are probabilistic—meaning they don’t always produce the same, predictable outcomes—experimentation with them needs to be statistically based. One solution involves setting up a quality pipeline that addresses this unique challenge of building with LLMs.

    For instance, when the GitHub Copilot team first decided to provide whole function coding suggestions, we also had to ensure output predictability and consistency, where the same prompt and context would produce the same suggestions from the AI model.

    To achieve this, the team applied two strategies: changing the parameters to reduce the randomness of outputs and caching responses. Additionally, using cached responses instead of generating new responses to the same prompt not only reduced variability in suggestions, but it also improved performance.

  • Implement a waitlist for your technical preview. A waitlist allowed the GitHub Copilot team to manage questions, feedback, and comments—and ensure we could address them effectively. This approach also helped ensure we had a diverse set of early adopters across developers of varying experience levels to provide feedback.

  • Take advantage of real user feedback. In one example, developers shared that an update had negatively affected the quality of the model’s coding suggestions. In response, the GitHub Copilot team implemented a new guardrail metric—the percentage of suggestions that are multi-line vs. single line—and tuned the model to ensure customers continued to get high-quality suggestions.

    While the GitHub team actively dogfooded GitHub Copilot to understand what the experience was like for developers, we also benefited from developers outside GitHub adding diverse feedback across real-world use cases. The GitHub Copilot team engaged and interacted with technical preview users early, often, and on the users’ preferred platforms. This allowed us to actively respond to issues and feedback in real time.

  • Commit to iterating as you scale. When GitHub Copilot became generally available, the team not only had to improve the product, but also its infrastructure. When we experimented with and quickly iterated GitHub Copilot, it worked directly with the OpenAI API. As the product grew, we scaled our use of Microsoft Azure’s infrastructure to ensure GitHub Copilot had the quality, reliability, and responsible guardrails of a large-scale, enterprise-grade product.

  • Define the product’s key performance metrics. To optimize GitHub Copilot, we used early developer feedback to identify the right performance metrics, such as code acceptance rate and, eventually, code retention rate (which measures how much of the original code suggestion is kept or edited by a developer).

  • Optimize costs. The team worked to optimize the costs of delivering GitHub Copilot suggestions while balancing developer impact. For instance, before we decided on using ghost text—the gray text that flashes one coding suggestion while you type—the tool would eagerly generate 10 suggestions and display them all at once. This incurred upfront compute costs for suggestions two through 10, when most people choose the first one. But it also created a user experience cost, because those 10 suggestions pulled developers out of their workflow and into an evaluation mindset. “It was like paying to calculate the results that appear on the second page of a search engine—and making that second page grab your attention—even though most folks end up using the top result,” Gazit says.

    Optimizing costs is an ongoing project, and we’re exploring new ideas to reduce costs while improving the user experience.

Optimize responsible use of AI

  • Prioritize security and trust. Feedback during GitHub Copilot’s technical preview reinforced the importance of suggesting code that is secure. In response, the team integrated code security capabilities to filter out suggestions that could contain security vulnerabilities (e.g., SQL injections and hard coded credentials) and used natural language filters from Azure OpenAI Service to filter out offensive content.*

  • Allow your community to help you. At GitHub, we deeply valued our extensive developer community for feedback on our products and collaborating with them to improve our offerings. With GitHub Copilot, our developer community was critical to understanding the potential around AI—and some concerns, too.

    For instance, the developer community was concerned that GitHub Copilot suggestions might match public code. In response, the GitHub Copilot team created a filter to block suggestions matching public source code in GitHub public repositories that were longer than 150 characters.

    Based on community input, the GitHub Copilot team also developed a code reference tool that includes links to public code that may match GitHub Copilot suggestions, so developers can review potential matches (and relevant licensing information), and make informed choices.

Develop a go-to-market strategy

  • Launch your product with product evangelists. Before launching the technical preview of GitHub Copilot in 2021, the team presented the prototype to influential members of the software developer community and GitHub Stars. This allowed us to launch the technical preview with an existing base of support and extend the preview’s reach to a broader range of users.

  • Get your product in front of individual users before going after businesses. The team decided to first sell licenses directly to developers, who would clearly benefit from an AI coding assistant. We paired this approach with a free trial program and monthly pricing, based on user survey findings that individuals prefer a simple and predictable subscription. Gaining traction among individual users helped to build a foundation of support and drive adoption at the enterprise level.

Key takeaways

We’re still in the early days of generative AI, so we’re keeping close tabs on the demand and need for this new technology. While each company and product will need to define its own approach to building an LLM app, here are some key learnings from our product journey with GitHub Copilot:

  • Identify a focused problem and thoughtfully discern an AI’s use cases. This will ensure your app has greater impact and a faster time-to-market.

  • Integrate experimentation and tight feedback loops into the design process. This is especially critical when working with LLMs, where outputs are probabilistic and most end users are just learning how to interact with AI models.

  • As you scale, continue to leverage user feedback and prioritize user needs. Doing so will ensure that your product is built to deliver consistent results and real value.

If you’re looking for a problem to solve with an LLM app, check out our post on how companies are boosting productivity with generative AI. You can also take lessons from how GitHub used GitHub Actions to help an AI nonprofit, Ersilia, disseminate AI models to advance pharmaceutical research in low- and middle-income countries.

The post How to build an enterprise LLM application: Lessons from GitHub Copilot appeared first on The GitHub Blog.

Optimize your GitHub Codespaces costs with upgraded virtual machines

Post Syndicated from Craig Peters original https://github.blog/2023-08-31-how-github-reduces-costs-with-upgraded-codespaces/

Since we released GitHub Codespaces in 2021, we’ve made a number of updates aimed at improving usability, controlling cost, and more (for example, free usage for all, one click into templates, and concurrency policies). Now, GitHub has improved our developer experience and reduced usage costs at the same time by taking advantage of new virtual machines that provide all of our users twice the RAM, and approximately 10-30% improved CPU performance after adopting Advanced Micro Devices (AMD)-based hosts. These changes enable you to achieve the same (or better) machine performance for half the cost of the previous machine generation.

How this change helps you

In our previous VM generation, memory intensive workloads often had to overprovision CPUs just to get enough RAM in order to run, particularly when running multiple services. For professional developers this was particularly frustrating because of the increased complexity of their development environments, and their higher expectations for performance. Now, rather than having to choose between paying a premium for larger developer machines or sacrificing developer experience and productivity, you can get the best of both worlds.

For example, at GitHub we use our own software and services to build GitHub itself. GitHub uses Codespaces to build not only Codespaces, but the entire platform. GitHub has a large Ruby monolith that requires significant CPU and RAM to test, and also sets an extremely high bar for developer experience. In order to operate these environments while maximizing developer happiness, GitHub used the largest virtual machines available in Codespaces.

Once the new machine types were available, GitHub’s internal developer experience (DX) team started by moving a few dev teams with RAM-hungry workflows to machines with half the CPU count, but the same RAM, to test whether they would be sufficient. With very little effort, and nearly zero developer impact, testing showed that developers were just as successful on the smaller machines, and GitHub incurred half the cost. As additional teams tried moving the fewer-core machines, there was only one build process that turned out to be CPU architecture dependent. The fix was simple—to specify the CPU architecture so that QEMU could emulate appropriately. No other negative impacts were identified.

Due to the success of the initial trials, we quickly rolled out the changes to more teams. The result? Approximately 50% savings!

Figure 1: Codespaces cost for GitHub during the introduction of the AMD machines

Since we’ve rolled out the AMD machines for GitHub, we’ve seen no problems and had only happy users.

You can do the same in your organization by working with your development teams using GitHub Codespaces to test smaller machines on your existing development environments. All Codespaces virtual machines have been upgraded, so testing is as simple as having some developers try working in a smaller machine than they usually do. In most cases, no other configuration changes are necessary!

Once you have found the sweet spot for performance and experience, you can set a policy within your organization to restrict machine types, ensuring cost controls while providing environments that allow your developers to do their best work.

Save costs while empowering your developers

Now that these changes are in your hands, we invite you to see how much more you can get out of GitHub Codespaces by taking advantage of the improved processing power and increased headroom the RAM provides. As ever, please reach out to your account team, or participate in the GitHub Codespaces Community Discussions to provide us your feedback.


The post Optimize your GitHub Codespaces costs with upgraded virtual machines appeared first on The GitHub Blog.

Why Rust is the most admired language among developers

Post Syndicated from Sara Verdi original https://github.blog/2023-08-30-why-rust-is-the-most-admired-language-among-developers/


For the eighth year in a row, Rust has topped the chart as “the most desired programming language” in Stack Overflow’s annual developer survey. And with more than 80% of developers reporting that they’d like to use the language again next year, you have to wonder how a language created less than 20 years ago has stolen the hearts of developers around the world.

In this article, we’ll look at the history of Rust, what it’s commonly used for, why developers love it so much, and some resources to help you start learning one of the top fastest growing languages on GitHub.

So, what is the Rust programming language?

Rust’s print macro displaying the output “Hello, World!”

Originally intended to serve as a safer alternative to C and C++, Rust is a systems programming language that has gained significant popularity among developers thanks to its emphasis on safety, performance, and productivity. Rust is a statically typed language, so variable and expression types are determined and checked at compile time, which helps enhance memory safety and error detection, resulting in more reliable builds.

In 2006, the software developer, Graydon Hoare, started Rust as a personal project while he was working at Mozilla. According to an interview with MIT Technology Review, the inspiration for Rust came from a broken elevator in Hoare’s apartment building. The software for the lift operation system had crashed and Hoare understood that issues like this usually came from problems with how a program uses memory.

Quite often, the software for these types of devices is written in C or C++, but these languages require significant memory management, which can lead to errors that would cause the system to crash. So, Hoare set to work on figuring out how to create a programming language that could be both compact and memory bug-free.

He later showed the project to a manager—which led to Mozilla sponsoring it in 2009 as part of a longer-term effort to incorporate the language into the development of an experimental browser engine. In 2010, Mozilla Research officially announced the Rust project and released the source code to the public as an open-source project. After several years of development, Rust reached a stable and mature state—and in May 2015, Rust 1.0 was released. This milestone signaled that Rust was ready for production and provided a foundation for developers to build upon.

Since the 1.0 release, Rust has exploded in popularity and adoption, with top applications, such as Microsoft Windows, utilizing Rust to rewrite core libraries with its memory-safe code. Outside of the tech giants, Rust also has a vibrant community of developers, or “Rustaceans,” that are dedicated to making the Rust experience an active and collaborative one.

Ferris, an orange cartoon crustacean who is the unofficial mascot of Rust.

Meet Ferris, the unofficial mascot for Rust!

According to a recent survey by SlashData, there are roughly 2.8 million Rust developers worldwide in 2023, a number that has nearly tripled over the past two years. With plenty of active forums, documentation, and a supportive community for developers of all skill levels, it’s perhaps unsurprising that Rust keeps topping the most-desired language lists.

What makes Rust special?

So, what are some of Rust’s key features that make it so attractive to developers?

In simple terms, Rust solves some of developers’ most frustrating memory management problems commonly associated with C and C++, but that’s not its only shining capability. One of GitHub’s staff software engineers, Jason Orendorff, who co-authored a book on programming with Rust, said about the language:

“To me, what’s great about Rust is that it’s both fast AND reliable,” according to Orendorff. “It lets me write multi-headed programs that run on 16 cores and keep them readable, maintainable, and crash-free. It also lets me write very low-level algorithms requiring control over memory layout and pull in a crate that makes HTTPS requests super simple. It’s the combination of these features that makes Rust so unique.”

Building on that, here’s a few more of its well-loved characteristics and features:

  • Concurrency. Rust has built-in support for concurrent programming through its ownership system which enforces strict rules for data access, and its borrowing model, which prevents data races by allowing controlled, simultaneous access. This ensures that multiple threads can work on shared data without introducing memory-related issues.
  • No garbage collection. Unlike some programming languages, Rust does not employ garbage collection. Instead, its ownership and borrowing rules manage memory, which helps empower developers to have precise control over memory allocation and deallocation for efficient resource management.

  • Cargo Package Manager. Rust’s built-in package manager, Cargo, streamlines project management, dependency tracking, and building, which helps contribute to efficient and organized development workflows. But this doesn’t make it clear just how bonkers the Cargo ecosystem is. According to Orendorff, “My team takes advantage of high-quality open source packages for hashing, serialization, multithreading, data structures, compression, and a lot more. These are performance-critical libraries. Without some of these, our project to rethink code search on GitHub wouldn’t have been possible.” And here’s a fun fact: Rust was actually the first systems programming language to have a standard package manager, and, as a result, the Rust ecosystem is incredibly robust.

  • Zero-cost abstractions. This feature allows developers to write high-level code abstractions and features without introducing any runtime performance overhead.

  • Pattern matching. This powerful language feature enables developers to concisely and effectively match complex data structures against specific patterns to extract and handle different cases or scenarios in a clean and readable manner.

  • Type inference. This feature allows Rust’s compiler to automatically detect an expression based on context while you code. “Many programming languages have some type inference,” Orendorff said. “C# and C++ have some, Rust has a little more, and languages like Haskell, Scala, and ML have even more.”

fn main() {
    break rust;
}

Run this code for an inside joke among Rust developers 😆

What is Rust commonly used for?

Thanks to its direct access to both hardware and memory, Rust is well suited for embedded systems and bare-metal development. And since it’s a general purpose language, it can also be used for a variety of applications.

Let’s explore a few key use cases:

Using Rust to build performance-critical backend systems

Performance-critical backend systems are software components or services that handle tasks that require high-speed processing, low-latency responses, and efficient resource utilization—and Rust’s performance, thread safety, and error handling make it an excellent choice for developing these types of systems. In fact, we use Rust to build some of these systems at GitHub. For example, the backend of our code search feature is written in Rust (and you can read more about the development of GitHub’s newest code search with Rust, too).

Using Rust to develop operating systems

Rust was originally created to solve an operating system issue (remember the elevator problem?)—so, unsurprisingly, it’s often used to build operating systems, kernels, device drivers, or other low-level components where control over memory and performance is crucial. Redox, a Unix-like operating system, was written in Rust, which contributes to its most crucial feature: its security. “Fuchsia is another example that was built at Google,” Orendorff said. “If you have a Google Nest smart speaker, it’s likely running Fuchsia.”

Rust for operating system-adjacent code

Rust is also well-suited for writing code that performs tasks that closely interact with the operating system. For example, the Codespaces team at GitHub is leveraging Rust to enhance the speed of starting up the virtual disk within GitHub Codespaces and optimize the utilization of Azure storage. Coursera also employs Rust in its online grading system, as it operates within Docker and needs a language that compiles to machine code with minimal dependencies.

Using Rust for web development

Rust is increasingly being used for web development—especially on the server side. The async programming model and performance characteristics of Rust make it fitting for building high-performance web servers, APIs, and backend services. Plus, there’s been an influx of web frameworks for Rust, like Rocket, that can help folks get started with writing secure web applications. The emergence of these frameworks underscores Rust’s position as a mature language, and also helps increase the support for folks looking to use Rust in front or backend work.

Using Rust for crypto and blockchain development

Rust’s speed, memory management, and security all contribute to its involvement with cryptocurrency and blockchain technologies. For example, Polkadot, which is designed to enable the interoperability and interaction between multiple blockchains to share information and assets in a secure and decentralized manner, utilizes Rust to build its core infrastructure. Polkadot’s runtime logic, which governs the behavior and rules of the blockchain, is also written in Rust. Check out this repository, awesome-blockchain-rust, for some useful components for building your own blockchain applications with Rust.

Using Rust to build CLI tools

Rust’s compilation to efficient machine code and its expressive syntax make it a strong choice for building command line tools and applications. Plus, writing a command line app is a great way to learn and get comfortable with Rust. Take a look at this comprehensive guide on how to build your own CLI application with Rust in 15 minutes!

Learn how open source developers are making the command line more friendly—and more powerful.

Using Rust for embedded systems and IoT development

Rust’s minimal runtime and control over memory layout makes it incredibly useful for developing embedded systems and Internet of Things (IoT) devices. Its ability to prevent memory-related bugs, manage concurrency, and generate small, efficient binaries caters to IoT’s security, real-time, and efficiency needs.

Why developers love Rust

While its user base for Rust isn’t nearly as large as Java or Python, Rust continues to compete with the big hitters in most-admired lists across the internet. There’s even a full website composed of developer’s praises for Rust.

But why exactly is Rust so admired by developers? If you boil it down to just a handful of reasons why developers love Rust so much, they’d have to be the language’s speed, safety, and performance.

Moreover, Rust is continuing to evolve and grow with new frameworks, tools, and resources. You can keep tabs on contributions to the language in the awesome-rust repository, which hosts an impressive list of Rust code and resources.

The bottom line: Admiring Rust isn’t just about adopting a language—it’s embracing a mindset that prioritizes innovation without compromising on the core tenets of stability and security.

How to get started with Rust

We know, there’s plenty of resources to sharpen your Rust skills peppered throughout this article—but we have another pro tip for you: take Rust for a test drive with GitHub Copilot. As your AI-powered pair programmer, GitHub Copilot can help you learn and refine the basics of Rust as you go with tailored code suggestions.

Here’s a developer advocate at GitHub experimenting with Rust for the first time with GitHub Copilot. And to all of our seasoned Rustaceans out there, what do you think? Was the suggestion correct?

If you’re ready to begin your coding journey with Rust, GitHub Copilot can jumpstart your progress—all without the need to study documentation for hours at a time.

The post Why Rust is the most admired language among developers appeared first on The GitHub Blog.

Building hyperlocal GrabMaps

Post Syndicated from Grab Tech original https://engineering.grab.com/building-hyperlocal-grabmaps

Introduction

Southeast Asia (SEA) is a dynamic market, very different from other parts of the world. When travelling on the road, you may experience fast-changing road restrictions, new roads appearing overnight, and high traffic congestion. To address these challenges, GrabMaps has adapted to the SEA market by leveraging big data solutions. One of the solutions is the integration of hyperlocal data in GrabMaps.

Hyperlocal information is oriented around very small geographical communities and obtained from the local knowledge that our map team gathers. The map team is spread across SEA, enabling us to define clear specifications (e.g. legal speed limits), and validate that our solutions are viable.

Figure 1 – Map showing detections from images and probe data, and hyperlocal data.

Hyperlocal inputs make our mapping data even more robust, adding to the details collected from our image and probe detection pipelines. Figure 1 shows how data from our detection pipeline is overlaid with hyperlocal data, and then mapped across the SEA region. If you are curious and would like to check out the data yourself, you can download it here.

Processing hyperlocal data

Now let’s go through the process of detecting hyperlocal data.

Download data

GrabMaps is based on OpenStreetMap (OSM). The first step in the process is to download the .pbf file for Asia from geofabrick.de. This .pbf file contains all the data that is available on OSM, such as details of places, trees, and roads. Take for example a park, the .pbf file would contain data on the park name, wheelchair accessibility, and many more.

For this article, we will focus on hyperlocal data related to the road network. For each road, you can obtain data such as the type of road (residential or motorway), direction of traffic (one-way or more), and road name.

Convert data

To take advantage of big data computing, the next step in the process is to convert the .pbf file into Parquet format using a Parquetizer. This will convert the binary data in the .pbf file into a table format. Each road in SEA is now displayed as a row in a table as shown in Figure 2.

Figure 2 – Road data in Parquet format.

Identify hyperlocal data

After the data is prepared, GrabMaps then identifies and inputs all of our hyperlocal data, and delivers a consolidated view to our downstream services. Our hyperlocal data is obtained from various sources, either by looking at geometry, or other attributes in OSM such as the direction of travel and speed limit. We also apply customised rules defined by our local map team, all in a fully automated manner. This enhances the map together with data obtained from our rides and deliveries GPS pings and from KartaView, Grab’s product for imagery collection.

Figure 3 – Architecture diagram showing how hyperlocal data is integrated into GrabMaps.

Benefit of our hyperlocal GrabMaps

GrabNav, a turn-by-turn navigation tool available on the Grab driver app, is one of our products that benefits from having hyperlocal data. Here are some hyperlocal data that are made available through our approach:

  • Localisation of roads: The country, state/county, or city the road is in
  • Language spoken, driving side, and speed limit
  • Region-specific default speed regulations
  • Consistent name usage using language inference
  • Complex attributes like intersection links

To further explain the benefits of this hyperlocal feature, we will use intersection links as an example. In the next section, we will explain how intersection links data is used and how it impacts our driver-partners and passengers.

An intersection link is when two or more roads meet. Figure 4 and 5 illustrates what an intersection link looks like in a GrabMaps mock and in OSM.

Figure 4 – Mock of an intersection link.
Figure 5 – Intersection link illustration from a real road network in OSM.

To locate intersection links in a road network, there are computations involved. We would first combine big data processing (which we do using Spark) with graphs. We use geohash as the unit of processing, and for each geohash, a bi-directional graph is created.

From such resulting graphs, we can determine intersection links if:

  • Road segments are parallel
  • The roads have the same name
  • The roads are one way roads
  • Angles and the shape of the road are in the intervals or requirements we seek

Each intersection link we identify is tagged in the map as intersection_links. Our downstream service teams can then identify them by searching for the tag.

Impact

The impact we create with our intersection link can be explained through the following example.

Figure 6 – Longer route, without GrabMaps intersection link feature. The arrow indicates where the route should have suggested a U-turn.
Figure 7 – Shorter route using GrabMaps by taking a closer link between two main roads.

Figure 6 and Figure 7 show two different routes for the same origin and destination. However, you can see that Figure 7 has a shorter route and this is made available by taking an intersection link early on in the route. The highlighted road segment in Figure 7 is an intersection link, tagged by the process we described earlier. The route is now much shorter making GrabNav more efficient in its route suggestion.

There are numerous factors that can impact a driver-partner’s trip, and intersection links are just one example. There are many more features that GrabMaps offers across Grab’s services that allow us to “outserve” our partners.

Conclusion

GrabMaps and GrabNav deliver enriched experiences to our driver-partners. By integrating certain hyperlocal data features, we are also able to provide more accurate pricing for both our driver-partners and passengers. In our mission towards sustainable growth, this is an area that we will keep on improving by leveraging scalable tech solutions.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

10 things you didn’t know you could do with GitHub Projects

Post Syndicated from Kedasha Kerr original https://github.blog/2023-08-28-10-things-you-didnt-know-you-could-do-with-github-projects/

GitHub Projects has been adopted by program managers, OSS maintainers, enterprises, and individual developers alike for its user-friendly design and efficiency. We all know that managing issues and pull requests in our repositories can be challenging.

To help you optimize your usage of GitHub Projects to plan and track your work from start to finish, I’ll be sharing 10 things you can do with GitHub Projects to make it easier to keep track of your issues and pull requests.

1. Manage your projects with the CLI

If you prefer to work from your terminals, we’ve made it more convenient for you to manage and automate your project workflows with the GitHub CLI project command. This essentially allows you to work more collaboratively with your team to keep your projects updated with your existing toolkit.

For example, if I wanted to add a draft issue to my project “Learning Ruby,” I would do this by first ensuring that I have the CLI installed and I’m authenticated with the project scope. Once authenticated, I need to find the number of the project I want to manage with the CLI. You can find the project number by looking at the project URL. For example, https://github.com/orgs/That-Lady-Dev/projects/4 the project number here is “4.” Now that we have the project number, we can use it to add a draft issue to the project! The command will look like this:

gh project item-create 4 --owner That-Lady-Dev --title "Test Adding Draft" --body "I added this draft issue with GitHub CLI"

When we run this, a new draft issue is added to the project:

updating the project from the terminal; seeing the new item added to the board live

You can do a lot more with the GitHub CLI and GitHub projects. Check out our documentation to see all the possibilities of interacting with your projects from the terminal.

2. Export your projects to TSV

If you ever need your project data, you can export your project view to a file, which can then be imported into Figjam, Google Sheets, Excel, or any other platform that supports TSV files.

Go to any view of your project and click the arrow next to the view name, then select Export view data. This will give you a TSV file that you can use.

export project view data as a TSV file

Though TSV offers much better formatting than a CSV file, you can ask GitHub Copilot Chat how to convert a TSV file to a CSV file, copy the code, run it, and get your new CSV document, if CSV is your jam.

GitHub Copilot Chat converts TSV to CSV with Python code

Here’s a quick gist of how I converted a TSV to a CSV with GitHub Copilot Chat!

3. Create reusable project templates

If you often find yourself recreating projects with similar content and structure, you can set a project as a template so you and others can use it as a base when creating new projects.

To set your project as a template, navigate to the project “Settings” page, and under the “Templates” section toggle on Make template.

 toggle templates on from the setting page showing a green button in the UI

This will turn the project into a template that can be used with the green Use this template button at the top of your project, or when creating a new project. Building a library of templates that can be reused across your organization can help you and your teams share best practices and inspiration when getting started with a project!

4. Make a copy of a project

In addition to making your project a template that can be reused, you can also make a one-time copy of an existing project that will contain the fields, views, any configured workflows, insights, and draft items from the original project!

To copy a project, navigate to the project you want to copy, click the three dots to open the menu, and select Make a copy. This will open up a dialog where you can set the Owner, name the project, and click whether you want draft issues copied over or not. Once that’s all set, your new project is ready to be used!

making a copy of the project and updating data

You can also do this with the CLI. The command will look like this:

gh project copy 1 --source-owner That-Lady-Dev --target-owner Demos-and-Donuts --title "copied project"

5. Automate your project with workflows

If you want an issue to be automatically added to a project or if you want to set the status of an issue to “completed” when it is closed, you can do this automatically with built-in project workflows!

Go to the menu and click “Workflows.” This will show you a list of default workflows you can enable on your projects. To automatically add an issue to your project from a repository, you can enable the “Auto-add to project” workflow. To automatically set the status of a closed issue to “complete,” you can enable the “item closed” workflow.

turning on built-in project workflows from the settings page

Explore more built-in workflows by reading our documentation where you can also learn how to automate your projects with GitHub Actions.

6. Add colors and description to custom fields

Custom fields help you organize and categorize items in your projects, with flexible field types including text, number, date, single select, and iteration. If you want to add a splash of color to your project or more details about a specific field, you can add colors and descriptions to your single select fields!

To add a color and a description to a new single select field, navigate to the project settings, and add a new field. From there, you can add options to the field where you can select colors and add a description so everyone on your team knows what those options in the field mean and how they can be used.

updating project settings with new fields and descriptions from the settings page

You can also update field descriptions and colors directly from the project view by selecting Edit details from the group or column menus.

updating colors and description fields from the main project view

7. Add Issues from any organization

If you’re an open source maintainer, or a developer with multiple clients, you may be working across multiple organizations at a time. This means you have multiple issues to keep track of and need a way to combine these issues in one cohesive manner.

This is where GitHub Projects come in! You can collate issues from any organization onto a single project.

For example, I’m a part of the That-Lady-Dev and the Demos-and-Donuts organizations. I have the issues I want to track on my project board from That-Lady-Dev, but I also want to add the issues I have from the other organization to the same board. I can do this in one of two ways—I can either copy the issue link from the Demos-and-Donuts organization and paste it into the project, or I can search for the Demos-and-Donuts organization and repository from the project using # and select the issues I want to add.

This is a lot to take in—take a look at the gif below.

pasting an issue url from another org onto the project and searching for an issue from another org to add to the project

You can also add an issue or pull request to a project with the CLI. The command will look like this:

gh project item-add 4 --owner That-Lady-Dev --url https://github.com/Demos-and-Donuts/video-to-gif-converter/issues/1

8. Edit multiple items at once

Rather than spending time manually updating individual items, you can edit multiple items in one go with our bulk editing feature on GitHub Projects.

Let’s say you wanted to assign multiple issues to yourself. On the table layout, assign one issue and with the cell highlighted, and copy the contents of the cell. Select all the remaining items you want to be assigned and paste the copied contents. You just assigned yourself to multiple issues at once, and this can be undone at the click of a button or using keyboard commands as well.

This is demonstrated in the gif below.

bulk editing fields by assigning LadyKerr to thirteen field at the same time

You can also drag and drop multiple items on a project board to different columns.

dragging and dropping four board items to another column at the same time

9. Reorder fields

With a growing list of fields in your project, you’ll want to make sure your fields are organized and you see the most important ones up top. To change the order in how they appear on the side panel and on the issues page, you can rearrange the order of the fields from the project settings by dragging and dropping them in the “Custom fields” list.

putting status field at the top on settings page and showing on the project view that it is now the first field on the issue

10. See what you want to see with slice by

If you find yourself with multiple views and filters to see how items are spread among various teams, labels, or assignees, you can configure a slice field to break down and quickly toggle through your items. You can choose a Slice by field that will pull the field values into a panel on the left of your view, and clicking each value will adjust the items in the project view on the right. See the gif below for how this works.

slicing the project by content type, labels and assignees to demonstrate slice by feature

Try out slicing by different fields to unlock a new way to organize your items!

Bonus tip: Deep linking

Let’s say you want to send a specific issue from your project to a teammate. You can use the Copy link to project button to send them a direct link to that particular issue in the project without having them sift through to find the issue you mentioned. See what I mean in this gif.

using the copy project link to deep link items

Wrap-up

And there you have it—10 things you didn’t know you could do with GitHub Projects. The team is continuing to work on more amazing features to make tracking your issues with pull requests as seamless and painless as possible. GitHub Projects is a powerful, flexible, and efficient way to keep track of your items while staying on top of your work.

Do let me know if you have any questions about GitHub Projects; I’m happy to jump in and assist.

The post 10 things you didn’t know you could do with GitHub Projects appeared first on The GitHub Blog.

Unleashing GitHub Codespaces templates to ignite your development

Post Syndicated from Sneha Natekar original https://github.blog/2023-08-24-unleashing-github-codespaces-templates-to-ignite-your-development/

Ever found yourself struggling to set up a brand-new Integrated Development Environment (IDE) for a project? The overwhelming process of dealing with build errors, dependencies, and configurations can leave you feeling frustrated and short on time. Trust me, I’ve been there, too. As an avid developer, I understand the struggles and challenges firsthand.

That’s when I discovered GitHub Codespaces, and it’s a game-changer. GitHub Codespaces is a cloud-based coding environment for collaborative development accessible through a browser. Templates include specific configurations of tools, libraries, and settings within GitHub Codespaces, enabling developers to quickly create consistent coding environments for various projects without having to set up everything from scratch.

With customizable environments, streamlined workflows, and easy setup sharing, you can be more productive and focus on creating exceptional software.

With the rich offering of existing templates available at my disposal, I quickly realized there are many possibilities. As I was working on My Android app, I was looking for a template that would let me build My Android app. However, because there was no Android template, I decided to build my own.

Together, let’s conquer setup challenges and unlock a world of coding possibilities. With great power comes great coding!

Step 1: Customizing a GitHub Codespaces template and connecting to a repository

While you can find a number of templates for React, Django, and Ruby on Rails in the template library, I couldn’t find the one I needed for Android.

Fortunately, creating a custom template is a breeze! Just head to GitHub Codespaces and select the “Blank” template. Starting with a blank template opens a codespace that I can configure per my needs.

Screenshot showing the "Blank" template listing, with the description, "Start with a blank canvas or import any packages you need."

Clicking on “Use this template” launches a codespace in a separate tab within your browser.

Screenshot of the blank template opened in a browser tab.

My new codespace is called “vigilant potato” as you can see from the URL in the screenshot above. Your codespace will get its own unique, and maybe cute, name.

If you get distracted by something else, and need to come back to your codespace, you can access it from github.com/codespaces by looking for the “vigilant potato” among the list of codespaces—you might need to scroll down to find it if you have other codespaces.

Screenshot showing a list of codespaces owned by the logged-in user, include the recently created "vigilant potato."

Click on the three horizontal dots at the end of the row and select “Publish to a new repository.” This flow seamlessly lets you create a new repository and preserves your development environment and any code you might have added that belongs to your project.

Screenshot of the menu with the option "Publish to a new repository" selected.

Once your repository is created, you’ll be able to see it in your GitHub Settings > your repositories. Or, if you’d like to connect to an existing repository, you can do so through the terminal.

Step 2: Configuring your environment: devcontainer.json/Dockerfile

There are two ways to customize your environment in a template. The two ways: using either a devcontainer.json file or a Dockerfile. The devcontainer.json file focuses on configuring the development environment within Visual Studio Code, while a Dockerfile allows you to create a custom Docker image that forms the basis of the entire development environment. Both files are essential for defining and customizing the development environment to meet the specific requirements of your project.

For my Android development environment, I have to create a Dockerfile. To harness this power, I simply navigate to the root of my repository and create a new file called Dockerfile. You can define the base image, specify environment variables, copy files and directories, install dependencies, execute commands, expose ports, define the working directory, run services, and set the entrypoint or command. These instructions allow you to tailor the Docker image to include the necessary tools, configurations, and dependencies required for your project.

Watch the magic unfold as a tailored, fully-configured development environment materializes in the cloud. In an instant, you’ll be immersed in a seamless coding experience, equipped with all the necessary tools. Say goodbye to setup hassles and dive straight into your code. With a customized codespace template, you’ll have a portable development sanctuary that follows you everywhere. For my Android repository, I very easily followed these steps and a brand new codespace built from the blank template opened for my repository in a separate browser tab.

Screenshot of the codespace's Dockerfile.

Here are a few snippets from my Android Dockerfile to demonstrate its power.

Automating Android SDK installation

# Update package list and install packages required for Android app development
RUN apt-get update -yqq && \
apt-get install -y \
curl \
expect \
git \
make \
wget \
unzip \
vim \
openssh-client \
locales \
libarchive-tools && \
apt-get clean && rm -rf /var/lib/apt/lists/* && \
localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8

In this example, I am using regular shell commands to install the Android SDK.

Setting environment variables

# Set environment variables used by the Android SDK
ENV ANDROID_SDK_HOME /opt/android-sdk-linux
ENV ANDROID_SDK_ROOT /opt/android-sdk-linux
ENV ANDROID_HOME /opt/android-sdk-linux
ENV ANDROID_SDK /opt/android-sdk-linux

This snippet showcases how the Dockerfile lets you set environment variables that are necessary for development.

This remarkable file guarantees a breeze for anyone cloning my repository, effortlessly setting up a standardized development environment. Say goodbye to manual setup hassles and welcome a seamless and efficient collaborative experience.

Conclusion

By embracing these steps, I’ve discovered that the entire process of creating and using GitHub Codespaces templates can be incredibly easy, enjoyable, and efficient. Leveraging the power of templates and devcontainer.json files means I never have to start from scratch again. What used to be a painstaking, day plus process of downloading and installing all of the Android SDK and Java components, setting the environment variables, getting the libraries, and maintaining their updates is no more through using my pre-configured Android template. I have created additional templates for work that are Java and Python development environments.

Remember, with great coding comes great responsibility,and a lot fewer debugging sessions! Happy coding, and may the bugs be ever in your favor!

If you’re also an Android developer, try out my template or explore the other templates GitHub offers and customize your own.

The post Unleashing GitHub Codespaces templates to ignite your development appeared first on The GitHub Blog.

Highlights from Git 2.42

Post Syndicated from Taylor Blau original https://github.blog/2023-08-21-highlights-from-git-2-42/

The open source Git project just released Git 2.42 with features and bug fixes from over 78 contributors, 17 of them new. We last caught up with you on the latest in Git back when 2.41 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

Faster object traversals with bitmaps

Many long-time readers of these blog posts will recall our coverage of reachability bitmaps. Most notably, we covered Git’s new multi-pack reachability bitmaps back in our coverage of the 2.34 release towards the end of 2021.

If this is your first time here, or you need a refresher on reachability bitmaps, don’t worry. Reachability bitmaps allow Git to quickly determine the result set of a reachability query, like when serving fetches or clones. Git stores a collection of bitmaps for a handful of commits. Each bit position is tied to a specific object, and the value of that bit indicates whether or not it is reachable from the given commit.

This often allows Git to compute the answers to reachability queries using bitmaps much more quickly than without, particularly for large repositories. For instance, if you want to know the set of objects unique to some branch relative to another, you can build up a bitmap for each endpoint (in this case, the branch we’re interested in, along with main), and compute the AND NOT between them. The resulting bitmap has bits set to “1” for exactly the set of objects unique to one side of the reachability query.

But what happens if one side doesn’t have bitmap coverage, or if the branch has moved on since the last time it was covered with a bitmap?

In previous versions of Git, the answer was that Git would build up a complete bitmap for all reachability tips relative to the query. It does so by walking backwards from each tip, assembling its own bitmap, and then stopping as soon as it finds an existing bitmap in history. Here’s an example of the existing traversal routine:

Figure 1: Bitmap-based traversal computing the set of objects unique to `main` in Git 2.41.0.

There’s a lot going on here, but let’s break it down. Above we have a commit graph, with five branches and one tag. Each of the commits are indicated by circles, and the references are indicated by squares pointing at their respective referents. Existing bitmaps can be found for both the v2.42.0 tag, and the branch bar.

In the above, we’re trying to compute the set of objects which are reachable from main, but aren’t reachable from any other branch. By inspection, it’s clear that the answer is {C₆, C₇}, but let’s step through how Git would arrive at the same result:

  • For each branch that we want to exclude from the result set (in this case, foo, bar, baz, and quux), we walk along the commit graph, marking each of the corresponding bits in our have‘s bitmap in the top-left.
  • If we happen to hit a portion of the graph that we’ve covered already, we can stop early. Likewise, if we find an existing bitmap (like what happens when we try to walk beginning at branch bar), we can OR in the bits from that commit’s bitmap into our have‘s set, and move on to the next branch.
  • Then, we repeat the same process for each branch we do want to keep (in this case, just main), this time marking or ORing bits into the have‘s bitmap.
  • Finally, once we have a complete bitmap representing each side of the reachability query, we can compute the result by AND NOTing the two bitmaps together, leaving us with the set of objects unique to main.

We can see that in the above, having existing bitmap coverage (as is the case with branch bar) is extremely beneficial, since they allow us to discover the set of objects reachable from a certain point in the graph immediately without having to open up and parse objects.

But what happens when bitmap coverage is sparse? In that case, we end up having to walk over many objects in order to find an existing bitmap. Oftentimes, the additional overhead of maintaining a series of bitmaps outweighs the benefits of using them in the first place, particularly when coverage is poor.

In this release, Git introduces a new variant of the bitmap traversal algorithm that often out performs the existing implementation, particularly when bitmap coverage is sparse.

The new algorithm represents the unwanted side of the reachability query as a bitmap from the query’s boundary, instead of the union of bitmap(s) from the individual tips on the unwanted side. The exact definition of what a query boundary is is slightly technical, but for our purposes you can think of it as the first commit in the wanted set of objects which is also reachable from at least one unwanted object.

In the above example, this is commit C₅, which is reachable from both main (which is in the wanted half of the reachability query) along with bar and baz (both of which are in the unwanted half). Let’s step through computing the same result using the boundary-based approach:

Figure 2: The same traversal as above, instead using the boundary commit-based approach.

The approach here is similar to the above, but not quite the same. Here’s the process:

  • We first discover the boundary commit(s), in this case C₅.
  • We then walk backwards from the set of boundary commit(s) we just discovered until we find a reachability bitmap (or reach the beginning of history). At each stage along the walk, we mark the corresponding bit in the have‘s bitmap.
  • Then, we build up a complete bitmap on the want‘s side by starting a walk from main until either we hit an existing bitmap, the beginning of history, or an object marked in the previous step.
  • Finally, as before, we compute the AND NOT between the two bitmaps, and return the results.

When there are bitmaps close to the boundary commit(s), or the unwanted half of the query is large, this algorithm often vastly outperforms the existing traversal. In the toy example above, you can see we compute the answer much more quickly when using the boundary-based approach. But in real-world examples, between a 2- and 15-fold improvement can be observed between the two algorithms.

You can try out the new algorithm by running:

$ git repack -ad --write-bitmap-index
$ git config pack.useBitmapBoundaryTraversal true

in your repository (using Git 2.42), and then using git rev-list with the --use-bitmap-index flag.

[source]

Exclude references by pattern in for-each-ref

If you’ve ever scripted around Git before, you are likely familiar with its for-each-ref command. If not, you likely won’t be surprised to learn that this command is used to enumerate references in your repository, like so:

$ git for-each-ref --sort='-*committerdate' refs/tags
264b9b3b04610cb4c25e01c78d9a022c2e2cdf19 tag    refs/tags/v2.42.0-rc2
570f1f74dee662d204b82407c99dcb0889e54117 tag    refs/tags/v2.42.0-rc1
e8f04c21fdad4551047395d0b5ff997c67aedd90 tag    refs/tags/v2.42.0-rc0
32d03a12c77c1c6e0bbd3f3cfe7f7c7deaf1dc5e tag    refs/tags/v2.41.0
[...]

for-each-ref is extremely useful for listing references, finding which references point at a given object (with --points-at), which references have been merged into a given branch (with --merged), or which references contain a given commit (with --contains).

Git relies on the same machinery used by for-each-ref across many different components, including the reference advertisement phase of pushes. During a push, the Git server first advertises a list of references that it wants the client to know about, and the client can then exclude those objects (and anything reachable from them) from the packfile they generate during the push.

Suppose that you have some references that you don’t want to advertise to clients during a push? For example, GitHub maintains a pair of references for each open pull request, like refs/pull/NNN/head and refs/pull/NNN/merge, which aren’t advertised to pushers. Luckily, Git has a mechanism that allows server operators to exclude groups of references from the push advertisement phase by configuring the transfer.hideRefs variable.

Git implements the functionality configured by transfer.hideRefs by enumerating all references, and then inspecting each one to see whether or not it should advertise that reference to pushers. Here’s a toy example of a similar process:

Figure 3: Running `for-each-ref` while excluding the `refs/pull/` hierarchy.

Here, we want to list every reference that doesn’t begin with refs/pull/. In order to do that, Git enumerates each reference one-by-one, and performs a prefix comparison to determine whether or not to include it in the set.

For repositories that have a small number of hidden references, this isn’t such a big deal. But what if you have thousands, tens of thousands, or even more hidden references? Performing that many prefix comparisons only to throw out a reference as hidden can easily become costly.

In Git 2.42, there is a new mechanism to more efficiently exclude references. Instead of inspecting each reference one-by-one, Git first locates the start and end of each excluded region in its packed-refs file. Once it has this information, it creates a jump list allowing it to skip over whole regions of excluded references in a single step, rather than discarding them one by one, like so:

Figure 4: The same `for-each-ref` invocation as above, this time using a jump list as in Git 2.42.

Like the previous example, we still want to discard all of the refs/pull references from the result set. To do so, Git finds the first reference beginning with refs/pull (if one exists), and then performs a modified binary search to find the location of the first reference after all of the ones beginning with refs/pull.

It can then use this information (indicated by the dotted yellow arrow) to avoid looking at the refs/pull hierarchy entirely, providing a measurable speed-up over inspecting and discarding each hidden reference individually.

In Git 2.42, you can try out this new functionality with git for-each-ref‘s new --exclude option. This release also uses this new mechanism to improve the reference advertisement above, as well as analogous components for fetching. In extreme examples, this can provide a 20-fold improvement in the CPU cost of advertising references during a push.

Git 2.42 also comes with a pair of new options in the git pack-refs command, which is responsible for updating the packed-refs file with any new loose references that aren’t stored. In certain scenarios (such as a reference being frequently updated or deleted), it can be useful to exclude those references from ever entering the packed-refs file in the first place.

git pack-refs now understands how to tweak the set of references it packs using its new --include and --exclude flags.

[source, source]

Preserving precious objects from garbage collection

In our last set of release highlights, we talked about a new mechanism for collecting unreachable objects in Git known as cruft packs. Git uses cruft packs to collect and track the age of unreachable objects in your repository, gradually letting them age out before eventually being pruned from your repository.

But Git doesn’t simply delete every unreachable object (unless you tell it to with --prune=now). Instead, it will delete every object except those that meet one of the below criteria:

  1. The object is reachable, in which case it cannot be deleted ever.
  2. The object is unreachable, but was modified after the pruning cutoff.
  3. The object is unreachable, and hasn’t been modified since the pruning cutoff, but is reachable via some other unreachable object which has been modified recently.

But what do you do if you want to hold onto an object (or many objects) which are both unreachable and haven’t been modified since the pruning cutoff?

Historically, the only answer to this question was that you should point a reference at those object(s). That works if you have a relatively small set of objects you want to hold on to. But what if you have more precious objects than you could feasibly keep track of with references?

Git 2.42 introduces a new mechanism to preserve unreachable objects, regardless of whether or not they have been modified recently. Using the new gc.recentObjectsHook configuration, you can configure external program(s) that Git will run any time it is about to perform a pruning garbage collection. Each configured program is allowed to print out a line-delimited sequence of object IDs, each of which is immune to pruning, regardless of its age.

Even if you haven’t started using cruft packs yet, this new configuration option works even when using loose objects to hold unreachable objects which have not yet aged out of your repository.

This makes it possible to store a potentially large set of unreachable objects which you want to retain in your repository indefinitely using an external mechanism, like a SQLite database. To try out this new feature for yourself, you can run:

$ git config gc.recentObjectsHook /path/to/your/program
$ git gc --prune=<approxidate>

[source, source]


  • If you’ve read these blog posts before, you may recall our coverage of the sparse index feature, which allows you to check out a narrow cone of your repository instead of the whole thing.

    Over time, many commands have gained support for working with the sparse index. For commands that lacked support for the sparse index, invoking those commands would cause your repository to expand the index to cover the entire repository, which can be a potentially expensive operation.

    This release, the diff-tree command joined the group of commands with full support for the sparse index, meaning that you can now use diff-tree without expanding your index.

    This work was contributed by Shuqi Liang, one of the Git project’s Google Summer of Code (GSoC) students. You can read more about their project here, and follow along with their progress on their blog.

    [source]

  • If you’ve gotten this far in the blog post and thought that we were done talking about git for-each-ref, think again! This release enhances for-each-ref‘s --format option with a handful of new ways to format a reference.

    The first set of new options enables for-each-ref to show a handful of GPG-related information about commits at reference tips. You can ask for the GPG signature directly, or individual components of it, like its grade, the signer, key, fingerprint, and so on. For example,

    $ git for-each-ref --format='%(refname) %(signature:key)' \
        --sort=v:refname 'refs/remotes/origin/release-*' | tac
    refs/remotes/origin/release-3.1 4AEE18F83AFDEB23
    refs/remotes/origin/release-3.0 4AEE18F83AFDEB23
    refs/remotes/origin/release-2.13 4AEE18F83AFDEB23
    [...]
    

    This work was contributed by Kousik Sanagavarapu, another GSoC student working on Git! You can read more about their project here, and keep up to date with their work on their blog.

    [source, source]

  • Earlier in this post, we talked about git rev-list, a low-level utility for listing the set of objects contained in some query.

    In our early examples, we discussed a straightforward case of listing objects unique to one branch. But git rev-list supports much more complex modifiers, like --branches, --tags, --remotes, and more.

    In addition to specifying modifiers like these on the command-line, git rev-list has a --stdin mode which allows for reading a line-delimited sequence of commits (optionally prefixed with ^, indicating objects reachable from those commit(s) should be excluded) from the command’s standard input.

    Previously, support for --stdin extended only to referring to commits by their object ID, without support for more complex modifiers like the ones listed earlier. In Git 2.42, git rev-list --stdin can now accept the same set of modifiers given on the command line, making it much more useful when scripting.

    [source]

  • Picture this: you’re working away on your repository, typing up a tag message for a tag named foo. Suppose that in the background, you have some repeating task that fetches new commits from your remote repository. If you happen to fetch a tag foo/bar while writing the tag message for foo, Git will complain that you cannot have both tag foo and foo/bar.

    OK, so far so good: Git does not support this kind of tag hierarchy1. But what happened to your tag message? In previous versions of Git, you’d be out of luck, since your in-progress message at $GIT_DIR/TAG_EDITMSG is deleted before the error is displayed. In Git 2.42, Git delays deleting the TAG_EDITMSG until after the tag is successfully written, allowing you to recover your work later on.

    [source]

  • In other git tag-related news, this release comes with a fix for a subtle bug that appeared when listing tags. git tag can list existing tags with the -l option (or when invoked with no arguments). You can further refine those results to only show tags which point at a given object with the --points-at option.

    But what if you have one or more tags that point at the given object through one or more other tags instead of directly? Previous versions of Git would fail to report those tags. Git 2.42 addresses this by dereferencing tags through multiple layers before determining whether or not it points to a given object.

    [source]

  • Finally, back in Git 2.38, git cat-file --batch picked up a new -z flag, allowing you to specify NUL-delimited input instead of delimiting your input with a standard newline. This flag is useful when issuing queries which themselves contain newlines, like trying to read the contents of some blob by path, if the path contains newlines.

    But the new -z option only changed the rules for git cat-file‘s input, leaving the output still delimited by newlines. Ordinarily, this won’t cause any problems. But if git cat-file can’t locate an object, it will print out ” missing”, followed by a newline.

    If the given query itself contains a newline, the result is unparseable. To address this, git cat-file has a new mode, -Z (as opposed to its lowercase variant, -z) which changes both the input and output to be NUL-delimited.

    [source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.42, or any previous version in the Git repository.

Notes


  1. Doing so would introduce a directory/file-conflict. Since Git stores loose tags at paths like $GIT_DIR/refs/tags/foo/bar, it would be impossible to store a tag foo, since it would need to live at $GIT_DIR/refs/tags/foo, which already exists as a directory. 

The post Highlights from Git 2.42 appeared first on The GitHub Blog.

mTLS: When certificate authentication is done wrong

Post Syndicated from Michael Stepankin original https://github.blog/2023-08-17-mtls-when-certificate-authentication-is-done-wrong/

Although X.509 certificates have been here for a while, they have become more popular for client authentication in zero-trust networks in recent years. Mutual TLS, or authentication based on X.509 certificates in general, brings advantages compared to passwords or tokens, but you get increased complexity in return.

In this post, I’ll deep dive into some interesting attacks on mTLS authentication. We won’t bother you with heavy crypto stuff, but instead we’ll have a look at implementation vulnerabilities and how developers can make their mTLS systems vulnerable to user impersonation, privilege escalation, and information leakages.

We will present some CVEs we found in popular open-source identity servers and ways to exploit them. Finally, we’ll explain how these vulnerabilities can be spotted in source code and how to fix them.

This blog post is based on work that I recently presented at Black Hat USA and DEF CON.

Introduction: What is mutual TLS?

Website certificates are a very widely recognized technology, even to people who don’t work in the tech industry, thanks to the padlock icon used by web browsers. Whenever we connect to Gmail or GitHub, our browser checks the certificate provided by the server to make sure it’s truly the service we want to talk to. Fewer people know that the same technology can be used to authenticate clients: the TLS protocol is also designed to be able to verify the client using public and private key cryptography.

It happens on the handshake level, even before any application data is transmitted:

Excerpt from RFC 5246: "Figure 1. Message flow for a full handshake"

If configured to do so, a server can ask a client to provide a security certificate in the X.509 format. This certificate is just a blob of binary data that contain information about the client, such as its name, public key, issuer, and other fields:

$ openssl x509 -text -in client.crt
Certificate:
    Data:
        Version: 1 (0x0)
        Serial Number:
            d6:2a:25:e3:89:22:4d:1b
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=localhost            //used to locate issuers certificate
        Validity
            Not Before: Jun 13 14:34:28 2023 GMT
            Not After : Jul 13 14:34:28 2023 GMT
        Subject: CN=client          //aka "user name"
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                RSA Public-Key: (2048 bit)
                Modulus:
                    00:9c:7c:b4:e5:e9:3d:c1:70:9c:9d:18:2f:e8:a0:

The server checks that this certificate is signed by one of the trusted authorities. This is a bit similar to checking the signature of a JWT token. Next, the client sends a “Certificate verify” message encrypted with the private key, so that the server can verify that the client actually has the private key.

How certificates are validated

“Certificate validation” commonly refers to the PKIX certificate validation process defined in RFC 5280.

In short, in order to validate the certificate, the server constructs a certification path (also known as a certificate chain) from the target certificate to a trust anchor. The trust anchor is a self-signed root certificate that is inherently trusted by the validator. The end entity certificate is often signed by an intermediate CA, which is also signed by another intermediate certificate or directly by a trust anchor.

Diagram of certificate chain with three links: Client certificate, Intermediate CA, Root Certificate Authority

Then, for each certificate in the chain, the validator checks the signature, validity period, allowed algorithms and key lengths, key usage, and other properties. There are also a number of optional certificate extensions: if they are included in the certificate, they can be checked as well. This process is quite complicated, so every language or library implements it differently.

Note: in my research I mostly looked at how mTLS is implemented in applications written in Java, but it is likely that the ideas and attacks below apply to other languages as well.

mTLS in a Java web application, an example

Let’s see how to use mTLS in a Java web application. The bare minimum configuration is to enable it in the application settings and specify the location of all trusted root certificates, like this:

$ cat application.properties

…
server.ssl.client-auth=need
server.ssl.trust-store=/etc/spring/server-truststore.p12
server.ssl.trust-store-password=changeit

From the client, such as curl, you need to specify which certificate is sent to the server. The rest of the application code, such as request mappings, is exactly the same as for a normal web application.

$ curl -k -v –cert client.pem http://localhost/hello

This setup works for very simple mTLS configurations, when there is only a single root certificate, and all client certificates are signed by it. You can find this example in various articles on the web and it’s quite secure due to its simplicity. Let’s quickly break down its pros and cons.

Pros:

  • Speed: Authorization happens only during TLS handshake, all subsequent “keep-alive” HTTP requests are considered authenticated, saving CPU time.
  • Storage: Similar to JWT, the server does not store all client certificates, only the root certificate.

Cons:

  • No granular control: if mTLS is enabled, all requests have to be authenticated, even to /static/style.css.
  • Any certificate signed by a trusted CA can be used to access this HTTP service. Even if the certificate is issued for another purpose, it still can potentially be used for TLS authentication.
  • No host verification by default: client certificates can be accepted from any IP.
  • Certificate issuance process needs to be implemented separately.
  • Certificates expire, so need to be rotated frequently.

As you can see, this approach brings some advantages and disadvantages compared to traditional authentication methods, such as password or tokens.

Previous attacks

Before we dive into the attack section, I’ll briefly mention some previous well-known attacks on certificate parsing and validation:

  • Obviously, the security of the authentication system depends on the strength of the signature. If we can somehow forge the content of the certificate, but keep the same signature, we can completely break the authentication process.
  • Since the X.509 format is quite complex, just parsing these data structures can lead to buffer and heap overflows.
  • Lack of basic constraints checking. The end-entity certificates should not be used to sign additional certificates.

My approach

In Java, most of these attacks are already mitigated in APIs provided by the JDK. Weak algorithms are intentionally not allowed. Fuzzing of certificate parsing in Java also did not look productive to me, as the vast majority of PKIX code is implemented in memory-safe Java, instead of using native libraries. I had to take a different approach, so I decided to have a deep look at how mTLS is used from the source code perspective. Since the certificate validation process is quite complex, I suspected that someone might implement it in a weird way. After several weeks, it yielded me some interesting vulnerabilities in popular open source projects.

So, let’s move on to the attack’s section.

Chapter 1: Improper certificate extraction

In real-life applications, developers often need to access the certificate presented during the TLS handshake. For example, they might need it for authorization purposes, such as checking the current username. In Java, there are two common ways how to access it:

X509Certificate[] certificates = sslSession.getPeerCertificates();  

// another way
X509Certificate[] certificates = request.getAttribute("javax.servlet.request.X509Certificate");

Interestingly, this API returns an array of certificates presented by the client, not a single one. Why? Perhaps because TLS specification defines that clients may send a full chain of certificates, from end-entity to the root CA.

So, I decided to take a look at how different applications use this API. The most common approach I’ve seen is to take only the first certificate from the array and consider it as the client certificate. This is correct, as mTLS RFC explicitly says that the sender’s certificate MUST come first in the list.

//way 1 is good
String user = certificates[0].getSubjectX500Principal().getName();

At the same time, I discovered some rare cases when applications disregard this rule and iterate over the array trying to find a certificate that matches some criteria.

//way 2 is dangerous
for (X509Certificate cert : certificates) {
   if (isClientCertificate(cert)) {
      user = cert.getSubjectX500Principal().getName();
   }
}

This is dangerous, as the underlying TLS library in Java only verifies the first certificate in the list. Moreover, it does not require the chain to be sent in a strict order.

Example: CVE-2023-2422 improper certificate validation in KeyCloak

One of these examples was a vulnerability I discovered in Keycloak. Keycloak is a popular authorization server that supports OAuth, SAML, and other authorization methods, as well as mutual TLS.

Keycloak iterates over all certificates in the array, searching for the one that matches the client_id form parameter. As soon as it finds a matching certificate, it implicitly trusts it, assuming that its signature has already been checked during the TLS handshake:

X509Certificate[] certs = null;
ClientModel client = null;
try { 
    certs = provider.getCertificateChain(context.getHttpRequest());
    String client_id = null;
    ...
    if (formData != null) {
        client_id = formData.getFirst(OAuth2Constants.CLIENT_ID);
    }
    …
    matchedCertificate = Arrays.stream(certs)
        .map(certificate -> certificate.getSubjectDN().getName())
        .filter(subjectdn -> subjectDNPattern.matcher(subjectdn).matches())
        .findFirst();

In reality, a client can send as many certificates as they want, and the server only verifies the first one.

A potential attacker can exploit this behavior to authenticate under a different username. It is possible to send a list of certificates, where the first one contains one username and is properly chained to a root CA. But the last certificate in the array might be self signed and belong to a different user. The client does not even need to provide a valid private key for it.

Diagram of a certificate list in which the first client certificate is signed by a CA, but the second is self-signed.

Speaking about the exploitation, there are a number of endpoints in Keycloak that support mTLS authentication, but we need one that does not require any additional factors, such as tokens or secrets. “client-management/register-node” is a good example, as it mutates the user’s data. We can normally use this api with mTLS in the following way:

$ cat client1.crt client1.key > chain1.pem
$ curl --tlsv1.2 --tls-max 1.2 --cert chain1.pem -v -i -s -k "https://127.0.0.1:8443/realms/master/clients-managements/register-node?client_id=client1" -d "client_cluster_host=http://127.0.0.1:1213/"

To demonstrate the vulnerability, we generate a new self signed certificate using openssl and add it to the end of the array.

$ openssl req -newkey rsa:2048 -nodes -x509 -subj /CN=client2 -out client2-fake.crt
$ cat client1.crt client1.key client2-fake.crt client1.key > chain2.pem
$ curl --tlsv1.2 --tls-max 1.2 --cert chain2.pem -v -i -s -k "https://127.0.0.1:8443/realms/master/clients-managements/register-node?client_id=client2" -d "client_cluster_host=http://127.0.0.1:1213/"

When we send the second curl request, Keycloak performs this action on behalf of the user specified in client2-fake.crt, instead of client1.crt. Therefore, we can mutate data on the server for any client that supports mTLS.

How to fix that? Easy: just use the first certificate from the array. That’s exactly how Keycloak patched this vulnerability. This CVE is a good example of how developers provide methods and interfaces that can be misunderstood or used incorrectly.

Passing certificate as a header

Another common scenario for mTLS deployments is when the TLS connection is terminated on a reverse proxy. In this case, the reverse proxy often checks the certificate and forwards it to a backend server as an additional header. Here is a typical nginx configuration to enable mTLS:

$ cat nginx.conf

http {
    server {
        server_name example.com;
        listen 443 ssl;
        …
        ssl_client_certificate /etc/nginx/ca.pem;
        ssl_verify_client on;

        location / {
            proxy_pass http://host.internal:80;
            proxy_set_header ssl-client-cert $ssl_client_cert;
        }
    }

I’ve seen a number of systems like that, and in most cases the backend servers behind nginx do not perform additional validation, just trusting the reverse proxy. This behavior is not directly exploitable, but it’s not ideal either. Why? Well, first of all, it means that any server in the local network can make a request with this header, so this network segment needs to be carefully isolated from any traffic coming from outside. Additionally, if the backend or reverse proxy is affected by request smuggling or header injection, its exploitation becomes trivial. Over the past few years, we’ve seen a lot of request and header smuggling vulnerabilities, including the latest CVEs in Netty and Nodejs. Be careful when implementing these scenarios and check the certificate’s signature on all servers if possible.

Chapter 2: “Follow the chain, where does it lead you?”

Excerpt from RFC 4158: "Figure 1 - Sample Hierarchical PKI"

In large systems, servers may not store all root and intermediate certificates locally, but use external storage instead. RFC 4387 explains the concept of a certificate store: an interface you can use to lazily access certificates during chain validation. These stores are implemented over different protocols, such as HTTP, LDAP, FTP, or SQL queries.

RFC 3280 defines some X.509 certificate extensions that can contain information about where to find the issuer and CA certificates. For instance, the Authority Information Access (AIA) extension contains a URL pointing to the Issuer’s certificate. If this extension is used for validation, there is a high chance that you can exploit it to perform an SSRF attack. Also, Subject, Issuer, Serial, and their alternative names can be used to construct SQL or LDAP queries, creating opportunities for injection attacks.

Client certificate with an AIA extension, containing a link to http://example.com

When certificate stores are in use, you should think of these values as “untrusted user input” or “Insertion points,” similar to those we have in Burp Suite’s Intruder. And what attackers will really love is that all of these values can be used in queries before the signature is checked.

Example: CVE-2023-33201 LDAP injection in Bouncy Castle

To demonstrate an example of this vulnerability, we’ll use LDAPCertStore from the Bouncy Castle library. Bouncy Castle is one of the most popular libraries for certificate validation in Java. Here is an example of how you can use this store to build and validate a certificate chain.

PKIXBuilderParameters pkixParams = new PKIXBuilderParameters(keystore, selector);

//setup additional LDAP store
X509LDAPCertStoreParameters CertStoreParameters = new X509LDAPCertStoreParameters.Builder("ldap://127.0.0.1:1389", "CN=certificates").build();
CertStore certStore = CertStore.getInstance("LDAP", CertStoreParameters, "BC");
pkixParams.addCertStore(certStore);

// Build and verify the certification chain
try {
   CertPathBuilder builder = CertPathBuilder.getInstance("PKIX", "BC");
   PKIXCertPathBuilderResult result =
           (PKIXCertPathBuilderResult) builder.build(pkixParams);

Under the hood, Bouncy Castle uses the Subject field from the certificate to build an LDAP query. The Subject field is inserted in the filter, without—you guessed it—any escaping.

Client certificate containing the text "Subject: CN=Client*)(userPassword=123"

So, if the Subject contains any special characters, it can change the syntax of the query. In most cases, this can be exploited as a blind ldap query injection. Therefore, it might be possible to use this vulnerability to extract other fields from the LDAP directory. The exploitability depends on many factors, including whether the application exposes any errors or not, and it also depends on the structure of the LDAP directory.

In general, whenever you incorporate user-supplied data into an LDAP query, special characters should be properly filtered. That’s exactly how this CVE has been patched in the Bouncy Castle code.

Chapter 3: Certificate revocation and its unintended uses

Similar to Json web tokens, the beauty of certificate chains is that they can be trusted just based on their signature. But what happens if we need to revoke a certificate, so it can no longer be used?

The PKIX specification (RFC 4387) addresses this problem by proposing a special store for revoked certificates, accessible via HTTP or LDAP protocols. Many developers believe that revocation checking is absolutely necessary, whereas others urge to avoid it for performance reasons or only use offline revocation lists.

Generally speaking, the store location can be hardcoded into the application or taken from the certificate itself. There are two certificate extensions used for that: Authority Information Access OSCP URL and CRL Distribution points.

Client certificate containing URLs in its AIA OSCL and CRL Distribution points.

Looking at it from the hackers point of view, I think it’s incredible that the location of the revocation server can be taken from the certificate. So, if the application takes URLs relying on AIA or CRLDP extension to make a revocation check, it can be abused for SSRF attacks.

Sadly for attackers, this normally happens after the signature checks, but in some cases it’s still exploitable.

Moreover, LDAP is also supported, at least in Java. You probably heard that, in Java, unmarshaling an LDAP lookup response can lead to a remote code execution. A few years back, Moritz Bechler reported this problem and remote code execution via revocation has since been patched in the JDK. You can check out his blog post for more details.

In my research, I decided to check if the Bouncy Castle library is also affected. It turns out that Bouncy Castle can be configured to use the CRLDP extension and make calls to an LDAP server. At the same time, Bouncy Castle only fetches a specific attribute from the LDAP response and does not support references. So, remote code execution is not possible there. HTTP SSRF is still viable though.

private static Collection getCrlsFromLDAP(CertificateFactory certFact, URI distributionPoint) throws IOException, CRLException
{
    Map<String, String> env = new Hashtable<String, String>();

    env.put(Context.INITIAL_CONTEXT_FACTORY, "com.sun.jndi.ldap.LdapCtxFactory");
    env.put(Context.PROVIDER_URL, distributionPoint.toString());

    byte[] val = null;
    try
    {
        DirContext ctx = new InitialDirContext((Hashtable)env);
        Attributes avals = ctx.getAttributes("");
        Attribute aval = avals.get("certificateRevocationList;binary");
        val = (byte[])aval.get();
    }

Example: CVE-2023-28857 credentials leak in Apereo CAS

I also had a quick look at open source projects that support mTLS and perform revocation checking. One of these projects was Apereo CAS. It’s another popular authentication server that is highly configurable. Administrators of Apereo CAS can enable the revocation check using an external LDAP server by specifying its address and password in the settings:

cas.authn.x509.crl-fetcher=ldap
cas.authn.x509.ldap.ldap-url=ldap://example.com:1389/
cas.authn.x509.ldap.bind-dn=admin
cas.authn.x509.ldap.bind-credential=s3cr3taaaaa

If these settings are applied, Apereo CAS performs the revocation check for the certificate, fetching the address from the certificate’s CRLDP extension.

/**
* Validate the X509Certificate received.
*
* @param cert the cert
* @throws GeneralSecurityException the general security exception
*/
private void validate(final X509Certificate cert) throws GeneralSecurityException {
   cert.checkValidity();
   this.revocationChecker.check(cert);

   val pathLength = cert.getBasicConstraints();
   if (pathLength < 0) {
       if (!isCertificateAllowed(cert)) {
           val msg = "Certificate subject does not match pattern " + this.regExSubjectDnPattern.pattern();
           LOGGER.error(msg);

I was afraid that this could lead to remote code execution, but it turns out that Apereo CAS uses a custom library for LDAP connection, which does not support external codebases or object factories needed for RCE.

When I tested this in Apereo CAS, I noticed one interesting behavior. The server prefers the LDAP URL located inside the certificate, instead of the one that is configured in settings. At the same time, Apereo CAS still sends the password from the settings. I quickly set up a testing environment and sent a self-signed certificate in the header. My self-signed certificate had a CRLDP extension with the LDAP URL pointing to a netcat listener. After sending this request to Apereo CAS, I received a request to my netcat listener with the username and password leaked.

Pair of screenshots: the first contains a POST request to Apereo CAS and the second is a terminal running netcat.

After reporting this vulnerability, the application developers issued a fix within just one day. They patched it by clearing the login and password used for LDAP connection if the URL is taken from the CRLDP. Therefore, the password leak is no longer possible. Nevertheless, I would say that using URLs from the CRLDP extension is still dangerous, as it broadens the attack surface.

Summary

If you’re developing an mTLS system or performing a security assessment, I suggest:

  1. Pay attention when extracting usernames from the mTLS chain, as the servers only verify the first certificate in the chain.
  2. Use Certificate Stores with caution, as it can lead to LDAP and SQL injections.
  3. Certificate revocation can lead to SSRF or even to RCE in the worst case. So, do the revocation check only after all other checks and do not rely on URLs taken from the certificate extensions.

The post mTLS: When certificate authentication is done wrong appeared first on The GitHub Blog.

Streamlining Grab’s Segmentation Platform with faster creation and lower latency

Post Syndicated from Grab Tech original https://engineering.grab.com/streamlining-grabs-segmentation-platform

Launched in 2019, Segmentation Platform has been Grab’s one-stop platform for user segmentation and audience creation across all business verticals. User segmentation is the process of dividing passengers, driver-partners, or merchant-partners (users) into sub-groups (segments) based on certain attributes. Segmentation Platform empowers Grab’s teams to create segments using attributes available within our data ecosystem and provides APIs for downstream teams to retrieve them.

Checking whether a user belongs to a segment (Membership Check) influences many critical flows on the Grab app:

  1. When a passenger launches the Grab app, our in-house experimentation platform will tailor the app experience based on the segments the passenger belongs to.
  2. When a driver-partner goes online on the Grab app, the Drivers service calls Segmentation Platform to ensure that the driver-partner is not blacklisted.
  3. When launching marketing campaigns, Grab’s communications platform relies on Segmentation Platform to determine which passengers, driver-partners, or merchant-partners to send communication to.

This article peeks into the current design of Segmentation Platform and how the team optimised the way segments are stored to reduce read latency thus unlocking new segmentation use cases.

Architecture

Segmentation Platform comprises two major subsystems:

  1. Segment creation
  2. Segment serving
Fig 1. Segmentation Platform architecture

Segment creation

Segment creation is powered by Spark jobs. When a Grab team creates a segment, a Spark job is started to retrieve data from our data lake. After the data is retrieved, cleaned, and validated, the Spark job calls the serving sub-system to populate the segment with users.

Segment serving

Segment serving is powered by a set of Go services. For persistence and serving, we use ScyllaDB as our primary storage layer. We chose to use ScyllaDB as our NoSQL store due to its ability to scale horizontally and meet our <80ms p99 SLA. Users in a segment are stored as rows indexed by the user ID. The table is partitioned by the user ID ensuring that segment data is evenly distributed across the ScyllaDB clusters.

User ID Segment Name Other metadata columns
1221 Segment_A
3421 Segment_A
5632 Segment_B
7889 Segment_B

With this design, Segmentation Platform handles up to 12K read and 36K write QPS, with a p99 latency of 40ms.

Problems

The existing system has supported Grab, empowering internal teams to create rich and personalised experiences. However, with the increased adoption and use, certain challenges began to emerge:

  1. As more and larger segments are being created, the write QPS became a bottleneck leading to longer wait times for segment creation.
  2. Grab services requested even lower latency for membership checks.

Long segment creation times

As more segments were created by different teams within Grab, the write QPS was no longer able to keep up with the teams’ demands. Teams would have to wait for hours for their segments to be created, reducing their operational velocity.

Read latency

Further, while the platform already offers sub-40ms p99 latency for reads, this was still too slow for certain services and their use cases. For example, Grab’s communications platform needed to check whether a user belongs to a set of segments before sending out communication and incurring increased latency for every communication request was not acceptable. Another use case was for Experimentation Platform, where checks must have low latency to not impact the user experience.
Thus, the team explored alternative ways of storing the segment data with the goals of:

  1. Reducing segment creation time
  2. Reducing segment read latency
  3. Maintaining or reducing cost

Solution

Segments as bitmaps

One of the main engineering challenges was scaling the write throughput of the system to keep pace with the number of segments being created. As a segment is stored across multiple rows in ScyllaDB, creating a large segment incurs a huge number of writes to the database. What we needed was a better way to store a large set of user IDs. Since user IDs are represented as integers in our system, a natural solution to storing a set of integers was a bitmap.

For example, a segment containing the following user IDs: 1, 6, 25, 26, 89 could be represented with a bitmap as follows:

Fig 2. Bitmap representation of a segment

To perform a membership check, a bitwise operation can be used to check if the bit at the user ID’s index is 0 or 1. As a bitmap, the segment can also be stored as a single Blob in object storage instead of inside ScyllaDB.

However, as the number of user IDs in the system is large, a small and sparse segment would lead to prohibitively large bitmaps. For example, if a segment contains 2 user IDs 100 and 200,000,000, it will require a bitmap containing 200 million bits (25MB) where all but 2 of the bits are just 0. Thus, the team needed an encoding to handle sparse segments more efficiently.

Roaring Bitmaps

After some research, we landed on Roaring Bitmaps, which are compressed uint32 bitmaps. With roaring bitmaps, we are able to store a segment with 1 million members in a Blob smaller than 1 megabyte, compared to 4 megabytes required by a naive encoding.

Roaring Bitmaps achieve good compression ratios by splitting the set into fixed-size (216) integer chunks and using three different data structures (containers) based on the data distribution within the chunk. The most significant 16 bits of the integer are used as the index of the chunk, and the least significant 16 bits are stored in the containers.

Array containers

Array containers are used when data is sparse (<= 4096 values). An array container is a sorted array of 16-bit integers. It is memory-efficient for sparse data and provides logarithmic-time access.

Bitmap containers

Bitmap containers are used when data is dense. A bitmap container is a 216 bit container where each bit represents the presence or absence of a value. It is memory-efficient for dense data and provides constant-time access.

Run containers

Finally, run containers are used when a chunk has long consecutive values. Run containers use run-length encoding (RLE) to reduce the storage required for dense bitmaps. Run containers store a pair of values representing the start and the length of the run. It provides good memory efficiency and fast lookups.

The diagram below shows how a dense bitmap container that would have required 91 bits can be compressed into a run container by storing only the start (0) and the length (90). It should be noted that run containers are used only if it reduces the number of bytes required compared to a bitmap.

Fig 3. A dense bitmap container compressed into a run container

By using different containers, Roaring Bitmaps are able to achieve good compression across various data distributions, while maintaining excellent lookup performance. Additionally, as segments are represented as Roaring Bitmaps, service teams are able to perform set operations (union, interaction, and difference, etc) on the segments on the fly, which previously required re-materialising the combined segment into the database.

Caching with an SDK

Even though the segments are now compressed, retrieving a segment from the Blob store for each membership check would incur an unacceptable latency penalty. To mitigate the overhead of retrieving a segment, we developed an SDK that handles the retrieval and caching of segments.

Fig 4. How the SDK caches segments

The SDK takes care of the retrieval, decoding, caching, and watching of segments. Users of the SDK are only required to specify the maximum size of the cache to prevent exhausting the service’s memory. The SDK provides a cache with a least-recently-used eviction policy to ensure that hot segments are kept in the cache. They are also able to watch for updates on a segment and the SDK will automatically refresh the cached segment when it is updated.

Hero teams

Communications Platform

Communications Platform has adopted the SDK to implement a new feature to control the communication frequency based on which segments a user belongs to. Using the SDK, the team is able to perform membership checks on multiple multi-million member segments, achieving peak QPS 15K/s with a p99 latency of <1ms. With the new feature, they have been able to increase communication engagement and reduce the communication unsubscribe rate.

Experimentation Platform

Experimentation Platform powers experimentation across all Grab services. Segments are used heavily in experiments to determine a user’s experience. Prior to using the SDK, Experimentation Platform limited the maximum size of the segments that could be used to prevent exhausting a service’s memory.

After migrating to the new SDK, they were able to lift this restriction due to the compression efficiency of Roaring Bitmaps. Users are now able to use any segments as part of their experiment without worrying that it would require too much memory.

Closing

This blog post discussed the challenges that Segmentation Platform faced when scaling and how the team explored alternative storage and encoding techniques to improve segment creation time, while also achieving low latency reads. The SDK allows our teams to easily make use of segments without having to handle the details of caching, eviction, and updating of segments.

Moving forward, there are still existing use cases that are not able to use the Roaring Bitmap segments and thus continue to rely on segments from ScyllaDB. Therefore, the team is also taking steps to optimise and improve the scalability of our service and database.

Special thanks to Axel, the wider Segmentation Platform team, and Data Technology team for reviewing the post.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

GitHub Availability Report: July 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-08-09-github-availability-report-july-2023/

In July, we experienced one incident that resulted in degraded performance across GitHub services.

July 21 13:07 UTC (lasting 59 minutes)

On July 21 at 13:07 UTC, GitHub experienced a partial power outage in one of our redundant data centers, which resulted in a loss of compute capacity. GitHub updated the status of six services to yellow at 13:12 UTC. The vast majority of customer impact occurred in the first 10 minutes up to 13:17 UTC as requests were internally rerouted to other nodes in the data center, but we elected to keep status at yellow until full capacity was restored out of an abundance of caution. As a result of this incident, we are conducting reviews of all power feeds with each of our datacenter partners. We have also identified improvements to reduce recovery time after power was restored and are evaluating ways to reduce the time to fail over all traffic.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: July 2023 appeared first on The GitHub Blog.

Introducing code referencing for GitHub Copilot

Post Syndicated from Ryan J. Salva original https://github.blog/2023-08-03-introducing-code-referencing-for-github-copilot/

Make more informed decisions about the code you use. In the rare case where a GitHub Copilot suggestion matches public code, this update will show a list of repositories where that code appears and their licenses. Sign up for the private beta today.

Over the course of the last year, GitHub Copilot, the world’s first at-scale AI pair programmer trained on billions of lines of public code, has attracted more than 1 million developers and helped over 27,000 organizations build faster and more productively. During that time, many developers told us they want to see when GitHub Copilot’s suggestions match public code.

Today, we’re announcing a private beta of GitHub Copilot with code referencing that includes an updated filter which detects and shows context of code suggestions matching public code on GitHub. When the filter is enabled, GitHub Copilot checks code suggestions with surrounding code of about 150 characters and compares it against an index of all the public code on GitHub.com. Matches—along with information about every repository in which they appear—are displayed right in the editor. Developers can now choose whether to block suggestions containing matching code, or allow those suggestions with information about matches.

Why? Some want to learn from others’ work, others may want to take a dependency rather than introduce new app logic, and still others want to give or receive credit for similar work. Whatever the reason, it’s nice to know when similar code is out there.

Let’s see how it works.

How GitHub Copilot code referencing works

With billions of files to index and a latency budget of only 10-20ms, it’s a miracle of engineering that this is even possible. Still, if there’s a match, a notification appears in the editor showing: (1) the matching code, (2) the repositories where that code appears, and (3) the license governing each repository.

Why code referencing matters

In our journey to create a code referencing tool, we discovered a few interesting things:

First, our previous research suggests that matches occur in less than one percent of GitHub Copilot suggestions. But that one percent isn’t evenly distributed across all use cases. In the context of an existing application with surrounding code, we almost never see a match. But in an empty or nearly empty file, we see matches far more often.

Suggestions are heavily biased toward the prompt so GitHub Copilot can provide suggestions tailor-made for your current task. That means, in an existing app with lots of context, you’ll get a suggestion customized for your code. But in an empty, or nearly empty file, there’s little to no context. So, you’re more likely to get a suggestion that matches public code.

We’ve also found that when suggestions match public code, those matches frequently appear in dozens, if not hundreds of repositories. In some ways, this isn’t surprising because the models that power GitHub Copilot are akin to giant probability machines. A code fragment that appears in many repositories is more likely to be a “pattern” detected by the model—similar to the patterns we see elsewhere in public code.

For example, research on Java projects finds that up to 11% of repositories may contain code that resembles solutions posted to Stack Overflow, and the vast majority of those snippets appear without attribution. Another study on Python found that many matches are too generic to trace to an original usage. A smaller-scale study found that Stack Overflow answers contain code from Android applications.

Finally, the repositories with matching code are often governed by multiple, sometimes conflicting licenses, which makes attributing a match to its source more challenging. By consulting a list of references, developers can now:

  • Determine whether, what, and who to attribute rather than having matches simply blocked from the outset.
  • Learn from other developers by studying their approaches to similar problems.
  • Take a dependency on an open source library to avoid new business logic.
  • Evaluate the context of code before accepting a matching suggestion.
  • Discover new projects and contribute upstream.

Building a better developer experience and community

This is just the beginning. As GitHub continues to innovate, we will always strive to help developers stay in the flow, build creatively, and maintain a transparent connection to the community. We’re excited for you to try GitHub Copilot with code referencing.

Happy coding.

How we build containerized services at GitHub using GitHub

Post Syndicated from MV Karan original https://github.blog/2023-08-02-how-we-build-containerized-services-at-github-using-github/

The developer experience engineering team at GitHub works on creating safe, delightful, and inclusive solutions for GitHub engineers to efficiently code, ship, and operate software–setting an example for the world on how to build software with GitHub. To achieve this we provide our developers with a paved path–a comprehensive suite of automated tools and applications to streamline our runtime platforms, deployment, and hosting that helps power some of the microservices on the GitHub.com platform and many internal tools. Let’s take a deeper look at how one of our main paved paths works.

Our development ecosystem

GitHub’s main paved path covers everything that’s needed for running software–creating, deploying, scaling, debugging, and running applications. It is an ecosystem of tools like Kubernetes, Docker, load balancers, and many custom apps that work together to create a cohesive experience for our engineers. It isn’t just infrastructure and isn’t just Kubernetes. Kubernetes is our base layer, and the paved path is a mix of conventions, tools, and settings built on top of it.

The kind of services that we typically run using the paved path include web apps, computation pipelines, batch processors, and monitoring systems.

Kubernetes, which is the base layer of the paved path, runs in a multi-cluster, multi-region topology.

Benefits of the paved path

There are hundreds of services at GitHub–from a small internal tool to an external API supporting production workloads. For a variety of reasons, it would be inefficient to spin up virtual machines for each service.

  • Planning and capacity usage across all services wouldn’t be efficient. We would encounter significant overhead in managing both physical and Kubernetes infrastructure on an ongoing basis.
  • Teams would need to build deep expertise in managing their own Kubernetes clusters and would have less time to focus on their application’s unique needs.
  • We would have less central visibility of applications.
  • Security and compliance would be difficult to standardize and enforce.

With the paved path based on Kubernetes and other runtime apps, we’re instead able to:

  • Plan capacity centrally and only for the Kubernetes nodes, so we can optimally use capacity across nodes, as small workloads and large workloads coexist on the same machines.
  • Scale rapidly thanks to central capacity planning.
  • Easily manage configuration and deployments across services in one central control plane.
  • Consistently provide insights into app and deployment performance for individual services.

Onboarding a service

Onboarding a service with the code living in its own repository has been made easy with our ChatOps command service, called Hubot, and GitHub Apps. Service owners can easily generate some basic scaffolding needed to deploy the service by running a command like:

hubot gh-platform app scaffold monalisa-app

A custom GitHub App installed on the service’s GitHub repository will then automatically generate a pull request to add the necessary configurations, which includes:

  • A deployment.yaml file that defines the service’s deployment environments.
  • Kubernetes manifests that define Deployment and Service objects for deploying the service.
  • A Debian Dockerfile that runs a trivial web server to start off with, which will be used by the Kubernetes manifests.
  • Setting up CI builds as GitHub Checks that build the Docker images on every push, and store in a container registry ready for deployment.

Each service that is onboarded to the paved path has its unique Kubernetes namespace that is defined by <app-name>-<environment> and generally has a staging and production environment. This helps separate the workloads of multiple services, and also multiple environments for the same service since each environment gets its own Kubernetes namespace.

Deploying a service

At GitHub, we deploy branches and perform deployments through Hubot ChatOps commands. To deploy a branch named bug-fixes in the monalisa-app repository to the staging environment, a developer would run a ChatOps command like:

hubot deploy monalisa-app/bug-fixes to staging

This triggers a deployment that fetches the Docker image associated with the latest commit in the bug-fixes branch, updates the Kubernetes manifests, and applies those manifests to the clusters in the runtime platform relevant to that environment.

Typically, the Docker image would be deployed to multiple Kubernetes clusters across multiple geographical sites in a region that forms a part of the runtime platform.

To automate pull request merges into the busiest branches and orchestrate the rollout across environments we’re also using merge queue and deployment pipelines, which our engineers can observe and interact with during their deployment.

Securing our services

For any company, the security of the platform itself, along with services running within it, is critical. In addition to our engineering-wide practices, such as requiring two-person reviews on every pull request, we also have Security and Platform teams automating security measures, such as:

  • Pre-built Docker images to be used as base images for the Dockerfiles. These base images contain only the necessary packages/dependencies with security updates, a set of installed software that is auditable and curated according to shared needs.
  • Build-time and periodic scanning of all packages and running container images, for any vulnerabilities or dependencies needing a patch, powered by our own products for software supply chain security like Dependabot.
  • Build time and periodic scanning of GitHub repositories of services for exposed secrets and vulnerabilities, using GitHub’s native security features for advanced security like code scanning and secret scanning.
  • Multiple authentication and authorization mechanisms that allow only the relevant individuals to directly access underlying Kubernetes resources.
  • Comprehensive telemetry for threat detection.
  • Services running in the platform are by default accessible only within GitHub’s internal networks and not exposed on the public internet.
  • Branch protection policies are enforced on all production repositories. These policies prevent merging a pull request until designated automated tests pass and the change has been reviewed by a different developer from the one who proposed the change.

Another key aspect of security for an application is how secrets like keys and tokens are managed. At GitHub, we use a centralized secret store to manage secrets. Each service and each environment within the service has its own vault to store secrets. These secrets are then injected into the relevant pods in Kubernetes, which are then exposed to the containers.

The deployment flow, from merge to rollout

The whole deployment process would look something like this:

  1. A GitHub engineer merges a pull request to a branch in a repository. In the above example, it is the bug-fixes branch in the monalisa-app repository. This repository would also contain the Kubernetes manifest template files for deploying the application.
  2. The pull request merge triggers relevant CI workflows. One of them is a build of the Docker image, which builds the container image based on the Dockerfile specified in the repository, and pushes the image to an internal artifact registry.
  3. Once all the CI workflows have completed successfully, the engineer initiates a deployment by running a ChatOps command like hubot deploy monalisa-app/bug-fixes to staging. This triggers a deployment to our environments such as Staging.
  4. The build systems fetch the Kubernetes manifest files from the repository branch, replaces the latest image to be deployed from the artifact registry, injects app secrets from the secret store, and runs some custom operations. At the end of this stage, a ready-to-deploy Kubernetes manifest is available.
  5. Our deployment systems then apply the Kubernetes manifest to relevant clusters and monitor the rollout status of new changes.

Conclusion

GitHub’s internal paved path helps developers at GitHub focus on building services and delivering value to our users, with minimal focus on the infrastructure. We accomplish this by providing a streamlined path to our GitHub engineers that uses the power of containers and Kubernetes; scalable security, authentication, and authorization mechanisms; and the GitHub.com platform itself.

Want to try some of these for yourself? Learn more about all of GitHub’s features on github.com/features. If you have adopted any of our practices for your own development, give us a shout on Twitter!

Scaling merge-ort across GitHub

Post Syndicated from Matt Cooper original https://github.blog/2023-07-27-scaling-merge-ort-across-github/

At GitHub, we perform a lot of merges and rebases in the background. For example, when you’re ready to merge your pull request, we already have the resulting merge assembled. Speeding up merge and rebase performance saves both user-visible time and backend resources. Git has recently learned some new tricks which we’re using at scale across GitHub. This post walks through what’s changed and how the experience has improved.

Our requirements for a merge strategy

There are a few non-negotiable parts of any merge strategy we want to employ:

  • It has to be fast. At GitHub’s scale, even a small slowdown is multiplied by the millions of activities going on in repositories we host each day.
  • It has to be correct. For merge strategies, what’s “correct” is occasionally a matter of debate. In those cases, we try to match what users expect (which is often whatever the Git command line does).
  • It can’t check out the repository. There are both scalability and security implications to having a working directory, so we simply don’t.

Previously, we used libgit2 to tick these boxes: it was faster than Git’s default merge strategy and it didn’t require a working directory. On the correctness front, we either performed the merge or reported a merge conflict and halted. However, because of additional code related to merge base selection, sometimes a user’s local Git could easily merge what our implementation could not. This led to a steady stream of support tickets asking why the GitHub web UI couldn’t merge two files when the local command line could. We weren’t meeting those users’ expectations, so from their perspective, we weren’t correct.

A new strategy emerges

Two years ago, Git learned a new merge strategy, merge-ort. As the author details on the mailing list, merge-ort is fast, correct, and addresses many shortcomings of the older default strategy. Even better, unlike merge-recursive, it doesn’t need a working directory. merge-ort is much faster even than our optimized, libgit2-based strategy. What’s more, merge-ort has since become Git’s default. That meant our strategy would fall even further behind on correctness.

It was clear that GitHub needed to upgrade to merge-ort. We split this effort into two parts: first deploy merge-ort for merges, then deploy it for rebases.

merge-ort for merges

Last September, we announced that we’re using merge-ort for merge commits. We used Scientist to run both code paths in production so we can compare timing, correctness, etc. without risking much. The customer still gets the result of the old code path, while the GitHub feature team gets to compare and contrast the behavior of the new code path. Our process was:

  1. Create and enable a Scientist experiment with the new code path.
  2. Roll it out to a fraction of traffic. In our case, we started with some GitHub-internal repositories first before moving to a percentage-based rollout across all of production.
  3. Measure gains, check correctness, and fix bugs iteratively.

We saw dramatic speedups across the board, especially on large, heavily-trafficked repositories. For our own github/github monolith, we saw a 10x speedup in both the average and P99 case. Across the entire experiment, our P50 saw the same 10x speedup and P99 case got nearly a 5x boost.

Chart showing experimental candidate versus control at P50. The candidate implementation fairly consistently stays below 0.1 seconds.

Chart showing experimental candidate versus control at P99. The candidate implementation follows the same spiky pattern as the control, but its peaks are much lower.

Dashboard widgets showing P50 average times for experimental candidate versus control. The control averages 71.07 milliseconds while the candidate averages 7.74 milliseconds.

Dashboard widgets showing P99 average times for experimental candidate versus control. The control averages 1.63 seconds while the candidate averages 329.82 milliseconds.

merge-ort for rebases

Like merges, we also do a huge number of rebases. Customers may choose rebase workflows in their pull requests. We also perform test rebases and other “behind the scenes” operations, so we also brought merge-ort to rebases.

This time around, we powered rebases using a new Git subcommand: git-replay. git replay was written by the original author of merge-ort, Elijah Newren (a prolific Git contributor). With this tool, we could perform rebases using merge-ort and without needing a worktree. Once again, the path was pretty similar:

  1. Merge git-replay into our fork of Git. (We were running the experiment with Git 2.39, which didn’t include the git-replay feature.)
  2. Before shipping, leverage our test suite to detect discrepancies between the old and the new implementations.
  3. Write automation to flush out bugs by performing test rebases of all open pull requests in github/github and comparing the results.
  4. Set up a Scientist experiment to measure the performance delta between libgit2-powered rebases and monitor for unexpected mismatches in behavior.
  5. Measure gains, check correctness, and fix bugs iteratively.

Once again, we were amazed at the results. The following is a great anecdote from testing, as relayed by @wincent (one of the GitHub engineers on this project):

Another way to think of this is in terms of resource usage. We ran the experiment over 730k times. In that interval, our computers spent 2.56 hours performing rebases with libgit2, but under 10 minutes doing the same work with merge-ort. And this was running the experiment for 0.5% of actors. Extrapolating those numbers out to 100%, if we had done all rebases during that interval with merge-ort, it would have taken us 2,000 minutes, or about 33 hours. That same work done with libgit2 would have taken 512 hours!

What’s next

While we’ve covered the most common uses, this is not the end of the story for merge-ort at GitHub. There are still other places in which we can leverage its superpowers to bring better performance, greater accuracy, and improved availability. Squashing and reverting are on our radar for the future, as well as considering what new product features it could unlock down the road.

Appreciation

Many thanks to all the GitHub folks who worked on these two projects. Also, GitHub continues to be grateful for the hundreds of volunteer contributors to the Git open source project, including Elijah Newren for designing, implementing, and continually improving merge-ort.

How to build a GPT-3 App with Nextjs, React, and GitHub Copilot

Post Syndicated from Kedasha Kerr original https://github.blog/2023-07-25-how-to-build-a-gpt-3-app-with-nextjs-react-and-github-copilot/

At the beginning of the year, I started working out with a trainer who wanted me to start tracking my food, but I’ve always been super against tracking my meals because it just doesn’t work for me. Instead of tracking my meals however, I decided to build an application that automagically tells me the nutritional information of any recipe. But to do that I needed some pretty complex natural language parsing capabilities so I figured this would be a great opportunity for me to play around with OpenAI and get to use GitHub Copilot a little bit more to help me build the app quickly.

GitHub Copilot is a great example of a product that takes advantage of Large Language Models (LLM) to solve problems for people and improve their productivity. In this blog,I’ll take you through how I created my own application that finds the nutritional information for any recipe using OpenAI’s GPT-3.5-turbo model, GitHub Copilot, Next.js, React, and Material UI.

Let’s dig right into the tutorial.

1. Create a repository and install dependencies

To get started, let’s create a new repository from the GitHub Codespaces Next.js template to get up and running quickly. Go to this repository, and make a copy. To do so, click on the green “Use this template” button then select “Create a new repository” and name your repository whatever you like. I called mine “mealmetrics-copilot.”

Now, clone the repository to your local machine, and open up the repository in your preferred code editor. I’m using VS Code.

Open up your terminal and cd into the project so we can install a few needed dependencies. In your terminal run the following command:

npm i express openai dotenv @material-ui/core @material-ui/icons

Then, install the following as a dev dependency:

npm i --save-dev nodemon

Once everything installs successfully, we’re ready to start building the server, but first, let’s grab our api key from OpenAI.

2. Getting your OpenAI API key

Go to OpenAI’s developer login page and create a new account or sign in if you already have one. Once you’ve logged in, click your name in the upper right hand corner and select “view API keys.” Click the “Create a new secret key” button. From there you can name your apikey, click the green button to generate the key and then copy the apikey and save it in a secure location (such as a password manager).

Save your apikey in a .env file in vscode at the root of the project and add .env to the gitignore file.

3. Install GitHub Copilot extension

We’ll be using GitHub Copilot as our assistant to build this application. If you’re not familiar with GitHub Copilot, read this blog post to learn more.

In your code editor of choice, go to your extensions panel and search for GitHub Copilot—I’m using VSCode and this is what that looks like.

Click the install button then click the login button to authenticate your access. Once that’s done, you’ll be ready to get started and follow along!

4. Building the server

Now that we have our apikey, have installed dependencies, and have GitHub Copilot in our code editor, let’s dig into building the application!

The first thing we’re going to do is build a simple server with Express.js (if you prefer to use Fastify, NestJS, Koa or something else, feel free to use them!).

In the pages folder, create a folder called api and then a file called server.js. This is where we’ll add our prompting information for GitHub Copilot. Let’s add our first prompt as a comment in the server.js file that says the following:

Create a server with the following specifications:

1. import express and dotenv node modules
3. create the server with express and name it app
4. use port 8080 as default port
5. enable body parser to accept json data
6. state which port the server is listening to and log it to the console

Hit the enter key and GitHub Copilot will start generating suggestions to build the server. To accept the suggestions, hit the tab key. Take a look at this video to see what accepting suggestions look like.

We can update the package.json file to include the script devserver: "nodemon pages/api/server.js then run the command in our terminal using npm run devserver. You’ll see that the server has started!

After we build the simple server, let’s move on to making this a bit more complex by building the controller for our app.

Let’s create a new file in the api folder called generateInfo.js and add the following comment in the file:

Create a controller with the following specifications:

1. import the Configuration class and the OpenAIApi class from the openai npm module
2. create a new configuration object that includes the api key and uses the Configuration class from the openai module
3. create a new instance of the OpenAIApi class and pass in the configuration object
4. create an async function called generateInfo that accepts a request and response object as parameters
5. use try to make a request to the OpenAI completetion api and return the response
6. use catch to catch any errors and return the error include a message to the user
7. export the generateInfo function as a module

As you’ll notice, I’m being very explicit in my instructions to GitHub Copilot. One thing to always remember when working with LLMs is that the magic is in the prompt—the clearer you are in your instructions, the better the results you’ll get.

Hit enter on your keyboard and then hit tab to accept the recommendations that GitHub Copilot provides. In the image below, you’ll notice that Copilot’s suggestions are gray.

Accept the suggestion by hitting tab on your keyboard and let’s do some cleaning up.

Remember, GitHub Copilot is our assistant, so we still need to ensure that the suggestions it provides meet our requirements.

Since we’re building a gpt-3 application, we’ll be using the completion API from OpenAI and the gpt-3.5-turbo model to generate nutritional information for us.

If you look at the suggestion above, we were provided with the davinci engine and parameters that are not needed for this project—we also need the messages parameter to send requests with the 3.5-turbo model.

We also want to add the recipe prompt, create a function called recipe that represents the recipe a user inputs and also update the completion object sent to OpenAI. The keys we’ll be using are max_tokens, prompt, model, temperature and n. Learn more about these parameters by reading OpenAI’s api docs.

Let’s also update the error message to include 401 just in case our api key is invalid. So, let’s make these updates.

1. Add recipe prompt

Create a new folder called data at the root of your project, then create a file called prompt.json. This will contain a part of the recipe prompt that we’ll send to OpenAI. Add the following script to the prompt.json file:

{
"recipePrompt": "I want you to act as a Nutrition Facts Generator. I will provide you with a recipe and your role is to generate nutrition facts for that recipe. You should use your knowledge of nutrition science, nutrition facts labels and other relevant information to generate nutritional information for the recipe. Add each nutrition fact to a new line. I want you to only reply with the nutrition fact. Do not provide any other information. My first request is: "
}

Then import the recipePrompt to the generateInfo.js file and update the main function to grab the recipe submitted by the user.

// add the prompt to the top of the file
const { recipePrompt } = require('../../data/recipe.json');

// update this function to include the recipe before the try
const generateInfo = async(req, res) => {
const { recipe } = req.body
}

2. Update the completion function and response

Now, let’s update the try to look a bit more like what we want.

model: "gpt-3.5-turbo",
messages: [{ role: "user", content: `${recipePrompt}${recipe}` }],
max_tokens: 200,
temperature: 0,
n: 1,

And let’s also update the response that we receive.

const response = completion.data.choices[0].message.content;

return res.status(200).json({
success: true,
data: response,
});

3. Update the error message

Finally, let’s update the catch to have more explicit error messages.

catch (error) {
if (error.response.status === 401) {
return res.status(401).json({
error: "Please provide a valid API key.",
});
}
return res.status(500).json({
error:
"An error occurred while generating recipe information. Please try again later.",
});
}

Once we’ve updated everything, our controller function should look like this:

const { Configuration, OpenAIApi } = require("openai");
const { recipePrompt } = require("../../data/recipe.json");

const config = new Configuration({
apiKey: process.env.OPENAI_API_KEY,
});

const openai = new OpenAIApi(config);

const generateInfo = async (req, res) =&gt; {
const { recipe } = req.body;

try {
const completion = await openai.createChatCompletion({
model: "gpt-3.5-turbo",
messages: [{ role: "user", content: `${recipePrompt}${recipe}` }],
max_tokens: 200,
temperature: 0,
n: 1,
});
const response = completion.data.choices[0].message.content;

return res.status(200).json({
success: true,
data: response,
});
} catch (error) {
console.log(error);
if (error.response.status === 401) {
return res.status(401).json({
error: "Please provide a valid API key.",
});
}
return res.status(500).json({
error:
"An error occurred while generating recipe information. Please try again later.",
});
}
};

module.exports = { generateInfo };

Now, let’s create the router and test this out in Postman. Create a new file called router.js and start typing to allow GitHub Copilot to assist you while typing.

As you can see, GitHub Copilot offered suggestions while I was typing and I hit the tab button to accept the suggestions. Add the newly created router to the server.js file and test the route in postman. Add the following to your server.js file.

app.use('/openai', require('./router'));

Now, let’s test the route in postman with a POST request—make sure your server is still running!

Go to the URL below and add any recipe to the body of the request:

URL: http://localhost:8080/openai/generateinfo

RECIPE:
{
"recipe": "1 cup of all purpose flour, sifted 1 1/2 teaspoon baking powder 1/4 teaspoon salt 2 Tablespoon granulated sugar 1/2 Tablespoon unsalted butter, room temperature Approximately 1/3 cup water"
}

You should receive a successful response that looks like this:

{
"success": true,
"data": "\n\nCalories: 112 \nTotal Fat: 2.3g \nSaturated Fat: 1.3g \nTrans Fat: 0g \nCholesterol: 5.4mg \nSodium: 175.8mg \nTotal Carbohydrates: 21.6g \nDietary Fiber: 0.8g \nSugars: 6.5g \nProtein: 2.6g"
}

Here is the data that we submitted to OpenAI that generated that successful response:

data: '{"model":"text-davinci-003","prompt":"I want you to act as a Nutrition Facts Generator. I will provide you with a recipe and your role is to generate nutrition facts for that recipe. You should use your knowledge of nutrition science, nutrition facts labels and other relevant information to generate nutritional information for the recipe. Add each nutrition fact to a new line. I want you to only reply with the nutrition fact. Do not provide any other information. My first request is: 1 cup of all purpose flour, sifted 1 1/2 teaspoon baking powder 1/4 teaspoon salt 2 Tablespoon granulated sugar 1/2 Tablespoon unsalted butter, room temperature Approximately 1/3 cup water","max_tokens":200,"temperature":0.5,"n":1}',

As you can see, both the prompt script and the recipe entered in the request body was sent to OpenAI, which generated the nutritional data for this Jamaican fried dumpling recipe.

Now, let’s create the frontend of the application to display the info on the web.

5. Building the frontend app

We’ll be using React for the frontend. Delete all the code in the index.js file that currently exists in the project. Then, in a comment, instruct GitHub Copilot to build a simple text area.

Create a text area with the following specifications:
1. a H1 with the text "Find Nutrition Facts for any recipe"
2. a text area for users to upload recipe
3. a button for users to submit the entered recipe
4. a section at the bottom to display nutrition facts
5. Get the data from this link: http://localhost:8080/openai/generateinfo
6. Name the component RecipeInfo

GitHub Copilot quickly generated the code for us.

Let’s hit tab to accept the suggestion and run npm run dev in your terminal to see the app on the web. When you go to localhost:3000, you’ll see the following displayed on the web.

Admittedly, it’s not the most beautiful thing, but we were able to spin this up very quickly.

In less than one minute we completed a functional frontend mvp of our application with GitHub Copilot.

Let’s enter the recipe we have into the text box and see the response that we get back.

And we have our first error—which is not surprising since we didn’t validate the code that was provided. Let’s look into the console and see if we have any additional details.

Seems it’s a CORS issue—classic. Let’s ask GitHub Copilot how to resolve this.

Add the following questions as a comment anywhere in your file:

q: how do I resolve the CORS error?
q: how do I add Access-Control-Allow-Origin to the header?

The question and response should look something like this:

We can also ask GitHub Copilot Chat how to resolve CORS errors and it gives us a seamless response.

Let’s install the cors middleware and add it to the server.js file.

const cors = require("cors");

// Allow cross-origin requests (CORS)
app.use(cors());

Then, let’s update our router.js file.

router.options("/generateInfo", (req, res) => {
res.setHeader("Access-Control-Allow-Origin", "*");
res.setHeader("Access-Control-Allow-Headers", "*");
res.setHeader("Access-Control-Allow-Methods", "*");
res.sendStatus(200);
});

Now, let’s try fetching nutritional data again and see what happens.

And we have another error. Progress!

This time there’s no need to debug with GitHub Copilot since we’re being told that the data being returned is an object with the keys success and data. Let’s change the name of the response function to recipeInfo and update the nutrition state to receive recipeInfo.data.

const recipeInfo = await response.json();
setNutrition(recipeInfo.data);

Let’s try sending the recipe again and hope for a successful response.

Success!

We just created a GPT-3 app in record time with GitHub Copilot, React, Next.js, and OpenAI. Now that we have the data that we need, let’s make the application more beautiful with Material UI.

6. Styling the app with Material UI

In this section, we’ll be using a GitHub Copilot X feature in technical preview for individuals and in public beta for organizationsCopilot Chat–to improve the appearance of our application. You must have GitHub Copilot access to be on the Copilot Chat’s waitlist which is currently open. Sign up today if you haven’t yet!

Let’s ask GitHub Copilot chat how we can implement material-ui into the application:

Let’s go ahead and implement the suggestions and see what happens, and also ask GitHub Copilot chat to implement a header for us.

After we implement the header, new text area and center the content, this is what the app is looking like.

Ok, we’re getting somewhere.

Let’s make a few more updates with the assistance of GitHub Copilot Chat. I’ve included the prompt/questions I asked:

  • Make the text area larger and implement Material UI
update the component to use material ui with the content centered and the buttoned positioned below the text area. use Grid from material ui and any other components needed.
  • Add the paper component from Material UI to elevate the look and feel of the app
add the Paper component from material ui to the text area highlighted
  • Add a second button that clears the text area + facts after a recipe is submitted
add a button to the app to clear the text in the textarea
  • Add a loader while waiting for the data to load
add a loader to the highlighted code that checks if the data is loading. If the data is loading, then display the text "Nutrition Facts" and loader, if there is an error, display the error message otherwise, display nothing
  • Add a theme with custom primary and secondary colors
how do I create a custom theme with material ui and where do I create the custom theme?
  • Prevent the text area from going over the paper component from Material UI
in the highlihghted code how do I prevent the text area line from going over the paper component?
  • Add a new component—footer
Create a footer component with the following specifications:
1. The footer must be fixed at the bottom of the page
2. Use the Paper component from Material UI
3. Use the Typography component from Material UI
3. The text must say "Made with ❤ by LadyKerr & GitHub Copilot" and "Powered by OpenAI"
4. The text "GitHub Copilot" must be a link to https://copilot.github.com/ that opens in a new tab with alt text " GitHub Copilot"
5. The text "OpenAI" must be a link to https://openai.com/ that opens in a new tab with alt text "openai api"

Ask GitHub Copilot chat to make the changes above and, once it’s done, your application should be looking and functioning like this:

Now, let’s create a new component that splits each nutrition fact onto a new line. Create a new file called NutritionFacts.js in the components folder, and write the following as a comment so GitHub Copilot can get to work.

Create a component with the following specifications:
1. the component must split the received string data at /n/n or /n and return a Typography component for each string
2. the component must set a unique key for each Typography component
3. the component must return a div with the Typography components
4. the component must return null if the data is not a string
5. Name the component NutritionFacts
6. Use the Paper Component from Material UI
7. Add text above the data that says "Here are the nutrition facts for your recipe:"

From this prompt, GitHub Copilot generated the following code.

import React from "react";
import { Typography, Paper } from "@material-ui/core";

const NutritionFacts = ({ data }) => {
if (typeof data !== "string") {
return null;
}
const nutritionFacts = data.split(/\n\n|\n/);
return (

Here are the nutrition facts for your recipe:

{nutritionFacts.map((fact, index) => (
{fact}
))}
);
};

export default NutritionFacts;

Brilliant! Let’s import this component into our main index.js file and test the application to see if each fact was split on to a new line as expected.

And it did. Our app is functioning as expected.

Now, let’s move the code for the header into a new file in the components folder called Header.js. Once everything is updated, the final application looks like this and returns the nutritional data for any recipe.

So, there we have it!

We just built an application using GitHub Copilot, OpenAI, React, Next.js, and GitHub Copilot Chat. The next step would be to deploy the application on GitHub Pages and deploy your server on a service like Azure.

You can see the full code here—feel free to clone or fork the project and make it your own. This was a fun little project to build and I hope you learned something new and feel inspired to create your own GPT-3 app!

Learn more about prompting GitHub Copilot by reading How to use GitHub Copilot: Prompts, tips and use cases and A Developer’s Guide to Prompt Engineering and LLMs.

Until next time, happy coding!

Metrics for issues, pull requests, and discussions

Post Syndicated from Zack Koppert original https://github.blog/2023-07-19-metrics-for-issues-pull-requests-and-discussions/

Data-driven insights

At GitHub, we believe that data-driven insights are the keys to success for any software development project. Understanding the health and progress of your issues, pull requests, and discussions is crucial for effective collaboration, maintainership, and project management.

That is why we’re excited to announce the release of the Issue Metrics GitHub Action, a powerful tool that empowers developers and teams to measure key metrics and gain valuable insights into their projects.

With the new Issue Metrics GitHub Action, you can now easily track and monitor important metrics related to issues, pull requests, and discussions, such as time to first response, time to close, and more for any given time period.

Whether you’re an individual developer, a small team, or a large organization, these metrics will help you gauge the overall health, progress, and engagement of your projects.

Sample report

A sample report showing 2 tables. The first table contains overall metrics like average time to first response, anda corresponding value of 50 minutes and 44 seconds. The second table contains a list of the issues measured, with links to the issue and the metrics as measured on the individual issue.

Common use cases

Maintainers: ensuring proper attention

As a maintainer, it is essential to give reasonable attention to the issues and pull requests in the repositories you maintain. With the Issue Metrics GitHub Action, you can track metrics, such as the number of open issues, closed issues, open pull requests, and merged pull requests.

These metrics can provide you with a clear overview of the workload for a project over a given week, month, or even year. The action can also allow you to consider how you or your team prioritize time and attention effectively while also highlighting potentially overlooked requests in need of attention.

First responders: timely user contact

As a first responder in a repository, it’s part of the job description to ensure that users receive contact in a reasonable amount of time. By utilizing the Issue Metrics GitHub Action, you can keep track of metrics like the number of discussions awaiting replies, unresolved issues, or pull requests waiting for reviews. These metrics enable you to maintain a high level of responsiveness, fostering a positive user experience and timely problem resolution. These can be used to build a to-do list or retrospectively to reflect on how long users had to wait for a response during a given time period.

Open Source Program Office (OSPO): streamlining open source requests

An important part of what OSPOs do is making the open source release process easy and efficient while adhering to company policy. This process usually involves employees opening an issue, pull request, or discussion. With the Issue Metrics GitHub Action, OSPOs can gain valuable insights into the number of requests, the ratio of open to closed requests, and metrics related to the time it takes to navigate the open-source process to completion.

These metrics empower you to streamline your workflows, optimize response times, and ensure a smooth open-source collaboration experience. Optimizing the open source release process encourages employees to continue to produce open source projects on the organization’s behalf.

Product development teams: optimizing pull request reviews

Product development teams rely heavily on the code review process to collaborate and build high-quality software. By leveraging the Issue Metrics GitHub Action, teams can measure metrics such as the time it takes to get pull request reviews. These insights allow you to reflect on the data during retrospectives, identify areas for improvement, and optimize the review process to enhance team collaboration and accelerate development cycles.

Certain aspects of efficiency and flow may be hard to measure but often it is possible to spot and remove inefficiencies in the value stream.

– Forsgren et al. 2021

Setup and workflow integration

Setting up the Issue Metrics GitHub Action takes a few minutes, compared to the few hours it takes to calculate these metrics manually. You also only need to set up the action once, and it will run on a regular basis of your own choosing. It integrates into your existing GitHub Actions workflow or you can create a new workflow specifically for metrics tracking.

The action provides a wide range of customizable options, allowing you to tailor the issues, pull requests, and discussions measured by utilizing GitHub’s powerful search filtering. Ready to use configurations have been tested and used internally at GitHub and are now available for you to try out as well.

Here is one such example that runs monthly to report on metrics for issues created last month:

name: Monthly issue metrics
on:
  workflow_dispatch:
  schedule:
    - cron: '3 2 1 * *'

jobs:
  build:
    name: issue metrics
    runs-on: ubuntu-latest

    steps:

    - name: Get dates for last month
      shell: bash
      run: |
        # Get the current date
        current_date=$(date +'%Y-%m-%d')

        # Calculate the previous month
        previous_date=$(date -d "$current_date -1 month" +'%Y-%m-%d')

        # Extract the year and month from the previous date
        previous_year=$(date -d "$previous_date" +'%Y')
        previous_month=$(date -d "$previous_date" +'%m')

        # Calculate the first day of the previous month
        first_day=$(date -d "$previous_year-$previous_month-01" +'%Y-%m-%d')

        # Calculate the last day of the previous month
        last_day=$(date -d "$first_day +1 month -1 day" +'%Y-%m-%d')

        echo "$first_day..$last_day"
        echo "last_month=$first_day..$last_day" >> "$GITHUB_ENV"

    - name: Run issue-metrics tool
      uses: github/issue-metrics@v2
      env:
        GH_TOKEN: ${{ secrets.GH_TOKEN }}
        SEARCH_QUERY: 'repo:owner/repo is:issue created:${{ env.last_month }} -reason:"not planned"'

    - name: Create issue
      uses: peter-evans/create-issue-from-file@v4
      with:
        title: Monthly issue metrics report
        content-filepath: ./issue_metrics.md
        assignees: <YOUR_GITHUB_HANDLE_HERE>

Ready to start leveling up your GitHub project management?

Head over to the Issue Metrics GitHub Action repository to explore the documentation, installation instructions, and examples. The repository provides a comprehensive README file that guides you through the setup process and showcases the wide range of metrics you can measure. If you need additional help, feel free to open an issue in the repository.

GitHub is committed to providing developers with the best tools to enhance collaboration and productivity. The Issue Metrics GitHub Action is a significant step towards empowering teams to measure key metrics related to issues, pull requests, and discussions. By gaining valuable insights into the pulse of your projects, you can drive continuous improvement and deliver exceptional software. We are using this in several places internally across GitHub to help us continually improve and hope this action can help you as well. Happy coding!

A developer’s guide to prompt engineering and LLMs

Post Syndicated from Albert Ziegler original https://github.blog/2023-07-17-prompt-engineering-guide-generative-ai-llms/


In a blog post authored back in 2011, Marc Andreessen warned that, “Software is eating the world.” Over a decade later, we are witnessing the emergence of a new type of technology that’s consuming the world with even greater voracity: generative artificial intelligence (AI). This innovative AI includes a unique class of large language models (LLM), derived from a decade of groundbreaking research, that are capable of out-performing humans at certain tasks. And you don’t have to have a PhD in machine learning to build with LLMs—developers are already building software with LLMs with basic HTTP requests and natural language prompts.

In this article, we’ll tell the story of GitHub’s work with LLMs to help other developers learn how to best make use of this technology. This post consists of two main sections: the first will describe at a high level how LLMs function and how to build LLM-based applications. The second will dig into an important example of an LLM-based application: GitHub Copilot code completions.

Others have done an impressive job of cataloging our work from the outside. Now, we’re excited to share some of the thought processes that have led to the ongoing success of GitHub Copilot.

Let’s jump in.

Everything you need to know about prompt engineering in 1600 tokens or less

You know when you’re tapping out a text message on your phone, and in the middle of the screen just above the keypad, there’s a button you can click to accept a suggested next word? That’s pretty much what an LLM is doing—but at scale.

A GIF show autocomplete functionalities in iOS.
An example of iMessage’s text prediction feature.

Instead of text on your phone, an LLM works to predict the next best group of letters, which are called “tokens.” And in the same way that you can keep tapping that middle button to complete your text message, the LLM completes a document by predicting the next word. It will continue to do that over and over, and it will only stop once it has reached a maximum threshold of tokens or once it has encountered a special token that signals “Stop! This is the end of the document.”

There’s an important difference, though. The language model in your phone is pretty simple—it’s basically saying, “Based only upon the last two words entered, what is the most likely next word?” In contrast, an LLM produces an output that’s more akin to being “based upon the full content of every document ever known to exist in the public domain, what is the most likely next token in your document?” By training such a large, well-architected model on an enormous dataset, an LLM can almost appear to have common sense such as understanding that a glass ball sitting on a table might roll off and shatter.

A screenshot of ChatGPT answering a question about the danger of setting a round glass ball on a small table.
Example of an LLM’s awareness or “common sense” due to its training.

But be warned: LLMs will also sometimes confidently produce information that isn’t real or true, which are typically called “hallucinations” or “fabulations.” LLMs can also appear to learn how to do things they weren’t initially trained to do. Historically, natural language models have been created for one-off tasks, like classifying the sentiment of a tweet, extracting the business entities from an email, or identifying similar documents, but now you can ask AI tools like ChatGPT to perform a task that it was never trained to do.

A screenshot of ChatGPT answering a prompt to create a chicken-based limerick.
John conversing with ChatGPT about serious things.

Building applications using LLMs

A document completion engine is a far cry from the amazing proliferation of LLM applications that are springing up every day, running the gamut from conversational search, writing assistants, automated IT support, and code completion tools, like GitHub Copilot. But how is it possible that all of these tools can come from what is effectively a document completion tool? The secret is any application that uses an LLM is actually mapping between two domains: the user domain and the document domain.

A graphic showing how LLMs work and the processes behind them to determine context before giving an answer.
Diagram of the user flow when communicating with an LLM, in this case, Dave’s user flow.

On the left is the user. His name is Dave, and he has a problem. It’s the day of his big World Cup watch party, and the Wi-Fi is out. If they don’t get it fixed soon, he’ll be the butt of his friends’ jokes for years. Dave calls his internet provider and gets an automated assistant. Ugh! But imagine that we are implementing the automated assistant as an LLM application. Can we help him?

The key here is to figure out how to convert from user domain into document domain. For one thing, we will need to transcribe the user’s speech into text. As soon as the automated support agent says “Please state the nature of your cable-related emergency,” Dave blurts out:

Oh it’s awful! It’s the World Cup finals. My TV was connected to my Wi-Fi, but I bumped the counter and the Wi-Fi box fell off and broke! Now, we can’t watch the game.

At this point, we have text, but it’s not of much use. Maybe you would imagine that this was part of a story and continue it, “I guess, I’ll call up my brother and see if we can watch the game with him.” An LLM with no context will similarly create the continuation of Dave’s story. So, let’s give the LLM some context and establish what type of document this is:

### ISP IT Support Transcript:

The following is a recorded conversation between an ISP customer, Dave Anderson, and Julia Jones, IT support expert. This transcript serves as an example of the excellent support provided by Comcrash to its customers.

*Dave: Oh it's awful! This is the big game day. My TV was connected to my Wi-Fi, but I bumped the counter and the Wi-Fi box fell off and broke! Now we can't watch the game.
*Julia:

Now, if you found this pseudo document on the ground, how would you complete it? Based on the extra context, you would see that Julia is an IT support expert, and apparently a really good one. You would expect the next words to be sage advice to help Dave with his problem. It doesn’t matter that Julia doesn’t exist, and this wasn’t a recorded conversation—what matters is that these extra words offer more context for what a completion might look like. An LLM does the same exact thing. After reading this partial document, it will do its best to complete Julia’s dialogue in a helpful manner.

But there’s more we can do to make the best document for the LLM. The LLM doesn’t know a whole lot about cable TV troubleshooting. (Well, it has read every manual and IT document ever published online, but stay with me here). Let’s assume that its knowledge is lacking in this particular domain. One thing we can do is search for extra content that might help Dave and place it into the document. Let’s assume that we have a complaints search engine that allows us to find documentation that has been helpful in similar situations in the past. Now, all we have to do is weave this information into our pseudo document in a natural place.

Continuing from above:

*Julia:(rifles around in her briefcase and pulls out the perfect documentation for Dave's request)
Common internet connectivity problems ...
<...here we insert 1 page of text that comes from search results against our customer support history database...>
(After reading the document, Julia makes the following recommendation)
*Julia:

Now, given this full body of text, the LLM is conditioned to make use of the implanted documentation, and in the context of “a helpful IT expert,” the model will generate a response. This reply takes into account the documentation as well as Dave’s specific request.

The last step is to move from the document domain into the user’s problem domain. For this example, that means just converting text to voice. And since this is effectively a chat application, we would go back and forth several times between the user and the document domain, making the transcript longer each time.

This, at the core of the example, is prompt engineering. In the example, we crafted a prompt with enough context for the AI to produce the best possible output, which in this case was providing Dave with helpful information to get his Wi-Fi up and running again. In the next section, we’ll take a look at how we at GitHub have refined our prompt engineering techniques for GitHub Copilot.

The art and science of prompt engineering

Converting between the user domain and document domain is the realm of prompt engineering—and since we’ve been working on GitHub Copilot for over two years, we’ve started to identify some patterns in the process.

These patterns have helped us formalize a pipeline, and we think it is an applicable template to help others better approach prompt engineering for their own applications. Now, we’ll demonstrate how this pipeline works by examining it in the context of GitHub Copilot, our AI pair programmer.

The prompt engineering pipeline for GitHub Copilot

From the very beginning, GitHub Copilot’s LLMs have been built on AI models from OpenAI that have continued to get better and better. But what hasn’t changed is the answer to the central question of prompt engineering: what kind of document is the model trying to complete?

The OpenAI models we use have been trained to complete code files on GitHub. Ignoring some filtering and stratification steps that don’t really change the prompt engineering game, this distribution is pretty much that of individual file contents according to the most recent commit to main at data collection time.

The document completion problem the LLM solves is about code, and GitHub Copilot’s task is all about completing code. But the two are very different.

Here are some examples:

  • Most files committed to main are finished. For one, they usually compile. Most of the time the user is typing, the code does not compile because of incompletions that will be fixed before a commit is pushed.
  • The user might even write their code in hierarchical order, method signatures first, then bodies rather than line by line or in a mixed style.
  • Writing code means jumping around. In particular, people’s edits often require them to jump up in the document and make a change there, for example, adding a parameter to a function. Strictly speaking, if Codex suggests using a function that has not been imported yet, no matter how much sense it might make, that’s a mistake. But as a GitHub Copilot suggestion, it would be useful.

The issue is that merely predicting the most likely continuation based on the text in front of the cursor to make a GitHub Copilot suggestion would be a wasted opportunity. That’s because it ignores an incredible wealth of context. We can use that context to guide the suggestion, like metadata, the code below the cursor, the content of imports, the rest of the repository, or issues, and create a strong prompt for the AI assistant.

Software development is a deeply interconnected, multimodal challenge, and the more of that complexity we can tame and present to the model, the better your completions are going to be.

Step 1: Gathering context

GitHub Copilot lives in the context of an IDE such as Visual Studio Code (VS Code), and it can use whatever it can get the IDE to tell it—only if the IDE is quick about it though. In an interactive environment like GitHub Copilot, every millisecond matters. GitHub Copilot promises to take care of the common coding tasks, and if it wants to do that, it needs to display its solution to the developer before they have started to write more code in their IDE. Our rough heuristics say that for every additional 10 milliseconds we take to come up with a suggestion, the chance it’ll arrive in time decreases by one percent.

So, what can we say quickly? Well, here’s an example. Consider this suggestion to a simple piece of Python:

A developer prompting GitHub Copilot to write a simple function in Python to compute Fibonacci numbers.

Wrong! Turns out the user actually wanted to write Ruby, like this:

A developer using GitHub Copilot to write a simple function to compute Fibonacci numbers in Ruby.

The two languages have similar enough syntax so that only a couple of lines can be ambiguous, especially when it’s toward the beginning of the file where much of what we encounter are boilerplate comments. But modern IDEs such as VS Code typically know what language the user is writing in. That makes language mix ups especially annoying to the user because they break the implicit expectation that “the computer should know” (after all, most IDEs highlight language syntax).

So, let’s put the language metadata into our pile of context we might want to include. In fact, let’s add the whole filename too. If it’s available, it usually implies the language through its extension, and additionally sets the tone for what to expect in that file—small, easy pieces of information that won’t turn the tide but are helpful to include.

On the other end of the spectrum, there’s the rest of the repository. Say you’ve got a file that defines an abstract class DataReader. And you have another that defines a subclass CsvReader. And you’re now writing a new file defining another subclass SqlReader. Chances are that to write the new file, you’ll want to check out both existing files as well because they communicate useful background into what you need to implement and how to do it. Typically, developers keep such files open in different tabs and switch to remind themselves of definitions, examples, similar patterns, or tests.

If the content of those two files is useful to you, chances are it would be useful to the AI as well. So, let’s add it as context! After all, the IDE knows what other files from the repository are open as tabs in the same window. The repository might have hundreds or even thousands of files, but only some will be open, and that is a strong hint that they might be useful to what they’re doing right now. Of course, “some” can mean a lot of things, so we don’t consider any more than the 20 most recent tabs.

Step 2: Snippeting

Irrelevant information in an LLM’s context decreases its accuracy. Additionally, source code tends to be long, so even a single file is not guaranteed to fit completely into an LLM’s context window (a problem that occurs roughly a fifth of the time). So, unless the user is very frugal about their tab usage, we simply cannot include all the tabs.

It’s important to be selective about what code to include from other files, so we cut files into (hopefully) natural, overlapping snippets that are no longer than 60 lines. Of course, we don’t want to actually include all overlapping snippets—that’s why we score them and take only the best. In this case, the “score” is meant to reflect relevance. To determine a snippet’s score, we use the Jaccard similarity, a stat that can be used to gauge the similarity or diversity of sample sets. (It’s also super fast to compute, which is great for reducing latency.)

Step 3: Dressing them up

Now we have some context we’d like to pass on to the model. But how? Codex and other models don’t offer an API where you can add other files, or where you can specify the document’s language and filename for that matter. They complete one single document. As mentioned above, you’ll need to inject your context into that document in a natural way.

The path and name might be easiest. Many files start with a preamble that gives some metadata, like author, project name, or filename. So, we’ll pretend this is happening here as well, and add a line at the very top that reads something like # filepath: foo/bar.py or // filepath: foo.bar.js, depending on comment syntax in the file’s language.

Sometimes the path isn’t known, like with new files that haven’t yet been saved. Even then, we could try to at least specify the language, provided the IDE is aware of it. For many languages, we have the opportunity to include shebang lines like #!/usr/bin/python or #!/usr/bin/node. That’s a neat trick that works pretty well at warding against mistaken language identity. But it’s also a bit dangerous since files with shebang lines are a biased subpopulation of all code. So, let’s do it for short files where the danger of mistaken language identity is high, and avoid it for larger or named files.

If comments work as a delivery system for tiny nuggets of information, like path or language, we can also make them work as delivery systems for the chunky deep dives that are 60 lines of related code.

Comments are versatile, and commented-out code exists all over GitHub. Let’s look at some of the most common examples:

  • Old code that doesn’t apply anymore
  • Deleted features
  • Earlier versions of current code
  • Example code specifically left there for documentation purposes
  • Code lifted from other parts of the codebase

Let’s take our inspiration from the last group of examples. Familiarity with groups (1) – (3) makes things a bit easier on the model, but our snippets aim to emulate groups (4) and (5):

# compare this snippet from utils/concatenate.py:

# def crazy_concat(a, b):

# return str(a) + str(b)[::-1]

Note that including the file name and path of the snippet source can be useful. And combined with the current file’s path, this might guide completions referencing imports.

Step 4: Prioritization

So far, we have grabbed many pieces of context from many sources: the text directly above the cursor, text below the cursor, text in other files, and metadata like language and file path.

In the vast majority of cases (around 95%), we have to make the tough choice of what we can or cannot include.

We make that choice by thinking of the items we might include as “wishes.” Each time we uncover a piece of context, like a commented out snippet from an open tab, we make a wish. Wishes come with some priority attached, for example, the shebang lines have rather low priorities. Snippets with a low similarity score are barely higher. In contrast, the lines directly above the cursor have maximum priority. Wishes also come with a desired position in the document. The shebang line needs to be the very first item, while the text directly above the cursor comes last—it should directly precede the LLM’s completion.

The fastest way of selecting which wishes to fill and which ones to discard is by sorting that wishlist by priority. Then, we can keep deleting the lowest priority wishes until what remains fits in the context window. We then sort again by the intended order in the document and paste everything together.

Step 5: The AI does its thing

Now that we’ve assembled an informative prompt, it’s time for the AI to come up with a useful completion. We have always faced a very delicate tradeoff here—GitHub Copilot needs to use a highly capable model because quality makes all the difference between a useful suggestion and a distraction. But at the same time, it needs to be a model capable of speed, because latency makes all the difference between a useful suggestion and not being able to provide a suggestion at all.

So, which AI should we choose to “do its thing” on the completion task: the fastest or the most accurate one? It’s hard to know in advance, so OpenAI developed a fleet of models in collaboration with GitHub. We put two different models in front of developers but found that people got the most mileage (in terms of accepted and retained completions) out of the much faster model. Since then, further optimizations have increased model speed significantly, so that the current version of GitHub Copilot is backed by an even more capable model.

Step 6: Now, over to you!

The generative AI produces a string, and if it’s not stopped, it keeps on producing and will keep going until it predicts the end of the file. That would waste time and compute resources, so you need to set up “stop” criteria.

The most common stop criterion is actually looking for the first line break. In many situations, it seems likely that a software developer wants the current line to be finished, but not more. But some of the most magical contributions by GitHub Copilot are when it suggests multiple lines of code all at once.

Multi-line completions feel natural when they’re about a single semantic unit, such as the body of a function, an if-branch, or a class. GitHub Copilot looks for cases where such a block is being started, either because the developer has just written the start, such as the header, if guard, or class declaration, or is currently writing the start. If the block body appears to be empty, it will attempt to make a suggestion for it, and only stop when the block appears to be done.

This is the point when the suggestion gets surfaced to the coder. And the rest, as they say, is ~~history~~ 10x development.

If you’re interested in learning more about prompt engineering in general and how you can refine your own techniques, check out our guide on getting started with GitHub Copilot.

GitHub Availability Report: June 2023

Post Syndicated from Jakub Oleksy original https://github.blog/2023-07-12-github-availability-report-june-2023/

In June, we experienced two incidents that resulted in degraded performance across GitHub services. 

June 7 16:11 UTC (lasting 2 hours 28 minutes)

On June 7 at 16:11 UTC, GitHub started experiencing increasing delays in an internal job queue used to process Git pushes. Our monitoring systems alerted our first responders after 19 minutes. During this incident, customers experienced GitHub Actions workflow run and webhook delays as long as 55 minutes, and pull requests did not accurately reflect new commits.

We immediately began investigating and found that the delays were caused by a customer making a large number of pushes to a repository with a specific data shape. The jobs processing these pushes became throttled when communicating with the Git backend, leading to increased job execution times. These slow jobs exhausted a worker pool, starving the processing of pushes for other repositories. Once the source was identified and temporarily disabled, the system gradually recovered as the backlog of jobs was completed. To prevent a recurrence, we updated the Git backend’s throttling behavior to fail faster and reduced the Git client timeout within the job to prevent it from hanging.  We have additional repair items in place to reduce the times to detect, diagnose, and recover.

June 29 14:50 UTC (lasting 32 minutes)

On June 29, starting from 17:39 UTC, GitHub was down in parts of North America, particularly the US East coast and South America, for approximately 32 minutes.

GitHub takes measures to ensure that we have redundancy in our system for various disaster scenarios. We have been working on building redundancy to an earlier single point of failure in our network architecture at a second Internet edge facility. This facility was completed in January and has been actively routing production traffic since then in a high availability (HA) architecture alongside the first edge facility. As part of the facility validation steps, we performed a live failover test in order to verify that we could use this second Internet edge facility if the primary were to fail. Unfortunately, during this failover test we inadvertently caused a production outage.

The test exposed a network path configuration issue in the secondary side that prevented it from properly functioning as a primary, which resulted in the outage. This has since been fixed. We were immediately notified of the issue and within two minutes of being alerted we reverted the change and brought the primary facility back online. Once online it took time for traffic to be rebalanced and for our border routers to reconverge restoring public connectivity to GitHub systems.

This failover test helped expose the configuration issue, and we are addressing the gaps in both configuration and our failover testing, which will help make GitHub more resilient. We recognize the severity of this outage and the importance of keeping GitHub available. Moving forward, we will continue our commitment to high availability, improving these tests and scheduling them in a way where potential customer impact is minimized as much as possible.


Please follow our status page for real-time updates on status changes. To learn more about what we’re working on, check out the GitHub Engineering Blog.

GitHub CLI project command is now generally available!

Post Syndicated from Ariel Deitcher original https://github.blog/2023-07-11-github-cli-project-command-is-now-generally-available/

Effective planning and tracking is essential for developer teams of all shapes and sizes. Last year, we announced the general availability of GitHub Projects, connecting your planning directly to the work your teams are doing in GitHub. Today, we’re making GitHub Projects faster and more powerful. The project command for the gh CLI is now generally available!

In this blog, we’ll take a look at how to get started with the new command, share some examples you can try on the command line and in GitHub Actions, and list the steps to upgrade from the archived gh-projects extension. Let’s take a look at how you can conveniently manage and collaborate on GitHub Projects from the command line.

The components of GitHub Projects

Let’s start by familiarizing ourselves with the key components of GitHub Projects. A project is made up of three components—the Project, Project field(s), and Project item(s).

A Project belongs to an owner (which can be either a user or an organization), and is identified by a project number. As an example, the GitHub public roadmap project is number 4247 in the github organization. We’ll use this project in some of our examples later on.

Project fields belong to a Project and have a type such as Status, Assignee, or Number, while field values are set on an item. See understanding fields for more details.

Project items are one of type draft issue, issue, or pull request. An item of type draft issue belongs to a single Project, while items of type issue and pull request can be added to multiple projects.

These three components make up the subcommands of gh project, for example:

  • Project subcommands include: create, copy, list, and view.
  • Project field subcommands include: field-create, field-list , and field-delete.
  • Project item subcommands include: item-add, item-edit, item-archive, and item-list.

For the full list of project commands, check out the manual.

Permissions check

In order to get started with the new command, you’ll need to ensure you have the right permissions. The project command requires the project auth scope, which isn’t part of the default scopes of the gh auth token.

In your terminal, you can check your current scopes with this command:

$ gh auth status
github.com
✓ Logged in to github.com as mntlty (keyring)
✓ Git operations for github.com configured to use https protocol.
✓ Token: gho_************************************
✓ Token scopes: gist, read:org, repo, workflow

If you don’t see project in the list of token scopes, you can add it by following the interactive prompts from this command:

$ gh auth refresh -s project

In GitHub Actions, you must choose one of the options from the documentation to make a token with the project scope available.

Running project commands

Now that you have the permissions you need, let’s look at some examples of running project commands using my user and the GitHub public roadmap project, which you can adapt to your team’s use cases.

List the projects owned by the current user (note that no --owner flag is set):

$ gh project list
NUMBER TITLE STATE ID
1 my first project open PVT_kwxxx
2 @mntlty's second project open PVT_kwxxx

Create a project owned by mntlty:

$ gh project create --owner mntlty --title 'my project'

View the GitHub public roadmap project:

$ gh project view --owner github 4247

Title

GitHub public roadmap

## Description

--

## Visibility

Public

## URL

<https://github.com/orgs/github/projects/4247>

## Item count

208

## Readme

--

## Field Name (Field Type)

Title (ProjectV2Field)

Assignees (ProjectV2Field)

Status (ProjectV2SingleSelectField)

Labels (ProjectV2Field)

Repository (ProjectV2Field)

Milestone (ProjectV2Field)

Linked pull requests (ProjectV2Field)

Reviewers (ProjectV2Field)

Tracks (ProjectV2Field)

Tracked by (ProjectV2Field)

List the items in the GitHub public roadmap project:

$ gh project item-list --owner github 4247

TYPE TITLE NUMBER REPOSITORY ID
Issue Kotlin security analysis support in CodeQL code scanning
(public beta) 207 github/roadmap
PVTI_lADNJr_NE13OAALQgw
Issue Swift security analysis support in CodeQL code scanning
(beta) 206 github/roadmap
PVTI_lADNJr_NE13OAALQhA
Issue Fine-grained PATs (v2 PATs) - [Public Beta]
184 github/roadmap PVTI_lADNJr_NE13OAALQmw

Copy the GitHub public roadmap project structure to a new project owned by mntlty:

$ gh project copy 4247 --source-owner github --target-owner mntlty --title 'my roadmap'

https://github.com/users/mntlty/projects/1

Note that if you are using a TTY and do not pass a --owner flag or the project number argument to a command which requires those values, an interactive prompt will be shown from which you can select those values.

JSON format

Now, let’s look at how to format the command output in JSON, which displays more information for use in scripting, automation, and piping into other commands. Every project subcommand supports outputting to JSON format by setting the --format=json flag:

$ gh project view --owner github 4247 --format=json
{"number":4247,"url":"<https://github.com/orgs/github/projects/4247","shortDescription":"", "public":true,"closed":false,"title":"GitHub> public roadmap","id":"PVT_kwDNJr_NE10","readme":"","items":{"totalCount":208},"fields":{"totalCount":10},"owner":{"type":"Organization","login":"github"}}%

Combining JSON formatted output with a tool such as jq enables you to unlock even more capabilities. For example, you can create a list of the URLs from all of the Issues on the GitHub public roadmap project that have status “Future”:

$ gh project item-list --owner github 4247 --format=json | jq '.items[] |
select(.status=="Future" and .content.type == "Issue") | .content.url'

"<https://github.com/github/roadmap/issues/188>"
"<https://github.com/github/roadmap/issues/187>"
"<https://github.com/github/roadmap/issues/166>"

GitHub Actions

You can also level up your team’s usage of GitHub Projects with project commands in your GitHub Actions workflows to enhance automation, generate on demand reports, and react to events such as when a project item is modified. For example, you can create a workflow which is triggered by a workflow_dispatch event and will close all projects that are owned by mntlty and which have no items:

on: 
  workflow_dispatch:

jobs:
  close_empty:
    runs-on: ubuntu-latest
    env:
      GH_TOKEN: ${{ secrets.PROJECT_TOKEN }}
    steps:
      - run: |
          gh project list --owner mntlty --format=json \
          | jq '.projects[] | select(.items.totalCount == 0) | .number' \
          | xargs -n1 gh project close --owner mntlty 

The latest version of gh is automatically available in the GitHub Actions environment. For more information on using GitHub Actions, see https://docs.github.com/en/actions.

Upgrading from the gh-projects extension

Now that the project command is officially part of the CLI, the gh-projects extension repository has been archived. If you’re currently using the extension, you don’t need to change anything. You can continue installing and using the gh-projects extension; however, it won’t receive any future enhancements. Fortunately, it’s very simple to make the transition from the gh-project extension to the project command:

  • Upgrade to the latest version of gh.
  • Replace flags for --user and --org with --owner in project commands. owner is the login of the project owner, which is either a user or an organization.
  • Replace gh projects with gh project.

To avoid confusion, I also recommend removing the extension by running the following command:

$ gh ext remove gh-projects

Thank you to the community, @mislav, @samcoe, and @vilmibm for providing invaluable feedback and support on gh-projects!

Get started with GitHub CLI project command today

If you’re interested in learning more or giving us feedback, check out these links:

Upgrade to the latest version of the gh CLI to level up your usage of GitHub Projects!

Zero traffic cost for Kafka consumers

Post Syndicated from Grab Tech original https://engineering.grab.com/zero-traffic-cost

Introduction

Coban, Grab’s real-time data streaming platform team, has been building an ecosystem around Kafka, serving all Grab verticals. Along with stability and performance, one of our priorities is also cost efficiency.

In this article, we explain how the Coban team has substantially reduced Grab’s annual cost for data streaming by enabling Kafka consumers to fetch from the closest replica.

Problem statement

The Grab platform is primarily hosted on AWS cloud, located in one region, spanning over three Availability Zones (AZs). When it comes to data streaming, both the Kafka brokers and Kafka clients run across these three AZs.

Figure 1 – Initial design, consumers fetching from the partition leader

Figure 1 shows the initial design of our data streaming platform. To ensure high availability and resilience, we configured each Kafka partition to have three replicas. We have also set up our Kafka clusters to be rack-aware (i.e. 1 “rack” = 1 AZ) so that all three replicas reside in three different AZs.

The problem with this design is that it generates staggering cross-AZ network traffic. This is because, by default, Kafka clients communicate only with the partition leader, which has a 67% probability of residing in a different AZ.

This is a concern as we are charged for cross-AZ traffic as per AWS’s network traffic pricing model. With this design, our cross-AZ traffic amounted to half of the total cost of our Kafka platform.

The Kafka cross-AZ traffic for this design can be broken down into three components as shown in Figure 1:

  • Producing (step 1): Typically, a single service produces data to a given Kafka topic. Cross-AZ traffic occurs when the producer does not reside in the same AZ as the partition leader it is producing data to. This cross-AZ traffic cost is minimal, because the data is transferred to a different AZ at most once (excluding retries).
  • Replicating (step 2): The ingested data is replicated from the partition leader to the two partition followers, which reside in two other AZs. The cost of this is also relatively small, because the data is only transferred to a different AZ twice.
  • Consuming (step 3): Most of the cross-AZ traffic occurs here because there are many consumers for a single Kafka topic. Similar to the producers, the consumers incur cross-AZ traffic when they do not reside in the same AZ as the partition leader. However, on the consuming side, cross-AZ traffic can occur as many times as there are consumers (on average, two-thirds of the number of consumers). The solution described in this article addresses this particular component of the cross-AZ traffic in the initial design.

Solution

Kafka 2.3 introduced the ability for consumers to fetch from partition replicas. This opens the door to a more cost-efficient design.

Figure 2 – Target design, consumers fetching from the closest replica

Step 3 of Figure 2 shows how consumers can now consume data from the replica that resides in their own AZ. Implementing this feature requires rack-awareness and extra configurations for both the Kafka brokers and consumers. We will describe this in the following sections.

The Coban journey

Kafka upgrade

Our journey started with the upgrade of our legacy Kafka clusters. We decided to upgrade them directly to version 3.1, in favour of capturing bug fixes and optimisations over version 2.3. This was a safe move as version 3.1 was deemed stable for almost a year and we projected no additional operational cost for this upgrade.

To perform an online upgrade with no disruptions for our users, we broke down the process into three stages.

  • Stage 1: Upgrading Zookeeper. All versions of Kafka are tested by the community with a specific version of Zookeeper. To ensure stability, we followed this same process. The upgraded Zookeeper would be backward compatible with the pre-upgrade version of Kafka which was still in use at this early stage of the operation.
  • Stage 2: Rolling out the upgrade of Kafka to version 3.1 with an explicit backward-compatible inter-broker protocol version (inter.broker.protocol.version). During this progressive rollout, the Kafka cluster is temporarily composed of brokers with heterogeneous Kafka versions, but they can communicate with one another because they are explicitly set up to use the same inter-broker protocol version. At this stage, we also upgraded Cruise Control to a compatible version, and we configured Kafka to import the updated cruise-control-metrics-reporter JAR file on startup.
  • Stage 3: Upgrading the inter-broker protocol version. This last stage makes all brokers use the most recent version of the inter-broker protocol. During the progressive rollout of this change, brokers with the new protocol version can still communicate with brokers on the old protocol version.

Configuration

Enabling Kafka consumers to fetch from the closest replica requires a configuration change on both Kafka brokers and Kafka consumers. They also need to be aware of their AZ, which is done by leveraging Kafka rack-awareness (1 “rack” = 1 AZ).

Brokers

In our Kafka brokers’ configuration, we already had broker.rack set up to distribute the replicas across different AZs for resiliency. Our Ansible role for Kafka automatically sets it with the AZ ID that is dynamically retrieved from the EC2 instance’s metadata at deployment time.

- name: Get availability zone ID
  uri:
    url: http://169.254.169.254/latest/meta-data/placement/availability-zone-id
    method: GET
    return_content: yes
  register: ec2_instance_az_id

Note that we use AWS AZ IDs (suffixed az1, az2, az3) instead of the typical AWS AZ names (suffixed 1a, 1b, 1c) because the latter’s mapping is not consistent across AWS accounts.

Also, we added the new replica.selector.class parameter, set with value org.apache.kafka.common.replica.RackAwareReplicaSelector, to enable the new feature on the server side.

Consumers

On the Kafka consumer side, we mostly rely on Coban’s internal Kafka SDK in Golang, which streamlines how service teams across all Grab verticals utilise Coban Kafka clusters. We have updated the SDK to support fetching from the closest replica.

Our users only have to export an environment variable to enable this new feature. The SDK then dynamically retrieves the underlying host’s AZ ID from the host’s metadata on startup, and sets a new client.rack parameter with that information. This is similar to what the Kafka brokers do at deployment time.

We have also implemented the same logic for our non-SDK consumers, namely Flink pipelines and Kafka Connect connectors.

Impact

We rolled out fetching from the closest replica at the turn of the year and the feature has been progressively rolled out on more and more Kafka consumers since then.

Figure 3 – Variation of our cross-AZ traffic before and after enabling fetching from the closest replica

Figure 3 shows the relative impact of this change on our cross-AZ traffic, as reported by AWS Cost Explorer. AWS charges cross-AZ traffic on both ends of the data transfer, thus the two data series. On the Kafka brokers’ side, less cross-AZ traffic is sent out, thereby causing the steep drop in the dark green line. On the Kafka consumers’ side, less cross-AZ traffic is received, causing the steep drop in the light green line. Hence, both ends benefit by fetching from the closest replica.

Throughout the observeration period, we maintained a relatively stable volume of data consumption. However, after three months, we observed a substantial 25% drop in our cross-AZ traffic compared to December’s average. This reduction had a direct impact on our cross-AZ costs as it directly correlates with the cross-AZ traffic volume in a linear manner.

Caveats

Increased end-to-end latency

After enabling fetching from the closest replica, we have observed an increase of up to 500ms in end-to-end latency, that comes from the producer to the consumers. Though this is expected by design, it makes this new feature unsuitable for Grab’s most latency-sensitive use cases. For these use cases, we retained the traditional design whereby consumers fetch directly from the partition leaders, even when they reside in different AZs.

Figure 4 – End-to-end latency (99th percentile) of one of our streams, before and after enabling fetching from the closest replica

Inability to gracefully isolate a broker

We have also verified the behaviour of Kafka clients during a broker rotation; a common maintenance operation for Kafka. One of the early steps of our corresponding runbook is to demote the broker that is to be rotated, so that all of its partition leaders are drained and moved to other brokers.

In the traditional architecture design, Kafka clients only communicate with the partition leaders, so demoting a broker gracefully isolates it from all of the Kafka clients. This ensures that the maintenance is seamless for them. However, by fetching from the closest replica, Kafka consumers still consume from the demoted broker, as it keeps serving partition followers. When the broker effectively goes down for maintenance, those consumers are suddenly disconnected. To work around this, they must handle connection errors properly and implement a retry mechanism.

Potentially skewed load

Another caveat we have observed is that the load on the brokers is directly determined by the location of the consumers. If they are not well balanced across all of the three AZs, then the load on the brokers is similarly skewed. At times, new brokers can be added to support an increasing load on an AZ. However, it is undesirable to remove any brokers from the less loaded AZs as more consumers can suddenly relocate there at any time. Having these additional brokers and underutilisation of existing brokers on other AZs can also impact cost efficiency.

Figure 5 – Average CPU utilisation by AZ of one of our critical Kafka clusters

Figure 5 shows the CPU utilisation by AZ for one of our critical Kafka clusters. The skewage is visible after 01/03/2023. To better manage this skewage in load across AZs, we have updated our SDK to expose the AZ as a new metric. This allows us to monitor the skewness of the consumers and take measures proactively, for example, moving some of them to different AZs.

What’s next?

We have implemented the feature to fetch from the closest replica on all our Kafka clusters and all Kafka consumers that we control. This includes internal Coban pipelines as well as the managed pipelines that our users can self-serve as part of our data streaming offering.

We are now evangelising and advocating for more of our users to adopt this feature.

Beyond Coban, other teams at Grab are also working to reduce their cross-AZ traffic, notably, Sentry, the team that is in charge of Grab’s service mesh.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Accessibility considerations behind code search and code view

Post Syndicated from milemons original https://github.blog/2023-07-06-accessibility-considerations-behind-code-search-and-code-view/

GitHub prides itself on being the home for all developers, including developers with disabilities. Accessibility is a core priority for all new projects at GitHub, so it was top of mind when we started our project to rework code search and the code view at GitHub. With the old code view, some developers preferred to look at raw code rather than code view due to accessibility barriers, and that’s what we set out to fix. This blog post will shed light on our process and three major tasks that required thoughtful design and implementation—code search and query builder, the file tree, and navigating code by keyboard.

Process

We worked with a team of experts at GitHub dedicated to accessibility, since while we are confident in our development skills, accessibility has many nuances and requires a large depth of knowledge to get right. This team helped us immensely in three main ways:

  1. External auditing
    In this project, we performed accessibility auditing on all of our features. This meant that another team of auditors reviewed our webpage with all the changes implemented to find accessibility errors. They used tools including screen readers, color contrast tools, and more, to identify areas that users with disabilities may find hard to use. Once those issues were identified, the accessibility team would take a look at those issues and suggest a proper solution to the problem. Then, it was up to us to implement the solution and collaborate with the team where we needed additional assistance.
  2. Design reviews
    The user interface for code search and code view were both entirely redesigned to support our new goals—to allow users to search, navigate, and understand code in a way they weren’t able to before. As a part of the design process, a team of accessibility designers reviewed the Figma mockups to determine proper HTML markup and tab order. Then, they included detailed explanations of what interactions should look like in order to meet our accessibility goals.
  3. Office hours
    The accessibility team at GitHub hosts weekly meetings where engineers can sign up and discuss how to solve problems with their features with the team and a consultant. The consultant is incredibly helpful and knowledgeable about how to properly address issues because he has lived experience with screen readers and accessibility.During these meetings, we were able to discuss complicated issues, such as the following: proper filtering for code search, making an accessible file tree, and navigating code on a keyboard, along with other issues like tab order, and focus management across the whole feature.

Implementation details

This has been written about frequently on our blog—from one post on the details of QueryBuilder, our implementation of an accessible finder component to details about what navigating code search accessibly looks like. Those two posts are great reads and I strongly recommend checking them out, but they’re not the only thing that we’ve worked on. Thanks to @lindseywild, @smockle, @kendallgassner, @owenniblock, and @khiga8 for their guidance and help resolving issues with code search. This is still a work in progress, especially in areas like the filtering behavior on the search pages.

Two areas that also required careful consideration were managing focus after changing the sort order of search results and how to announce the result of the search for screen reader users.

Changing sort—changing focus?

When we first implemented this sort dropdown for non-code search types, if you navigated to it with your keyboard and then selected a different sort option, the whole page would reload and your focus moved back to the header. Our preferred, accessible behavior is that, when a dropdown menu is closed, focus returns to the button that opened the menu. However, our problem was that the “Sort” dropdown didn’t merely perform client-side operations like most dropdowns where you select an option. In this case, once a user selects an option, we do a full page navigation to perform the new search with the sort order added. This meant that the focus was being placed back on the header by default after the page navigation, instead of returning to the sort dropdown. For sighted mouse users, this is a nonissue; for sighted keyboard users, it is unexpected and annoying; for blind users, this is confusing and makes the product hard to use. To fix this, we had to make sure we returned focus to the dropdown after reloading the page.

Screenshot of search results. I have searched “react” and the “Repositories” filter by item is selected. There are 2.2M results, found in 1 second. The "Sort By" dropdown is open with “Best match” selected. The other options are “Most stars”, “Fewest stars”, “Most forks”, “Fewest forks”, “Recently updated”, and “Least recently updated”.

Announcing search results

When a sighted user types in the search bar and presses ‘Enter’ they quickly receive feedback that the search has completed and whether or not they found any results by looking at the page. For screen reader users, though, how to give the feedback that there were results, how many, or if there was an error requires more thought. One solution could be to place focus on the first result. That has a number of problems though.

  1. The user will not receive feedback about the number of search results and may think there’s only one.
  2. The user may miss important context like the Single sign on banner.
  3. The user will have to tab through more elements if they want to get back.

Another solution could be to use aria-live announcements to read out the number of results and success. This has its own problems.

  1. We already have an announcement on page navigation to read out the new title and these two announcements would compete or cause a race condition.
  2. There isn’t a way to programmatically force announcements to be read in order on page load.
  3. Some screen reader users have aria-live announcements turned off since they can be annoying.

After some discussion, we decided to focus and read out the header with the number of search results after a search and allow users to explore for errors, banners, or results on their own once they know how many results they received.

A screenshot of search results showing 0 results and an error message saying, “Invalid repository name.” I have searched “repo:asdfghijkl." The “Code” filter item is selected. There is also some help text on the page that says “Your search did not match any code. However, we found 544k packages that matched your search query. Alternatively, try one of the tips below.

Tree navigation

We knew when redesigning the code viewing experience that we wanted to include the tree panel to allow users to quickly switch between folders or files, because understanding code often requires looking in multiple places. We started the project by making our own tree to the aria spec, but it was too verbose. To fix this, our accessibility and design teams created the TreeView, which is open source. This was implemented to support various generic list elements and use proper HTML structure to make navigating through the tree a breeze, including typeahead (the ability to type a bit and have focus move to the first matching element), proper announcements for asynchronous loading of deeply nested items, and proper IDs for all elements that are guaranteed to be unique (that is, not constructed from their contents which may be the same). The valid markup for a tree view is very specific and developing it requires careful reviews for common issues, like invalid child item types under a role="group" element or using nested list elements without considering screen reader support, which is unlike most TreeView implementations on the internet. For more information about the design details for the tree view, check out the markdown documentation. Thanks to @colebemis, @mperrotti, and @ericwbailey for the work on this.

Reading and navigating code by keyboard

Before the new code view, reading code on GitHub wasn’t a great experience for screen reader users. The old experience presented code as a table—a hidden detail for sighted users but a major hurdle for screen readers, since code isn’t a table, and the semantics didn’t make sense in that context. For example, in code, whitespace has meaning. Whether stylistic or functional, that whitespace gets lost for screen reader users when the code is presented as a table. In addition, table semantics force users into line-by-line navigation, instead of character-by-character or word-by-word navigation. For many users, these hurdles meant that they mostly used GitHub just enough to be able to access the raw code or use a code editor for reading code. Since we want to support all our users, we knew we needed to totally rethink the way we structured code in the DOM.

This problem became even more complicated when we introduced symbols and the symbols panel. Mouse users are able to click on a “symbol” in the code—a special code element, like a function name—to see extra details about it, including references and definitions, in the symbol panel. They can then explore the code more deeply, navigating between lines of code where that symbol is found to investigate and understand the code better. This ability to dive deep has been game changing for many developers. However, for keyboard users, it doesn’t work. At best, a keyboard user can use the “Open symbols panel” button in the code header and then filter all symbol definitions for one they are interested in, but this doesn’t allow users to access symbol references when no definitions are found in a file. In addition, this flow really isn’t the same—if we want to support all developers, then we need to offer keyboard users a way to navigate through the code and select symbols they are interested in.

In addition, for many performance reasons mentioned in the post “Crafting a better, faster code view,” we introduced virtualization to the code viewer which created its own accessibility problems—not having all elements in the DOM can interfere with screen readers and overriding the cmd / ctrl + f shortcut is generally bad practice for screen readers as well. In addition, virtualization posed a problem for selecting text outside of the virtualization window.

This is when we came up with the solution of using a cursor from a <textarea> or a <div> that has the property contentEditable. This solution is modeled after Monaco, the text editor that powers VS Code. <textarea> elements, when not marked as readOnly, and contentEditiable<div> elements have a built in cursor that allows screen reader users to navigate code in their preferred manner using their built in screen reader settings. While a contentEditiable <div> would support syntax highlighting and the deep interactions we needed, screen readers don’t support them well1 which defeats the purpose of the cursor. As a result, we decided to go with the <textarea> However, <textarea> elements do not support syntax highlighting or deeper interactions like selecting symbols, which meant we needed to use both the <textarea> as a hidden element and syntax highlighted elements aligned perfectly above. Since we hide the text element, we need to add a visual “fake” cursor to show and keep it in sync with the “real” <textarea> cursor.

While the <textarea> and cursor help with our goals of allowing screen reader users and keyboard users to navigate code and symbols easily, they also ended up helping us with some of the problems we had run into with virtualization. One example of this was the cmd + f problem that we talk about more in depth in the blog post here. Another problem this solved was the drag and select behavior (or select all) for long files. Since the <textarea> is just one DOM node, we are able to load the whole file contents and select the contents directly from the <textarea> instead of the virtualized syntax highlighted version.

Unfortunately, while the <textarea> solved many problems we had, it also introduced a few other tricky ones. Since we have two layers of text, one hidden unless selected and one visual, we need to make sure that the scroll states are aligned. To do this, we have written observers to watch when one scrolls and mimic the scroll on the other. In addition, we often need to override the default <textarea> behaviors for some events—such as the middle click on mouse which was taking users all the way to the bottom of the code. Along with that issue, different browsers handle <textarea>s differently and making sure our solution behaves properly on all browsers has proven to be time intensive to say the least. In addition, we found that some browsers, like Firefox, allow users to customize their font size using Text zoom, which would apply to the formatted text but not the <textarea>. This led to “ghost text” issues with selection. We were able to resolve that by measuring the height of the text that is rendered and passing that to the <textarea>, though there are still some issues with certain plugins that modify text. We are still working to resolve these specific issues as well. In addition, the <textarea> currently does not work with our Wrap lines view option, which we are working to fix. Thanks especially to @joycezhu, @andrialexandrou, @adamshwert, and @jbrown1618 who put in a ton of work here.

Always iterating

We have taken huge strides to improve accessibility for all developers, but we recognize that we aren’t all the way there. It can be challenging to accommodate all developers, but we are committed to improving and iterating on what we have until we get it right. We want our tools to support everyone to access and explore code. We still have work—a lot of work—to do across GitHub, but we are excited for this journey.

To learn more about accessibility at GitHub, go to accessibility.github.com.


  1. For more information about contentEditable and screen readers, see this article written by @joycezhu from our Accessibility Engineering team.