Tag Archives: code navigation

Crafting a better, faster code view

Post Syndicated from Joshua Brown original https://github.blog/2023-06-21-crafting-a-better-faster-code-view/

Reading code is not as simple as reading the text of a file end-to-end. It is a non-linear, sometimes chaotic process of jumping between files to follow a trail, building a mental picture of how code relates to its surrounding context. GitHub’s mission is to be the home for all developers, and reading code is one of the core experiences we offer. Every day, millions of users use GitHub to view and interact with code. So, about a year ago we set out to create a new code view that supports the entire code reading experience with features like a file tree, symbol navigation, code search integration, sticky lines, and code section folding. The new code view is powerful, intelligent, and interactive, but it is not an attempt to turn the repository browsing experience into an IDE.

While building the new code view, our team had a few guiding principles on which we refused to compromise:

  • It must add these powerful new features to transform how users read code on GitHub.
  • It must be intuitive and easy to use for all of GitHub’s millions of users.
  • It must be fast.

Initial efforts

The first step was to build out the features we wanted in a natural, straightforward way, taking the advice that “premature optimization is the root of all evil.”1 After all, if simple code satisfactorily solves our problems, then we should stop there. We knew we wanted to build a highly interactive and stateful code viewing experience, so we decided to use React to enable us to iterate more quickly on the user interface. Our initial implementation for the code blob was dead-simple: our syntax highlighting service converted the raw file contents to a list of HTML strings corresponding to the lines of the file, and each of these lines was added to the document.

There was one key problem: our performance scaled badly with the number of lines in the file. In particular, our LCP and TTI times measurably increased at around 500 lines, and this increase became noticeable at around 2,000 lines. Around those same thresholds, interactions like highlighting a line or collapsing a code section became similarly sluggish. We take these performance metrics seriously for a number of reasons. Most importantly, they are user-centric—that is, they are meant to measure aspects of the quality of a user’s experience on the page. On top of that, they are also part of how search engines like Google determine where to rank pages in their search results; fast pages get shown first, and the code view is one of the many ways GitHub’s users can show their work to the world.

As we dug in, we discovered that there were a few things at play:

  • When there are many DOM nodes on the page, style calculations and paints take longer.
  • When there are many DOM nodes on the page, DOM queries take longer, and the results can have a significant memory footprint.
  • When there are many React nodes on the page, renders and DOM reconciliation both take longer.

It’s worth noting that none of these are problems with React specifically; any page with a very large DOM would experience the first two problems, and any solution where a large DOM is created and managed by JavaScript would experience the third.

We mitigated these problems considerably by ensuring that we were not running these expensive operations more than necessary. Typical React optimization techniques like memoization and debouncing user input, as well as some less common solutions like pulling in an observer pattern went a long way toward ensuring that React state updates, and therefore DOM updates, only occurred as needed.

Mitigating the problem, however, is not solving the problem. Even with all of these optimizations in place, the initial render of the page remained a fundamentally expensive operation for large files. In the repository that builds GitHub.com, for example, we have a CODEOWNERS file that is about 18,000 lines long, and pushes the 2MB size limit for displaying files in the UI. With no optimizations besides the ones described above, React’s first pass at building the DOM for this page takes nearly 27 seconds.2 Considering more than half of users will abandon a page if it loads for more than three seconds, there was obviously lots of work left to do.

A promising but incomplete solution

Enter virtualization. Virtualization is a performance optimization technique that examines the scroll state of the page to determine what content to include in the DOM. For example, if we are viewing a 10,000 line file but only about 75 lines fit on the screen at a time, we can save lots of time by only rendering the lines that fit in the viewport. As the user scrolls, we add any lines that need to appear, and remove any lines that can disappear, as illustrated by this demo.3

This satisfies the most basic requirements of the page with flying colors. It loads on average more quickly than the existing experience, and the experience of scrolling through the file is nearly indistinguishable from the non-virtualized case. Remember that 27 second initial render? Virtualizing the file content gets that time down to under a second, and that number does not increase substantially even if we artificially remove our file size limit and pull in hundreds of megabytes of text.

Unfortunately, virtualization is not a cure-all. While our initial implementation added features to the page at the expense of performance, naïvely virtualizing the code lines delivers a fast experience at the expense of vital functionality. The biggest problem was that without the entire text of the file on the page at once, the browser’s built-in find-in-file only surfaced results that are visible in the viewport. Breaking users’ ability to find text on the page breaks our hard requirement that the page remain intuitive and easy to use. Before we could ship any of this to real users, we had to ensure that this use case would be covered.

The immediate solution was to implement our own version of find-in-file by implementing a custom handler for the Ctrl+F shortcut (⌘+F on Mac). We added a new piece of UI in the sidebar to show results as part of our integration with symbol navigation and code search.

Screenshot of the "find" sidebar, showing a search bar with the term "isUn" and a list offive lines of code from the current file that contain that string, the second of which is highlighted as selected.

There is precedent for overriding this native browser feature to allow users to find text in virtualized code lines. Monaco, the text editor behind VS Code, does exactly this to solve the same problem, as do many other online code editors, including Repl.it and CodePen. Some other editors like the official Ruby playground ignore the problem altogether and accept that Ctrl+F will be partially broken within their virtualized editor.

At the time, we felt confident leaning on this precedent. These examples are applications that run in a browser window, and as users, we expect applications to implement their own controls. Writing our own way to find text on the page was a step toward making GitHub’s Code View less of a web page and more of a web application.

When we released the new code view experience as a private beta at GitHub Universe, we received clear feedback that our users think of GitHub as a page, not as an app. We tried to rework the experience to be as similar as possible to the native implementation, both in terms of user experience and performance. But ultimately, there are plenty of good reasons not to override this kind of native browser behavior.

  • Users of assistive technologies often use Ctrl+F to locate elements on a page, so restricting the scope to the contents of the file broke these workflows.
  • Users rely heavily on specific muscle memory for common actions, and we followed a deep rabbit hole to get the custom control to support all of the shortcuts used by various browsers.
  • Finally, the native browser implementation is simply faster.

Despite plenty of precedent for an overridden find experience, this user feedback drove us to dig deeper into how we could lean on the browser for something it already does well.

Virtualization has an important role to play in our final product, but it is only one piece of the puzzle.

How the pieces fit together

Our complete solution for the code view features two pieces:

  1. A textarea that contains the entire text of the raw file. The contents are accessible, keyboard-navigable, copyable, and findable, yet invisible.
  2. A virtualized, syntax-highlighted overlay. The contents are visible, yet hidden from both mouse events and the browser’s find.

Together, these pieces deliver a code view that supports the complete code reading experience with many new features. Despite the added complexity, this new experience is faster to render than the static HTML page that has displayed code on GitHub for more than a decade.

A textarea and a read-only cursor

The first half of this solution came to us from an unexpected angle.

Beyond adding functionality to the code view, we wanted to improve the code reading experience for users of assistive technologies like screen readers. The previous code view was minimally accessible; a code document was displayed as a table, which created a very surprising experience for screen reader users. A code document is not a table, but likewise it is not a paragraph of text. To support a familiar interface for interacting with the code on the page, we added an invisible textarea underneath the virtualized, syntax-highlighted code lines so that users can move through the code with the keyboard in a familiar way. And for the browser, rendering a textarea is much simpler than using JavaScript to insert syntax-highlighted HTML. Browsers can render megabytes of text in a textarea with ease.

Since this textarea contains the entire text of the raw file, it is not just an accessibility feature, but an opportunity to remove our custom implementation of Ctrl+F in favor of native browser implementations.

Hiding text from Ctrl+F

With the addition of the textarea, we now have two copies of every line that is visible in the viewport: one in the textarea, and another in the virtualized, syntax-highlighted overlay. In this state, searching for text yields duplicated results, which is more confusing than a slow or unfamiliar experience.

The question, then, is how to expose only one copy of the text to the browser’s native Ctrl+F. That brings us to the next key part of our solution: how we hid the syntax-highlighted overlay from find.

For a code snippet like this line of Python:

print("Hello!")

the old code view created a bit of HTML that looks like this:

<span class="pl-en">print</span>(<span class="pl-s">"Hello!"</span>)

But the text nodes containing print, (,"Hello!", and ) are all findable. It took two iterations to arrive at a format that looks identical but is consistently hidden fromCtrl+F on all major browsers. And as it turns out, this is not a question that is very easy to research!

The first approach we tried relied on the fact that :before pseudoelements are not part of the DOM, and therefore do not appear in find results. With a bit of a change to our HTML format that moves all text into a data- attribute, we can use CSS to inject the code text into the page without any findable text nodes.

HTML

<span class="pl-en" data-code-text="print"></span>
<span data-code-text="("></span>
<span class="pl-s" data-code-text=""Hello!""></span>
<span data-code-text=")"></span>

CSS

[data-code-text]:before {
   content: attr(data-code-text);
}

But that’s not the end of the story, because the major browsers do not agree on whether text in :before pseudoelements should be findable; Firefox in particular has a powerful Ctrl+F implementation that is not fooled by our first trick.

Our second attempt relied on a fact on which all browsers seem to agree: that text in adjacent pseudoelements is not treated as a contiguous block of text.4 So, even though Firefox would find print in the first example, it would not find print(. The solution, then, is to break up the text character-by-character:

<span class="pl-en">
   <span data-code-text="p"></span>
   <span data-code-text="r"></span>
   <span data-code-text="i"></span>
   <span data-code-text="n"></span>
   <span data-code-text="t"></span>
</span>
<span data-code-text="("></span>
<span class="pl-s">
   <span data-code-text="""></span>
   <span data-code-text="H"></span>
   <span data-code-text="e"></span>
   <span data-code-text="l"></span>
   <span data-code-text="l"></span>
   <span data-code-text="o"></span>
   <span data-code-text="!"></span>
   <span data-code-text="""></span>
</span>
<span data-code-text=")"></span>

At first glance, this might seem to complicate the DOM so much that it might outweigh the performance gains for which we worked so hard. But since these lines are virtualized, we create this overlay for at most a few hundred lines at a time.

Syntax highlighting in a compact format

The path we took to build a faster code view with more features was, like the path one might follow when reading code in a new repository, highly non-linear. Performance optimizations led us to fix behaviors which were not quite right, and those behavior fixes led us to need further performance optimizations. Knowing how we wanted the HTML for the syntax-highlighted overlay to look, we had a few options for how to make it happen. After a number of experiments, we completed our puzzle with a performance optimization that ended this cycle without causing any behavior changes.

Our syntax-highlighting service previously gave us a list of HTML strings, one for each line of code:

[
   "<span class=\"pl-en\">print</span>(<span class=\"pl-s\">"Hello!"</span>)"
]

In order to display code in a different format, we introduced a new format that simply gives the locations and css classes of highlighted segments:

[
   [
       {"start": 0, "end": 5, "cssClass": "pl-en"},
       {"start": 6, "end": 14, "cssClass": "pl-s"}
   ]
]

From here, we can easily generate whatever HTML we want. And that brings us to our final optimization:

within our syntax-highlighted overlay, we save React the trouble of managing the code lines by generating the HTML strings ourselves. This can deliver a surprisingly large performance boost in certain cases, like scrolling all the way through the 18,000-line CODEOWNERS file mentioned earlier. With React managing the entire DOM, we hit the “end” key to move all the way to the end of the file, and it takes the browser 870 milliseconds to finish handling the “keyup” event, followed by 3,700 milliseconds of JavaScript blocking the main thread. When we generate the code lines as HTML strings, handling the “keyup” event takes only 80 milliseconds, followed by about 700 milliseconds of blocking JavaScript.5

In summary

GitHub’s mission is to be the home for all developers. Developers spend a substantial amount of their time reading code, and reading code is hard! We spent the past year building a new code view that supports the entire code reading experience because we are passionate about bringing great tools to the developers of the world to make their lives a bit easier.

After a lot of difficult work, we have created a code view that introduces tons of new features for understanding code in context, and those features can be used by anyone. And we did it all while also making the page faster!

We’re proud of what we built, and we would love for everyone to try it out and send us feedback!

Notes


  1. This quote, popularized by and often attributed to Donald Knuth, was first said by Sir Tony Hoare, the developer of the quicksort algorithm. 
  2. All performance metrics generated for this post use a development build of React in order to better compare apples to apples. 
  3. Check out the source code for this virtualization demo here! 
  4. The fact that browsers do not treat adjacent :before elements as part of the same block of text also introduces another complication: it resets the tab stop location for each node, which means that tabs are not rendered with the correct width! We need the syntax-highlighted overlay to align exactly with the text content underneath because any discrepancy creates a highly confusing user experience. Luckily, since the overlay is neither findable nor copyable, we can modify it however we like. The tab width problem is solved neatly by converting tabs to the appropriate number of spaces in the overlay. 
  5. Although code on GitHub is often nested deeply, the syntax information for a line of code can still be described linearly much of the time—we have a keyword followed by some plain text and then a string literal, etc. But sometimes it is not so simple—we might have a Markdown document with a code section. That code section might be an HTML document with a script tag. That script tag might contain JavaScript. That JavaScript might contain doc comments on a function. Those doc comments might contain @param tags which are rendered as keywords. We can handle this kind of arbitrarily nested syntax tree with a recursive React component. But that means the shape of our tree of React nodes, and therefore the amount of time it takes to perform DOM reconciliation, is determined by the code our users have chosen to write. On top of that, React adds DOM nodes one-at-a-time, and our overlay uses one DOM node per character of code. These are the main reasons that sidestepping React for this part of the page gives us such a dramatic performance boost. 

Bringing code navigation to communities

Post Syndicated from Patrick Thomson original https://github.blog/2022-04-29-bringing-code-navigation-to-communities/

We’re proud to announce the availability of search-based code navigation for the Elixir programming language. Yet this is more than just new functionality. It’s the first example of a language community writing and submitting their own code for search-based code navigation.

Since GitHub introduced code navigation, we’ve received enthusiastic feedback from users. However, we hear one question over and over again: “When will my favorite language be supported?” We believe that every programming language deserves best-in-class support in GitHub. However, GitHub hosts code written in thousands of different programming languages, and we at GitHub can commit to supporting only a small subset of these languages ourselves. To this end, we’ve been working hard to empower language communities to integrate with our code navigation systems. After all, nobody understands a programming language better than the people who build it!

Community contributions welcome!

Would you like to develop and contribute search-based code navigation for a language? If a Tree-sitter grammar tool exists for your language, you can do so using Tree-sitter’s tree query language. This language describes how our code navigation systems scan through syntax trees and how to extract the important parts of a declaration or reference. This process is known as tagging, and it’s integrated with the tree-sitter command-line tool so that you can run and test tag queries locally. When a user pushes new commits or creates a pull request, the GitHub code navigation systems use tag queries to extract jump-to-definition and find-all-references information. Note: If you’re interested in how search-based code navigation works, you can read a technical report that describes the architecture and evolution of GitHub’s implementation of code navigation.

How do I get started?

Complete documentation, including how to write unit tests for your tags queries, can be found here. For examples of tag queries, you can check out the Elixir tags implementation, or those written for Python or Ruby. Once you’ve implemented tag queries for a language, have written unit tests, and are satisfied with the output from tree-sitter tags, you can submit a request on the Code Search and Navigation Feedback discussion page in the GitHub Discussions page in the GitHub feedback repository. The Code Navigation team will add this to their roadmap. When we’re confident in the quality of yielded tags and the load on the back-end systems that handle code navigation, we can enable it in testing for a subset of contributors, eventually rolling it out to all GitHub users, on both private and public repositories.

Please note that search-based code navigation is distinct from GitHub code search. Code search provides a view across all of the corpus of code on GitHub, whereas search-based code navigation is a part of the experience of reading code within a single repository. Search-based code navigation is also distinct from our support for precise code navigation. However, we hope to empower language communities to inform and contribute to the development of and support for these and other features in the future.

We’re committed to working with language maintainers and contributors to keep these rules as useful and up-to-date as possible. Whether you’re a longtime language contributor or someone seeking to enter the community for the first time, we encourage you to take a look at adding support for search-based code navigation, and join the community on the GiHub Discussions Page.

Introducing stack graphs

Post Syndicated from Douglas Creager original https://github.blog/2021-12-09-introducing-stack-graphs/

Today, we announced the general availability of precise code navigation for all public and private Python repositories on GitHub.com. Precise code navigation is powered by stack graphs, a new open source framework we’ve created that lets you define the name binding rules for a programming language using a declarative, domain-specific language (DSL). With stack graphs, we can generate code navigation data for a repository without requiring any configuration from the repository owner, and without tapping into a build process or other CI job. In this post, I’ll dig into how stack graphs work, and how they achieve these results.

(This post is a condensed version of a talk that I gave at Strange Loop in October 2021. Please check out the video of that talk if you’d like to learn even more!)

What is code navigation?

Code navigation is a family of features that let you explore the relationships in your code and its dependencies at a deep level. The most basic code navigation features are “jump to definition” and “find all references.” Both build on the fact that names are pervasive in the code that we write. Programming languages let us define things — functions, classes, modules, methods, variables, and more. Those things have names so that we can refer back to them in other parts of our code.

A picture (even a simple one) is worth a thousand words:

a simple Python module

In this Python module, the reference to broil at the end of the file refers to the function definition earlier in the file. (Throughout this post, I’ll highlight definitions in red and references in blue.)

Our goal, then, is to collect information about the lists of definitions and references, and to be able to determine which definitions each reference maps to, for all of the code hosted on GitHub.

Why is this hard?

In the above example, the definition and reference were close to each other, and it was easy to visually see the relationship between them. But it won’t always be that easy!

names can shadow each other in Python but not in Rust

For instance, what if there are multiple definitions with the same name? In Python, names can shadow each other, which means that the broil reference should refer to the latter of the two definitions.

But these rules are language-specific! In Rust, top-level definitions are not allowed to shadow each other, but local variables are. So, this transliteration of my example from Python to Rust is an error according to the Rust language spec. If we were writing a Rust compiler, we would want to surface this error for the programmer to fix. But what about for an exploration feature like code navigation? We might want to show some result even for erroneous code. We’re only human, after all!

code can live in multiple packages

Up to now, I’ve only shown you examples consisting of a single file. But when was the last time you worked on a software project consisting of a single file? It’s much more likely that your code will be split across multiple files, multiple packages, and multiple repositories. Programming languages give us the ability to refer to definitions that might be quite far away. But as you might expect, the rules for how you refer to things in other files are different for different languages.

In the above example, I’ve split everything up into three files living in two separate packages or repositories. (I’m using emoji to represent the package names.) In Python, import statements let us refer to names defined in other modules, and the name of a module is determined by the name of the file containing its code. Together, this lets us see that the broil reference in chef.py in the “chef” package refers to the broil definition in stove.py in the “frying pan” package.

code can change

Code changes and evolves over time. What happens when one of your dependencies changes the implementation of a function that you’re calling? Here, the maintainers of the “frying pan” package have added some logging to the broil function. As a result, the broil reference in chef.py now refers to a different definition. Insidiously, it was an intermediate file that changed — not the file containing the reference, nor the file containing the original definition! If we’re not careful, we’ll have to reanalyze every file in the repository, and in all its dependencies, whenever any file changes! This makes the amount of work we must do quadratic in the number of changed files, rather than linear, which is especially problematic at GitHub’s scale.

Our last difficulty is one of scale. As mentioned above, we want to provide this feature for all of the code hosted on GitHub. Moreover, we don’t want to require any manual configuration on the part of each repository owner. You shouldn’t have to figure out how to produce code navigation data for your language and project, or have to configure a CI build to generate that data. Code navigation should Just Work.

At GitHub’s scale, this poses two problems. The first is the sheer amount of code that comes in every minute of every day. In each commit that we receive, it’s very likely that only a small number of files have been modified. We must be able to rely on incremental processing and storage, reusing the results that we’ve already calculated and saved for the files that haven’t changed.

The second challenge is the number of programming languages that we need to (eventually) support. GitHub hosts code written in every programming language imaginable. Git itself doesn’t care what language you use for your project — to Git, everything is just bytes. But for a feature like code navigation, where the name binding rules are different for each language, we must know how to parse and interpret the content of those files. To support this at scale, it must be as easy as possible for GitHub engineers and external language communities to describe the name binding rules for a language.

To summarize:

  • Different languages have different name binding rules.
  • Some of those rules can be quite complex.
  • The result might depend on intermediate files.
  • We don’t want to require manual per-repository configuration.
  • We need incremental processing to handle our scale.

Stack graphs

After examining the problem space, we created stack graphs to tackle these challenges, based on the scope graphs framework from Eelco Visser’s research group at TU Delft. Below I’ll discuss what stack graphs are and how they work.

Because we must rely on incremental results, it’s important that at index time (that is, when we receive pushes containing new commits), we look at each file completely in isolation. Our goal is to extract “facts” about each file that describe the definitions and references in the file, and all possible things that each reference could resolve to.

For instance, consider this example:

two Python files

Our final result must be able to encode the fact that the broil reference and definition live in different files. But to be incremental, our analysis must look at each file separately. I’m going to step into each file to show you what information GitHub can extract in isolation.

the stack graph for stove.py

Looking first at stove.py, we can see that it contains a definition of broil. From the name of the file, we know that this definition lives in a module called stove, giving a fully qualified name of stove.broil. We can create a graph structure representing this fact (along with information about the other symbols in the file). Each definition (including the module itself) gets a red, double-bordered definition node. The other nodes, and the pattern of how we’ve connected these nodes with edges, define the scoping and shadowing rules for these symbols. For other programming languages, which don’t implement the same shadowing behavior as Python, we’d use a different pattern of edges to connect everything.

the stack graph for kitchen.py

We can do the same thing for kitchen.py. The broil reference is represented by a blue, single-bordered reference node. The import statement also appears in the graph, as a gadget of nodes involving the broil and stove symbols.

Because we are looking at this file in isolation, we don’t yet know what the broil reference resolves to. The import statement means that it might resolve to stove.broil, defined in some other file — but that depends on whether there is a file defining that symbol. This example does in fact contain such a file (we just looked at it!), but we must ignore that while extracting incremental facts about kitchen.py.

At query time, however, we’re able to bring together the data from all files in the commit that you’re looking at. We can load the graphs for each of the files, producing a single “merged” graph for the entire commit:

the merged stack graph

Within this merged graph, every valid name binding is represented by a path from a reference node to a definition node.

However, not every path in the graph represents a valid name binding! For instance, looking only at the graph structure, there are perfectly fine paths from the broil reference node to the saute and bake definition nodes. To rule out those paths, we also maintain a symbol stack while searching for paths. Each blue node pushes a symbol onto the stack, and each red node pops a symbol from the stack. Importantly, we are not allowed to move into a “pop” node if its symbol does not match the top of the stack.

We’ve shown the contents of the symbol stack at a handful of places in the path that’s highlighted above. Most importantly, when we reach the portion of the graph containing the saute, broil, and bake definition nodes, the symbol stack contains ⟨broil⟩, ensuring that the only valid path that we discover is the one that ends at the broil definition.

We can also use different graph structures to handle my other examples. For example:

the stack graph for shadowed Python definitions

In this graph, we annotate some of the graph edges with a precedence value. Paths that include edges with a higher precedence value are preferred over those with lower precedences. This lets us correctly handle Python’s shadowing behavior.

For other programming languages, which don’t implement the same shadowing behavior as Python, we’d use a different pattern of edges to connect everything. For instance, the stack graph for my Rust example from earlier would be:

the stack graph for conflicting Rust definitions

To model Rust’s rule that top-level definitions with the same name are conflicts, we have a single node that all definitions hang off of. We can use precedences to choose whether to show all conflicting definitions (by giving them all the same precedence value), or just the first one (by assigning precedences sequentially).

With a stack graph available to us, we can implement “jump to definition:”

  1. The user clicks on a reference.
  2. We load in the stack graphs for each file in the commit, and merge them
    together.
  3. We perform a path-finding search starting from the reference node
    corresponding to the symbol that the user clicked on, considering
    symbol stacks and precedences to ensure that we don’t create any invalid
    paths.
  4. Any valid paths that we find represent the definitions that the reference
    refers to. We display those in a hover card.

Creating stack graphs using Tree-sitter

I’ve described how to use stack graphs to perform code navigation lookups, but I haven’t mentioned how to create stack graphs from the source code that you push to GitHub.

For that, we turned to Tree-sitter, an open source parsing framework. The Tree-sitter community has already written parsers for a wide variety of programming languages, and we already use Tree-sitter in many places across GitHub. This makes it a natural choice to build stack graphs on.

Tree-sitter’s parsers already let us efficiently parse the code that our users upload. For instance, the Tree-sitter parser for Python produces a concrete syntax tree (CST) for our stove.py example file:

$ tree-sitter parse stove.py
(module [0, 0] - [10, 0]
  (function_definition [0, 0] - [1, 8]
    name: (identifier [0, 4] - [0, 8])
    parameters: (parameters [0, 8] - [0, 10])
    body: (block [1, 4] - [1, 8]
      (pass_statement [1, 4] - [1, 8])))
  (function_definition [3, 0] - [4, 8]
    name: (identifier [3, 4] - [3, 9])
    parameters: (parameters [3, 9] - [3, 11])
    body: (block [4, 4] - [4, 8]
      (pass_statement [4, 4] - [4, 8])))
  (function_definition [6, 0] - [7, 8]
    name: (identifier [6, 4] - [6, 9])
    parameters: (parameters [6, 9] - [6, 11])
    body: (block [7, 4] - [7, 8]
      (pass_statement [7, 4] - [7, 8]))))

Tree-sitter also provides a query language that lets us look for patterns within the CST:

(function_definition
  name: (identifier) @name) @function

This query would locate all three of our example method definitions, annotating each definition as a whole with a @function label and the name of each method with a @name label.

As part of developing stack graphs, we’ve added a new graph construction language to Tree-sitter, which lets you construct arbitrary graph structures (including but not limited to stack graphs) from parsed CSTs. You use stanzas to define the gadget of graph nodes and edges that should be created for each occurrence of a Tree-sitter query, and how the newly created nodes and edges should connect to graph content that you’ve already created elsewhere. For instance, the following snippet would create the stack graph definition node for my example Python method definitions:

(function_definition
  name: (identifier) @name) @function
{
    node @function.def
    attr (@function.def) kind = "definition"
    attr (@function.def) symbol = @name
    edge @function.containing_scope -> @function.def
}

This approach lets us create stack graphs incrementally for each source file that we receive, while only having to analyze the source code content, and without having to invoke any language-specific tooling or build systems. (The only language-specific part is the set of graph construction rules for that language!)

But wait, there’s more!

This post is already quite long, and I’ve only scratched the surface. You might be wondering:

  • Performing a full path-finding search for every “jump to definition” query seems wasteful. Can we precalculate more information at index time while still being incremental?

  • All the examples we’ve shown are pretty trivial. Can we handle more complex examples?

    For instance, how about the following Python file, where we need to use dataflow to trace what particular value was passed in as a parameter to passthrough to correctly resolve the reference to one on the final line?

    def passthrough(x):
      return x
    
    class A:
      one = 1
    
    passthrough(A).one
    

    Or the following Java file, where we have to trace inheritance and generic type parameters to see that the reference to length should resolve to String.length from the Java standard library?

    import java.util.HashMap;
    
    class MyMap extends HashMap<String, String> {
      int firstLength() {
          return this.entrySet().iterator().next().getKey().length();
      }
    }
    
  • Why aren’t we using the Language Server Protocol (LSP) or Language Server Index Format (LSIF)?

To dig even deeper and learn more, I encourage you to check out my Strange Loop talk and the stack-graphs crate: our open source Rust implementation of these ideas. And in the meantime, keep navigating!